1.5B Accuracy 0 After SFT

Apr 19, 2025 by ADMIN 26 views

1.5B Accuracy 0 after SFT: Troubleshooting and Solutions

Training and fine-tuning large language models like the 1.5B model can be a complex process, and issues can arise even with seemingly correct configurations. In this article, we will delve into a specific problem encountered by a user while experimenting with the 1.5B model, where the accuracy on AIME tasks is 0 after using the Sparse Fine-Tuning (SFT) method.

The user provided a script for training the 1.5B model using the SFT method, which involves setting the gradient_accumulation_steps to 8 to ensure batch consistency due to limited GPU resources. However, when using the lm_eval command to evaluate the model on AIME tasks, the accuracy is 0.

The training script provided by the user is as follows:

uid="$(date +%Y%m%d_%H%M%S)"
base_model=\"Qwen/Qwen2.5-1.5B-Instruct\"
lr=1e-5
min_lr=0
epochs=5
weight_decay=1e-4 # -> the same training pipe as slurm_training
micro_batch_size=1 # -> batch_size will be 16 if 16 gpus
gradient_accumulation_steps=8 # requires more GPU memory, default is 1
max_steps=-1
gpu_count=2 #$(nvidia-smi -L | wc -l)
push_to_hub=false

torchrun --nproc-per-node ${gpu_count} --master_port 12345 \\
    train/sft.py \"
    --block_size=12000 \
    --per_device_train_batch_size=${micro_batch_size} \
    --per_device_eval_batch_size=${micro_batch_size} \
    --gradient_accumulation_steps=${gradient_accumulation_steps} \"
    --num_train_epochs=${epochs} \
    --train_file_path=\"simplescaling/s1K-1.1_tokenized\" \\
    --model_name=${base_model} \\
    --warmup_ratio=0.05 \\
    --fsdp=\"full_shard auto_wrap\" \\
    --fsdp_config=\"train/fsdp_config_qwen.json\" \\
    --bf16=True \\
    --eval_strategy=\"no\" \\
    --logging_steps=1 \
    --save_strategy=\"no\" \
    --lr_scheduler_type=\"cosine\" \
    --learning_rate=${lr} \"
    --weight_decay=${weight_decay} \
    --adam_beta1=0.9 \
    --adam_beta2=0.95 \
    --output_dir=\"ckpts/s1-$(basename \"$base_model\")-${uid}\" \"
    --push_to_hub=${push_to_hub} \\
    --save_only_model=True \\

The user also provided three evaluation commands using the lm_eval tool:

CUDA_VISIBLE_DEVICES=5 lm_eval --model vllm --model_args pretrained=/s1-Qwen2.5-1.5B-Instruct-20250419_021608,dtype=bfloat16,tensor_parallel_size=1 --tasks aime24_figures,aime24_nofigures --batch_size auto --output_path dummy --log_samples --gen_kwargs "max_gen_toks=12000

CUDA_VISIBLE_DEVICES=5 lm_eval --model vllm --model pretrained=/s1-Qwen2.5-1.5B-Instruct-20250419_021608,dtype=float32,tensor_parallel_size=1 --tasks aime24_nofigures --batch_size auto --apply_chat_template --output_path s1.1forcingignore1wait --log_samples --gen_kwargs "max_gen_toks=12000,max_tokens_thinking=auto,thinking_n_ignore=1,thinking_n_ignore_str=Wait"

CUDA_VISIBLE_DEVICES=5 lm_eval --model vllm --model_args pretrained=/s1-Qwen2.5-1.5B-Instruct-20250419_021608,dtype=float32,tensor_parallel_size=1 --tasks aime25_nofigures,aime24_nofigures --batch_size auto --apply_chat_template --output_path s1.1forcingignore2wait --log_samples --gen_kwargs "max_gen_toks=12000,max_tokens_thinking=auto,thinking_n_ignore=2,thinking_n_ignore_str=Wait"

The evaluation results on AIME tasks are as follows:

Tasks	Version	Filter	Metric		Value		Stderr
aime24_figures	1	none	exact_match	↑	0	±	N/A
		none	extracted_answers	↑	-1	±	N/A
aime24_nofigures	1	none	exact_match	↑	0	±	N/A
		none	extracted_answers	↑	-1	±	N/A

Based on the provided information, there are several potential issues that could be causing the accuracy on AIME tasks to be 0:

Incorrect Model Loading: Ensure that the correct model is being loaded for evaluation. Check the pretrained argument in the lm_eval command to ensure it matches the trained model.
Incorrect Model Configuration: Verify that the model configuration is correct for the specific task. Check the model_args argument in the lm_eval command to ensure it matches the trained model configuration.
Batch Size and Gradient Accumulation: The batch size and gradient accumulation steps may be too small, leading to inaccurate results. Try increasing the batch size and gradient accumulation steps to see if it improves the accuracy.
Evaluation Metrics: Ensure that the correct evaluation metrics are being used. Check the tasks argument in the lm_eval command to ensure it matches the desired evaluation metrics.
Model Weights: Verify that the model weights are being loaded correctly. Check the output_dir argument in the training script to ensure it matches the saved model weights.

To troubleshoot and solve the issue, we recommend the following steps:

Verify Model Loading: Ensure that the correct model is being loaded for evaluation by checking the pretrained argument in the lm_eval command.
Check Model Configuration: Verify that the model configuration is correct for the task by checking the model_args argument in the lm_eval command.
Adjust Batch Size and Gradient Accumulation: Try increasing the batch size and gradient accumulation steps to see if it improves the accuracy.
Evaluate with Different Metrics: Try evaluating the model with different metrics to see if it improves the accuracy.
Load Model Weights: Verify that the model weights are being loaded correctly by checking the output_dir argument in the training script.

By following these steps, you should be able to troubleshoot and solve the issue of accuracy being 0 on AIME tasks after using the SFT method.
1.5B Accuracy 0 after SFT: Troubleshooting and Solutions

Q&A: Troubleshooting and Solutions for 1.5B Accuracy 0 after SFT

Q: What is the cause of accuracy being 0 on AIME tasks after using the SFT method?

A: There are several potential causes for accuracy being 0 on AIME tasks after using the SFT method, including incorrect model loading, incorrect model configuration, batch size and gradient accumulation issues, evaluation metrics issues, and model weight loading issues.

Q: How can I verify that the correct model is being loaded for evaluation?

A: To verify that the correct model is being loaded for evaluation, check the pretrained argument in the lm_eval command to ensure it matches the trained model.

Q: How can I verify that the model configuration is correct for the task?

A: To verify that the model configuration is correct for the task, check the model_args argument in the lm_eval command to ensure it matches the trained model configuration.

Q: What are some potential issues with batch size and gradient accumulation?

A: Some potential issues with batch size and gradient accumulation include:

Too small batch size: If the batch size is too small, it may not be enough to capture the underlying patterns in the data, leading to inaccurate results.
Too small gradient accumulation steps: If the gradient accumulation steps are too small, it may not be enough to capture the underlying patterns in the data, leading to inaccurate results.

Q: How can I adjust batch size and gradient accumulation to improve accuracy?

A: To adjust batch size and gradient accumulation to improve accuracy, try increasing the batch size and gradient accumulation steps to see if it improves the accuracy.

Q: What are some potential issues with evaluation metrics?

A: Some potential issues with evaluation metrics include:

Incorrect evaluation metrics: If the incorrect evaluation metrics are being used, it may not accurately reflect the performance of the model.
Insufficient evaluation metrics: If the evaluation metrics are not sufficient, it may not accurately reflect the performance of the model.

Q: How can I evaluate the model with different metrics to improve accuracy?

A: To evaluate the model with different metrics to improve accuracy, try using different evaluation metrics to see if it improves the accuracy.

Q: What are some potential issues with model weight loading?

A: Some potential issues with model weight loading include:

Incorrect model weight loading: If the incorrect model weights are being loaded, it may not accurately reflect the performance of the model.
Insufficient model weight loading: If the model weights are not being loaded correctly, it may not accurately reflect the performance of the model.

Q: How can I load model weights correctly to improve accuracy?

A: To load model weights correctly to improve accuracy, check the output_dir argument in the training script to ensure it matches the saved model weights.

Q: What are some best practices for troubleshooting and solving issues with 1.5B accuracy 0 after SFT?

A: Some best practices for troubleshooting and solving issues with 1.5B accuracy 0 after SFT include:

Verify model loading: Ensure that the correct model is being loaded for evaluation.
Check model configuration: Verify that the model configuration is correct for the task.
Adjust batch size and gradient accumulation: Try increasing the batch size and gradient accumulation steps to see if it improves the accuracy.
Evaluate with different metrics: Try using different evaluation metrics to see if it improves the accuracy.
Load model weights correctly: Check the output_dir argument in the training script to ensure it matches the saved model weights.

By following these best practices and troubleshooting and solving issues with 1.5B accuracy 0 after SFT, you should be able to improve the accuracy of your model and achieve better results.