1.5B Accuracy 0 After SFT
1.5B Accuracy 0 after SFT: Troubleshooting and Solutions
Training and fine-tuning large language models like the 1.5B model can be a complex process, and issues can arise even with seemingly correct configurations. In this article, we will delve into a specific problem encountered by a user while experimenting with the 1.5B model, where the accuracy on AIME tasks is 0 after using the Sparse Fine-Tuning (SFT) method.
The user provided a script for training the 1.5B model using the SFT method, which involves setting the gradient_accumulation_steps
to 8 to ensure batch consistency due to limited GPU resources. However, when using the lm_eval
command to evaluate the model on AIME tasks, the accuracy is 0.
The training script provided by the user is as follows:
uid="$(date +%Y%m%d_%H%M%S)"
base_model=\"Qwen/Qwen2.5-1.5B-Instruct\"
lr=1e-5
min_lr=0
epochs=5
weight_decay=1e-4 # -> the same training pipe as slurm_training
micro_batch_size=1 # -> batch_size will be 16 if 16 gpus
gradient_accumulation_steps=8 # requires more GPU memory, default is 1
max_steps=-1
gpu_count=2 #$(nvidia-smi -L | wc -l)
push_to_hub=false
torchrun --nproc-per-node ${gpu_count} --master_port 12345 \\
train/sft.py \"
--block_size=12000 \
--per_device_train_batch_size=${micro_batch_size} \
--per_device_eval_batch_size=${micro_batch_size} \
--gradient_accumulation_steps=${gradient_accumulation_steps} \"
--num_train_epochs=${epochs} \
--train_file_path=\"simplescaling/s1K-1.1_tokenized\" \\
--model_name=${base_model} \\
--warmup_ratio=0.05 \\
--fsdp=\"full_shard auto_wrap\" \\
--fsdp_config=\"train/fsdp_config_qwen.json\" \\
--bf16=True \\
--eval_strategy=\"no\" \\
--logging_steps=1 \
--save_strategy=\"no\" \
--lr_scheduler_type=\"cosine\" \
--learning_rate=${lr} \"
--weight_decay=${weight_decay} \
--adam_beta1=0.9 \
--adam_beta2=0.95 \
--output_dir=\"ckpts/s1-$(basename \"$base_model\")-${uid}\" \"
--push_to_hub=${push_to_hub} \\
--save_only_model=True \\
The user also provided three evaluation commands using the lm_eval
tool:
CUDA_VISIBLE_DEVICES=5 lm_eval --model vllm --model_args pretrained=/s1-Qwen2.5-1.5B-Instruct-20250419_021608,dtype=bfloat16,tensor_parallel_size=1 --tasks aime24_figures,aime24_nofigures --batch_size auto --output_path dummy --log_samples --gen_kwargs "max_gen_toks=12000
CUDA_VISIBLE_DEVICES=5 lm_eval --model vllm --model pretrained=/s1-Qwen2.5-1.5B-Instruct-20250419_021608,dtype=float32,tensor_parallel_size=1 --tasks aime24_nofigures --batch_size auto --apply_chat_template --output_path s1.1forcingignore1wait --log_samples --gen_kwargs "max_gen_toks=12000,max_tokens_thinking=auto,thinking_n_ignore=1,thinking_n_ignore_str=Wait"
CUDA_VISIBLE_DEVICES=5 lm_eval --model vllm --model_args pretrained=/s1-Qwen2.5-1.5B-Instruct-20250419_021608,dtype=float32,tensor_parallel_size=1 --tasks aime25_nofigures,aime24_nofigures --batch_size auto --apply_chat_template --output_path s1.1forcingignore2wait --log_samples --gen_kwargs "max_gen_toks=12000,max_tokens_thinking=auto,thinking_n_ignore=2,thinking_n_ignore_str=Wait"
The evaluation results on AIME tasks are as follows:
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
aime24_figures | 1 | none | 0 | exact_match | ↑ | 0 | ± | N/A |
none | 0 | extracted_answers | ↑ | -1 | ± | N/A | ||
aime24_nofigures | 1 | none | 0 | exact_match | ↑ | 0 | ± | N/A |
none | 0 | extracted_answers | ↑ | -1 | ± | N/A |
Based on the provided information, there are several potential issues that could be causing the accuracy on AIME tasks to be 0:
- Incorrect Model Loading: Ensure that the correct model is being loaded for evaluation. Check the
pretrained
argument in thelm_eval
command to ensure it matches the trained model. - Incorrect Model Configuration: Verify that the model configuration is correct for the specific task. Check the
model_args
argument in thelm_eval
command to ensure it matches the trained model configuration. - Batch Size and Gradient Accumulation: The batch size and gradient accumulation steps may be too small, leading to inaccurate results. Try increasing the batch size and gradient accumulation steps to see if it improves the accuracy.
- Evaluation Metrics: Ensure that the correct evaluation metrics are being used. Check the
tasks
argument in thelm_eval
command to ensure it matches the desired evaluation metrics. - Model Weights: Verify that the model weights are being loaded correctly. Check the
output_dir
argument in the training script to ensure it matches the saved model weights.
To troubleshoot and solve the issue, we recommend the following steps:
- Verify Model Loading: Ensure that the correct model is being loaded for evaluation by checking the
pretrained
argument in thelm_eval
command. - Check Model Configuration: Verify that the model configuration is correct for the task by checking the
model_args
argument in thelm_eval
command. - Adjust Batch Size and Gradient Accumulation: Try increasing the batch size and gradient accumulation steps to see if it improves the accuracy.
- Evaluate with Different Metrics: Try evaluating the model with different metrics to see if it improves the accuracy.
- Load Model Weights: Verify that the model weights are being loaded correctly by checking the
output_dir
argument in the training script.
By following these steps, you should be able to troubleshoot and solve the issue of accuracy being 0 on AIME tasks after using the SFT method.
1.5B Accuracy 0 after SFT: Troubleshooting and Solutions
Q&A: Troubleshooting and Solutions for 1.5B Accuracy 0 after SFT
Q: What is the cause of accuracy being 0 on AIME tasks after using the SFT method?
A: There are several potential causes for accuracy being 0 on AIME tasks after using the SFT method, including incorrect model loading, incorrect model configuration, batch size and gradient accumulation issues, evaluation metrics issues, and model weight loading issues.
Q: How can I verify that the correct model is being loaded for evaluation?
A: To verify that the correct model is being loaded for evaluation, check the pretrained
argument in the lm_eval
command to ensure it matches the trained model.
Q: How can I verify that the model configuration is correct for the task?
A: To verify that the model configuration is correct for the task, check the model_args
argument in the lm_eval
command to ensure it matches the trained model configuration.
Q: What are some potential issues with batch size and gradient accumulation?
A: Some potential issues with batch size and gradient accumulation include:
- Too small batch size: If the batch size is too small, it may not be enough to capture the underlying patterns in the data, leading to inaccurate results.
- Too small gradient accumulation steps: If the gradient accumulation steps are too small, it may not be enough to capture the underlying patterns in the data, leading to inaccurate results.
Q: How can I adjust batch size and gradient accumulation to improve accuracy?
A: To adjust batch size and gradient accumulation to improve accuracy, try increasing the batch size and gradient accumulation steps to see if it improves the accuracy.
Q: What are some potential issues with evaluation metrics?
A: Some potential issues with evaluation metrics include:
- Incorrect evaluation metrics: If the incorrect evaluation metrics are being used, it may not accurately reflect the performance of the model.
- Insufficient evaluation metrics: If the evaluation metrics are not sufficient, it may not accurately reflect the performance of the model.
Q: How can I evaluate the model with different metrics to improve accuracy?
A: To evaluate the model with different metrics to improve accuracy, try using different evaluation metrics to see if it improves the accuracy.
Q: What are some potential issues with model weight loading?
A: Some potential issues with model weight loading include:
- Incorrect model weight loading: If the incorrect model weights are being loaded, it may not accurately reflect the performance of the model.
- Insufficient model weight loading: If the model weights are not being loaded correctly, it may not accurately reflect the performance of the model.
Q: How can I load model weights correctly to improve accuracy?
A: To load model weights correctly to improve accuracy, check the output_dir
argument in the training script to ensure it matches the saved model weights.
Q: What are some best practices for troubleshooting and solving issues with 1.5B accuracy 0 after SFT?
A: Some best practices for troubleshooting and solving issues with 1.5B accuracy 0 after SFT include:
- Verify model loading: Ensure that the correct model is being loaded for evaluation.
- Check model configuration: Verify that the model configuration is correct for the task.
- Adjust batch size and gradient accumulation: Try increasing the batch size and gradient accumulation steps to see if it improves the accuracy.
- Evaluate with different metrics: Try using different evaluation metrics to see if it improves the accuracy.
- Load model weights correctly: Check the
output_dir
argument in the training script to ensure it matches the saved model weights.
By following these best practices and troubleshooting and solving issues with 1.5B accuracy 0 after SFT, you should be able to improve the accuracy of your model and achieve better results.