-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Probaly mistake in DPOTrainer
when compute/log grad_norm
#2456
Comments
This is an interesting finding! |
The issue arises from how the accelerator is configured in To set the number of gradient accumulation steps, users can either:
However, in both cases, the gradient norm ( Adding a - self.accelerator = Accelerator(**args)
+ self.accelerator = Accelerator(**args, gradient_accumulation_steps=self.args.gradient_accumulation_steps) @muellerzr, could you review this and share your thoughts? --
|
Correct, that's not what we want to do because with the fix to how we calculate the number of items in the batch, the losses will not align and things will be off, so we don't divide the loss by accumulation steps if we know that value. I'd need to play with this a bit as I'm not 100% sure if we can just modify the grads for clipping without modifying the overall loss we just calculated 🤔 |
@qgallouedec I have a new question that if the problem arises from create_accelerator_and_postprocess in sft, |
I can't explain it right now. Any idea? |
I may have found the solution: huggingface/transformers#35207 Running some experiments... |
Does it solve the issue?Before the fixsame effective batch size (32)
We can see here that the grad_norm is different while it should be the same. After the fixsame effective batch size (32)
Now the grad_norm is the same. Does it impact the results?Config 1grad accumulation = 32 / batch_size = 1 (effective batch size = 32). Curves are before the fix and after the fix The only value impacted is the grad_norm, no impact on loss Config 2grad accumulation = 8 / batch_size = 4 (effective batch size = 32). Curves are before the fix and after the fix The only value impacted is the grad_norm, no impact on loss |
@qgallouedec Thanks for ur work! So this bug actually only affects the reported logs and not the training results, right? :) |
That's what the results suggest yes |
Leaving the issue open until huggingface/transformers#35207 is properly merged |
System Info
Information
Tasks
examples
folderReproduction
dpo scripte
bash script
python dpo.py --model_name_or_path AIR/Llama-3.2-1B-ultrachat200k \ --dataset_name HuggingFaceH4/ultrafeedback_binarized \ --output_dir test \ --attn_implementation flash_attention_2 \ --beta 0.05 \ --bf16 \ --dataset_train_split train_prefs \ --do_train \ --gradient_checkpointing \ --gradient_accumulation_steps 16 \ --learning_rate 0.00001 \ --lr_scheduler_type cosine \ --logging_steps 5 \ --loss_type sigmoid \ --max_prompt_length 512 \ --max_length 1024 \ --max_steps -1 \ --num_train_epochs 1 \ --per_device_train_batch_size 2 \ --report_to tensorboard \ --save_strategy epoch \ --save_total_limit 1 \ --save_only_model \ --torch_dtype bfloat16 \ --warmup_ratio 0.05
Expected behavior
When using
DPOTrainer
I found a unexpected behavior ofgrad_norm
.Specificlly, I keep the
global_batch_size=32
and adjust differentper_device_train_batch_size
andgradient_accumulation_steps
, thegrad_norm
is positively correlated withgradient_accumulation_steps
, but it wont appear in SFTTrainer. As i know, thegrad_norm
shouldnt changed so dramatically under the sameglobal_batch_size
batch_size=4
,accmulation=8
batch_size=2
,accmulation=16
batch_size=1
,accmulation=32
Checklist
The text was updated successfully, but these errors were encountered: