-
-
Notifications
You must be signed in to change notification settings - Fork 898
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting deepspeed error on training completion and failing to save. if self.deepspeed_config["zero_optimization"]["stage"] == 3: AttributeError: 'Accelerator' object has no attribute 'deepspeed_config' #1092
Comments
Thanks for posting. I wonder if this issue needs to be posted to the upstream library. |
It has been reported in the DeepSpeed repo: microsoft/DeepSpeed#4143 |
By any chance did you run accelerate config or have an existing accelerate configuration yaml that accelerate is picking up? |
It happens with or without an existing accelerate config in the /home/user/.cache/hugging face/accelerate folder in my testing. |
Related issue |
upstream deepspeed issue seems to indicate this is related to wandb integration? I don't think you really should be uploading your models to wandb unless they are tiny. it takes a lot of time for that step and it gets expensive fast to store that in wandb |
see also #1156 (comment) |
Yea definitely is problem with wandb integration. I guess I'll try without uploading the model to wandb and see if that fixes it. Maybe there could be a comment in the yaml config in the readme to say not to use wandb model saving. |
Please check that this issue hasn't been reported before.
Expected Behavior
Running on Windows 10 WSL2 Ubuntu. On 2x RTX 3090 24GB with NVLink and Deepspeed Zero2.
Expected behavior is to complete the training and save the checkpoint normally like the training run would save in between epochs. So it would save in between the epochs or at the end of the epochs not at the end of the training run but it would fail at the end of the training run.
Current behaviour
At the end if a run if it tries to save at the end of the training run it gets interrupted by this error and fails to save only if
wandb_log_model: checkpoint
This error below is when
wandb_log_model: end
when it manages to save the last checkpoint but still shows the same error that seems to relate to deepspeed.Trying to continue the training from the last checkpoint also fails with a different error.
Steps to reproduce
I have narrowed it down to the issue of not saving at the end of the run being caused by setting
wandb_log_model: checkpoint
It will let the training run save at the end if I set it to
wandb_log_model: end
However neither options changes if the run can be resumed from checkpoint or not. Both fails.
Also changing the epochs options makes no difference on whether it will fail to save at the end of a run.
Config yaml
Possible solution
The error on saving at the end seems to come from wandb causing an error when setting
wandb_log_model: checkpoint
. But can be resolved when settingwandb_log_model: end
which is what is shows on the error log. It still fails on trying to resume from checkpoint and I have no clue why.Which Operating Systems are you using?
Python Version
3.11
axolotl branch-commit
main/9032e610b1af7565eec1908a298e53ae8e5252e7
Acknowledgements
The text was updated successfully, but these errors were encountered: