You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am currently using the FSDP (Fully Sharded Data Parallelism) approach with the Llama 2 70B model. The training process has begun, but I encounter an error when attempting to save the checkpoint at each save_step. I have set the save_step as 50.
Hi @keval2415, just posting this in case anyone else runs into this issue. I found that this was most likely related to checkpointing the optimizer states in fsdp (described in this issue and solved in this pr).
I solved it by upgrading my pytorch version from 2.0.1 to 2.1.0.
Thanks @sachalevy, Could you please share the clean entire code with me, because I am still getting different errors like lora is not supported with FSDP.
I am currently using the FSDP (Fully Sharded Data Parallelism) approach with the Llama 2 70B model. The training process has begun, but I encounter an error when attempting to save the checkpoint at each save_step. I have set the save_step as 50.
System: 1 Node with 2 A100 80 GB GPU
Here are the supporting screenshots
@pacman100
The text was updated successfully, but these errors were encountered: