Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error on save_steps using FSDP #6

Open
ghost opened this issue Sep 28, 2023 · 3 comments
Open

Error on save_steps using FSDP #6

ghost opened this issue Sep 28, 2023 · 3 comments

Comments

@ghost
Copy link

ghost commented Sep 28, 2023

I am currently using the FSDP (Fully Sharded Data Parallelism) approach with the Llama 2 70B model. The training process has begun, but I encounter an error when attempting to save the checkpoint at each save_step. I have set the save_step as 50.

System: 1 Node with 2 A100 80 GB GPU

Here are the supporting screenshots
MicrosoftTeams-image (1)

MicrosoftTeams-image (2)

@pacman100

@sachalevy
Copy link

Hey @keval2415, I'm seeing the same thing on my end, except I'm running Llama 2 7B and on 2 A100 40GB GPUs. Have you been able to solve the issue?

@sachalevy
Copy link

Hi @keval2415, just posting this in case anyone else runs into this issue. I found that this was most likely related to checkpointing the optimizer states in fsdp (described in this issue and solved in this pr).

I solved it by upgrading my pytorch version from 2.0.1 to 2.1.0.

@kevaldekivadiya2415
Copy link

Thanks @sachalevy, Could you please share the clean entire code with me, because I am still getting different errors like lora is not supported with FSDP.
MicrosoftTeams-image (3)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants