Error on save_steps using FSDP #6

ghost · 2023-09-28T13:11:20Z

I am currently using the FSDP (Fully Sharded Data Parallelism) approach with the Llama 2 70B model. The training process has begun, but I encounter an error when attempting to save the checkpoint at each save_step. I have set the save_step as 50.

System: 1 Node with 2 A100 80 GB GPU

Here are the supporting screenshots

@pacman100

sachalevy · 2023-10-11T05:43:05Z

Hey @keval2415, I'm seeing the same thing on my end, except I'm running Llama 2 7B and on 2 A100 40GB GPUs. Have you been able to solve the issue?

sachalevy · 2023-10-13T22:08:45Z

Hi @keval2415, just posting this in case anyone else runs into this issue. I found that this was most likely related to checkpointing the optimizer states in fsdp (described in this issue and solved in this pr).

I solved it by upgrading my pytorch version from 2.0.1 to 2.1.0.

kevaldekivadiya2415 · 2023-10-18T10:58:37Z

Thanks @sachalevy, Could you please share the clean entire code with me, because I am still getting different errors like lora is not supported with FSDP.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error on save_steps using FSDP #6

Error on save_steps using FSDP #6

ghost commented Sep 28, 2023 •

edited by ghost

Loading

sachalevy commented Oct 11, 2023

sachalevy commented Oct 13, 2023

kevaldekivadiya2415 commented Oct 18, 2023

Error on save_steps using FSDP #6

Error on save_steps using FSDP #6

Comments

ghost commented Sep 28, 2023 • edited by ghost Loading

sachalevy commented Oct 11, 2023

sachalevy commented Oct 13, 2023

kevaldekivadiya2415 commented Oct 18, 2023

ghost commented Sep 28, 2023 •

edited by ghost

Loading