Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Epic] Replace DeepSpeed with PyTorch FSDP for Model Training #197

Closed
8 tasks
ktam3 opened this issue Sep 6, 2024 · 9 comments
Closed
8 tasks

[Epic] Replace DeepSpeed with PyTorch FSDP for Model Training #197

ktam3 opened this issue Sep 6, 2024 · 9 comments
Assignees
Labels
enhancement New feature or request epic Larger tracking issue encompassing multiple smaller issues jira

Comments

@ktam3
Copy link

ktam3 commented Sep 6, 2024

Feature Overview
This Feature card is for transitioning our model training infrastructure from DeepSpeed to PyTorch's Fully Sharded Data Parallel (FSDP) to enhance training metrics visibility, broaden accelerator support, and maintain performance parity.

Goals

  • Improve training metrics visibility for ML engineers and data scientists through integration with Weights & Biases.
  • Expand accelerator support for hardware flexibility.
  • Maintain or improve training performance across GPU configurations.

Requirements

  1. Implement PyTorch FSDP as the primary distributed training framework, replacing DeepSpeed.
  2. Integrate PyTorch FSDP with Weights & Biases for comprehensive training metrics collection and visualization.
  3. Ensure compatibility with a broad range of accelerators (e.g., NVIDIA GPUs, AMD GPUs, TPUs).
  4. Achieve performance parity or improvement compared to DeepSpeed on GPU configurations.
  5. Implement and test CPU offload capabilities.
  6. Update all relevant training scripts and documentation to reflect the transition to PyTorch FSDP.
  7. Ensure security measures are in place for data handling during distributed training.
  8. Maintain or improve the scalability of the training process.
  9. (if applicable) Provide clear documentation on how to use the new PyTorch FSDP setup for different training scenarios.

Completion Checklist:

  • PyTorch FSDP implementation complete
  • Weights & Biases integration tested and functional
  • Performance benchmarks conducted across various accelerators
  • CPU offload capabilities evaluated and implemented if beneficial
  • All training scripts updated
  • Documentation updated
  • Security audit completed
  • Scalability tests passed

Questions to Answer

  1. What is the performance impact of PyTorch FSDP on our specific model architectures?
  2. How does the CPU offload capability of PyTorch FSDP compare to DeepSpeed?
  3. Are there any specific optimizations needed for different accelerator types?
  4. What changes are required in our CI/CD pipeline to accommodate this transition?

Out of Scope

  • Modifications to model architectures
  • Changes to data preprocessing pipelines
  • Alterations to evaluation metrics or procedures

Background
Our current training infrastructure uses DeepSpeed for distributed training. While effective, transitioning to PyTorch FSDP offers strategic advantages in terms of metrics visibility, accelerator support, and potential performance improvements.

User Considerations

  • Ensure that the transition is seamless for end-users of our models.
  • Communicate any changes in training times or resource requirements to relevant stakeholders.

Documentation Considerations

  • Update all training-related documentation to reflect the use of PyTorch FSDP.
  • Provide migration guides for users transitioning from DeepSpeed to PyTorch FSDP.
  • (if applicable) Document any changes in command-line arguments or configuration files needed for PyTorch FSDP.

Additional notes wrt to FSDP -

  • We would like to feature-gate FSDP support, so RHEL AI 1.2 will use deepspeed by default, but FSDP support will be available if enabled via a feature gate.
  • As a feature-gated feature, FSDP support could be considered tech preview for RHEL AI 1.2.
  • FSDP may not support all hardware. We will aim for this, but at a minimum it will support nvidia.
  • FSDP is a high-priority feature because it is needed by OpenShift AI to deliver FSDP to watsonx.ai.
@ktam3 ktam3 added the enhancement New feature or request label Sep 6, 2024
@ktam3 ktam3 changed the title Replace DeepSpeed with PyTorch FSDP for Model Training [Epic] Replace DeepSpeed with PyTorch FSDP for Model Training Sep 6, 2024
@ktam3
Copy link
Author

ktam3 commented Sep 9, 2024

Summary from discussion:

  • Additional discussion needs to be held to break down this work and create a step by step list of tasks that need to be done
  • Team needs to identify the folowing
    • Scope:
      • Hardware availability
      • Dependencies
      • What can the team actually commit to in the next 3 weeks?

@RobotSail @JamesKunstle @Maxusmusti - to follow up and create issues and link to this epic as the work is being done

@ktam3
Copy link
Author

ktam3 commented Sep 10, 2024

@RobotSail - additional notes wrt to support

  • We will build Intel and AMD for FSDP only. That allows us to avoid working on making DeepSpeed compile for Gaudi, which should drop a significant amount of work.
  • If we have issues with FSDP for either variant, we will either declare that variant’s release a preview, or not deliver it at all.
  • For Nvidia, to maintain backwards compatibility, we will keep DeepSpeed as the default but also provide a flag to let the user enable FSDP.
  • There is work in InstructLab to enable FSDP, add the flag, etc. and there is work to build Torch with FSDP (we think it’s on by default, but need to confirm).

@ktam3 ktam3 added the jira label Sep 10, 2024
@RobotSail
Copy link
Member

With regards to FSDP - the main risks that we still need to overcome are going to be:

LoRA
Getting LoRA to work properly, since DeepSpeed was very compatible with running PEFT models, FSDP will require more work on our end to get this working.

Checkpointing

In our current implementation, we run DeepSpeed with ZeRO stage-2, which allows us to save a model checkpoint by taking its state on one of the running GPUs because the models are simply replicated across all GPUs. DeepSpeed implements all ZeRO stages, but we are only using stage-2 at the moment.

zero stages listed for reference:

  • Stage 1: Partitions optimizer states.
  • Stage 2: Partitions optimizer states and gradients.
  • Stage 3: Partitions optimizer states, gradients, and model parameters.

FSDP on the other hand only supports ZeRO stage-3 or no offloading at all. So for this reason, it wouldn't be straightforward to feature-gate DeepSpeed as-is without also providing ZeRO-3 support there as well.

We'll need to make sure that this is all tested against the full matrix of the devices we intend to support as well

@nathan-weinberg nathan-weinberg added the epic Larger tracking issue encompassing multiple smaller issues label Sep 10, 2024
@JamesKunstle JamesKunstle self-assigned this Sep 14, 2024
@JamesKunstle
Copy link
Contributor

Adding the following general issue as well:

I'll be working on converting / testing this code on Gaudi 2 cards in the multi-GPU case as well.

@JamesKunstle
Copy link
Contributor

@Maxusmusti @RobotSail It sounds like this is being solved by @aldopareja's PR that uses Accelerate, and Mustafa's work that enables LoRA checkpointing. What do we need to do to finish this / get it tested?

@Maxusmusti
Copy link
Contributor

@JamesKunstle we should sync on this tmrw either before or after meetings today to make sure we have everything. Checkpoint resuming and a lot of fsdp testing will def be needed this week, and we still need to bring back padding free via hf transformers Granite model class.

@ktam3
Copy link
Author

ktam3 commented Sep 30, 2024

@JamesKunstle - can we close this epic if it's done?

@ktam3
Copy link
Author

ktam3 commented Sep 30, 2024

Closing this as discussed in chat. Feel free to reopen if it's incorrect

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request epic Larger tracking issue encompassing multiple smaller issues jira
Projects
None yet
Development

No branches or pull requests

5 participants