[Epic] Replace DeepSpeed with PyTorch FSDP for Model Training #197

ktam3 · 2024-09-06T12:57:25Z

Feature Overview
This Feature card is for transitioning our model training infrastructure from DeepSpeed to PyTorch's Fully Sharded Data Parallel (FSDP) to enhance training metrics visibility, broaden accelerator support, and maintain performance parity.

Goals

Improve training metrics visibility for ML engineers and data scientists through integration with Weights & Biases.
Expand accelerator support for hardware flexibility.
Maintain or improve training performance across GPU configurations.

Requirements

Implement PyTorch FSDP as the primary distributed training framework, replacing DeepSpeed.
Integrate PyTorch FSDP with Weights & Biases for comprehensive training metrics collection and visualization.
Ensure compatibility with a broad range of accelerators (e.g., NVIDIA GPUs, AMD GPUs, TPUs).
Achieve performance parity or improvement compared to DeepSpeed on GPU configurations.
Implement and test CPU offload capabilities.
Update all relevant training scripts and documentation to reflect the transition to PyTorch FSDP.
Ensure security measures are in place for data handling during distributed training.
Maintain or improve the scalability of the training process.
(if applicable) Provide clear documentation on how to use the new PyTorch FSDP setup for different training scenarios.

Completion Checklist:

PyTorch FSDP implementation complete
Weights & Biases integration tested and functional
Performance benchmarks conducted across various accelerators
CPU offload capabilities evaluated and implemented if beneficial
All training scripts updated
Documentation updated
Security audit completed
Scalability tests passed

Questions to Answer

What is the performance impact of PyTorch FSDP on our specific model architectures?
How does the CPU offload capability of PyTorch FSDP compare to DeepSpeed?
Are there any specific optimizations needed for different accelerator types?
What changes are required in our CI/CD pipeline to accommodate this transition?

Out of Scope

Modifications to model architectures
Changes to data preprocessing pipelines
Alterations to evaluation metrics or procedures

Background
Our current training infrastructure uses DeepSpeed for distributed training. While effective, transitioning to PyTorch FSDP offers strategic advantages in terms of metrics visibility, accelerator support, and potential performance improvements.

User Considerations

Ensure that the transition is seamless for end-users of our models.
Communicate any changes in training times or resource requirements to relevant stakeholders.

Documentation Considerations

Update all training-related documentation to reflect the use of PyTorch FSDP.
Provide migration guides for users transitioning from DeepSpeed to PyTorch FSDP.
(if applicable) Document any changes in command-line arguments or configuration files needed for PyTorch FSDP.

Additional notes wrt to FSDP -

We would like to feature-gate FSDP support, so RHEL AI 1.2 will use deepspeed by default, but FSDP support will be available if enabled via a feature gate.
As a feature-gated feature, FSDP support could be considered tech preview for RHEL AI 1.2.
FSDP may not support all hardware. We will aim for this, but at a minimum it will support nvidia.
FSDP is a high-priority feature because it is needed by OpenShift AI to deliver FSDP to watsonx.ai.

ktam3 · 2024-09-09T18:46:18Z

Summary from discussion:

Additional discussion needs to be held to break down this work and create a step by step list of tasks that need to be done
Team needs to identify the folowing
- Scope:
  - Hardware availability
  - Dependencies
  - What can the team actually commit to in the next 3 weeks?

@RobotSail @JamesKunstle @Maxusmusti - to follow up and create issues and link to this epic as the work is being done

ktam3 · 2024-09-10T11:55:47Z

@RobotSail - additional notes wrt to support

We will build Intel and AMD for FSDP only. That allows us to avoid working on making DeepSpeed compile for Gaudi, which should drop a significant amount of work.
If we have issues with FSDP for either variant, we will either declare that variant’s release a preview, or not deliver it at all.
For Nvidia, to maintain backwards compatibility, we will keep DeepSpeed as the default but also provide a flag to let the user enable FSDP.
There is work in InstructLab to enable FSDP, add the flag, etc. and there is work to build Torch with FSDP (we think it’s on by default, but need to confirm).

RobotSail · 2024-09-10T16:59:51Z

With regards to FSDP - the main risks that we still need to overcome are going to be:

LoRA
Getting LoRA to work properly, since DeepSpeed was very compatible with running PEFT models, FSDP will require more work on our end to get this working.

Checkpointing

In our current implementation, we run DeepSpeed with ZeRO stage-2, which allows us to save a model checkpoint by taking its state on one of the running GPUs because the models are simply replicated across all GPUs. DeepSpeed implements all ZeRO stages, but we are only using stage-2 at the moment.

zero stages listed for reference:

Stage 1: Partitions optimizer states.
Stage 2: Partitions optimizer states and gradients.
Stage 3: Partitions optimizer states, gradients, and model parameters.

FSDP on the other hand only supports ZeRO stage-3 or no offloading at all. So for this reason, it wouldn't be straightforward to feature-gate DeepSpeed as-is without also providing ZeRO-3 support there as well.

We'll need to make sure that this is all tested against the full matrix of the devices we intend to support as well

RobotSail · 2024-09-12T19:59:40Z

The following issues now are a part of this epic:

JamesKunstle · 2024-09-14T04:05:56Z

Adding the following general issue as well:

[FSDP] reimplement correct checkpoint saving for model and optimizer #211

I'll be working on converting / testing this code on Gaudi 2 cards in the multi-GPU case as well.

JamesKunstle · 2024-09-23T02:47:07Z

@Maxusmusti @RobotSail It sounds like this is being solved by @aldopareja's PR that uses Accelerate, and Mustafa's work that enables LoRA checkpointing. What do we need to do to finish this / get it tested?

Maxusmusti · 2024-09-23T03:05:00Z

@JamesKunstle we should sync on this tmrw either before or after meetings today to make sure we have everything. Checkpoint resuming and a lot of fsdp testing will def be needed this week, and we still need to bring back padding free via hf transformers Granite model class.

ktam3 · 2024-09-30T12:18:19Z

@JamesKunstle - can we close this epic if it's done?

ktam3 · 2024-09-30T17:50:21Z

Closing this as discussed in chat. Feel free to reopen if it's incorrect

ktam3 added the enhancement New feature or request label Sep 6, 2024

ktam3 changed the title ~~Replace DeepSpeed with PyTorch FSDP for Model Training~~ [Epic] Replace DeepSpeed with PyTorch FSDP for Model Training Sep 6, 2024

ktam3 assigned RobotSail Sep 9, 2024

ktam3 added the jira label Sep 10, 2024

nathan-weinberg added the epic Larger tracking issue encompassing multiple smaller issues label Sep 10, 2024

JamesKunstle self-assigned this Sep 14, 2024

ktam3 mentioned this issue Sep 9, 2024

Extend InstructLab to support torch.device("hpu") instructlab/instructlab#2219

Closed

ktam3 closed this as completed Sep 30, 2024

JamesKunstle assigned Maxusmusti Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Epic] Replace DeepSpeed with PyTorch FSDP for Model Training #197

[Epic] Replace DeepSpeed with PyTorch FSDP for Model Training #197

ktam3 commented Sep 6, 2024 •

edited

Loading

ktam3 commented Sep 9, 2024 •

edited

Loading

ktam3 commented Sep 10, 2024

RobotSail commented Sep 10, 2024

RobotSail commented Sep 12, 2024

JamesKunstle commented Sep 14, 2024

JamesKunstle commented Sep 23, 2024

Maxusmusti commented Sep 23, 2024

ktam3 commented Sep 30, 2024

ktam3 commented Sep 30, 2024

[Epic] Replace DeepSpeed with PyTorch FSDP for Model Training #197

[Epic] Replace DeepSpeed with PyTorch FSDP for Model Training #197

Comments

ktam3 commented Sep 6, 2024 • edited Loading

ktam3 commented Sep 9, 2024 • edited Loading

ktam3 commented Sep 10, 2024

RobotSail commented Sep 10, 2024

RobotSail commented Sep 12, 2024

JamesKunstle commented Sep 14, 2024

JamesKunstle commented Sep 23, 2024

Maxusmusti commented Sep 23, 2024

ktam3 commented Sep 30, 2024

ktam3 commented Sep 30, 2024

ktam3 commented Sep 6, 2024 •

edited

Loading

ktam3 commented Sep 9, 2024 •

edited

Loading