Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gaudi support training #330

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

JamesKunstle
Copy link
Contributor

  • reorg main function; add 'hpu' option w/o implementation
  • implement training for HPU

@JamesKunstle
Copy link
Contributor Author

Builds on changes from #329. Still needs to be tested.

The `main` training function needed to be broken down into smaller
functions for readability/testability.

WORLD_SIZE, LOCAL_RANK, and RANK have also been extracted and made
global constants since they are set by the administrating
multiprocessing launcher (torchrun, in our case).

HPU configuration options and checks are also added.

Signed-off-by: James Kunstle <[email protected]>
HPU cards (Gaudi 2 and 3) can't use Accelerate code path. This
contribution adds the training setup and loop for FSDP-only training.

Minor modifications required for HPUs specifically.

Signed-off-by: James Kunstle <[email protected]>
@JamesKunstle JamesKunstle force-pushed the gaudi-support-training branch from f9192d1 to aa192e8 Compare November 12, 2024 23:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement multi-hpu training with FSDP for Gaudi 3 cards Intel Gaudi Multi-GPU, single-node training
1 participant