-
Notifications
You must be signed in to change notification settings - Fork 533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tensor Parallelism #1521
Tensor Parallelism #1521
Conversation
Currently, the ffn strategy gives different results when we train fsdp vs fsdp-tp. See mcli runs:
and here are their losses which are visibly different. Currently investigating, though I think this may have to do more with my specific layer plan/strategy then with anything else. |
Will review once loss discrepancy has been addressed. good to see it's at least mechanically working though |
Should we include a |
@eitanturok does checkpointing work now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
No and it won't with FSDPv1. |
@mvpatel2000 @eitanturok ok lets leave the yaml out then |
Also can we log a warning when using TP that checkpointing is known to not work? |
@dakinggg I just added a warning that checkpointing does not work + give a link to the exact pytorch issue. One of the tests verifies that the trainer works but it takes too long cause it downloads a dataset. So I will fix this and I think we will be good to go. |
Implement Tensor Parallelism (TP) in foundry.
To do:
Updates:
I compared training 125m param models for 100 steps on c4 with tp-fsdp VS fsdp.
loss_train_total:
throughput_batches_per_sec:
memory_peak_reserved_mem:
It is okay that we don't see performance improvements here yet -- we'll get those later, in follow up PRs.