Tensor Parallelism #1521

eitanturok · 2024-09-12T15:01:45Z

Implement Tensor Parallelism (TP) in foundry.

To do:

Make tp_strategy registry
Check TP: if world_size == 1 or training an MoE, don't apply TP
Get the same loss with fsdp and fsdp-tp
Add Tests

Updates:
I compared training 125m param models for 100 steps on c4 with tp-fsdp VS fsdp.

brown = tp-fsdp
gray = fsdp

loss_train_total:

throughput_batches_per_sec:

memory_peak_reserved_mem:

It is okay that we don't see performance improvements here yet -- we'll get those later, in follow up PRs.

eitanturok · 2024-09-12T15:04:46Z

Currently, the ffn strategy gives different results when we train fsdp vs fsdp-tp.

See mcli runs:

mpt-125m-tp-fsdp-IHKG5s
mpt-125m-fsdp-8OgGjt

and here are their losses which are visibly different.

Currently investigating, though I think this may have to do more with my specific layer plan/strategy then with anything else.

snarayan21 · 2024-09-12T15:15:43Z

Will review once loss discrepancy has been addressed. good to see it's at least mechanically working though

eitanturok · 2024-09-26T22:09:04Z

Should we include a tp-mpt-125m.yaml in the repo?

dakinggg · 2024-09-26T22:12:39Z

@eitanturok does checkpointing work now?

mvpatel2000

LGTM!

llmfoundry/command_utils/train.py

mvpatel2000 · 2024-09-26T22:13:41Z

@eitanturok does checkpointing work now?

No and it won't with FSDPv1.

dakinggg · 2024-09-26T22:14:40Z

@mvpatel2000 @eitanturok ok lets leave the yaml out then

dakinggg · 2024-09-26T22:14:59Z

Also can we log a warning when using TP that checkpointing is known to not work?

eitanturok · 2024-09-26T22:29:39Z

@dakinggg I just added a warning that checkpointing does not work + give a link to the exact pytorch issue.

One of the tests verifies that the trainer works but it takes too long cause it downloads a dataset. So I will fix this and I think we will be good to go.

eitanturok and others added 25 commits August 28, 2024 14:47

add tp_strategy registry

4db2a91

update

9f948b2

add ffn tp strategy

8229133

Merge branch 'mosaicml:main' into tp

841c8cf

only do layer_plan for now

3b201d8

Merge branch 'mosaicml:main' into tp

1c9a532

add tp_config

b0118f7

build tp_strategy

4868c17

update

f30cda8

update

63d236c

update

6dcf5e6

replace Dict with dict

eac4ad2

update

4bcaf96

works!

092f2f2

tp_strategy does not require model

a935bed

tp_strategy accepts model

11c0492

fix validation

bddb165

updatE

309b96c

fix logging issue

4226916

fix yaml

86b1b81

add error

f2d6571

it works!

c6cee7f

works with original yaml

8040aa7

update

f384de7

Merge branch 'mosaicml:main' into tp

9f77bcf

eitanturok requested review from mvpatel2000 and snarayan21 September 12, 2024 15:13

delete file

3b5f935

eitanturok and others added 5 commits September 26, 2024 20:42

add experimental_function decorator to tp_strategy

c9c2455

simplify trainer

33bbf9b

tp_strategy -> tp_stratigies

c9e64df

make tp dir

df169e8

Merge branch 'main' into tp

c9a8078

eitanturok requested review from dakinggg, snarayan21 and mvpatel2000 September 26, 2024 21:51

eitanturok added 2 commits September 26, 2024 22:00

rename

e6ab929

better function names

6caeea9

mvpatel2000 approved these changes Sep 26, 2024

View reviewed changes

llmfoundry/command_utils/train.py Outdated Show resolved Hide resolved

mvpatel2000 requested a review from b-chu September 26, 2024 22:13

eitanturok added 3 commits September 26, 2024 22:19

import fix style

3426ea3

delete tp yaml

2683c6d

warn checkpointing does not work

d5779c7

eitanturok added 5 commits September 26, 2024 22:31

better description

7ac37bc

cleanup

67a1c7b

tp test directory

eb2b591

style

86992f9

type checking

24ffeb4

eitanturok requested a review from mvpatel2000 September 27, 2024 14:00

remove assert

04da536

eitanturok merged commit ee45600 into mosaicml:main Sep 27, 2024
9 checks passed

eitanturok deleted the tp branch September 27, 2024 14:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensor Parallelism #1521

Tensor Parallelism #1521

eitanturok commented Sep 12, 2024 •

edited

Loading

eitanturok commented Sep 12, 2024 •

edited

Loading

snarayan21 commented Sep 12, 2024

eitanturok commented Sep 26, 2024

dakinggg commented Sep 26, 2024

mvpatel2000 left a comment

mvpatel2000 commented Sep 26, 2024

dakinggg commented Sep 26, 2024

dakinggg commented Sep 26, 2024

eitanturok commented Sep 26, 2024

Tensor Parallelism #1521

Tensor Parallelism #1521

Conversation

eitanturok commented Sep 12, 2024 • edited Loading

eitanturok commented Sep 12, 2024 • edited Loading

snarayan21 commented Sep 12, 2024

eitanturok commented Sep 26, 2024

dakinggg commented Sep 26, 2024

mvpatel2000 left a comment

Choose a reason for hiding this comment

mvpatel2000 commented Sep 26, 2024

dakinggg commented Sep 26, 2024

dakinggg commented Sep 26, 2024

eitanturok commented Sep 26, 2024

eitanturok commented Sep 12, 2024 •

edited

Loading

eitanturok commented Sep 12, 2024 •

edited

Loading