[CPU] Support SHM based inference_all_reduce in TorchBackend #5391

delock · 2024-04-10T07:24:06Z

This PR adds SHM based inference_all_reduce kernel to TorchBackend communication backend. When inference on CPU server, this path replaces default torch.distributed.all_reduce which eventurally use gloo backend. This PR will improve inference performance with AutoTP when only stock PyTorch is installed without Intel Extension for PyTorch.

Compared with gloo backend. SHM based inference_all_reduce kernel is a more directed path and perform much better on single node.

message size	gloo all_reduce(ms)	SHM all_reduce(ms)
32MB	30.7	0.65
64KB	0.23	0.028

In text generation of bloom-3b with AutoTP, average token latency improved 1.45x with this PR on 2S Xeon node.

delock · 2024-04-11T02:32:38Z

Hi @loadams the formatting error had been fixed, thanks!

csrc/cpu/comm/ccl.cpp

op_builder/cpu/comm.py

…ft#5391) This PR adds SHM based `inference_all_reduce` kernel to `TorchBackend` communication backend. When inference on CPU server, this path replaces default `torch.distributed.all_reduce` which eventurally use gloo backend. This PR will improve inference performance with AutoTP when only stock PyTorch is installed without Intel Extension for PyTorch. Compared with gloo backend. SHM based inference_all_reduce kernel is a more directed path and perform much better on single node. | message size | gloo all_reduce(ms) | SHM all_reduce(ms) | | --- | --- | --- | | 32MB | 30.7 | 0.65 | | 64KB | 0.23 | 0.028 | In text generation of bloom-3b with AutoTP, average token latency improved 1.45x with this PR on 2S Xeon node. --------- Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]>

delock added 2 commits April 9, 2024 04:04

support shm based allreduce when torchCCL is not installed

752ea06

keep deepspeed.comm.inference_all_reduce interface not changed

9f2dd13

delock requested review from awan-10, mrwyattii and arashb as code owners April 10, 2024 07:24

delock changed the title ~~Support SHM based inference_all_reduce in TorchBackend~~ [CPU] Support SHM based inference_all_reduce in TorchBackend Apr 10, 2024

delock mentioned this pull request Apr 10, 2024

(Do not merge) (CPU) aggregation of few recent fixes/optimizations #3920

Closed

25 tasks

loadams and others added 2 commits April 10, 2024 10:48

Merge branch 'master' into gma/gloo_shm_allreduce

dd8f9b0

fix formatting

1cd87c3

Merge branch 'master' into gma/gloo_shm_allreduce

cbecb8d

tjruwase reviewed Apr 12, 2024

View reviewed changes

csrc/cpu/comm/ccl.cpp Outdated Show resolved Hide resolved

tjruwase reviewed Apr 12, 2024

View reviewed changes

op_builder/cpu/comm.py Outdated Show resolved Hide resolved

delock added 3 commits April 14, 2024 22:31

Change 'SHM' in op builder name into 'ShareMem'

517b7cb

add op to inference_all_reduce

516bc35

restore oneccl call parameter

541e79c

tjruwase approved these changes Apr 15, 2024

View reviewed changes

tjruwase and others added 2 commits April 15, 2024 10:06

Merge branch 'master' into gma/gloo_shm_allreduce

71e7480

Merge branch 'master' into gma/gloo_shm_allreduce

4149ce0

tjruwase added this pull request to the merge queue Apr 17, 2024

Merged via the queue into microsoft:master with commit b22706a Apr 17, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CPU] Support SHM based inference_all_reduce in TorchBackend #5391

[CPU] Support SHM based inference_all_reduce in TorchBackend #5391

delock commented Apr 10, 2024

delock commented Apr 11, 2024

[CPU] Support SHM based inference_all_reduce in TorchBackend #5391

[CPU] Support SHM based inference_all_reduce in TorchBackend #5391

Conversation

delock commented Apr 10, 2024

delock commented Apr 11, 2024