Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CPU] Support SHM based inference_all_reduce in TorchBackend #5391

Merged
merged 10 commits into from
Apr 17, 2024

Conversation

delock
Copy link
Collaborator

@delock delock commented Apr 10, 2024

This PR adds SHM based inference_all_reduce kernel to TorchBackend communication backend. When inference on CPU server, this path replaces default torch.distributed.all_reduce which eventurally use gloo backend. This PR will improve inference performance with AutoTP when only stock PyTorch is installed without Intel Extension for PyTorch.

Compared with gloo backend. SHM based inference_all_reduce kernel is a more directed path and perform much better on single node.

message size gloo all_reduce(ms) SHM all_reduce(ms)
32MB 30.7 0.65
64KB 0.23 0.028

In text generation of bloom-3b with AutoTP, average token latency improved 1.45x with this PR on 2S Xeon node.

@delock delock changed the title Support SHM based inference_all_reduce in TorchBackend [CPU] Support SHM based inference_all_reduce in TorchBackend Apr 10, 2024
@delock
Copy link
Collaborator Author

delock commented Apr 11, 2024

Hi @loadams the formatting error had been fixed, thanks!

csrc/cpu/comm/ccl.cpp Outdated Show resolved Hide resolved
op_builder/cpu/comm.py Outdated Show resolved Hide resolved
@tjruwase tjruwase added this pull request to the merge queue Apr 17, 2024
Merged via the queue into microsoft:master with commit b22706a Apr 17, 2024
13 checks passed
rraminen pushed a commit to ROCm/DeepSpeed that referenced this pull request May 9, 2024
…ft#5391)

This PR adds SHM based `inference_all_reduce` kernel to `TorchBackend`
communication backend. When inference on CPU server, this path replaces
default `torch.distributed.all_reduce` which eventurally use gloo
backend. This PR will improve inference performance with AutoTP when
only stock PyTorch is installed without Intel Extension for PyTorch.

Compared with gloo backend. SHM based inference_all_reduce kernel is a
more directed path and perform much better on single node.

| message size | gloo all_reduce(ms) | SHM all_reduce(ms) |
| --- | --- | --- |
| 32MB | 30.7 | 0.65 |
| 64KB | 0.23 | 0.028 |

In text generation of bloom-3b with AutoTP, average token latency
improved 1.45x with this PR on 2S Xeon node.

---------

Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
umchand pushed a commit to umchand/DeepSpeed that referenced this pull request May 20, 2024
…ft#5391)

This PR adds SHM based `inference_all_reduce` kernel to `TorchBackend`
communication backend. When inference on CPU server, this path replaces
default `torch.distributed.all_reduce` which eventurally use gloo
backend. This PR will improve inference performance with AutoTP when
only stock PyTorch is installed without Intel Extension for PyTorch.

Compared with gloo backend. SHM based inference_all_reduce kernel is a
more directed path and perform much better on single node.

| message size | gloo all_reduce(ms) | SHM all_reduce(ms) |
| --- | --- | --- |
| 32MB | 30.7 | 0.65 |
| 64KB | 0.23 | 0.028 |

In text generation of bloom-3b with AutoTP, average token latency
improved 1.45x with this PR on 2S Xeon node.

---------

Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
dbyoung18 pushed a commit to dbyoung18/DeepSpeed that referenced this pull request Jun 11, 2024
…ft#5391)

This PR adds SHM based `inference_all_reduce` kernel to `TorchBackend`
communication backend. When inference on CPU server, this path replaces
default `torch.distributed.all_reduce` which eventurally use gloo
backend. This PR will improve inference performance with AutoTP when
only stock PyTorch is installed without Intel Extension for PyTorch.

Compared with gloo backend. SHM based inference_all_reduce kernel is a
more directed path and perform much better on single node.

| message size | gloo all_reduce(ms) | SHM all_reduce(ms) |
| --- | --- | --- |
| 32MB | 30.7 | 0.65 |
| 64KB | 0.23 | 0.028 |

In text generation of bloom-3b with AutoTP, average token latency
improved 1.45x with this PR on 2S Xeon node.

---------

Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants