-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
22 changed files
with
1,791 additions
and
15 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,64 @@ | ||
name: nv-flash-attn | ||
|
||
on: | ||
workflow_dispatch: | ||
pull_request: | ||
paths: | ||
- 'deepspeed/sequence/**' | ||
- 'tests/unit/sequence_parallelism/**' | ||
- '.github/workflows/nv-flash-attn.yml' | ||
schedule: | ||
- cron: "0 0 * * *" | ||
|
||
concurrency: | ||
group: ${{ github.workflow }}-${{ github.ref }} | ||
cancel-in-progress: true | ||
|
||
jobs: | ||
unit-tests: | ||
runs-on: [self-hosted, nvidia, a6000] | ||
container: | ||
image: nvcr.io/nvidia/pytorch:24.03-py3 | ||
ports: | ||
- 80 | ||
options: --gpus all --shm-size "8G" | ||
|
||
steps: | ||
- uses: actions/checkout@v4 | ||
|
||
- name: Check container state | ||
run: | | ||
ldd --version | ||
nvcc --version | ||
nvidia-smi | ||
python -c "import torch; print('torch:', torch.__version__, torch)" | ||
python -c "import torch; print('CUDA available:', torch.cuda.is_available())" | ||
- name: Install transformers | ||
run: | | ||
git clone --depth=1 https://github.com/huggingface/transformers | ||
cd transformers | ||
git rev-parse --short HEAD | ||
python -m pip install . | ||
- name: Install deepspeed | ||
run: | | ||
python -m pip install .[dev] | ||
ds_report | ||
- name: Install FlashAttention | ||
run: | | ||
python -m pip install flash-attn | ||
- name: Python environment | ||
run: | | ||
python -m pip list | ||
- name: Unit tests | ||
run: | | ||
unset TORCH_CUDA_ARCH_LIST # only jit compile for current arch | ||
cd tests | ||
python -m pytest --color=yes --durations=0 --verbose -rF unit/sequence_parallelism/test_ulysses.py --torch_ver="2.3" --cuda_ver="12" | ||
- name: Open GitHub issue if nightly CI fails | ||
if: ${{ failure() && (github.event_name == 'schedule') }} | ||
uses: JasonEtco/create-an-issue@v2 | ||
env: | ||
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} | ||
with: | ||
filename: .github/ISSUE_TEMPLATE/ci_failure_report.md | ||
update_existing: true |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,263 @@ | ||
# Ulysses-Offload: Democratizing Long Context LLM Training | ||
|
||
<img src="./media/image1.png" style="width:6.5in;height:3.34583in" | ||
alt="A screenshot of a computer Description automatically generated" /> | ||
|
||
Figure 1: Ulysses-Offload supports 16x longer sequence lengths at 55% | ||
Model FLOPs Utilization (MFU) than NVIDIA Megatron-SP and DeepSpeed Ulysses. | ||
|
||
|
||
To cite and for more technical in depth of this release, please see | ||
our [arxiv report](https://arxiv.org/abs/2408.16978): | ||
|
||
@article{yao2024ulysses, | ||
|
||
title={ Training Ultra Long Context Language Model with Fully Pipelined | ||
Distributed Transformer}, | ||
|
||
author={Jinghan Yao and Sam Ade Jacobs and Masahiro Tanaka and Olatunji | ||
Ruwase and Aamir Shafi and Hari Subramoni and Dhabaleswar K. (DK) Panda | ||
}, | ||
|
||
journal={https://arxiv.org/abs/2408.16978}, | ||
|
||
year={2024} | ||
|
||
} | ||
|
||
## Introduction | ||
|
||
In the rapidly evolving field of generative AI and scientific ML, the | ||
ability to train large (language) models with ultra-long context | ||
capabilities is becoming increasingly important. These models are | ||
essential for a variety of complex tasks, such as understanding | ||
lengthy documents, generating images and videos, and processing extensive | ||
sequences in computational biology. However, training such models | ||
efficiently poses significant challenges due to the enormous GPU | ||
memory required. | ||
|
||
Building DeepSpeed Ulysses, our previous project, which developed | ||
system optimizations for training extremely long sequence transformer | ||
models, we are excited to present Ulysses-Offload, in this release. Ulysses-Offload | ||
is an innovative, resource-efficient technique that offers comparable | ||
benefits to DeepSpeed Ulysses and other previous long-context | ||
optimization methods, but with a lower hardware budget. Ulysses-Offload makes | ||
ultra long-context large language models (LLM) training and finetuning | ||
accessible to everyone, including those with limited GPU resources. Ulysses-Offload enables | ||
training with context lengths of up to 2 million tokens using just 4 | ||
NVIDIA A100-40GB GPUs. Ulysses-Offload supports 16x longer sequence lengths at 55% | ||
Model FLOPs Utilization (MFU) than NVIDIA Megatron-SP and DeepSpeed Ulysses | ||
(see Figure 1). The next section highlights the key innovations of Ulysses-Offload, | ||
and subsequent sections provide additional details on the design and | ||
usability of Ulysses-Offload, followed by experimental results. | ||
|
||
## Key Innovations | ||
|
||
### 1. Fully Pipelined Distributed Transformer (FPDT) | ||
|
||
The core innovation of our work is the Fully Pipelined Distributed | ||
Transformer (FPDT). This approach leverages a pipelined sequence | ||
chunking, which allows for the training of LLMs with sequence lengths up | ||
to 2 million tokens on just 4 A100-40GB GPUs. By breaking down the | ||
sequence into manageable chunks and processing them in a pipelined | ||
manner, Ulysses-Offload significantly reduces the memory footprint while | ||
maintaining high computational efficiency. This method ensures that the | ||
GPUs are utilized effectively, even when dealing with extremely long | ||
sequences. | ||
|
||
### 2. Memory Optimization | ||
|
||
One of the critical aspects of our approach is the comprehensive | ||
analysis and optimization of the memory footprint during LLM training. | ||
We target the reduction of redundant intermediate buffers in both the | ||
forward and backward passes of the training process. By optimizing the | ||
use of GPU and host CPU memory, we can train larger models with longer | ||
sequences without running into GPU memory limitations. This optimization | ||
is crucial for enabling the training of ultra-long context models on a | ||
limited number of GPUs. It is worth noting that Ulysses-Offload memory optimization | ||
is orthogonal and complementary to model- parameter-focused memory | ||
optimization techniques used by DeepSpeed ZeRO and PyTorch FSDP. Ulysses-Offload optimizes memory footprint of activations associated with long sequences while ZeRO and FSDP optimize memory footprint of model parameters. | ||
|
||
### 3. Compatibility and Flexibility | ||
|
||
Ulysses-Offload is designed to be agnostic to existing training techniques and | ||
works efficiently across different LLM models, including popular | ||
architecture like GPT and Llama. This flexibility ensures that our | ||
approach can be easily integrated into various training workflows. | ||
Additionally, Ulysses-Offload is compatible with advanced memory optimization | ||
techniques such as DeepSpeed ZeRO and PyTorch FSDP, further enhancing | ||
its usability and performance. | ||
|
||
## Core Design of Ulysses-Offload | ||
|
||
Figure 2 illustrates the core structure of Ulysses-Offload. Ulysses-Offload leverages multiple | ||
memory hierarchies in modern GPU clusters, thus boosting hardware | ||
efficiency and cost-effectiveness while achieving very high model FLOP | ||
utilization (MFU). The design of Ulysses-Offload centers around pipelining, | ||
scheduling, and memory management. These well-known optimization | ||
techniques are essential for scaling LLM context length to a million | ||
scale with a few GPUs and will be discussed in the subsequent | ||
subsections. | ||
|
||
<img src="./media/image2.png" style="width:6.5in;height:2.68634in" | ||
alt="A screenshot of a computer Description automatically generated" /> | ||
|
||
Figure 2: Core design | ||
|
||
### | ||
|
||
### Pipelining and Scheduling | ||
|
||
Ulysses-Offload employs sequence chunking and pipelined computation design to manage the memory | ||
and computational load efficiently. In traditional Transformer model, | ||
input (hidden state) tensor is projected to q, k, v tensors. Each of these tensors can be denoted *\[B, S, H, D\]*, where *B* is batch | ||
size, *S* is sequence length, *H* is number of heads and *D* is hidden | ||
dimension per head. With sequence parallelism such as DeepSpeed Ulysses, | ||
input tensor is partitioned along sequence dimension across sequence | ||
parallel group P, that is *\[B, S/P, H, D\]* prior to alltoall collective | ||
communication. The alltoall collective communication gathers partitioned tensors | ||
along sequence dimension and scatter them along head dimension essentially | ||
transforming tensor from *\[B, S/P, H, D\]* to *\[B, S, H/P, D\]*. Post attention computation, a second alltoall communication transforms *\[B, S, H/P, D\]* back to *\[B, S/P, H, D\]* | ||
|
||
In our Ulysses-Offload design, input sequence are partitioned at a much finer granularity than DeepSpeed Ulysses. In other words, we made changes to sequence partitioning such that we further subdivide per GPU *S/P* sequence into smaller *u* | ||
chunks. Thus, the input tensors are now represented as \[*B, S/uP, H, | ||
D*\]. We denote these chunks as *T<sub>i</sub>*, | ||
where$\ i\ \in \ 0,1,\ldots,\ u - 1.$ As shown in Figure 1, | ||
*T<sub>i</sub>* is projected to query *q<sub>i</sub>*, key | ||
*k<sub>i</sub>*, and value *v<sub>i</sub>*. Then, similar to DeepSpeed Ulysses, an alltoall collective communication gathers partitioned tensor | ||
along sequence dimension and scatter them along head dimension. In our chunk | ||
design, the sequence length for each chunk is reduced by a factor of *u* | ||
compared to Ulysses. Please note that our Ulysses-Offload chunking procedure is generally applicable to other sequence parallelism techniques. | ||
|
||
<img src="./media/image3.png" style="width:6.5in;height:5.36042in" | ||
alt="A screenshot of a computer Description automatically generated" /> | ||
|
||
Figure 3: Core design with offload description | ||
|
||
Figure 3 gives an example of how to perform the computation of chunk | ||
*T<sub>m</sub>*. After the alltoall collective communication, | ||
*GPU<sub>j</sub>* receives | ||
$\widehat{q}m,\ \widehat{k}m,\ and\ \widehat{v}m$*.* We then fetch the | ||
previous sequence chunk by chunk from the host memory to | ||
GPU<sub>j</sub>, and perform online attention with the current | ||
$\widehat{q}m$ and update the output chunk accordingly. Note that, in a | ||
strict manner, at any given time, only one set of chunks | ||
$\widehat{k}i,\ and\ \widehat{v}i$ is placed on GPU's HBM, reducing the | ||
memory footprint to $\frac{1}{u}$ compared to the non-offloading version | ||
without double buffering. With double buffering, memory footprint is | ||
reduced by *2/u*. | ||
|
||
### Memory Management | ||
|
||
Ulysses-Offload optimizes memory usage by carefully managing the allocation and | ||
deallocation of buffers during training. This involves: | ||
|
||
1. Double Buffering: | ||
|
||
- Two sets of buffers are maintained to overlap computation with | ||
data transfer. | ||
|
||
- While one set of buffers is used for computation, the other set is | ||
preloaded with the next chunk of data. | ||
|
||
2. Hierarchical Memory Utilization: | ||
|
||
- GPU High Bandwidth Memory (HBM) is used for active computation. | ||
|
||
- Host memory is used to store intermediate results that are not | ||
immediately needed, reducing the pressure on GPU memory. | ||
|
||
## Integration with Existing Frameworks | ||
|
||
Ulysses-Offload is designed to integrate seamlessly with popular deep learning | ||
frameworks such as PyTorch. Ulysses-Offload provides user-friendly APIs that | ||
abstract the complexities of pipelined training and memory management. | ||
Users can adopt Ulysses-Offload with minimal changes to existing codebases. | ||
|
||
## Experimental Results | ||
|
||
<img src="./media/image4.png" style="width:6.5in;height:3.37431in" | ||
alt="A collage of graphs Description automatically generated" /> | ||
|
||
Figure 4: Supported sequence lengths and corresponding Model FLOPs | ||
Utilization (MFU) using Megatron-SP, Ulysses, and our proposed Ulysses-Offload (FPDT). OOM | ||
denotes the point where increasing sequence length will cause memory | ||
issues. We show Ulysses-Offload's performance when the sequence length is larger | ||
than 128K, as shorter sequences can be properly handled by existing | ||
strategies. | ||
|
||
### Extended Sequence Lengths | ||
|
||
In our experimental setup, we compare Ulysses-Offload with two existing methods: | ||
Microsoft DeepSpeed Ulysses and NVIDIA Megatron-SP. Both DeepSpeed | ||
Ulysses and Megatron-SP employ similar approaches to sequence | ||
parallelism but differ in the collective communication used for | ||
gathering sequences before the attention block. The former utilizes | ||
alltoall communication, whereas the latter employs allgather. Ulysses-Offload | ||
builds upon the DeepSpeed Ulysses approach. The primary advantage of | ||
Ulysses-Offload is its capability to support the training of large language models | ||
(LLMs) with ultra-long sequence lengths using fewer GPUs. As shown in | ||
Figure 4, our method enables the training of 8B parameter models with | ||
sequence lengths of 2 million tokens using only 4 GPUs. For even larger | ||
models, such as GPT-30B and Llama-70B parameter models, Ulysses-Offload supports | ||
sequence lengths up to 3 million and 4 million tokens using 16 GPUs and | ||
32 GPUs respectively. This represents a 16x increase in sequence length | ||
compared to current state-of-the-art solutions (see Figure 5), making | ||
Ulysses-Offload a game-changer for tasks that require processing long sequences. | ||
|
||
### High Hardware Efficiency | ||
|
||
As shown in Figure 4 with different model sizes ranging from GPT-2.7B to | ||
Llama-80B parameters, Ulysses-Offload achieves over 55% Model FLOPs Utilization | ||
(MFU), ensuring that the hardware resources are utilized effectively. | ||
This high level of efficiency is maintained even when dealing with | ||
extremely long sequences (up to 4 million context length), making Ulysses-Offload | ||
an ideal solution for training large-scale LLMs. By maximizing the use | ||
of available hardware, Ulysses-Offload reduces the overall cost and complexity of | ||
training long-context models. Our [technical report](https://arxiv.org/abs/2408.16978) offers | ||
further insights into optimizing sequence chunks to balance the | ||
trade-off between memory usage and MFU. | ||
|
||
<img src="./media/image5.png" style="width:6.5in;height:2.01667in" /> | ||
|
||
Figure 5: A comprehensive analysis on long-context LLM training with | ||
different training techniques: tensor parallelism (TP), activation | ||
checkpoint (AC), activation checkpoint with CPU offloading (OC), Ulysses | ||
(UL), and our approach Ulysses-Offload (FPDT). | ||
|
||
## Implementation and Usability | ||
|
||
Ulysses-Offload is designed to be easily integrated with popular deep learning | ||
frameworks such as DeepSpeed, Megatron-DeepSpeed and PyTorch. Users can | ||
adopt our approach with minimal changes to their existing training | ||
pipeline, making it accessible to a broad audience. The integration | ||
process involves setting up the sequence chunk pipeline and configuring | ||
the memory optimization techniques, both of which are straightforward | ||
and well-documented (see tutorial). | ||
|
||
Our pipeline design and memory optimization techniques are | ||
straightforward to implement, making Ulysses-Offload accessible to researchers and | ||
practitioners aiming to train long-context LLMs efficiently. We provide | ||
detailed [technical report](https://arxiv.org/abs/2408.16978), | ||
documentation and examples to guide users through the setup process, | ||
ensuring a smooth transition to using Ulysses-Offload. Additionally, Ulysses-Offload, in the | ||
tradition of DeepSpeed provides user-friendly API which abstracts the | ||
complexities of mixed precision training and memory optimization, | ||
allowing users to focus on their research and development tasks. | ||
|
||
## General Availability of DeepSpeed Ulysses-Offload | ||
|
||
We are excited to release Ulysses-Offload. Ulysses-Offload has been | ||
fully integrated with Megatron-DeepSpeed and accessible through both | ||
DeepSpeed and Megatron-DeepSpeed GitHub repos. Click here for detailed | ||
[tutorial](https://www.deepspeed.ai/tutorials/ulysses-offload/) on usage. | ||
|
||
We invite the community to explore our implementation, contribute to | ||
further advancements, and join us in pushing the boundaries of what is | ||
possible in LLM and AI. This release is part of the bigger DeepSpeed | ||
ecosystem of large-scale AI training, finetuning and inference. For more | ||
details on all DeepSpeed technologies and innovations, please visit our | ||
[website]((https://www.deepspeed.ai/)) and follow us | ||
on X, formerly Twitter, ([English](https://twitter.com/MSFTDeepSpeed), | ||
[Japanese](https://twitter.com/MSFTDeepSpeedJP)) and | ||
[Chinese Zhihu](https://www.zhihu.com/people/deepspeed). |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.