Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ulyssess offload blog #6814

Merged
merged 13 commits into from
Dec 5, 2024
263 changes: 263 additions & 0 deletions blogs/ulysses-offload/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,263 @@
# Ulysses-Offload: Democratizing Long Context LLM Training

<img src="./media/image1.png" style="width:6.5in;height:3.34583in"
alt="A screenshot of a computer Description automatically generated" />

Figure 1: Ulysses-Offload supports 16x longer sequence lengths at 55%
Model FLOPs Utilization (MFU) than NVIDIA Megatron-SP and DeepSpeed Ulysses.


To cite and for more technical in depth of this release, please see
our [arxiv report](https://arxiv.org/abs/2408.16978):

@article{yao2024ulysses,

title={ Training Ultra Long Context Language Model with Fully Pipelined
Distributed Transformer},

author={Jinghan Yao and Sam Ade Jacobs and Masahiro Tanaka and Olatunji
Ruwase and Aamir Shafi and Hari Subramoni and Dhabaleswar K. (DK) Panda
},

journal={https://arxiv.org/abs/2408.16978},

year={2024}

}

## Introduction

In the rapidly evolving field of generative AI and scientific ML, the
ability to train large (language) models with ultra-long context
capabilities is becoming increasingly important. These models are
essential for a variety of complex tasks, ranging from understanding
lengthy documents to image and video generation to processing extensive
sequences in computational biology. However, training such models
samadejacobs marked this conversation as resolved.
Show resolved Hide resolved
efficiently poses significant challenges due to the enormous GPU
resources and memory required.
samadejacobs marked this conversation as resolved.
Show resolved Hide resolved

Building on our previous project, DeepSpeed Ulysses, which focuses on
samadejacobs marked this conversation as resolved.
Show resolved Hide resolved
system optimizations for training extremely long sequence transformer
models, we are excited to present the Fully Pipelined Distributed
Transformer (FPDT), also known as Ulysses-Offload, in this release. FPDT
is an innovative, resource-efficient technique that offers comparable
benefits to DeepSpeed Ulysses and other previous long-context
optimization methods, but with a modest hardware budget. FPDT makes
samadejacobs marked this conversation as resolved.
Show resolved Hide resolved
ultra long-context large language models (LLM) training and finetuning
accessible to everyone, regardless of GPU resources. FPDT enables
samadejacobs marked this conversation as resolved.
Show resolved Hide resolved
training with context lengths of up to 2 million tokens using just 4
NVIDIA A100-40GB GPUs. FPDT supports 16x longer sequence lengths at 55%
Model FLOPs Utilization (MFU) than NVIDIA Megatron-SP and DeepSpeed Ulysses
(see Figure 1). The next section highlights the key innovations of FPDT,
and subsequent sections provide additional details on the design and
usability of FPDT, followed by experimental results.

## Key Innovations

### 1. Fully Pipelined Distributed Transformer (FPDT)

The core innovation of our work is the Fully Pipelined Distributed
Transformer (FPDT). This approach leverages a pipelined sequence
chunking, which allows for the training of LLMs with sequence lengths up
samadejacobs marked this conversation as resolved.
Show resolved Hide resolved
to 2 million tokens on just 4 A100-40GB GPUs. By breaking down the
sequence into manageable chunks and processing them in a pipelined
manner, FPDT significantly reduces the memory footprint while
maintaining high computational efficiency. This method ensures that the
GPUs are utilized effectively, even when dealing with extremely long
sequences.

### 2. Memory Optimization

One of the critical aspects of our approach is the comprehensive
analysis and optimization of the memory footprint during LLM training.
We target the reduction of redundant intermediate buffers in both the
forward and backward passes of the training process. By optimizing the
use of GPU and host CPU memory, we can train larger models with longer
sequences without running into GPU memory limitations. This optimization
is crucial for enabling the training of ultra-long context models on a
limited number of GPUs. It is worth noting that FPDT memory optimization
is orthogonal and complementary to model- parameter-focused memory
samadejacobs marked this conversation as resolved.
Show resolved Hide resolved
optimization techniques used by DeepSpeed ZeRO and PyTorch FSDP.

### 3. Compatibility and Flexibility

FPDT is designed to be agnostic to existing training techniques and
works efficiently across different LLM models, including popular
architecture like GPT and Llama. This flexibility ensures that our
approach can be easily integrated into various training workflows.
Additionally, FPDT is compatible with advanced memory optimization
techniques such as DeepSpeed ZeRO and PyTorch FSDP, further enhancing
its usability and performance.

## Core Design of Fully Pipelined Distributed Transformer

Figure 2 illustrates the core structure of FPDT. FPDT leverages multiple
memory hierarchies in modern GPU clusters, thus boosting hardware
efficiency and cost-effectiveness while achieving very high model FLOP
utilization (MFU). The design of FPDT centers around pipelining,
scheduling, and memory management. These well-known optimization
techniques are essential for scaling LLM context length to a million
scale with a few GPUs and will be discussed in the subsequent
subsections.

<img src="./media/image2.png" style="width:6.5in;height:2.68634in"
alt="A screenshot of a computer Description automatically generated" />

Figure 2: FPDT core design

###

### Pipelining and Scheduling

FPDT employs a pipelined sequence chunking design to manage the memory
and computational load efficiently. In traditional Transformer model,
input QKV tensor can be denoted *\[B, S, H, D\]*, where *B* is batch
samadejacobs marked this conversation as resolved.
Show resolved Hide resolved
samadejacobs marked this conversation as resolved.
Show resolved Hide resolved
size, *S* is sequence length, *H* is number of heads and *D* is hidden
dimension per head. With sequence parallelism such as DeepSpeed Ulysses,
input tensor is partitioned along sequence dimension across sequence
parallel group P, that is *\[B, S/P, H,D\]* prior to alltoall collective
samadejacobs marked this conversation as resolved.
Show resolved Hide resolved
communication. The alltoall communication gathers partitioned tensor
along sequence dimension and scatter them along head dimension essential
transforming tensor from *\[B, S/P, H, D\]* to *\[B,S, H/P, D\]*. In our
samadejacobs marked this conversation as resolved.
Show resolved Hide resolved
FPDT design, we further subdivide per GPU *S/P* sequence into *u*
chunks. Thus, the input tensor is now represented as \[*B, S/uP, H,
D*\]. We denote these chunks as *T<sub>i</sub>*,
where$\ i\ \in \ 0,1,\ldots,\ u - 1.$ As shown in Figure 1,
*T<sub>i</sub>* is projected to query *q<sub>i</sub>*, key
*k<sub>i</sub>*, and value *v<sub>i</sub>*. Then, we perform the
alltoall communication among the sequence parallel group. In our chunk
design, the sequence length for each chunk is reduced by a factor of *u*
compared to Ulysses.

<img src="./media/image3.png" style="width:6.5in;height:5.36042in"
alt="A screenshot of a computer Description automatically generated" />

Figure 3: FDPT with Offload

Figure 3 gives an example of how to perform the computation of chunk
*T<sub>m</sub>*. After the alltoall collective communication,
*GPU<sub>j</sub>* receives
$\widehat{q}m,\ \widehat{k}m,\ and\ \widehat{v}m$*.* We then fetch the
previous sequence chunk by chunk from the host memory to
GPU<sub>j</sub>, and perform online attention with the current
$\widehat{q}m$ and update the output chunk accordingly. Note that, in a
strict manner, at any given time, only one set of chunks
$\widehat{k}i,\ and\ \widehat{v}i\ $is placed on GPU's HBM, reducing the
samadejacobs marked this conversation as resolved.
Show resolved Hide resolved
memory footprint to $\frac{1}{u}$ compared to the non-offloading version
without double buffering. With double buffering, memory footprint is
reduced by *2/u*.

### Memory Management

FPDT optimizes memory usage by carefully managing the allocation and
deallocation of buffers during training. This involves:

1. Double Buffering:

- Two sets of buffers are maintained to overlap computation with
data transfer.

- While one set of buffers is used for computation, the other set is
preloaded with the next chunk of data.

2. Hierarchical Memory Utilization:

- GPU High Bandwidth Memory (HBM) is used for active computation.

- Host memory is used to store intermediate results that are not
immediately needed, reducing the pressure on GPU memory.

## Integration with Existing Frameworks

FPDT is designed to integrate seamlessly with popular deep learning
frameworks such as PyTorch. FPDT provides user-friendly APIs that
abstract the complexities of pipelined training and memory management.
Users can adopt FPDT with minimal changes to existing codebases.

## Experimental Results

<img src="./media/image4.png" style="width:6.5in;height:3.37431in"
alt="A collage of graphs Description automatically generated" />

Figure 4: Supported sequence lengths and corresponding Model FLOPs
Utilization (MFU) using Megatron-SP, Ulysses, and our proposed FPDT. OOM
denotes the point where increasing sequence length will cause memory
issues. We show FPDT's performance when the sequence length is larger
than 128K, as shorter sequences can be properly handled by existing
strategies.

### Extended Sequence Lengths

In our experimental setup, we compare FPDT with two existing methods:
Microsoft DeepSpeed Ulysses and NVIDIA Megatron-SP. Both DeepSpeed
Ulysses and Megatron-SP employ similar approaches to sequence
parallelism but differ in the collective communication used for
gathering sequences before the attention block. The former utilizes
alltoall communication, whereas the latter employs allgather. FPDT
builds upon the DeepSpeed Ulysses approach. The primary advantage of
FPDT is its capability to support the training of large language models
(LLMs) with ultra-long sequence lengths using fewer GPUs. As shown in
Figure 4, our method enables the training of 8B parameter models with
sequence lengths of 2 million tokens using only 4 GPUs. For even larger
models, such as GPT-30B and Llama-70B parameter models, FPDT supports
sequence lengths up to 3 million and 4 million tokens using 16 GPUs and
32 GPUs respectively. This represents a 16x increase in sequence length
compared to current state-of-the-art solutions (see Figure 5), making
FPDT a game-changer for tasks that require processing long sequences.

### High Hardware Efficiency

As shown in Figure 4 with different model sizes ranging from GPT-2.7B to
Llama-80B parameters, FPDT achieves over 55% Model FLOPs Utilization
(MFU), ensuring that the hardware resources are utilized effectively.
This high level of efficiency is maintained even when dealing with
extremely long sequences (up to 4 million context length), making FPDT
an ideal solution for training large-scale LLMs. By maximizing the use
of available hardware, FPDT reduces the overall cost and complexity of
training long-context models. Our technical report ({https://arxiv.org/abs/2408.16978) offers
further insights into optimizing sequence chunks to balance the
trade-off between memory usage and MFU.

<img src="./media/image5.png" style="width:6.5in;height:2.01667in" />

Figure 5: A comprehensive analysis on long-context LLM training with
different training techniques: tensor parallelism (TP), activation
checkpoint (AC), activation checkpoint with CPU offloading (OC), Ulysses
(UL), and our approach FPDT.

## Implementation and Usability

FPDT is designed to be easily integrated with popular deep learning
frameworks such as DeepSpeed, Megatron-DeepSpeed and PyTorch. Users can
adopt our approach with minimal changes to their existing training
pipeline, making it accessible to a broad audience. The integration
process involves setting up the sequence chunk pipeline and configuring
the memory optimization techniques, both of which are straightforward
and well-documented (see tutorial).

Our pipeline design and memory optimization techniques are
straightforward to implement, making FPDT accessible to researchers and
practitioners aiming to train long-context LLMs efficiently. We provide
detailed [technical report](https://arxiv.org/abs/2408.16978),
documentation and examples to guide users through the setup process,
ensuring a smooth transition to using FPDT. Additionally, FPDT, in the
tradition of DeepSpeed provides user-friendly API which abstracts the
complexities of mixed precision training and memory optimization,
allowing users to focus on their research and development tasks.

## General Availability of DeepSpeed Ulysses-Offload

We are excited to release FPDT (aka Ulysses-Offload). FPDT has been
fully integrated with Megatron-DeepSpeed and accessible through both
DeepSpeed and Megatron-DeepSpeed GitHub repos. Click here for detailed
[tutorial](https://www.deepspeed.ai/tutorials/fpdt/) on usage.

We invite the community to explore our implementation, contribute to
further advancements, and join us in pushing the boundaries of what is
possible in LLM and AI. This release is part of the bigger DeepSpeed
ecosystem of large-scale AI training, finetuning and inference. For more
details on all DeepSpeed technologies and innovations, please visit our
[website]((https://www.deepspeed.ai/)) and follow us
on X, formerly Twitter, ([English](https://twitter.com/MSFTDeepSpeed),
[Japanese](https://twitter.com/MSFTDeepSpeedJP)) and
[Chinese Zhihu](https://www.zhihu.com/people/deepspeed).
Binary file added blogs/ulysses-offload/media/image1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added blogs/ulysses-offload/media/image2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added blogs/ulysses-offload/media/image3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added blogs/ulysses-offload/media/image4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added blogs/ulysses-offload/media/image5.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading