Skip to content

Commit

Permalink
update ulysses-offload tutorial
Browse files Browse the repository at this point in the history
  • Loading branch information
Jinghan Yao committed Dec 3, 2024
1 parent de18e22 commit 39a97cd
Showing 1 changed file with 1 addition and 10 deletions.
11 changes: 1 addition & 10 deletions docs/_tutorials/ulysses-offload.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: "Fully Pipeliend Distributed Transformer"
title: "DeepSpeed Ulysses-Offload"
tags: training ultra long context language model with fully pipelined distributed transformer
---

Expand All @@ -11,25 +11,16 @@ We recommend that you read the tutorials on [Getting Started](/getting-started/)
## Design of Ulysses-Offload
Ulysses-Offload is a chunking and offloading-based transformer implementation, which retain the full precision of the vanilla transformer, while significantly reduce the activation memory required during long-context model training. FPDT breaks long sequence input into smaller chunks, moving them among host and GPU memory to achieve the superior memory efficiency while reaching over 50% of MFU. FPDT adopts a double-buffer design, which overlaps the fetching/offloading with the attention computation. FPDT also allows uUsers to configure the chunk size to match the expected memory budget.

<<<<<<< HEAD:docs/_tutorials/ulysses-offload.md
Ulysses-Offload supports ZeRO, which shards the model and tensors among GPU memory, further pushing the limit of long-context model training with state-of-the-art hardware efficiency.
=======
FPDT supports ZeRO and DeepSpeed-Ulysses, which shard the model and tensors among GPU memory, further pushing the limit of long-context model training with state-of-the-art hardware efficiency.
>>>>>>> 681a83f450f0b1ac3f1c28885e2cc4f1145aa2af:docs/_tutorials/fpdt.md


## Training Environment

For this tutorial, Flash Attention (CUDA) is required. We will configure a 8 billion parameter LLaMA model using the DeepSpeed [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/tree/master/) code. We will use 1 nodes of 4x [NVIDIA Tesla A100-SXM4 Tensor Core GPU](https://www.nvidia.com/en-us/data-center/a100/).


<<<<<<< HEAD:docs/_tutorials/ulysses-offload.md
## Training a 6.7B parameter GPT with Ulysses-Offload
Users can set the context size at the beginning of the script, for this excercise, we will use 256K context and mini batch of one.
=======
## Training a 6.7B parameter GPT with FPDT
Users can set the context size at the beginning of the script, for this exercise, we will use 256K context and mini batch of one.
>>>>>>> 681a83f450f0b1ac3f1c28885e2cc4f1145aa2af:docs/_tutorials/fpdt.md
```
### Main configs
seq_len=262144 # need to be power of 2
Expand Down

0 comments on commit 39a97cd

Please sign in to comment.