Llama 3.1 Configs and Code (#1208)

pytorch · Jul 24, 2024 · 403c7f3 · 403c7f3
1 parent 9ce0c32
commit 403c7f3
Show file tree

Hide file tree

Showing 15 changed files with 1,530 additions and 35 deletions.
diff --git a/README.md b/README.md
@@ -1,19 +1,15 @@
+# torchtune
 
 [![Unit Test](https://github.com/pytorch/torchtune/actions/workflows/unit_test.yaml/badge.svg?branch=main)](https://github.com/pytorch/torchtune/actions/workflows/unit_test.yaml)
 ![Recipe Integration Test](https://github.com/pytorch/torchtune/actions/workflows/recipe_test.yaml/badge.svg)
 [![](https://dcbadge.vercel.app/api/server/4Xsdn8Rr9Q?style=flat)](https://discord.gg/4Xsdn8Rr9Q)
 
-&nbsp;
-&nbsp;
-
-torchtune now officially supports Meta Llama3! Check out our recipes for Llama3-8B-Instruct with LoRA, QLoRA and Full fine-tune in the [Llama3](#llama3) section! We also support 70B fine-tuning with LoRA! 🚀 🦙
-
-# torchtune
-
 [**Introduction**](#introduction) | [**Installation**](#installation) | [**Get Started**](#get-started) |  [**Documentation**](https://pytorch.org/torchtune/main/index.html) | [**Design Principles**](#design-principles) | [**Community Contributions**](#community-contributions) | [**License**](#license)
 
 &nbsp;
 
+> **July 2024**: torchtune has updated model weights for Llama3.1 in source and nightly builds! Check out our configs for both the [8B and 70B versions](recipes/configs/llama3_1/) of the model. LoRA, QLoRA, and full finetune methods are supported. Support for QLoRA 405B will be added soon.
+
 ## Introduction
 
 torchtune is a PyTorch-native library for easily authoring, fine-tuning and experimenting with LLMs. We're excited to announce our alpha release!
@@ -44,14 +40,15 @@ torchtune currently supports the following models.
 
 | Model                                         | Sizes     |
 |-----------------------------------------------|-----------|
+| [Llama3.1](https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1)    | 8B, 70B [[models](torchtune/models/llama3_1/_model_builders.py), [configs](recipes/configs/llama3_1/)]        |
 | [Llama3](https://llama.meta.com/llama3)    | 8B, 70B [[models](torchtune/models/llama3/_model_builders.py), [configs](recipes/configs/llama3/)]        |
 | [Llama2](https://llama.meta.com/llama2/)   | 7B, 13B, 70B [[models](torchtune/models/llama2/_model_builders.py), [configs](recipes/configs/llama2/)]        |
 | [Code-Llama2](https://ai.meta.com/blog/code-llama-large-language-model-coding/)   | 7B, 13B, 70B [[model](torchtune/models/code_llama2/_model_builders.py), [configs](recipes/configs/code_llama2/)] |
 | [Mistral](https://huggingface.co/mistralai)   | 7B [[model](torchtune/models/mistral/_model_builders.py), [configs](recipes/configs/mistral/)] |
 | [Gemma](https://huggingface.co/collections/google/gemma-release-65d5efbccdbb8c4202ec078b)   | 2B, 7B [[model](torchtune/models/gemma/_model_builders.py), [configs](recipes/configs/gemma/)] |
 | [Microsoft Phi3](https://huggingface.co/collections/microsoft/phi-3-6626e15e9585a200d2d761e3) | Mini [[model](torchtune/models/phi3/), [configs](recipes/configs/phi3/)]
 
-We'll be adding a number of new models in the coming weeks, including support for 70B versions and MoEs.
+We're always adding new models, but feel free to [file an Issue](https://github.com/pytorch/torchtune/issues/new) if there's a new one you would love to see in torchtune!
 
 &nbsp;
 
@@ -91,12 +88,12 @@ This table captures the peak memory usage and training speed for recipes in torc
 
 &nbsp;
 
-## Llama3
+## Llama3 and Llama3.1
 
 torchtune supports fine-tuning for the Llama3 8B and 70B size models. We currently support LoRA, QLoRA and full fine-tune on a single GPU as well as LoRA and full fine-tune on multiple devices for the 8B model, and LoRA on multiple devices for the 70B model. For all the details, take a look at our [tutorial](https://pytorch.org/torchtune/main/tutorials/llama3.html).
 
-**Note**: our Llama3 LoRA and QLoRA configs default to the instruct fine-tuned models.
-This is because not all special token embeddings are initialized in the base 8B and 70B models.
+> [!NOTE]
+> Our Llama3 and Llama3.1 LoRA and QLoRA configs default to the instruct fine-tuned models. This is because not all special token embeddings are initialized in the base 8B and 70B models.
 
 In our initial experiments for Llama3-8B, QLoRA has a peak allocated memory of ``~9GB`` while LoRA on a single GPU has a peak allocated memory of ``~19GB``. To get started, you can use our default configs to kick off training.
 
@@ -105,49 +102,49 @@ In our initial experiments for Llama3-8B, QLoRA has a peak allocated memory of `
 LoRA 8B
 
 ```bash
-tune run lora_finetune_single_device --config llama3/8B_lora_single_device
+tune run lora_finetune_single_device --config llama3_1/8B_lora_single_device
 ```
 
 QLoRA 8B
 
 ```bash
-tune run lora_finetune_single_device --config llama3/8B_qlora_single_device
+tune run lora_finetune_single_device --config llama3_1/8B_qlora_single_device
 ```
 
 Full 8B
 
 ```bash
-tune run full_finetune_single_device --config llama3/8B_full_single_device
+tune run full_finetune_single_device --config llama3_1/8B_full_single_device
 ```
 
 ### Multi GPU
 
 Full 8B
 
 ```bash
-tune run --nproc_per_node 4 full_finetune_distributed --config llama3/8B_full
+tune run --nproc_per_node 4 full_finetune_distributed --config llama3_1/8B_full
 ```
 
 LoRA 8B
 
 ```bash
-tune run --nproc_per_node 2 lora_finetune_distributed --config llama3/8B_lora
+tune run --nproc_per_node 2 lora_finetune_distributed --config llama3_1/8B_lora
 ```
 
 LoRA 70B
 
 Note that the download command for the Meta-Llama3 70B model slightly differs from download commands for the 8B models. This is because we use the HuggingFace [safetensor](https://huggingface.co/docs/safetensors/en/index) model format to load the model. To download the 70B model, run
 ```bash
-tune download meta-llama/Meta-Llama-3-70b --hf-token <> --output-dir /tmp/Meta-Llama-3-70b --ignore-patterns "original/consolidated*"
+tune download meta-llama/Meta-Llama-3.1-70b --hf-token <> --output-dir /tmp/Meta-Llama-3.1-70b --ignore-patterns "original/consolidated*"
 ```
 
 Then, a finetune can be kicked off:
 
 ```bash
-tune run --nproc_per_node 8 lora_finetune_distributed --config recipes/configs/llama3/70B_lora.yaml
+tune run --nproc_per_node 8 lora_finetune_distributed --config llama3_1/70B_lora.yaml
 ```
 
-You can find a full list of all our Llama3 configs [here.](recipes/configs/llama3)
+You can find a full list of all our Llama3 configs [here](recipes/configs/llama3) and Llama3.1 configs [here.](recipes/configs/llama3_1)
 
 
 &nbsp;
@@ -199,12 +196,6 @@ To get started with fine-tuning your first LLM with torchtune, see our tutorial
 
 Follow the instructions on the official [`meta-llama`](https://huggingface.co/meta-llama) repository to ensure you have access to the official Llama model weights. Once you have confirmed access, you can run the following command to download the weights to your local machine. This will also download the tokenizer model and a responsible use guide.
 
-### Llama2 download
-```bash
-tune download meta-llama/Llama-2-7b-hf \
---output-dir /tmp/Llama-2-7b-hf \
---hf-token <HF_TOKEN> \
-```
 
 ### Llama3 download
 ```bash
@@ -213,28 +204,28 @@ tune download meta-llama/Meta-Llama-3-8B \
 --hf-token <HF_TOKEN> \
 ```
 
-
-> Tip: Set your environment variable `HF_TOKEN` or pass in `--hf-token` to the command in order to validate your access.
-You can find your token at https://huggingface.co/settings/tokens
+> [!Tip]
+> Set your environment variable `HF_TOKEN` or pass in `--hf-token` to the command in order to validate your access. You can find your token at https://huggingface.co/settings/tokens
 
 &nbsp;
 
 ### Running fine-tuning recipes
 
-Llama2 7B + LoRA on single GPU:
+Llama3 8B + LoRA on single GPU:
 
 ```bash
 tune run lora_finetune_single_device --config llama2/7B_lora_single_device
 ```
 
 For distributed training, tune CLI integrates with [torchrun](https://pytorch.org/docs/stable/elastic/run.html).
-Llama2 7B + LoRA on two GPUs:
+Llama3 8B + LoRA on two GPUs:
 
 ```bash
 tune run --nproc_per_node 2 full_finetune_distributed --config llama2/7B_full
 ```
 
-> Tip: Make sure to place any torchrun commands **before** the recipe specification. Any CLI args after this will override the config and not impact distributed training.
+> [!Tip]
+> Make sure to place any torchrun commands **before** the recipe specification. Any CLI args after this will override the config and not impact distributed training.
 
 &nbsp;
 

diff --git a/docs/source/api_ref_models.rst b/docs/source/api_ref_models.rst
@@ -6,8 +6,8 @@ torchtune.models
 
 .. currentmodule:: torchtune.models
 
-llama3
-------
+llama3 & llama3.1
+-----------------
 
 All models from the `Llama3 family <https://llama.meta.com/llama3/>`_.
 
@@ -21,9 +21,10 @@ To download the Llama3-70B-Instruct model:
 
 .. code-block:: bash
 
-    tune download meta-llama/Meta-Llama-3-70B-Instruct --hf-token <HF_TOKEN>
-    --ignore-patterns "original/consolidated*"
+    tune download meta-llama/Meta-Llama-3-70B-Instruct --hf-token <HF_TOKEN> --ignore-patterns "original/consolidated*"
 
+To download the Llama3.1 weights of the above models, you can instead download from `Meta-Llama-3.1-8B-Instruct`
+or `Meta-Llama-3.1-70B-Instruct`.
 
 .. autosummary::
     :toctree: generated/
@@ -40,6 +41,21 @@ To download the Llama3-70B-Instruct model:
     llama3.llama3_tokenizer
     llama3.Llama3Tokenizer
 
+    |
+
+    llama3_1.llama3_1
+    llama3_1.lora_llama3_1
+    llama3_1.llama3_1_8b
+    llama3_1.lora_llama3_1_8b
+    llama3_1.qlora_llama3_1_8b
+    llama3_1.llama3_1_70b
+    llama3_1.lora_llama3_1_70b
+    llama3_1.qlora_llama3_1_70b
+
+
+.. note::
+
+    The Llama3.1 tokenizer reuses the `llama3.llama3_tokenizer` builder class.
 
 llama2
 ------

diff --git a/recipes/configs/generation.yaml b/recipes/configs/generation.yaml
@@ -34,6 +34,7 @@ chat_format: null
 max_new_tokens: 300
 temperature: 0.6 # 0.8 and 0.6 are popular values to try
 top_k: 300
+# It is recommended to set enable_kv_cache=False for long-context models like Llama3.1
 enable_kv_cache: True
 
 quantizer: null
diff --git a/recipes/configs/llama3_1/70B_full.yaml b/recipes/configs/llama3_1/70B_full.yaml
@@ -0,0 +1,109 @@
+# Config for multi-device full finetuning in full_finetune_distributed.py
+# using a Llama3.1 70B Instruct model
+#
+# This config assumes that you've run the following command before launching
+# this run:
+#   tune download meta-llama/Meta-Llama-3.1-70B-Instruct --output-dir /tmp/Meta-Llama-3.1-70B-Instruct --ignore-patterns "original/consolidated*"
+#
+# To launch on 8 devices, run the following command from root:
+#   tune run --nproc_per_node 8 full_finetune_distributed --config llama3_1/70B_full
+#
+# You can add specific overrides through the command line. For example
+# to override the checkpointer directory while launching training
+# you can run:
+#   tune run --nproc_per_node 8 full_finetune_distributed --config llama3_1/70B_full checkpointer.checkpoint_dir=<YOUR_CHECKPOINT_DIR>
+#
+# This config is only tested on an 8xA100 machine.
+
+
+# Tokenizer
+tokenizer:
+  _component_: torchtune.models.llama3.llama3_tokenizer
+  path: /tmp/Meta-Llama-3.1-70B-Instruct/original/tokenizer.model
+
+# Dataset
+dataset:
+  _component_: torchtune.datasets.alpaca_dataset
+seed: null
+shuffle: True
+
+# Model Arguments
+model:
+  _component_: torchtune.models.llama3_1.llama3_1_70b
+
+checkpointer:
+  _component_: torchtune.utils.FullModelHFCheckpointer
+  checkpoint_dir: /tmp/Meta-Llama-3.1-70B-Instruct/
+  checkpoint_files: [
+    model-00001-of-00030.safetensors,
+    model-00002-of-00030.safetensors,
+    model-00003-of-00030.safetensors,
+    model-00004-of-00030.safetensors,
+    model-00005-of-00030.safetensors,
+    model-00006-of-00030.safetensors,
+    model-00007-of-00030.safetensors,
+    model-00008-of-00030.safetensors,
+    model-00009-of-00030.safetensors,
+    model-00010-of-00030.safetensors,
+    model-00011-of-00030.safetensors,
+    model-00012-of-00030.safetensors,
+    model-00013-of-00030.safetensors,
+    model-00014-of-00030.safetensors,
+    model-00015-of-00030.safetensors,
+    model-00016-of-00030.safetensors,
+    model-00017-of-00030.safetensors,
+    model-00018-of-00030.safetensors,
+    model-00019-of-00030.safetensors,
+    model-00020-of-00030.safetensors,
+    model-00021-of-00030.safetensors,
+    model-00022-of-00030.safetensors,
+    model-00023-of-00030.safetensors,
+    model-00024-of-00030.safetensors,
+    model-00025-of-00030.safetensors,
+    model-00026-of-00030.safetensors,
+    model-00027-of-00030.safetensors,
+    model-00028-of-00030.safetensors,
+    model-00029-of-00030.safetensors,
+    model-00030-of-00030.safetensors,
+  ]
+  recipe_checkpoint: null
+  output_dir: /tmp/Meta-Llama-3.1-70B-Instruct/
+  model_type: LLAMA3
+resume_from_checkpoint: False
+
+# Fine-tuning arguments
+batch_size: 2
+epochs: 3
+
+optimizer:
+  _component_: torch.optim.AdamW
+  lr: 2e-5
+  foreach: False
+  # Note: highly recommended to use fused=True optimizer flag
+  # with CPU offload for faster optimizer step.
+  fused: True
+
+loss:
+  _component_: torch.nn.CrossEntropyLoss
+max_steps_per_epoch: null
+gradient_accumulation_steps: 1
+
+
+# Training env
+device: cuda
+
+# Memory management
+enable_activation_checkpointing: True
+memory_efficient_fsdp_wrap: True
+fsdp_cpu_offload: True
+
+# Reduced precision
+dtype: bf16
+
+# Logging
+metric_logger:
+  _component_: torchtune.utils.metric_logging.DiskLogger
+  log_dir: ${output_dir}
+output_dir: /tmp/alpaca-llama3-finetune
+log_every_n_steps: 1
+log_peak_memory_stats: False