From ec310ede85128c9fdb533cb54b05e710132c06ca Mon Sep 17 00:00:00 2001 From: Toolkit User Date: Mon, 2 Dec 2024 21:32:20 +0000 Subject: [PATCH 1/7] initial recipe --- docs/recipes/train-llama-8b.md | 85 +++++++++++++++++++++++++++++++++- 1 file changed, 83 insertions(+), 2 deletions(-) diff --git a/docs/recipes/train-llama-8b.md b/docs/recipes/train-llama-8b.md index d76f2822..c7fe2509 100644 --- a/docs/recipes/train-llama-8b.md +++ b/docs/recipes/train-llama-8b.md @@ -2,6 +2,87 @@ title: Training Llama 3.1 8B --- -!!! warning +In this guide, we provide step-by-step instructions to do continued pretraining on The Stack with Llama 3.1-8B 🦙. - Heads up! This guide isn't ready yet. Check back soon. +# Preliminary steps +- [Quick Start](quick-start.md) +- [Data preparation](data-preparation.md) + +# Download the Pretrained Model +Let's download Llama-3.1-8B: +```bash +git lfs install +git clone https://huggingface.co/meta-llama/Llama-3.1-8B ./fast-llm-tutorial/pretrained-model +``` + +# Training +This is not much different from a pretraining config. We will: +- specify the Llama3.1 checkpoint to load. Fast-LLm will automatically infer the corresponding model architecture. +- adapt some of the training parameters for our needs. +- and that's it! + + ```yaml + training: + train_iters: 100_000 + logs: + interval: 10 + validation: + iterations: 25 + interval: 1000 + checkpoint: + interval: 1000 + keep: 5 + test_iters: 0 + export: # (1)! + format: llama + interval: 20_000 + batch: + micro_batch_size: 2 + sequence_length: 4096 + batch_size: 256 + data: + format: file + path: fast-llm-tutorial/dataset.json # (2)! + split: [99, 1, 0] + optimizer: + weight_decay: 0.1 + beta_1: 0.9 + beta_2: 0.95 + learning_rate: + base: 6.0e-04 + minimum: 6.0e-05 + decay_style: cosine + decay_iterations: 100_000 + warmup_iterations: 2000 + pretrained: # (3)! + format: llama + path: fast-llm-tutorial/pretrained-model + model_weights: yes + model: + base_model: + transformer: + use_flash_attention: yes + cross_entropy_impl: fused + multi_stage: + zero_stage: 2 + distributed: + training_dtype: bf16 + run: + experiment_dir: fast-llm-tutorial/Llama-3.1-8B-cpt + ``` + + 1. A Llama model will be saved in Hugging Face format to `~/results` directory every 20,000 iterations. + 2. Location of the dataset metadata file generated in Step 4. + 3. Config of the pretrained model. We load a `llama` model from the repository downloaded earlier. + +# Checkpoint usage +Checkpoints will be saved regularly, and every 20k steps a checkpoint will be exported in the HF format. +You can use it in `transformers` as you would use the pretrained Llama 3.1 model, except this one will be much stronger at {lang}! + +```python +from transformers import pipeline, AutoTokenizer + +tokenizer = AutoTokenizer.from_pretrained("fast-llm-tutorial/pretrained-model") +pipe = pipeline("text-generation", model="fast-llm-tutorial/Llama-3.1-8B-cpt/export/llama/20000/", tokenizer=tokenizer) + +``` From c6768e6133d78bc2ba1cb47f586ef91acff2abf3 Mon Sep 17 00:00:00 2001 From: Toolkit User Date: Tue, 3 Dec 2024 16:33:58 +0000 Subject: [PATCH 2/7] more guidance --- docs/recipes/train-llama-8b.md | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/docs/recipes/train-llama-8b.md b/docs/recipes/train-llama-8b.md index c7fe2509..7a49a352 100644 --- a/docs/recipes/train-llama-8b.md +++ b/docs/recipes/train-llama-8b.md @@ -49,15 +49,15 @@ This is not much different from a pretraining config. We will: beta_1: 0.9 beta_2: 0.95 learning_rate: - base: 6.0e-04 - minimum: 6.0e-05 + base: 1.0e-04 # (3)! + minimum: 1.0e-05 decay_style: cosine decay_iterations: 100_000 warmup_iterations: 2000 - pretrained: # (3)! + pretrained: # (4)! format: llama path: fast-llm-tutorial/pretrained-model - model_weights: yes + model_weights: yes # (5)! model: base_model: transformer: @@ -73,7 +73,9 @@ This is not much different from a pretraining config. We will: 1. A Llama model will be saved in Hugging Face format to `~/results` directory every 20,000 iterations. 2. Location of the dataset metadata file generated in Step 4. - 3. Config of the pretrained model. We load a `llama` model from the repository downloaded earlier. + 3. The learning-rate can be used to trade-off between learning and forgetting. A higher learning-rate will learn quickly on our new dataset but will cause forgetting. A lower learning-rate will instead retain more of the pretrained model's knowledge, but will slow down adapting to the new domain. + 4. Config of the pretrained model. We load a `llama` model from the repository downloaded earlier. + 5. This tells Fast-LLM to load the weights of the pretrained model. If we wanted to use Llama-3.1's configuration, but train from scratch, we could use the same config but set this to `no`. # Checkpoint usage Checkpoints will be saved regularly, and every 20k steps a checkpoint will be exported in the HF format. From 963e2b2a0bb1ac414f3dc4e6848c3c7a025570dd Mon Sep 17 00:00:00 2001 From: Toolkit User Date: Tue, 3 Dec 2024 21:32:57 +0000 Subject: [PATCH 3/7] swap --- docs/recipes/continue-training-llama-8b.md | 87 +++++++++++++++++++++- docs/recipes/train-llama-8b.md | 85 +-------------------- 2 files changed, 87 insertions(+), 85 deletions(-) diff --git a/docs/recipes/continue-training-llama-8b.md b/docs/recipes/continue-training-llama-8b.md index 159be53e..3e31c04f 100644 --- a/docs/recipes/continue-training-llama-8b.md +++ b/docs/recipes/continue-training-llama-8b.md @@ -2,6 +2,89 @@ title: Continual Pretraining of Llama 3.1 8B --- -!!! warning - This recipe’s still in the oven. Check back soon for the full details! +In this guide, we provide step-by-step instructions to do continued pretraining on The Stack with Llama 3.1-8B 🦙. + +# Preliminary steps +- [Quick Start](quick-start.md) +- [Data preparation](data-preparation.md) + +# Download the Pretrained Model +Let's download Llama-3.1-8B: +```bash +git lfs install +git clone https://huggingface.co/meta-llama/Llama-3.1-8B ./fast-llm-tutorial/pretrained-model +``` + +# Training +This is not much different from a pretraining config. We will: +- specify the Llama3.1 checkpoint to load. Fast-LLm will automatically infer the corresponding model architecture. +- adapt some of the training parameters for our needs. +- and that's it! + + ```yaml + training: + train_iters: 100_000 + logs: + interval: 10 + validation: + iterations: 25 + interval: 1000 + checkpoint: + interval: 1000 + keep: 5 + test_iters: 0 + export: # (1)! + format: llama + interval: 20_000 + batch: + micro_batch_size: 2 + sequence_length: 4096 + batch_size: 256 + data: + format: file + path: fast-llm-tutorial/dataset.json # (2)! + split: [99, 1, 0] + optimizer: + weight_decay: 0.1 + beta_1: 0.9 + beta_2: 0.95 + learning_rate: + base: 1.0e-04 # (3)! + minimum: 1.0e-05 + decay_style: cosine + decay_iterations: 100_000 + warmup_iterations: 2000 + pretrained: # (4)! + format: llama + path: fast-llm-tutorial/pretrained-model + model_weights: yes # (5)! + model: + base_model: + transformer: + use_flash_attention: yes + cross_entropy_impl: fused + multi_stage: + zero_stage: 2 + distributed: + training_dtype: bf16 + run: + experiment_dir: fast-llm-tutorial/Llama-3.1-8B-cpt + ``` + + 1. A Llama model will be saved in Hugging Face format to `~/results` directory every 20,000 iterations. + 2. Location of the dataset metadata file generated in Step 4. + 3. The learning-rate can be used to trade-off between learning and forgetting. A higher learning-rate will learn quickly on our new dataset but will cause forgetting. A lower learning-rate will instead retain more of the pretrained model's knowledge, but will slow down adapting to the new domain. + 4. Config of the pretrained model. We load a `llama` model from the repository downloaded earlier. + 5. This tells Fast-LLM to load the weights of the pretrained model. If we wanted to use Llama-3.1's configuration, but train from scratch, we could use the same config but set this to `no`. + +# Checkpoint usage +Checkpoints will be saved regularly, and every 20k steps a checkpoint will be exported in the HF format. +You can use it in `transformers` as you would use the pretrained Llama 3.1 model, except this one will be much stronger at {lang}! + +```python +from transformers import pipeline, AutoTokenizer + +tokenizer = AutoTokenizer.from_pretrained("fast-llm-tutorial/pretrained-model") +pipe = pipeline("text-generation", model="fast-llm-tutorial/Llama-3.1-8B-cpt/export/llama/20000/", tokenizer=tokenizer) +``` diff --git a/docs/recipes/train-llama-8b.md b/docs/recipes/train-llama-8b.md index 7a49a352..8bc18275 100644 --- a/docs/recipes/train-llama-8b.md +++ b/docs/recipes/train-llama-8b.md @@ -2,89 +2,8 @@ title: Training Llama 3.1 8B --- -In this guide, we provide step-by-step instructions to do continued pretraining on The Stack with Llama 3.1-8B 🦙. -# Preliminary steps -- [Quick Start](quick-start.md) -- [Data preparation](data-preparation.md) +!!! warning -# Download the Pretrained Model -Let's download Llama-3.1-8B: -```bash -git lfs install -git clone https://huggingface.co/meta-llama/Llama-3.1-8B ./fast-llm-tutorial/pretrained-model -``` + Coming soon! -# Training -This is not much different from a pretraining config. We will: -- specify the Llama3.1 checkpoint to load. Fast-LLm will automatically infer the corresponding model architecture. -- adapt some of the training parameters for our needs. -- and that's it! - - ```yaml - training: - train_iters: 100_000 - logs: - interval: 10 - validation: - iterations: 25 - interval: 1000 - checkpoint: - interval: 1000 - keep: 5 - test_iters: 0 - export: # (1)! - format: llama - interval: 20_000 - batch: - micro_batch_size: 2 - sequence_length: 4096 - batch_size: 256 - data: - format: file - path: fast-llm-tutorial/dataset.json # (2)! - split: [99, 1, 0] - optimizer: - weight_decay: 0.1 - beta_1: 0.9 - beta_2: 0.95 - learning_rate: - base: 1.0e-04 # (3)! - minimum: 1.0e-05 - decay_style: cosine - decay_iterations: 100_000 - warmup_iterations: 2000 - pretrained: # (4)! - format: llama - path: fast-llm-tutorial/pretrained-model - model_weights: yes # (5)! - model: - base_model: - transformer: - use_flash_attention: yes - cross_entropy_impl: fused - multi_stage: - zero_stage: 2 - distributed: - training_dtype: bf16 - run: - experiment_dir: fast-llm-tutorial/Llama-3.1-8B-cpt - ``` - - 1. A Llama model will be saved in Hugging Face format to `~/results` directory every 20,000 iterations. - 2. Location of the dataset metadata file generated in Step 4. - 3. The learning-rate can be used to trade-off between learning and forgetting. A higher learning-rate will learn quickly on our new dataset but will cause forgetting. A lower learning-rate will instead retain more of the pretrained model's knowledge, but will slow down adapting to the new domain. - 4. Config of the pretrained model. We load a `llama` model from the repository downloaded earlier. - 5. This tells Fast-LLM to load the weights of the pretrained model. If we wanted to use Llama-3.1's configuration, but train from scratch, we could use the same config but set this to `no`. - -# Checkpoint usage -Checkpoints will be saved regularly, and every 20k steps a checkpoint will be exported in the HF format. -You can use it in `transformers` as you would use the pretrained Llama 3.1 model, except this one will be much stronger at {lang}! - -```python -from transformers import pipeline, AutoTokenizer - -tokenizer = AutoTokenizer.from_pretrained("fast-llm-tutorial/pretrained-model") -pipe = pipeline("text-generation", model="fast-llm-tutorial/Llama-3.1-8B-cpt/export/llama/20000/", tokenizer=tokenizer) - -``` From e5e886ebf585494852763b13a359cabde589b333 Mon Sep 17 00:00:00 2001 From: Toolkit User Date: Tue, 3 Dec 2024 22:20:23 +0000 Subject: [PATCH 4/7] add training from scratch --- docs/recipes/train-llama-8b.md | 106 ++++++++++++++++++++++++++++++++- 1 file changed, 104 insertions(+), 2 deletions(-) diff --git a/docs/recipes/train-llama-8b.md b/docs/recipes/train-llama-8b.md index 8bc18275..60022e1d 100644 --- a/docs/recipes/train-llama-8b.md +++ b/docs/recipes/train-llama-8b.md @@ -2,8 +2,110 @@ title: Training Llama 3.1 8B --- +Follow this guide to train a Llama-3.1 like model from scratch! -!!! warning - Coming soon! +# Preliminary steps +- [Quick Start](quick-start.md) +- [Data preparation](data-preparation.md) + + +# Training configuration +In this guide, we show you how to configure a model architecture and train a model from scratch. +Let's start from the following training configuration: + + ```yaml + training: + train_iters: 100_000 + logs: + interval: 10 + validation: + iterations: 25 + interval: 1000 + checkpoint: + interval: 1000 + keep: 5 + test_iters: 0 + export: + format: llama + interval: 20_000 + batch: + micro_batch_size: 4 + sequence_length: 4096 + batch_size: 480 + data: + format: file + path: fast-llm-tutorial/dataset/fast_llm_dataset.json + split: [99, 1, 0] + optimizer: + weight_decay: 0.1 + beta_1: 0.9 + beta_2: 0.95 + learning_rate: + base: 6.0e-04 + minimum: 6.0e-05 + decay_style: cosine + decay_iterations: 100_000 + warmup_iterations: 2000 + model: + base_model: + cross_entropy_impl: fused + multi_stage: + zero_stage: 2 + distributed: + training_dtype: bf16 + run: + experiment_dir: fast-llm-tutorial/experiment + ``` +This configuration will not work because it misses important arguments to define model architecture. +There are 2 ways of instantiating our Llama-3.1-8B model. We could use a pretrained model config, or define the model architecture ourselves. + +=== "Pretrained configuration" + This step is similar to what is done in the [Quick Start guide](quick-start.md). + First download the model configuration: + ```bash + git lfs install + GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/meta-llama/Llama-3.1-8B ./fast-llm-tutorial/pretrained-model + ``` + By specifying a pretrained model from the HuggingFace hub, Fast-LLM automatically converts the config to load the model. + **Only the configuration is loaded, not the weights**, because of `model_weights: no`. + + ```yaml + pretrained: + format: llama + path: fast-llm-tutorial/pretrained_model + model_weights: no + ``` + +=== "From-scratch configuration" + In this step, we specify the model architecture as follows: + + ```yaml + model: + base_model: + tie_word_embeddings: false + transformer: + activation_type: silu + add_linear_biases: false + ffn_hidden_size: 14336 + gated: true + head_groups: 8 + hidden_size: 4096 # (1)! + kv_channels: 128 + normalization: + type: rms_norm + num_attention_heads: 32 + num_layers: 32 + rotary: + scaling_type: llama3 + rotary_embedding_scale: -13.122363377404328 # (2)! + use_rotary_embeddings: true + use_position_embeddings: false + vocab_size: 128256 + ``` + + 1. Hidden-size/num-layers will be used to provide good defaults for weight initialization std. + 2. -ln(500_000) + + Configuring the model this way is a bit more verbose than using the pretrained configuration, but gives you an idea of how you configure a Llama-3.1-8B-like model with Fast-LLM. From 303fcb55bfc46590cf32fc56f81904f95cf0f465 Mon Sep 17 00:00:00 2001 From: Toolkit User Date: Tue, 3 Dec 2024 22:22:04 +0000 Subject: [PATCH 5/7] reorder --- docs/recipes/train-llama-8b.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/recipes/train-llama-8b.md b/docs/recipes/train-llama-8b.md index 60022e1d..77015b94 100644 --- a/docs/recipes/train-llama-8b.md +++ b/docs/recipes/train-llama-8b.md @@ -84,6 +84,8 @@ There are 2 ways of instantiating our Llama-3.1-8B model. We could use a pretrai model: base_model: tie_word_embeddings: false + use_position_embeddings: false + vocab_size: 128256 transformer: activation_type: silu add_linear_biases: false @@ -100,8 +102,6 @@ There are 2 ways of instantiating our Llama-3.1-8B model. We could use a pretrai scaling_type: llama3 rotary_embedding_scale: -13.122363377404328 # (2)! use_rotary_embeddings: true - use_position_embeddings: false - vocab_size: 128256 ``` 1. Hidden-size/num-layers will be used to provide good defaults for weight initialization std. From ae3de628a16a622107ccb5a8320899f64858320e Mon Sep 17 00:00:00 2001 From: Toolkit User Date: Tue, 3 Dec 2024 22:34:01 +0000 Subject: [PATCH 6/7] adjust --- docs/recipes/train-llama-8b.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/recipes/train-llama-8b.md b/docs/recipes/train-llama-8b.md index 77015b94..ec697a9d 100644 --- a/docs/recipes/train-llama-8b.md +++ b/docs/recipes/train-llama-8b.md @@ -30,9 +30,9 @@ Let's start from the following training configuration: format: llama interval: 20_000 batch: - micro_batch_size: 4 + micro_batch_size: 2 sequence_length: 4096 - batch_size: 480 + batch_size: 256 data: format: file path: fast-llm-tutorial/dataset/fast_llm_dataset.json @@ -107,5 +107,5 @@ There are 2 ways of instantiating our Llama-3.1-8B model. We could use a pretrai 1. Hidden-size/num-layers will be used to provide good defaults for weight initialization std. 2. -ln(500_000) - Configuring the model this way is a bit more verbose than using the pretrained configuration, but gives you an idea of how you configure a Llama-3.1-8B-like model with Fast-LLM. + Configuring the model this way is a bit more verbose than using the pretrained configuration, but gives an idea of how to configure a Llama-3.1-8B-like model with Fast-LLM. From f19ff8bcd1247eb2730d4f3f62aa4173c807a84c Mon Sep 17 00:00:00 2001 From: Toolkit User Date: Wed, 11 Dec 2024 00:57:23 +0000 Subject: [PATCH 7/7] adjust --- docs/recipes/train-llama-8b.md | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/docs/recipes/train-llama-8b.md b/docs/recipes/train-llama-8b.md index ec697a9d..c33db394 100644 --- a/docs/recipes/train-llama-8b.md +++ b/docs/recipes/train-llama-8b.md @@ -99,13 +99,11 @@ There are 2 ways of instantiating our Llama-3.1-8B model. We could use a pretrai num_attention_heads: 32 num_layers: 32 rotary: - scaling_type: llama3 - rotary_embedding_scale: -13.122363377404328 # (2)! - use_rotary_embeddings: true + type: llama3 + theta: 500_000 ``` 1. Hidden-size/num-layers will be used to provide good defaults for weight initialization std. - 2. -ln(500_000) Configuring the model this way is a bit more verbose than using the pretrained configuration, but gives an idea of how to configure a Llama-3.1-8B-like model with Fast-LLM.