From ec310ede85128c9fdb533cb54b05e710132c06ca Mon Sep 17 00:00:00 2001
From: Toolkit User <raymond.li@servicenow.com>
Date: Mon, 2 Dec 2024 21:32:20 +0000
Subject: [PATCH 1/7] initial recipe

---
 docs/recipes/train-llama-8b.md | 85 +++++++++++++++++++++++++++++++++-
 1 file changed, 83 insertions(+), 2 deletions(-)

diff --git a/docs/recipes/train-llama-8b.md b/docs/recipes/train-llama-8b.md
index d76f2822..c7fe2509 100644
--- a/docs/recipes/train-llama-8b.md
+++ b/docs/recipes/train-llama-8b.md
@@ -2,6 +2,87 @@
 title: Training Llama 3.1 8B
 ---
 
-!!! warning
+In this guide, we provide step-by-step instructions to do continued pretraining on The Stack with Llama 3.1-8B 🦙.
 
-    Heads up! This guide isn't ready yet. Check back soon.
+# Preliminary steps
+- [Quick Start](quick-start.md)
+- [Data preparation](data-preparation.md)
+
+# Download the Pretrained Model
+Let's download Llama-3.1-8B:
+```bash
+git lfs install
+git clone https://huggingface.co/meta-llama/Llama-3.1-8B ./fast-llm-tutorial/pretrained-model
+```
+
+# Training
+This is not much different from a pretraining config. We will:
+- specify the Llama3.1 checkpoint to load. Fast-LLm will automatically infer the corresponding model architecture.
+- adapt some of the training parameters for our needs.
+- and that's it!
+
+  ```yaml
+  training:
+    train_iters: 100_000
+    logs:
+      interval: 10
+    validation:
+      iterations: 25
+      interval: 1000
+    checkpoint:
+      interval: 1000
+      keep: 5
+    test_iters: 0
+    export:  # (1)!
+      format: llama
+      interval: 20_000
+  batch:
+    micro_batch_size: 2
+    sequence_length: 4096
+    batch_size: 256
+  data:
+    format: file
+    path: fast-llm-tutorial/dataset.json  # (2)!
+    split: [99, 1, 0]  
+  optimizer:  
+    weight_decay: 0.1
+    beta_1: 0.9
+    beta_2: 0.95
+    learning_rate:
+      base: 6.0e-04
+      minimum: 6.0e-05
+      decay_style: cosine
+      decay_iterations: 100_000
+      warmup_iterations: 2000
+  pretrained:  # (3)!
+    format: llama
+    path: fast-llm-tutorial/pretrained-model
+    model_weights: yes
+  model:
+    base_model:
+      transformer:
+        use_flash_attention: yes
+      cross_entropy_impl: fused
+    multi_stage:
+      zero_stage: 2
+    distributed:
+      training_dtype: bf16  
+  run:
+    experiment_dir: fast-llm-tutorial/Llama-3.1-8B-cpt
+    ```
+
+    1.  A Llama model will be saved in Hugging Face format to `~/results` directory every 20,000 iterations.
+    2.  Location of the dataset metadata file generated in Step 4.
+    3.  Config of the pretrained model. We load a `llama` model from the repository downloaded earlier.
+
+# Checkpoint usage
+Checkpoints will be saved regularly, and every 20k steps a checkpoint will be exported in the HF format.
+You can use it in `transformers` as you would use the pretrained Llama 3.1 model, except this one will be much stronger at {lang}!
+
+```python
+from transformers import pipeline, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("fast-llm-tutorial/pretrained-model")
+pipe = pipeline("text-generation", model="fast-llm-tutorial/Llama-3.1-8B-cpt/export/llama/20000/", tokenizer=tokenizer)
+
+```

From c6768e6133d78bc2ba1cb47f586ef91acff2abf3 Mon Sep 17 00:00:00 2001
From: Toolkit User <raymond.li@servicenow.com>
Date: Tue, 3 Dec 2024 16:33:58 +0000
Subject: [PATCH 2/7] more guidance

---
 docs/recipes/train-llama-8b.md | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/docs/recipes/train-llama-8b.md b/docs/recipes/train-llama-8b.md
index c7fe2509..7a49a352 100644
--- a/docs/recipes/train-llama-8b.md
+++ b/docs/recipes/train-llama-8b.md
@@ -49,15 +49,15 @@ This is not much different from a pretraining config. We will:
     beta_1: 0.9
     beta_2: 0.95
     learning_rate:
-      base: 6.0e-04
-      minimum: 6.0e-05
+      base: 1.0e-04  # (3)!
+      minimum: 1.0e-05
       decay_style: cosine
       decay_iterations: 100_000
       warmup_iterations: 2000
-  pretrained:  # (3)!
+  pretrained:  # (4)!
     format: llama
     path: fast-llm-tutorial/pretrained-model
-    model_weights: yes
+    model_weights: yes  # (5)!
   model:
     base_model:
       transformer:
@@ -73,7 +73,9 @@ This is not much different from a pretraining config. We will:
 
     1.  A Llama model will be saved in Hugging Face format to `~/results` directory every 20,000 iterations.
     2.  Location of the dataset metadata file generated in Step 4.
-    3.  Config of the pretrained model. We load a `llama` model from the repository downloaded earlier.
+    3.  The learning-rate can be used to trade-off between learning and forgetting. A higher learning-rate will learn quickly on our new dataset but will cause forgetting. A lower learning-rate will instead retain more of the pretrained model's knowledge, but will slow down adapting to the new domain.
+    4.  Config of the pretrained model. We load a `llama` model from the repository downloaded earlier.
+    5.  This tells Fast-LLM to load the weights of the pretrained model. If we wanted to use Llama-3.1's configuration, but train from scratch, we could use the same config but set this to `no`.
 
 # Checkpoint usage
 Checkpoints will be saved regularly, and every 20k steps a checkpoint will be exported in the HF format.

From 963e2b2a0bb1ac414f3dc4e6848c3c7a025570dd Mon Sep 17 00:00:00 2001
From: Toolkit User <raymond.li@servicenow.com>
Date: Tue, 3 Dec 2024 21:32:57 +0000
Subject: [PATCH 3/7] swap

---
 docs/recipes/continue-training-llama-8b.md | 87 +++++++++++++++++++++-
 docs/recipes/train-llama-8b.md             | 85 +--------------------
 2 files changed, 87 insertions(+), 85 deletions(-)

diff --git a/docs/recipes/continue-training-llama-8b.md b/docs/recipes/continue-training-llama-8b.md
index 159be53e..3e31c04f 100644
--- a/docs/recipes/continue-training-llama-8b.md
+++ b/docs/recipes/continue-training-llama-8b.md
@@ -2,6 +2,89 @@
 title: Continual Pretraining of Llama 3.1 8B
 ---
 
-!!! warning
 
-    This recipe’s still in the oven. Check back soon for the full details!
+In this guide, we provide step-by-step instructions to do continued pretraining on The Stack with Llama 3.1-8B 🦙.
+
+# Preliminary steps
+- [Quick Start](quick-start.md)
+- [Data preparation](data-preparation.md)
+
+# Download the Pretrained Model
+Let's download Llama-3.1-8B:
+```bash
+git lfs install
+git clone https://huggingface.co/meta-llama/Llama-3.1-8B ./fast-llm-tutorial/pretrained-model
+```
+
+# Training
+This is not much different from a pretraining config. We will:
+- specify the Llama3.1 checkpoint to load. Fast-LLm will automatically infer the corresponding model architecture.
+- adapt some of the training parameters for our needs.
+- and that's it!
+
+  ```yaml
+  training:
+    train_iters: 100_000
+    logs:
+      interval: 10
+    validation:
+      iterations: 25
+      interval: 1000
+    checkpoint:
+      interval: 1000
+      keep: 5
+    test_iters: 0
+    export:  # (1)!
+      format: llama
+      interval: 20_000
+  batch:
+    micro_batch_size: 2
+    sequence_length: 4096
+    batch_size: 256
+  data:
+    format: file
+    path: fast-llm-tutorial/dataset.json  # (2)!
+    split: [99, 1, 0]  
+  optimizer:  
+    weight_decay: 0.1
+    beta_1: 0.9
+    beta_2: 0.95
+    learning_rate:
+      base: 1.0e-04  # (3)!
+      minimum: 1.0e-05
+      decay_style: cosine
+      decay_iterations: 100_000
+      warmup_iterations: 2000
+  pretrained:  # (4)!
+    format: llama
+    path: fast-llm-tutorial/pretrained-model
+    model_weights: yes  # (5)!
+  model:
+    base_model:
+      transformer:
+        use_flash_attention: yes
+      cross_entropy_impl: fused
+    multi_stage:
+      zero_stage: 2
+    distributed:
+      training_dtype: bf16  
+  run:
+    experiment_dir: fast-llm-tutorial/Llama-3.1-8B-cpt
+    ```
+
+    1.  A Llama model will be saved in Hugging Face format to `~/results` directory every 20,000 iterations.
+    2.  Location of the dataset metadata file generated in Step 4.
+    3.  The learning-rate can be used to trade-off between learning and forgetting. A higher learning-rate will learn quickly on our new dataset but will cause forgetting. A lower learning-rate will instead retain more of the pretrained model's knowledge, but will slow down adapting to the new domain.
+    4.  Config of the pretrained model. We load a `llama` model from the repository downloaded earlier.
+    5.  This tells Fast-LLM to load the weights of the pretrained model. If we wanted to use Llama-3.1's configuration, but train from scratch, we could use the same config but set this to `no`.
+
+# Checkpoint usage
+Checkpoints will be saved regularly, and every 20k steps a checkpoint will be exported in the HF format.
+You can use it in `transformers` as you would use the pretrained Llama 3.1 model, except this one will be much stronger at {lang}!
+
+```python
+from transformers import pipeline, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("fast-llm-tutorial/pretrained-model")
+pipe = pipeline("text-generation", model="fast-llm-tutorial/Llama-3.1-8B-cpt/export/llama/20000/", tokenizer=tokenizer)
+```
diff --git a/docs/recipes/train-llama-8b.md b/docs/recipes/train-llama-8b.md
index 7a49a352..8bc18275 100644
--- a/docs/recipes/train-llama-8b.md
+++ b/docs/recipes/train-llama-8b.md
@@ -2,89 +2,8 @@
 title: Training Llama 3.1 8B
 ---
 
-In this guide, we provide step-by-step instructions to do continued pretraining on The Stack with Llama 3.1-8B 🦙.
 
-# Preliminary steps
-- [Quick Start](quick-start.md)
-- [Data preparation](data-preparation.md)
+!!! warning
 
-# Download the Pretrained Model
-Let's download Llama-3.1-8B:
-```bash
-git lfs install
-git clone https://huggingface.co/meta-llama/Llama-3.1-8B ./fast-llm-tutorial/pretrained-model
-```
+    Coming soon!
 
-# Training
-This is not much different from a pretraining config. We will:
-- specify the Llama3.1 checkpoint to load. Fast-LLm will automatically infer the corresponding model architecture.
-- adapt some of the training parameters for our needs.
-- and that's it!
-
-  ```yaml
-  training:
-    train_iters: 100_000
-    logs:
-      interval: 10
-    validation:
-      iterations: 25
-      interval: 1000
-    checkpoint:
-      interval: 1000
-      keep: 5
-    test_iters: 0
-    export:  # (1)!
-      format: llama
-      interval: 20_000
-  batch:
-    micro_batch_size: 2
-    sequence_length: 4096
-    batch_size: 256
-  data:
-    format: file
-    path: fast-llm-tutorial/dataset.json  # (2)!
-    split: [99, 1, 0]  
-  optimizer:  
-    weight_decay: 0.1
-    beta_1: 0.9
-    beta_2: 0.95
-    learning_rate:
-      base: 1.0e-04  # (3)!
-      minimum: 1.0e-05
-      decay_style: cosine
-      decay_iterations: 100_000
-      warmup_iterations: 2000
-  pretrained:  # (4)!
-    format: llama
-    path: fast-llm-tutorial/pretrained-model
-    model_weights: yes  # (5)!
-  model:
-    base_model:
-      transformer:
-        use_flash_attention: yes
-      cross_entropy_impl: fused
-    multi_stage:
-      zero_stage: 2
-    distributed:
-      training_dtype: bf16  
-  run:
-    experiment_dir: fast-llm-tutorial/Llama-3.1-8B-cpt
-    ```
-
-    1.  A Llama model will be saved in Hugging Face format to `~/results` directory every 20,000 iterations.
-    2.  Location of the dataset metadata file generated in Step 4.
-    3.  The learning-rate can be used to trade-off between learning and forgetting. A higher learning-rate will learn quickly on our new dataset but will cause forgetting. A lower learning-rate will instead retain more of the pretrained model's knowledge, but will slow down adapting to the new domain.
-    4.  Config of the pretrained model. We load a `llama` model from the repository downloaded earlier.
-    5.  This tells Fast-LLM to load the weights of the pretrained model. If we wanted to use Llama-3.1's configuration, but train from scratch, we could use the same config but set this to `no`.
-
-# Checkpoint usage
-Checkpoints will be saved regularly, and every 20k steps a checkpoint will be exported in the HF format.
-You can use it in `transformers` as you would use the pretrained Llama 3.1 model, except this one will be much stronger at {lang}!
-
-```python
-from transformers import pipeline, AutoTokenizer
-
-tokenizer = AutoTokenizer.from_pretrained("fast-llm-tutorial/pretrained-model")
-pipe = pipeline("text-generation", model="fast-llm-tutorial/Llama-3.1-8B-cpt/export/llama/20000/", tokenizer=tokenizer)
-
-```

From e5e886ebf585494852763b13a359cabde589b333 Mon Sep 17 00:00:00 2001
From: Toolkit User <raymond.li@servicenow.com>
Date: Tue, 3 Dec 2024 22:20:23 +0000
Subject: [PATCH 4/7] add training from scratch

---
 docs/recipes/train-llama-8b.md | 106 ++++++++++++++++++++++++++++++++-
 1 file changed, 104 insertions(+), 2 deletions(-)

diff --git a/docs/recipes/train-llama-8b.md b/docs/recipes/train-llama-8b.md
index 8bc18275..60022e1d 100644
--- a/docs/recipes/train-llama-8b.md
+++ b/docs/recipes/train-llama-8b.md
@@ -2,8 +2,110 @@
 title: Training Llama 3.1 8B
 ---
 
+Follow this guide to train a Llama-3.1 like model from scratch!
 
-!!! warning
 
-    Coming soon!
+# Preliminary steps
+- [Quick Start](quick-start.md)
+- [Data preparation](data-preparation.md)
+
+
+# Training configuration
+In this guide, we show you how to configure a model architecture and train a model from scratch.
+Let's start from the following training configuration:
+
+    ```yaml
+    training:
+      train_iters: 100_000
+      logs:
+        interval: 10
+      validation:
+        iterations: 25
+        interval: 1000
+      checkpoint:
+        interval: 1000
+        keep: 5
+      test_iters: 0
+      export:
+        format: llama
+        interval: 20_000
+    batch:
+      micro_batch_size: 4
+      sequence_length: 4096
+      batch_size: 480
+    data:
+      format: file
+      path: fast-llm-tutorial/dataset/fast_llm_dataset.json
+      split: [99, 1, 0]
+    optimizer:
+      weight_decay: 0.1
+      beta_1: 0.9
+      beta_2: 0.95
+      learning_rate:
+        base: 6.0e-04
+        minimum: 6.0e-05
+        decay_style: cosine
+        decay_iterations: 100_000
+        warmup_iterations: 2000
+    model:
+      base_model:
+        cross_entropy_impl: fused
+      multi_stage:
+        zero_stage: 2
+      distributed:
+        training_dtype: bf16
+    run:
+      experiment_dir: fast-llm-tutorial/experiment
+    ```
+This configuration will not work because it misses important arguments to define model architecture.
+There are 2 ways of instantiating our Llama-3.1-8B model. We could use a pretrained model config, or define the model architecture ourselves.
+
+=== "Pretrained configuration"
+    This step is similar to what is done in the [Quick Start guide](quick-start.md).
+    First download the model configuration:
+    ```bash
+    git lfs install
+    GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/meta-llama/Llama-3.1-8B ./fast-llm-tutorial/pretrained-model
+    ```
+    By specifying a pretrained model from the HuggingFace hub, Fast-LLM automatically converts the config to load the model.
+    **Only the configuration is loaded, not the weights**, because of `model_weights: no`.
+
+    ```yaml
+    pretrained:
+      format: llama  
+      path: fast-llm-tutorial/pretrained_model
+      model_weights: no 
+    ```
+
+=== "From-scratch configuration"
+      In this step, we specify the model architecture as follows:
+      
+      ```yaml
+      model:
+        base_model:
+          tie_word_embeddings: false
+          transformer:
+            activation_type: silu
+            add_linear_biases: false
+            ffn_hidden_size: 14336
+            gated: true
+            head_groups: 8
+            hidden_size: 4096  # (1)!
+            kv_channels: 128
+            normalization:
+              type: rms_norm
+            num_attention_heads: 32
+            num_layers: 32
+            rotary:
+              scaling_type: llama3
+            rotary_embedding_scale: -13.122363377404328  # (2)!
+            use_rotary_embeddings: true
+          use_position_embeddings: false
+          vocab_size: 128256
+      ```
+
+      1.  Hidden-size/num-layers will be used to provide good defaults for weight initialization std.
+      2.  -ln(500_000)
+
+      Configuring the model this way is a bit more verbose than using the pretrained configuration, but gives you an idea of how you configure a Llama-3.1-8B-like model with Fast-LLM.
 

From 303fcb55bfc46590cf32fc56f81904f95cf0f465 Mon Sep 17 00:00:00 2001
From: Toolkit User <raymond.li@servicenow.com>
Date: Tue, 3 Dec 2024 22:22:04 +0000
Subject: [PATCH 5/7] reorder

---
 docs/recipes/train-llama-8b.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/recipes/train-llama-8b.md b/docs/recipes/train-llama-8b.md
index 60022e1d..77015b94 100644
--- a/docs/recipes/train-llama-8b.md
+++ b/docs/recipes/train-llama-8b.md
@@ -84,6 +84,8 @@ There are 2 ways of instantiating our Llama-3.1-8B model. We could use a pretrai
       model:
         base_model:
           tie_word_embeddings: false
+          use_position_embeddings: false
+          vocab_size: 128256
           transformer:
             activation_type: silu
             add_linear_biases: false
@@ -100,8 +102,6 @@ There are 2 ways of instantiating our Llama-3.1-8B model. We could use a pretrai
               scaling_type: llama3
             rotary_embedding_scale: -13.122363377404328  # (2)!
             use_rotary_embeddings: true
-          use_position_embeddings: false
-          vocab_size: 128256
       ```
 
       1.  Hidden-size/num-layers will be used to provide good defaults for weight initialization std.

From ae3de628a16a622107ccb5a8320899f64858320e Mon Sep 17 00:00:00 2001
From: Toolkit User <raymond.li@servicenow.com>
Date: Tue, 3 Dec 2024 22:34:01 +0000
Subject: [PATCH 6/7] adjust

---
 docs/recipes/train-llama-8b.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/docs/recipes/train-llama-8b.md b/docs/recipes/train-llama-8b.md
index 77015b94..ec697a9d 100644
--- a/docs/recipes/train-llama-8b.md
+++ b/docs/recipes/train-llama-8b.md
@@ -30,9 +30,9 @@ Let's start from the following training configuration:
         format: llama
         interval: 20_000
     batch:
-      micro_batch_size: 4
+      micro_batch_size: 2
       sequence_length: 4096
-      batch_size: 480
+      batch_size: 256
     data:
       format: file
       path: fast-llm-tutorial/dataset/fast_llm_dataset.json
@@ -107,5 +107,5 @@ There are 2 ways of instantiating our Llama-3.1-8B model. We could use a pretrai
       1.  Hidden-size/num-layers will be used to provide good defaults for weight initialization std.
       2.  -ln(500_000)
 
-      Configuring the model this way is a bit more verbose than using the pretrained configuration, but gives you an idea of how you configure a Llama-3.1-8B-like model with Fast-LLM.
+      Configuring the model this way is a bit more verbose than using the pretrained configuration, but gives an idea of how to configure a Llama-3.1-8B-like model with Fast-LLM.
 

From f19ff8bcd1247eb2730d4f3f62aa4173c807a84c Mon Sep 17 00:00:00 2001
From: Toolkit User <raymond.li@servicenow.com>
Date: Wed, 11 Dec 2024 00:57:23 +0000
Subject: [PATCH 7/7] adjust

---
 docs/recipes/train-llama-8b.md | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/docs/recipes/train-llama-8b.md b/docs/recipes/train-llama-8b.md
index ec697a9d..c33db394 100644
--- a/docs/recipes/train-llama-8b.md
+++ b/docs/recipes/train-llama-8b.md
@@ -99,13 +99,11 @@ There are 2 ways of instantiating our Llama-3.1-8B model. We could use a pretrai
             num_attention_heads: 32
             num_layers: 32
             rotary:
-              scaling_type: llama3
-            rotary_embedding_scale: -13.122363377404328  # (2)!
-            use_rotary_embeddings: true
+              type: llama3
+              theta: 500_000
       ```
 
       1.  Hidden-size/num-layers will be used to provide good defaults for weight initialization std.
-      2.  -ln(500_000)
 
       Configuring the model this way is a bit more verbose than using the pretrained configuration, but gives an idea of how to configure a Llama-3.1-8B-like model with Fast-LLM.