ServiceNow · RaymondLi0 · Dec 2, 2024 · Dec 3, 2024 · Dec 3, 2024 · Dec 3, 2024
diff --git a/docs/recipes/continue-training-llama-8b.md b/docs/recipes/continue-training-llama-8b.md
@@ -2,6 +2,89 @@
 title: Continual Pretraining of Llama 3.1 8B
 ---
 
-!!! warning
 
-    This recipe’s still in the oven. Check back soon for the full details!
+In this guide, we provide step-by-step instructions to do continued pretraining on The Stack with Llama 3.1-8B 🦙.
+
+# Preliminary steps
+- [Quick Start](quick-start.md)
+- [Data preparation](data-preparation.md)
+
+# Download the Pretrained Model
+Let's download Llama-3.1-8B:
+```bash
+git lfs install
+git clone https://huggingface.co/meta-llama/Llama-3.1-8B ./fast-llm-tutorial/pretrained-model
+```
+
+# Training
+This is not much different from a pretraining config. We will:
+- specify the Llama3.1 checkpoint to load. Fast-LLm will automatically infer the corresponding model architecture.
+- adapt some of the training parameters for our needs.
+- and that's it!
+
+  ```yaml
+  training:
+    train_iters: 100_000
+    logs:
+      interval: 10
+    validation:
+      iterations: 25
+      interval: 1000
+    checkpoint:
+      interval: 1000
+      keep: 5
+    test_iters: 0
+    export:  # (1)!
+      format: llama
+      interval: 20_000
+  batch:
+    micro_batch_size: 2
+    sequence_length: 4096
+    batch_size: 256
+  data:
+    format: file
+    path: fast-llm-tutorial/dataset.json  # (2)!
+    split: [99, 1, 0]  
+  optimizer:  
+    weight_decay: 0.1
+    beta_1: 0.9
+    beta_2: 0.95
+    learning_rate:
+      base: 1.0e-04  # (3)!
+      minimum: 1.0e-05
+      decay_style: cosine
+      decay_iterations: 100_000
+      warmup_iterations: 2000
+  pretrained:  # (4)!
+    format: llama
+    path: fast-llm-tutorial/pretrained-model
+    model_weights: yes  # (5)!
+  model:
+    base_model:
+      transformer:
+        use_flash_attention: yes
+      cross_entropy_impl: fused
+    multi_stage:
+      zero_stage: 2
+    distributed:
+      training_dtype: bf16  
+  run:
+    experiment_dir: fast-llm-tutorial/Llama-3.1-8B-cpt
+    ```
+
+    1.  A Llama model will be saved in Hugging Face format to `~/results` directory every 20,000 iterations.
+    2.  Location of the dataset metadata file generated in Step 4.
+    3.  The learning-rate can be used to trade-off between learning and forgetting. A higher learning-rate will learn quickly on our new dataset but will cause forgetting. A lower learning-rate will instead retain more of the pretrained model's knowledge, but will slow down adapting to the new domain.
+    4.  Config of the pretrained model. We load a `llama` model from the repository downloaded earlier.
+    5.  This tells Fast-LLM to load the weights of the pretrained model. If we wanted to use Llama-3.1's configuration, but train from scratch, we could use the same config but set this to `no`.
+
+# Checkpoint usage
+Checkpoints will be saved regularly, and every 20k steps a checkpoint will be exported in the HF format.
+You can use it in `transformers` as you would use the pretrained Llama 3.1 model, except this one will be much stronger at {lang}!
+
+```python
+from transformers import pipeline, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("fast-llm-tutorial/pretrained-model")
+pipe = pipeline("text-generation", model="fast-llm-tutorial/Llama-3.1-8B-cpt/export/llama/20000/", tokenizer=tokenizer)
+```
diff --git a/docs/recipes/train-llama-8b.md b/docs/recipes/train-llama-8b.md
@@ -2,6 +2,108 @@
 title: Training Llama 3.1 8B
 ---
 
-!!! warning
+Follow this guide to train a Llama-3.1 like model from scratch!
+
+
+# Preliminary steps
+- [Quick Start](quick-start.md)
+- [Data preparation](data-preparation.md)
+
+
+# Training configuration
+In this guide, we show you how to configure a model architecture and train a model from scratch.
+Let's start from the following training configuration:
+
+    ```yaml
+    training:
+      train_iters: 100_000
+      logs:
+        interval: 10
+      validation:
+        iterations: 25
+        interval: 1000
+      checkpoint:
+        interval: 1000
+        keep: 5
+      test_iters: 0
+      export:
+        format: llama
+        interval: 20_000
+    batch:
+      micro_batch_size: 2
+      sequence_length: 4096
+      batch_size: 256
+    data:
+      format: file
+      path: fast-llm-tutorial/dataset/fast_llm_dataset.json
+      split: [99, 1, 0]
+    optimizer:
+      weight_decay: 0.1
+      beta_1: 0.9
+      beta_2: 0.95
+      learning_rate:
+        base: 6.0e-04
+        minimum: 6.0e-05
+        decay_style: cosine
+        decay_iterations: 100_000
+        warmup_iterations: 2000
+    model:
+      base_model:
+        cross_entropy_impl: fused
+      multi_stage:
+        zero_stage: 2
+      distributed:
+        training_dtype: bf16
+    run:
+      experiment_dir: fast-llm-tutorial/experiment
+    ```
+This configuration will not work because it misses important arguments to define model architecture.
+There are 2 ways of instantiating our Llama-3.1-8B model. We could use a pretrained model config, or define the model architecture ourselves.
+
+=== "Pretrained configuration"
+    This step is similar to what is done in the [Quick Start guide](quick-start.md).
+    First download the model configuration:
+    ```bash
+    git lfs install
+    GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/meta-llama/Llama-3.1-8B ./fast-llm-tutorial/pretrained-model
+    ```
+    By specifying a pretrained model from the HuggingFace hub, Fast-LLM automatically converts the config to load the model.
+    **Only the configuration is loaded, not the weights**, because of `model_weights: no`.
+
+    ```yaml
+    pretrained:
+      format: llama  
+      path: fast-llm-tutorial/pretrained_model
+      model_weights: no 
+    ```
+
+=== "From-scratch configuration"
+      In this step, we specify the model architecture as follows:
+
+      ```yaml
+      model:
+        base_model:
+          tie_word_embeddings: false
+          use_position_embeddings: false
+          vocab_size: 128256
+          transformer:
+            activation_type: silu
+            add_linear_biases: false
+            ffn_hidden_size: 14336
+            gated: true
+            head_groups: 8
+            hidden_size: 4096  # (1)!
+            kv_channels: 128
+            normalization:
+              type: rms_norm
+            num_attention_heads: 32
+            num_layers: 32
+            rotary:
+              type: llama3
+              theta: 500_000
+      ```
+
+      1.  Hidden-size/num-layers will be used to provide good defaults for weight initialization std.
+
+      Configuring the model this way is a bit more verbose than using the pretrained configuration, but gives an idea of how to configure a Llama-3.1-8B-like model with Fast-LLM.
 
-    Heads up! This guide isn't ready yet. Check back soon.