Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs - WIP]: llama3 CPT recipe #80

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 85 additions & 2 deletions docs/recipes/continue-training-llama-8b.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,89 @@
title: Continual Pretraining of Llama 3.1 8B
---

!!! warning

This recipe’s still in the oven. Check back soon for the full details!
In this guide, we provide step-by-step instructions to do continued pretraining on The Stack with Llama 3.1-8B πŸ¦™.

# Preliminary steps
- [Quick Start](quick-start.md)
- [Data preparation](data-preparation.md)

# Download the Pretrained Model
Let's download Llama-3.1-8B:
```bash
git lfs install
git clone https://huggingface.co/meta-llama/Llama-3.1-8B ./fast-llm-tutorial/pretrained-model
```

# Training
This is not much different from a pretraining config. We will:
- specify the Llama3.1 checkpoint to load. Fast-LLm will automatically infer the corresponding model architecture.
- adapt some of the training parameters for our needs.
- and that's it!

```yaml
training:
train_iters: 100_000
logs:
interval: 10
validation:
iterations: 25
interval: 1000
checkpoint:
interval: 1000
keep: 5
test_iters: 0
export: # (1)!
format: llama
interval: 20_000
batch:
micro_batch_size: 2
sequence_length: 4096
batch_size: 256
data:
format: file
path: fast-llm-tutorial/dataset.json # (2)!
split: [99, 1, 0]
optimizer:
weight_decay: 0.1
beta_1: 0.9
beta_2: 0.95
learning_rate:
base: 1.0e-04 # (3)!
minimum: 1.0e-05
decay_style: cosine
decay_iterations: 100_000
warmup_iterations: 2000
pretrained: # (4)!
format: llama
path: fast-llm-tutorial/pretrained-model
model_weights: yes # (5)!
model:
base_model:
transformer:
use_flash_attention: yes
cross_entropy_impl: fused
multi_stage:
zero_stage: 2
distributed:
training_dtype: bf16
run:
experiment_dir: fast-llm-tutorial/Llama-3.1-8B-cpt
```

1. A Llama model will be saved in Hugging Face format to `~/results` directory every 20,000 iterations.
2. Location of the dataset metadata file generated in Step 4.
3. The learning-rate can be used to trade-off between learning and forgetting. A higher learning-rate will learn quickly on our new dataset but will cause forgetting. A lower learning-rate will instead retain more of the pretrained model's knowledge, but will slow down adapting to the new domain.
4. Config of the pretrained model. We load a `llama` model from the repository downloaded earlier.
5. This tells Fast-LLM to load the weights of the pretrained model. If we wanted to use Llama-3.1's configuration, but train from scratch, we could use the same config but set this to `no`.

# Checkpoint usage
Checkpoints will be saved regularly, and every 20k steps a checkpoint will be exported in the HF format.
You can use it in `transformers` as you would use the pretrained Llama 3.1 model, except this one will be much stronger at {lang}!

```python
from transformers import pipeline, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("fast-llm-tutorial/pretrained-model")
pipe = pipeline("text-generation", model="fast-llm-tutorial/Llama-3.1-8B-cpt/export/llama/20000/", tokenizer=tokenizer)
```
106 changes: 104 additions & 2 deletions docs/recipes/train-llama-8b.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,108 @@
title: Training Llama 3.1 8B
---

!!! warning
Follow this guide to train a Llama-3.1 like model from scratch!


# Preliminary steps
- [Quick Start](quick-start.md)
- [Data preparation](data-preparation.md)


# Training configuration
In this guide, we show you how to configure a model architecture and train a model from scratch.
Let's start from the following training configuration:

```yaml
training:
train_iters: 100_000
logs:
interval: 10
validation:
iterations: 25
interval: 1000
checkpoint:
interval: 1000
keep: 5
test_iters: 0
export:
format: llama
interval: 20_000
batch:
micro_batch_size: 2
sequence_length: 4096
batch_size: 256
data:
format: file
path: fast-llm-tutorial/dataset/fast_llm_dataset.json
split: [99, 1, 0]
optimizer:
weight_decay: 0.1
beta_1: 0.9
beta_2: 0.95
learning_rate:
base: 6.0e-04
minimum: 6.0e-05
decay_style: cosine
decay_iterations: 100_000
warmup_iterations: 2000
model:
base_model:
cross_entropy_impl: fused
multi_stage:
zero_stage: 2
distributed:
training_dtype: bf16
run:
experiment_dir: fast-llm-tutorial/experiment
```
This configuration will not work because it misses important arguments to define model architecture.
There are 2 ways of instantiating our Llama-3.1-8B model. We could use a pretrained model config, or define the model architecture ourselves.

=== "Pretrained configuration"
This step is similar to what is done in the [Quick Start guide](quick-start.md).
First download the model configuration:
```bash
git lfs install
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/meta-llama/Llama-3.1-8B ./fast-llm-tutorial/pretrained-model
```
By specifying a pretrained model from the HuggingFace hub, Fast-LLM automatically converts the config to load the model.
**Only the configuration is loaded, not the weights**, because of `model_weights: no`.

```yaml
pretrained:
format: llama
path: fast-llm-tutorial/pretrained_model
model_weights: no
```

=== "From-scratch configuration"
In this step, we specify the model architecture as follows:

```yaml
model:
base_model:
tie_word_embeddings: false
use_position_embeddings: false
vocab_size: 128256
transformer:
activation_type: silu
add_linear_biases: false
ffn_hidden_size: 14336
gated: true
head_groups: 8
hidden_size: 4096 # (1)!
kv_channels: 128
normalization:
type: rms_norm
num_attention_heads: 32
num_layers: 32
rotary:
RaymondLi0 marked this conversation as resolved.
Show resolved Hide resolved
type: llama3
theta: 500_000
```

1. Hidden-size/num-layers will be used to provide good defaults for weight initialization std.

Configuring the model this way is a bit more verbose than using the pretrained configuration, but gives an idea of how to configure a Llama-3.1-8B-like model with Fast-LLM.

Heads up! This guide isn't ready yet. Check back soon.
Loading