Tiny Llama

This is a side project that follows all the acceleration tricks in tinyllama, with the minimal modification to the huggingface transformers code. This means that one can pretrain a tinyllama with huggingface trainer on RTX3090 / RTX4090 / A6000 / A100 without gradient checkpointing, and the training speed is comparable to (or even higher than) the original tinyllama code.

I use the latest codes in FlashAttention. I also use accelerate to accelerate the training.

News

[24/08/31] Update to support the latest transformers library. Fix some minor issues in fused swiglu implementation.
[24/03/10] Change backend to accelerate. The deepspeed backend version is still available here.
[23/12/06] Initial release.

Benchmark

Note

The results below are based on an older version of the code. The latest version may have slightly different results.

Model	GPU	Batch Size Per GPU	GPU Memory	Speed (tokens/s)
tinyllama	1*RTX3090	4	22.3GiB	8.2k
tinyllama	8*RTX3090	3	20.8GiB	32.7k
tinyllama	1*RTX4090	4	23.1GiB	17k
tinyllama	1*A6000	8	43.8GiB	12.2k
tinyllama	4*A6000	8	44GiB	40.4k
tinyllama	8*A6000	8	44.1GiB	73.3k
tinyllama	1*A100-80G	16	76.2GiB	26.6k
tinyllama	8*A100-80G	16	76.9GiB	204.5k
tinyllama	8*A100-80G	16*8	77.0GiB	212.9k
llama-7b	8*A100-80G	1	78.8GiB	22.4k

Deepspeed Results

Model	GPU	Batch Size Per GPU	GPU Memory	Speed (tokens/s)
tinyllama	8*RTX3090	4	16.3G	36k
tinyllama	4*A6000	8	30G	35k
tinyllama	4*A6000	12	39G	40k
tinyllama	8*A40	8	30G	86k
tinyllama	8*A40	12	39G	92k
llama-7b	8*A40	1	39.5G	4.7k
llama-7b	8*A100-80G	4	60G	18k

where 16*8 means a batch size of 16 per GPU and gradient accumulation steps of 8. It achieves a throughput of 26.6k tokens per second per A100-80G GPU.

That means you could train a chinchilla-optimal TinyLlama (1.1B param, 22B tokens) in 1 week with 4 A6000 or 8 RTX3090, or 28.7 hours with 8 A100-80G.

Installation

Use a proper python version, install pytorch, xformers, and other dependencies.

# for example, use python 3.10
conda install pytorch pytorch-cuda=12.1 -c pytorch -c nvidia
conda install xformers -c xformers
pip install -r requirements.txt

Change the cuda version if it is not compatible.

Development Environment

The current implementation is developed using python 3.12.4.

# install pytorch (2.5.0 nightly)
conda install pytorch pytorch-cuda=12.1 -c pytorch-nightly -c nvidia

# install xFormers
# (Optional) Makes the build much faster
pip install ninja
# Set TORCH_CUDA_ARCH_LIST if running and building on different GPU types
pip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers
# (this can take dozens of minutes)

# install other dependencies like transformers, accelerate, etc.
pip install -r requirements.txt

[!NOTE] I am using pytorch nightly build since I want to use FlexAttention. Feel free to use the stable version of PyTorch. If so, you can install xFormers by conda install xformers -c xformers.

Usage

Option 1: Use our scripts

We have already provide you with the scripts to do training and inference. Simply run the following command.

bash run_clm.sh

See run_clm.sh for more details.

Option 2: Modify your own code

In python code, import the tinyllama module. The code will automatically replace the transformer modules with the optimized modules.

The flash attention 2 is supported by huggingface transformers. Just load the model with flash attention 2 enabled.

+ import tinyllama
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-intermediate-step-1195k-token-2.5T")
- model = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-intermediate-step-1195k-token-2.5T")
+ model = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-intermediate-step-1195k-token-2.5T", attn_implementation="flash_attention_2")

...

That's not over! You need to switch on the optimizing strategies by the environment variables.

+ export TINY_FUSED_RMSNORM=1
+ export TINY_FUSED_CROSSENTROPY=1
+ export TINY_FUSED_ROTARY=1
+ export TINY_FUSED_SWIGLU=1

accelerate launch run_clm.py \
    --model_name_or_path TinyLlama/TinyLlama-1.1B-intermediate-step-1195k-token-2.5T \
    --dataset_name wikitext \
    --dataset_config_name wikitext-103-raw-v1 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --auto_find_batch_size \
    --block_size 2048 \
    --lr_scheduler_type cosine \
    --warmup_ratio 0.02 \
    --learning_rate 5e-4 \
    --weight_decay 6.6e-6 \
    --bf16 \
    --torch_dtype bfloat16 \
    --do_train \
    --do_eval \
    --do_predict \
    --num_train_epochs 20 \
    --save_total_limit 3 \
    --save_strategy epoch \
    --evaluation_strategy epoch \
    --load_best_model_at_end True \
    --metric_for_best_model eval_loss \
    --report_to none \
    --output_dir outputs/tiny-llama

Now the codes should be Blazingly Fast. The acceleration also applies to other llama models.

Related Projects

tinyllama-zh: My demo project that uses this repo to pretrain a Chinese TinyLlama.
LCKV: My project that uses this repo to do pretraining on a variant of Llama model.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
tinyllama		tinyllama
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
run_clm.py		run_clm.py
run_clm.sh		run_clm.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tiny Llama

News

Benchmark

Installation

Usage

Option 1: Use our scripts

Option 2: Modify your own code

Related Projects

About

Releases

Packages

Languages

whyNLP/tinyllama

Folders and files

Latest commit

History

Repository files navigation

Tiny Llama

News

Benchmark

Installation

Usage

Option 1: Use our scripts

Option 2: Modify your own code

Related Projects

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages