Some benchmark results and issues on 2*RTX4090 #1

nanamiwang · 2024-02-20T02:36:45Z

Hi,
I am doing tinyllama pretraining on Chinese dataset using your code, it is very helpful for me, thanks.

Benmark results

Model	GPU	Distribution Type	Batch Size Per GPU	Gradient Accumulation Steps	GPU Memory	Speed (tokens/s)
tinyllama	2*RTX4090	DeepSpeed Zero-2	3	4	21G	1.8k
tinyllama	2*RTX4090	DDP	3	4	21G	2.7k
tinyllama	2*RTX4090	DDP	3	1	21G	1.5k
tinyllama	1*RTX4090	N/A	3	4	21G	1.8k

Some issues:

the token thoughput is much slower than 8*RTX3090, and DeepSpeed Zero-2 performed worse than DDP and even no better than single RTX4090.
I can't set per_device_train_batch_size to a value greater than 3, in case I set it to 4, auto_find_batch_size will reset it to 2.

Environments

deepspeed 0.9.5
transformers 4.37.2
torch 2.0.1+cu118
flash-attn 2.4.2

Any ideas I can improve the thoughput?

why-in-Shanghaitech · 2024-02-24T14:59:35Z

Hi! Thank you for trying the codes!

the token thoughput is much slower than 8*RTX3090

I test with 2*RTX3090 and it seems that it is still much faster than the results from your side.

Model	GPU	Distribution Type	Batch Size Per GPU	Gradient Accumulation Steps	GPU Memory	Speed (tokens/s)
tinyllama	2*RTX3090	DeepSpeed Zero-2	4	4	22.5G	14.6k
tinyllama	2*RTX3090	DeepSpeed Zero-2	4	1	21.5G	11.4k
tinyllama	4*RTX3090	DeepSpeed Zero-2	4	1	18G	19k
tinyllama	2*RTX3090	DDP	2	1	16.5G	7.7k
tinyllama	2*RTX3090	DDP	3	1	20.8G	9.2k
tinyllama	2*RTX3090	Accelerate	3	1	20.8G	13k

I think there might be something wrong? Try to disable the environment variables one by one and see whether they are working or not. I'm not sure whether the language affects the speed. What is the vocabulary size? What is the block size?

DeepSpeed Zero-2 performed worse than DDP and even no better than single RTX4090.

Yes, I have the same observation. I'm currently using accelerate for ddp in my projects.

I can't set per_device_train_batch_size to a value greater than 3, in case I set it to 4, auto_find_batch_size will reset it to 2.

I think this is normal since the number of cards reduces to 2. Zero 2 will distribute the optimizer states and gradients to different cards, reducing the memory cost on each card.

Any ideas I can improve the thoughput?

Some observations from my recent experiments:

DeepSpeed-Zero2 is slow if the memory is sufficient. Unless using deepspeed can double the batch size, don't use it. I'm currently using accelerate for ddp (just replace python run_clm.py with accelerate launch run_clm.py, remove --deepspeed ds_config.json).
It seems that the speed does not always increase with the batch size.

Looking forward to further discussions!

why-in-Shanghaitech · 2024-03-11T13:34:59Z

Hi,
I've released a new repo tinyllama-zh with instructions to pretrain TinyLlama on a Chinese Corpus. The training script has been updated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some benchmark results and issues on 2*RTX4090 #1

Some benchmark results and issues on 2*RTX4090 #1

nanamiwang commented Feb 20, 2024

why-in-Shanghaitech commented Feb 24, 2024

why-in-Shanghaitech commented Mar 11, 2024

Some benchmark results and issues on 2*RTX4090 #1

Some benchmark results and issues on 2*RTX4090 #1

Comments

nanamiwang commented Feb 20, 2024

Benmark results

Some issues:

Environments

why-in-Shanghaitech commented Feb 24, 2024

why-in-Shanghaitech commented Mar 11, 2024