Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some benchmark results and issues on 2*RTX4090 #1

Open
nanamiwang opened this issue Feb 20, 2024 · 2 comments
Open

Some benchmark results and issues on 2*RTX4090 #1

nanamiwang opened this issue Feb 20, 2024 · 2 comments

Comments

@nanamiwang
Copy link

Hi,
I am doing tinyllama pretraining on Chinese dataset using your code, it is very helpful for me, thanks.

Benmark results

Model GPU Distribution Type Batch Size Per GPU Gradient Accumulation Steps GPU Memory Speed (tokens/s)
tinyllama 2*RTX4090 DeepSpeed Zero-2 3 4 21G 1.8k
tinyllama 2*RTX4090 DDP 3 4 21G 2.7k
tinyllama 2*RTX4090 DDP 3 1 21G 1.5k
tinyllama 1*RTX4090 N/A 3 4 21G 1.8k

Some issues:

  • the token thoughput is much slower than 8*RTX3090, and DeepSpeed Zero-2 performed worse than DDP and even no better than single RTX4090.
  • I can't set per_device_train_batch_size to a value greater than 3, in case I set it to 4, auto_find_batch_size will reset it to 2.

Environments

deepspeed 0.9.5
transformers 4.37.2
torch 2.0.1+cu118
flash-attn 2.4.2

Any ideas I can improve the thoughput?

@why-in-Shanghaitech
Copy link
Member

Hi! Thank you for trying the codes!

the token thoughput is much slower than 8*RTX3090

I test with 2*RTX3090 and it seems that it is still much faster than the results from your side.

Model GPU Distribution Type Batch Size Per GPU Gradient Accumulation Steps GPU Memory Speed (tokens/s)
tinyllama 2*RTX3090 DeepSpeed Zero-2 4 4 22.5G 14.6k
tinyllama 2*RTX3090 DeepSpeed Zero-2 4 1 21.5G 11.4k
tinyllama 4*RTX3090 DeepSpeed Zero-2 4 1 18G 19k
tinyllama 2*RTX3090 DDP 2 1 16.5G 7.7k
tinyllama 2*RTX3090 DDP 3 1 20.8G 9.2k
tinyllama 2*RTX3090 Accelerate 3 1 20.8G 13k

I think there might be something wrong? Try to disable the environment variables one by one and see whether they are working or not. I'm not sure whether the language affects the speed. What is the vocabulary size? What is the block size?

DeepSpeed Zero-2 performed worse than DDP and even no better than single RTX4090.

Yes, I have the same observation. I'm currently using accelerate for ddp in my projects.

I can't set per_device_train_batch_size to a value greater than 3, in case I set it to 4, auto_find_batch_size will reset it to 2.

I think this is normal since the number of cards reduces to 2. Zero 2 will distribute the optimizer states and gradients to different cards, reducing the memory cost on each card.

Any ideas I can improve the thoughput?

Some observations from my recent experiments:

  1. DeepSpeed-Zero2 is slow if the memory is sufficient. Unless using deepspeed can double the batch size, don't use it. I'm currently using accelerate for ddp (just replace python run_clm.py with accelerate launch run_clm.py, remove --deepspeed ds_config.json).
  2. It seems that the speed does not always increase with the batch size.

Looking forward to further discussions!

@why-in-Shanghaitech
Copy link
Member

Hi,
I've released a new repo tinyllama-zh with instructions to pretrain TinyLlama on a Chinese Corpus. The training script has been updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants