-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some benchmark results and issues on 2*RTX4090 #1
Comments
Hi! Thank you for trying the codes!
I test with 2*RTX3090 and it seems that it is still much faster than the results from your side.
I think there might be something wrong? Try to disable the environment variables one by one and see whether they are working or not. I'm not sure whether the language affects the speed. What is the vocabulary size? What is the block size?
Yes, I have the same observation. I'm currently using accelerate for ddp in my projects.
I think this is normal since the number of cards reduces to 2. Zero 2 will distribute the optimizer states and gradients to different cards, reducing the memory cost on each card.
Some observations from my recent experiments:
Looking forward to further discussions! |
Hi, |
Hi,
I am doing tinyllama pretraining on Chinese dataset using your code, it is very helpful for me, thanks.
Benmark results
Some issues:
Environments
deepspeed 0.9.5
transformers 4.37.2
torch 2.0.1+cu118
flash-attn 2.4.2
Any ideas I can improve the thoughput?
The text was updated successfully, but these errors were encountered: