-
Notifications
You must be signed in to change notification settings - Fork 471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8-bit inference (#512) #513
base: main
Are you sure you want to change the base?
Conversation
@glerzing Do you have an example run using 8bit? |
There are a few things to improve, I'm working on it. I'll also add an example. |
@glerzing Thank you for the great PR, do you have any update on this or anything that you need to help with? |
I added When testing, I ran into 2 problems with
In both cases, it doesn't look related to trlx. Quantization can introduce bugs because it additionally relies on accelerate and bitsandbytes which also have dependencies, and there can be problems with the versions of different libraries. With the library versions listed in requirements.txt, I run into the 2nd problem. If I take with the latest versions, I run into the 1st one. |
@PhungVanDuy If you have time can you help to debug this? I think having lower precision inference and training options will be very useful. |
@glerzing Are you able to get quantized model inference working with our package requirements? (but without any training) |
No, when I have the version 4.28.1 of the transformers library like in trlx, I have |
Actually, adding the argument |
@glerzing @Dahoas I tried to run inference with 8-bit but I dont think this way could help inference faster: This is also mentioned by the author here:
Let's come up with another idea like using vLLM, with my experiments vLLM actually boosts the inference time. I will work in that direction. |
Thanks for checking this. Were you able to run this experiment with the trlX's pinned transformer's version? Or will we need to update it. On the inference speedup side, vLLM seems like a good idea. In general implementing some kind of asynchronous PPO like v-trace seems promising |
I have to update that one, I guess we should also update the transformer's version in terms of supporting LLaMA 2. I am checking vLLM to see how hard to integrate. Thank you for your suggestion on asynchronous PPO. |
I was wondering if there should be an example of how to train 16-bit models. |
@glerzing Checking in on the state of this pr. Do you have any more features you would like to add? If not, let's get it merged sometime this week |
No description provided.