Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Float16 KV Cache in voicecraft.py #72

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Ph0rk0z
Copy link
Contributor

@Ph0rk0z Ph0rk0z commented Apr 5, 2024

Didn't appear to do anything bad. Not sure how much it helps. Give it a try. I think there are some missing torch GC calls somewhere because not all memory is always cleared. Are there other places we can use FP16? In inference it shouldn't matter, unlike training.

@jasonppy
Copy link
Owner

jasonppy commented Apr 5, 2024

Thanks!

Do you have an estimate on how much VRAM after do make the cache fp16?

With fp32, for the default example in the demo, For the 830M model, it needs around 22GB with kvcache on, 12GB with kvcache off (i.e. kvcache=0); for the 330M model, 15GB with kvcache on, 5GB with kvcache off

In addition, can one make the entire model/operation in fp16?

@jasonppy jasonppy self-assigned this Apr 5, 2024
@Ph0rk0z
Copy link
Contributor Author

Ph0rk0z commented Apr 6, 2024

The model loading with whisperX is about 6gb but it goes up on inference.

I tried to add model.half() in the model loading code too but there was no difference. It could be due to the 4 batches, I think it uses less if you set it do do 1 batch.

@Ph0rk0z
Copy link
Contributor Author

Ph0rk0z commented Apr 6, 2024

https://files.catbox.moe/azwyj4.mov

here is what it does on my machine. I wonder why the CPU use is so high as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants