-
Notifications
You must be signed in to change notification settings - Fork 990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Curious obeservation with T5 example and Apple Accelerate #868
Comments
I don't think we have a kv cache in this model which may well explain this kind of things. This shouldn't be hard to add so we will add it soon |
That makes sense. Thanks Laurent ! |
Just merged #873 that adds a KV cache on the decoding side of T5. I haven't tested it much but seems fine, hopefully this should speed things up! |
Thanks ! I will check it out.
…On Sun, 17 Sep 2023 at 12:33 PM, Laurent Mazare ***@***.***> wrote:
Just merged #873 <#873> that
adds a KV cache on the decoding side of T5. I haven't tested it much but
seems fine, hopefully this should speed things up!
—
Reply to this email directly, view it on GitHub
<#868 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAXGU4E5DUUQYLZJDXC72ZLX22OEXANCNFSM6AAAAAA42ZNAYA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Single threaded native implementation is giving 7.60 token/s where apple accelerate based single threaded implementation is giving 10.40 token/s generation speed. Seems consistent to what we have seen before. Thank you Laurent. Closing this issue. |
Is there any way one could accelerate prompt evaluation in T5 ? The inference slows down considerably for larger models. |
Ah sorry to hear this, do you have a good example that reproduces the issue? That would make it easier to investigate. |
First one is smaller prompt, which is translated into german at a rapid pace. The second one is a bit (not much) larger prompt, which almost grinds to a halt in 3 sentences. I have enabled apple accelerate here. |
Could you make these copy and pastable so that I can run them on my computer? |
This one is slow. This one is fast. This is the version I am using
I have to double check with the Python version. |
Thanks Laurent. Can you give a quick tutorial on how to use tracing ?
I ran "cargo run --release --example t5 -- --prompt "translate to German: A
beautiful candle that casts a large halo of yellow light." --decode
--model-id google/flan-t5-base --tracing" and it gave me a trace file. How
do I visualise it similar to the image that you have attached ?
…On Sun, Oct 1, 2023 at 12:35 PM Laurent Mazare ***@***.***> wrote:
Thanks for the repro, I've confirmed that it properly runs with the cache
and below is how the trace looks like on one of the slow processing steps
(you can generate these with the --tracing flag).
75% of the time is spent in the attention cache doing the tensor
concatenation there so copying memory around. I'm not sure how the python
version works but maybe it only looks at a finite context which would make
things faster - or maybe they have some other ways of getting around this.
I'll have a look at the t5 code in transformers but if you have any
insights on how t5 is suppose to handle this that would be very welcome.
[image: image]
<https://user-images.githubusercontent.com/1041292/271813534-4349fcf8-2832-41a8-b4e4-4068f12193f2.png>
—
Reply to this email directly, view it on GitHub
<#868 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAXGU4ALXJ3XGLEJOIAUFILX5EIZ3ANCNFSM6AAAAAA42ZNAYA>
.
You are receiving this because you modified the open/close state.Message
ID: ***@***.***>
|
You can view it with chrome in the performance tab from the dev tools, there should be an upload button that lets you select the file (sorry on mobile now so cannot send a screenshot). |
Got it, thanks. This is super helpful. I will check T5 transformers
implementation and get back.
…On Sun, Oct 1, 2023 at 12:51 PM Laurent Mazare ***@***.***> wrote:
You can view it with chrome in the performance tab
<https://developer.chrome.com/docs/devtools/performance/> from the dev
tools, there should be an upload button that lets you select the file
(sorry on mobile now so cannot send a screenshot).
—
Reply to this email directly, view it on GitHub
<#868 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAXGU4GMIGPAHM2FAFBSY3DX5EKZDANCNFSM6AAAAAA42ZNAYA>
.
You are receiving this because you modified the open/close state.Message
ID: ***@***.***>
|
So we tried this newer kv_cache implementation with candle. Now text generation does not slow down at all, although now I have to check the hallucination. So it was an issue with the concatenating nature of kv_cache. |
@okpatil4u Where is that implementation of attention sinks ? Is opensource by any chance ? |
We wrote it internally. Let me check if I can open source it. The implementation is relatively easy though. Uber Programmers like you should be able to code it when you are sleeping. |
I'm just genuinely curious to take a look. Olivier took a stab at it in TGI, it's not that trivial (because of the position_ids slide, currently the implem is not exactly correct and simple). Next step tokenizerless models. |
Actually I think the slowness with the kv-cache was mostly a bug on my side (the giveaway was that this behavior did not happen on the python side, I should have checked it earlier). The fix in #1054 has been merged, hopefully inference should be much faster now and return the exact same results. |
Thank you Laurent, T5 model is working exactly as intended with this fix. I am closing this issue. But Quant T5 models are still slower than the base model for larger prompts. I am adding my observations in the other issue that I created. |
When I enable accelerate on my M1 Max 64 system I got following results with a single rayon thread with multiple tries.
These are the results without accelerate
But as soon as I increase the prompt size, the speed drops down for accelerate. Same with amount of newly generated tokens are more than 5. It's fast in the beginning and then steeply drops down.
Any idea why ?
The text was updated successfully, but these errors were encountered: