Curious obeservation with T5 example and Apple Accelerate #868

okpatil4u · 2023-09-16T10:24:18Z

When I enable accelerate on my M1 Max 64 system I got following results with a single rayon thread with multiple tries.

 cargo run --example t5 -- --model-id "t5-small" --prompt "translate to German: A beautiful candle." --decode
    Finished dev [unoptimized + debuginfo] target(s) in 0.14s
     Running `target/debug/examples/t5 --model-id t5-small --prompt 'translate to German: A beautiful candle.' --decode`
Running on CPU, to run on GPU, build this example with `--features cuda`
 Eine schöne Kerze.
9 tokens generated (56.03 token/s)

These are the results without accelerate

cargo run --example t5 -- --model-id "t5-small" --prompt "translate to German: A beautiful candle." --decode
    Blocking waiting for file lock on build directory
    Finished dev [unoptimized + debuginfo] target(s) in 2.18s
     Running `target/debug/examples/t5 --model-id t5-small --prompt 'translate to German: A beautiful candle.' --decode`
Running on CPU, to run on GPU, build this example with `--features cuda`
 Eine schöne Kerze.
9 tokens generated (0.82 token/s)

But as soon as I increase the prompt size, the speed drops down for accelerate. Same with amount of newly generated tokens are more than 5. It's fast in the beginning and then steeply drops down.

Any idea why ?

The text was updated successfully, but these errors were encountered:

LaurentMazare · 2023-09-16T10:26:02Z

I don't think we have a kv cache in this model which may well explain this kind of things. This shouldn't be hard to add so we will add it soon

okpatil4u · 2023-09-16T10:27:05Z

That makes sense. Thanks Laurent !

LaurentMazare · 2023-09-17T07:03:28Z

Just merged #873 that adds a KV cache on the decoding side of T5. I haven't tested it much but seems fine, hopefully this should speed things up!

okpatil4u · 2023-09-17T07:21:04Z

Thanks ! I will check it out.

…

On Sun, 17 Sep 2023 at 12:33 PM, Laurent Mazare ***@***.***> wrote: Just merged #873 <#873> that adds a KV cache on the decoding side of T5. I haven't tested it much but seems fine, hopefully this should speed things up! — Reply to this email directly, view it on GitHub <#868 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAXGU4E5DUUQYLZJDXC72ZLX22OEXANCNFSM6AAAAAA42ZNAYA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

okpatil4u · 2023-09-17T11:13:19Z

Single threaded native implementation is giving 7.60 token/s where apple accelerate based single threaded implementation is giving 10.40 token/s generation speed.

Seems consistent to what we have seen before. Thank you Laurent. Closing this issue.

okpatil4u · 2023-10-01T05:29:29Z

Is there any way one could accelerate prompt evaluation in T5 ? The inference slows down considerably for larger models.

LaurentMazare · 2023-10-01T05:35:46Z

Ah sorry to hear this, do you have a good example that reproduces the issue? That would make it easier to investigate.

okpatil4u · 2023-10-01T05:46:58Z

First one is smaller prompt, which is translated into german at a rapid pace.
https://github.com/huggingface/candle/assets/3041904/42febc08-f721-4117-b4ac-3527147cf3f4

The second one is a bit (not much) larger prompt, which almost grinds to a halt in 3 sentences.
https://github.com/huggingface/candle/assets/3041904/fa570141-4b92-4149-be6d-951fce59147e

I have enabled apple accelerate here.

LaurentMazare · 2023-10-01T06:44:37Z

Could you make these copy and pastable so that I can run them on my computer?
Also just to check you're using the latest github version right? (there were some fixes to the t5 kv cache a week or two ago.
Finally do you have any comparisons with how this usually go on the python side, you wouldn't see such slowdown there?

okpatil4u · 2023-10-01T06:47:41Z

This one is slow.
cargo run --release --example t5 -- --prompt "translate to German: Multiple sclerosis (MS) is the most common demyelinating disease,[8] in which the insulating covers of nerve cells in the brain and spinal cord are damaged.[3] This damage disrupts the ability of parts of the nervous system to transmit signals, resulting in a range of signs and symptoms, including physical, mental, and sometimes psychiatric problems.[1][9][10] Specific symptoms can include double vision, visual loss, muscle weakness, and trouble with sensation or coordination.[3][11][12] MS takes several forms, with new symptoms either occurring in isolated attacks (relapsing forms) or building up over time (progressive forms).[13][14] In the relapsing forms of MS, between attacks, symptoms may disappear completely, although some permanent neurological problems often remain, especially as the disease advances." --decode --model-id google/flan-t5-base

This one is fast.
cargo run --release --example t5 -- --prompt "translate to German: A beautiful candle that casts a large halo of yellow light." --decode --model-id google/flan-t5-base

This is the version I am using
`commit 0620733 (HEAD -> main, origin/main, origin/HEAD)
Author: Laurent Mazare [email protected]
Date: Sat Sep 30 16:04:11 2023 +0200

Streaming mode for reporting the generated tokens (#1007)

* Token streaming.

* Use the token output stream.

* Flush the output.

* Ensure that the last characters get reported.`

I have to double check with the Python version.

LaurentMazare · 2023-10-01T07:04:51Z

Thanks for the repro, I've confirmed that it properly runs with the cache and below is how the trace looks like on one of the slow processing steps (you can generate these with the --tracing flag).
More than 75% of the time is spent in the attention cache doing the tensor concatenation there so copying memory around. I'm not sure how the python version works but maybe it only looks at a finite context which would make things faster - or maybe they have some other ways of getting around this. I'll have a look at the t5 code in transformers but if you have any insights on how t5 is suppose to handle this that would be very welcome.

okpatil4u · 2023-10-01T07:15:15Z

Thanks Laurent. Can you give a quick tutorial on how to use tracing ? I ran "cargo run --release --example t5 -- --prompt "translate to German: A beautiful candle that casts a large halo of yellow light." --decode --model-id google/flan-t5-base --tracing" and it gave me a trace file. How do I visualise it similar to the image that you have attached ?

…

On Sun, Oct 1, 2023 at 12:35 PM Laurent Mazare ***@***.***> wrote: Thanks for the repro, I've confirmed that it properly runs with the cache and below is how the trace looks like on one of the slow processing steps (you can generate these with the --tracing flag). 75% of the time is spent in the attention cache doing the tensor concatenation there so copying memory around. I'm not sure how the python version works but maybe it only looks at a finite context which would make things faster - or maybe they have some other ways of getting around this. I'll have a look at the t5 code in transformers but if you have any insights on how t5 is suppose to handle this that would be very welcome. [image: image] <https://user-images.githubusercontent.com/1041292/271813534-4349fcf8-2832-41a8-b4e4-4068f12193f2.png> — Reply to this email directly, view it on GitHub <#868 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAXGU4ALXJ3XGLEJOIAUFILX5EIZ3ANCNFSM6AAAAAA42ZNAYA> . You are receiving this because you modified the open/close state.Message ID: ***@***.***>

LaurentMazare · 2023-10-01T07:21:43Z

You can view it with chrome in the performance tab from the dev tools, there should be an upload button that lets you select the file (sorry on mobile now so cannot send a screenshot).

okpatil4u · 2023-10-01T07:37:38Z

Got it, thanks. This is super helpful. I will check T5 transformers implementation and get back.

…

On Sun, Oct 1, 2023 at 12:51 PM Laurent Mazare ***@***.***> wrote: You can view it with chrome in the performance tab <https://developer.chrome.com/docs/devtools/performance/> from the dev tools, there should be an upload button that lets you select the file (sorry on mobile now so cannot send a screenshot). — Reply to this email directly, view it on GitHub <#868 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAXGU4GMIGPAHM2FAFBSY3DX5EKZDANCNFSM6AAAAAA42ZNAYA> . You are receiving this because you modified the open/close state.Message ID: ***@***.***>

okpatil4u · 2023-10-06T06:32:02Z

So we tried this newer kv_cache implementation with candle. Now text generation does not slow down at all, although now I have to check the hallucination.

So it was an issue with the concatenating nature of kv_cache.

Narsil · 2023-10-06T08:57:45Z

@okpatil4u Where is that implementation of attention sinks ? Is opensource by any chance ?

okpatil4u · 2023-10-06T09:06:08Z

We wrote it internally. Let me check if I can open source it.

The implementation is relatively easy though. Uber Programmers like you should be able to code it when you are sleeping.

Narsil · 2023-10-06T09:57:45Z

I'm just genuinely curious to take a look.
I've seen the paper, it really makes a lot of sense, like this one too: https://huggingface.co/papers/2309.16588

Olivier took a stab at it in TGI, it's not that trivial (because of the position_ids slide, currently the implem is not exactly correct and simple).
huggingface/text-generation-inference#1105

Next step tokenizerless models.

LaurentMazare · 2023-10-07T21:39:24Z

Actually I think the slowness with the kv-cache was mostly a bug on my side (the giveaway was that this behavior did not happen on the python side, I should have checked it earlier). The fix in #1054 has been merged, hopefully inference should be much faster now and return the exact same results.

okpatil4u · 2023-10-08T10:35:36Z

Thank you Laurent, T5 model is working exactly as intended with this fix. I am closing this issue.

But Quant T5 models are still slower than the base model for larger prompts. I am adding my observations in the other issue that I created.

okpatil4u closed this as completed Sep 17, 2023

okpatil4u reopened this Oct 1, 2023

okpatil4u closed this as completed Oct 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Curious obeservation with T5 example and Apple Accelerate #868

Curious obeservation with T5 example and Apple Accelerate #868

okpatil4u commented Sep 16, 2023

LaurentMazare commented Sep 16, 2023

okpatil4u commented Sep 16, 2023

LaurentMazare commented Sep 17, 2023

okpatil4u commented Sep 17, 2023 via email

okpatil4u commented Sep 17, 2023

okpatil4u commented Oct 1, 2023

LaurentMazare commented Oct 1, 2023

okpatil4u commented Oct 1, 2023

LaurentMazare commented Oct 1, 2023

okpatil4u commented Oct 1, 2023

LaurentMazare commented Oct 1, 2023 •

edited

Loading

okpatil4u commented Oct 1, 2023 via email

LaurentMazare commented Oct 1, 2023

okpatil4u commented Oct 1, 2023 via email

okpatil4u commented Oct 6, 2023

Narsil commented Oct 6, 2023

okpatil4u commented Oct 6, 2023

Narsil commented Oct 6, 2023

LaurentMazare commented Oct 7, 2023

okpatil4u commented Oct 8, 2023

Curious obeservation with T5 example and Apple Accelerate #868

Curious obeservation with T5 example and Apple Accelerate #868

Comments

okpatil4u commented Sep 16, 2023

LaurentMazare commented Sep 16, 2023

okpatil4u commented Sep 16, 2023

LaurentMazare commented Sep 17, 2023

okpatil4u commented Sep 17, 2023 via email

okpatil4u commented Sep 17, 2023

okpatil4u commented Oct 1, 2023

LaurentMazare commented Oct 1, 2023

okpatil4u commented Oct 1, 2023

LaurentMazare commented Oct 1, 2023

okpatil4u commented Oct 1, 2023

LaurentMazare commented Oct 1, 2023 • edited Loading

okpatil4u commented Oct 1, 2023 via email

LaurentMazare commented Oct 1, 2023

okpatil4u commented Oct 1, 2023 via email

okpatil4u commented Oct 6, 2023

Narsil commented Oct 6, 2023

okpatil4u commented Oct 6, 2023

Narsil commented Oct 6, 2023

LaurentMazare commented Oct 7, 2023

okpatil4u commented Oct 8, 2023

LaurentMazare commented Oct 1, 2023 •

edited

Loading