Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Curious obeservation with T5 example and Apple Accelerate #868

Closed
okpatil4u opened this issue Sep 16, 2023 · 20 comments
Closed

Curious obeservation with T5 example and Apple Accelerate #868

okpatil4u opened this issue Sep 16, 2023 · 20 comments

Comments

@okpatil4u
Copy link

When I enable accelerate on my M1 Max 64 system I got following results with a single rayon thread with multiple tries.

 cargo run --example t5 -- --model-id "t5-small" --prompt "translate to German: A beautiful candle." --decode
    Finished dev [unoptimized + debuginfo] target(s) in 0.14s
     Running `target/debug/examples/t5 --model-id t5-small --prompt 'translate to German: A beautiful candle.' --decode`
Running on CPU, to run on GPU, build this example with `--features cuda`
 Eine schöne Kerze.
9 tokens generated (56.03 token/s)

These are the results without accelerate

cargo run --example t5 -- --model-id "t5-small" --prompt "translate to German: A beautiful candle." --decode
    Blocking waiting for file lock on build directory
    Finished dev [unoptimized + debuginfo] target(s) in 2.18s
     Running `target/debug/examples/t5 --model-id t5-small --prompt 'translate to German: A beautiful candle.' --decode`
Running on CPU, to run on GPU, build this example with `--features cuda`
 Eine schöne Kerze.
9 tokens generated (0.82 token/s)

But as soon as I increase the prompt size, the speed drops down for accelerate. Same with amount of newly generated tokens are more than 5. It's fast in the beginning and then steeply drops down.

Any idea why ?

@LaurentMazare
Copy link
Collaborator

I don't think we have a kv cache in this model which may well explain this kind of things. This shouldn't be hard to add so we will add it soon

@okpatil4u
Copy link
Author

That makes sense. Thanks Laurent !

@LaurentMazare
Copy link
Collaborator

Just merged #873 that adds a KV cache on the decoding side of T5. I haven't tested it much but seems fine, hopefully this should speed things up!

@okpatil4u
Copy link
Author

okpatil4u commented Sep 17, 2023 via email

@okpatil4u
Copy link
Author

Single threaded native implementation is giving 7.60 token/s where apple accelerate based single threaded implementation is giving 10.40 token/s generation speed.

Seems consistent to what we have seen before. Thank you Laurent. Closing this issue.

@okpatil4u okpatil4u reopened this Oct 1, 2023
@okpatil4u
Copy link
Author

Is there any way one could accelerate prompt evaluation in T5 ? The inference slows down considerably for larger models.

@LaurentMazare
Copy link
Collaborator

Ah sorry to hear this, do you have a good example that reproduces the issue? That would make it easier to investigate.

@okpatil4u
Copy link
Author

First one is smaller prompt, which is translated into german at a rapid pace.
https://github.com/huggingface/candle/assets/3041904/42febc08-f721-4117-b4ac-3527147cf3f4

The second one is a bit (not much) larger prompt, which almost grinds to a halt in 3 sentences.
https://github.com/huggingface/candle/assets/3041904/fa570141-4b92-4149-be6d-951fce59147e

I have enabled apple accelerate here.

@LaurentMazare
Copy link
Collaborator

Could you make these copy and pastable so that I can run them on my computer?
Also just to check you're using the latest github version right? (there were some fixes to the t5 kv cache a week or two ago.
Finally do you have any comparisons with how this usually go on the python side, you wouldn't see such slowdown there?

@okpatil4u
Copy link
Author

This one is slow.
cargo run --release --example t5 -- --prompt "translate to German: Multiple sclerosis (MS) is the most common demyelinating disease,[8] in which the insulating covers of nerve cells in the brain and spinal cord are damaged.[3] This damage disrupts the ability of parts of the nervous system to transmit signals, resulting in a range of signs and symptoms, including physical, mental, and sometimes psychiatric problems.[1][9][10] Specific symptoms can include double vision, visual loss, muscle weakness, and trouble with sensation or coordination.[3][11][12] MS takes several forms, with new symptoms either occurring in isolated attacks (relapsing forms) or building up over time (progressive forms).[13][14] In the relapsing forms of MS, between attacks, symptoms may disappear completely, although some permanent neurological problems often remain, especially as the disease advances." --decode --model-id google/flan-t5-base

This one is fast.
cargo run --release --example t5 -- --prompt "translate to German: A beautiful candle that casts a large halo of yellow light." --decode --model-id google/flan-t5-base

This is the version I am using
`commit 0620733 (HEAD -> main, origin/main, origin/HEAD)
Author: Laurent Mazare [email protected]
Date: Sat Sep 30 16:04:11 2023 +0200

Streaming mode for reporting the generated tokens (#1007)

* Token streaming.

* Use the token output stream.

* Flush the output.

* Ensure that the last characters get reported.`

I have to double check with the Python version.

@LaurentMazare
Copy link
Collaborator

LaurentMazare commented Oct 1, 2023

Thanks for the repro, I've confirmed that it properly runs with the cache and below is how the trace looks like on one of the slow processing steps (you can generate these with the --tracing flag).
More than 75% of the time is spent in the attention cache doing the tensor concatenation there so copying memory around. I'm not sure how the python version works but maybe it only looks at a finite context which would make things faster - or maybe they have some other ways of getting around this. I'll have a look at the t5 code in transformers but if you have any insights on how t5 is suppose to handle this that would be very welcome.

image

@okpatil4u
Copy link
Author

okpatil4u commented Oct 1, 2023 via email

@LaurentMazare
Copy link
Collaborator

You can view it with chrome in the performance tab from the dev tools, there should be an upload button that lets you select the file (sorry on mobile now so cannot send a screenshot).

@okpatil4u
Copy link
Author

okpatil4u commented Oct 1, 2023 via email

@okpatil4u
Copy link
Author

So we tried this newer kv_cache implementation with candle. Now text generation does not slow down at all, although now I have to check the hallucination.

So it was an issue with the concatenating nature of kv_cache.

@Narsil
Copy link
Collaborator

Narsil commented Oct 6, 2023

@okpatil4u Where is that implementation of attention sinks ? Is opensource by any chance ?

@okpatil4u
Copy link
Author

We wrote it internally. Let me check if I can open source it.

The implementation is relatively easy though. Uber Programmers like you should be able to code it when you are sleeping.

@Narsil
Copy link
Collaborator

Narsil commented Oct 6, 2023

I'm just genuinely curious to take a look.
I've seen the paper, it really makes a lot of sense, like this one too: https://huggingface.co/papers/2309.16588

Olivier took a stab at it in TGI, it's not that trivial (because of the position_ids slide, currently the implem is not exactly correct and simple).
huggingface/text-generation-inference#1105

Next step tokenizerless models.

@LaurentMazare
Copy link
Collaborator

Actually I think the slowness with the kv-cache was mostly a bug on my side (the giveaway was that this behavior did not happen on the python side, I should have checked it earlier). The fix in #1054 has been merged, hopefully inference should be much faster now and return the exact same results.

@okpatil4u
Copy link
Author

Thank you Laurent, T5 model is working exactly as intended with this fix. I am closing this issue.

But Quant T5 models are still slower than the base model for larger prompts. I am adding my observations in the other issue that I created.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants