Speculative #1308

Narsil · 2023-12-04T13:43:09Z

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

- Still wrong when batched - Incorrect returned payload. (No multiple ids/logprobs)

Upgrading Cargo? Undo update.

proto/generate.proto

OlivierDehaene · 2023-12-05T13:48:51Z

router/client/src/lib.rs

 pub use pb::generate::v1::InfoResponse as ShardInfo;
 pub use pb::generate::v1::{


Suggested change

pub use pb::generate::v1::InfoResponse as ShardInfo;

pub use pb::generate::v1::{

pub use pb::generate::v2::InfoResponse as ShardInfo;

pub use pb::generate::v2::{

OlivierDehaene · 2023-12-05T13:49:48Z

router/src/infer.rs

@@ -515,6 +515,7 @@ fn send_responses(

    let mut stopped = false;

+    tracing::info!("Generation: {:?}", generation);


Suggested change

tracing::info!("Generation: {:?}", generation);

OlivierDehaene · 2023-12-05T13:51:40Z

router/src/infer.rs

-        text: generation.token_text,
-        logprob: generation.token_logprob,
-        special: generation.token_is_special,
+    let tokens: Vec<Token> = if let Some(tokens_) = generation.tokens {


Maybe we should .expect() here as this should be a bug if tokens is empty.

We could also if the vecs are not the same size.

OlivierDehaene · 2023-12-05T14:01:31Z

server/text_generation_server/models/__init__.py

@@ -97,6 +101,8 @@ def get_model(
    else:
        raise RuntimeError(f"Unknown dtype {dtype}")

+    SPECULATE = 2


Changed to proper handling.

OlivierDehaene · 2023-12-05T14:21:29Z

server/text_generation_server/models/flash_causal_lm.py

-        next_input_ids, next_token_logprobs, logprobs = batch.next_token_chooser(
-            batch.all_input_ids_tensor[:, : batch.max_seqlen], next_token_logits
+
+        from text_generation_server.models import SPECULATE


Use a get like we do for the cache manager instead.

OlivierDehaene · 2023-12-05T14:22:43Z

server/text_generation_server/models/flash_causal_lm.py

+            # next_token_ids,
+            # next_token_logprobs,


Suggested change

# next_token_ids,

# next_token_logprobs,

OlivierDehaene · 2023-12-05T14:22:54Z

server/text_generation_server/models/flash_causal_lm.py

+            # next_token_id,
+            # next_token_logprob,


Suggested change

# next_token_id,

# next_token_logprob,

OlivierDehaene · 2023-12-05T14:25:10Z

server/text_generation_server/models/flash_causal_lm.py

-            if not stop:
-                stopped = False
+            left = 0
+            for j, next_token_id in enumerate(_next_token_ids):


Maybe this could happen in the same for loop as above?

server/text_generation_server/models/flash_causal_lm.py

OlivierDehaene · 2023-12-05T14:35:20Z

We are also missing:

new prometheus metrics for the number of accepted speculative ids
speculative tokens count in next_batch
maybe refactor infer to have the same InferResponse as before.

Narsil · 2023-12-05T16:46:16Z

We are also missing:

* new prometheus metrics for the number of accepted speculative ids

* speculative tokens count in `next_batch`

* maybe refactor infer to have the same InferResponse as before.

Did all 3.

Padded zeros was the worst case scenario.

Bad degradation on pad tokens for ngram Invalid batch discard `stopped` value could be incorrect.

now.

search on device instead of on CPU with bad worst cases O(n)

integration-tests/models/test_flash_llama.py

server/text_generation_server/cli.py

server/text_generation_server/utils/tokens.py

Narsil · 2023-12-09T17:31:56Z

Sorry you reviewed this, I was making adjustements to reduce the overhead. Cleaned it up.

Narsil · 2023-12-09T17:54:40Z

@OlivierDehaene Good for review this time, I'll run a few benches.

server/text_generation_server/utils/layers.py

server/text_generation_server/models/flash_causal_lm.py

Co-authored-by: OlivierDehaene <[email protected]>

shcho1118 · 2023-12-12T18:15:26Z

I know this PR has been merged, but I have a question.
Where can I find the part about generating masking for tree-based attention, which is one of Medusa's features?
Or does it just use top-1 from each medusa head (in which case it would be the same as causal)?

DoubleVII · 2024-02-22T07:56:48Z

I know this PR has been merged, but I have a question. Where can I find the part about generating masking for tree-based attention, which is one of Medusa's features? Or does it just use top-1 from each medusa head (in which case it would be the same as causal)?

As you say, this PR just use top-1 from each medusa head.

Narsil requested a review from OlivierDehaene December 4, 2023 14:46

Narsil added 18 commits December 5, 2023 09:10

Tmp work for medusa.

d7d07d4

Tmp.

3fae84d

Speculative medusa (illegal address Paged).

243e9c3

Speedup 2x.

aa442dc

- Still wrong when batched - Incorrect returned payload. (No multiple ids/logprobs)

Modifying the protobuf.

cda627e

Non breaking router.

b4d97d5

Medusa + ngram

657ccd8

Working state except all params ??

e7e0734

Speculative decoding + mistral

7ed07bc

Propagate speculate

bdd9596

Updating launcher + docs.

1f46bc4

Need to update params since tensor changed.

970e57b

Needed to regenerate params tests + fix simple tests

79f9afb

Remove pdb comments.

d99f281

Cargo fmt

2697920

Revert falcon load modification.

f576598

C'mon falcon.

8efff84

Upgrading Cargo? Undo update.

cargo update.

4d6efe3

Narsil force-pushed the medusa2 branch from 28d73a5 to 4d6efe3 Compare December 5, 2023 09:13

Working around falcon tests.

e808222

Narsil mentioned this pull request Dec 5, 2023

Together AI Inference Engine #1271

Closed

OlivierDehaene reviewed Dec 5, 2023

View reviewed changes

Narsil added 4 commits December 5, 2023 15:21

Address comments.

be481a4

Fix.

cb8a168

Fmt.

5aa3a01

Fixing some simple stuff, adding speculate to budget.

09839b0

Fixing infer iterator.

9bf31fe

Narsil added 8 commits December 5, 2023 22:02

Fix no speculation.

fdef00c

Cargo fmt.

a3cc5a9

Improve create_n_gram degradation.

7b34445

Padded zeros was the worst case scenario.

Include a few fixes

f6958ea

Bad degradation on pad tokens for ngram Invalid batch discard `stopped` value could be incorrect.

Discard all params modifications, we're not running ngram speculation

6350c11

now.

Revert integration tests modifications.

b3c1492

Remove ngram debug code

3a8b192

Updating medusa test + Speeding ngram immensely by just making a smple

d2b42f6

search on device instead of on CPU with bad worst cases O(n)

Narsil force-pushed the medusa2 branch from 0c1a3e9 to d2b42f6 Compare December 6, 2023 16:33

Updated.

3a79fbc

OlivierDehaene reviewed Dec 6, 2023

View reviewed changes

Narsil added 4 commits December 6, 2023 19:51

Old llama test.

abc8d48

Fixing medusa off by ones.

ba16994

Removing dead code.

e95a5a8

Update medusa sampling.

b6519b5

OlivierDehaene previously approved these changes Dec 11, 2023

View reviewed changes

server/text_generation_server/utils/layers.py Outdated Show resolved Hide resolved

server/text_generation_server/models/flash_causal_lm.py Outdated Show resolved Hide resolved

Narsil dismissed OlivierDehaene’s stale review via 0006fab December 11, 2023 11:06

Apply suggestions from code review

0006fab

Co-authored-by: OlivierDehaene <[email protected]>

OlivierDehaene merged commit 9ecfa16 into main Dec 11, 2023
8 checks passed

OlivierDehaene deleted the medusa2 branch December 11, 2023 11:46

greg-us mentioned this pull request Jan 20, 2024

v1.3.x release: HTTP 424 when requesting top_n_tokens > 0 #1340

Closed

4 tasks

kdamaszk pushed a commit to kdamaszk/tgi-gaudi that referenced this pull request Apr 29, 2024

Speculative (huggingface#1308)

a7f52f3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speculative #1308

Speculative #1308

Narsil commented Dec 4, 2023

OlivierDehaene Dec 5, 2023

Narsil Dec 5, 2023

OlivierDehaene Dec 5, 2023

Narsil Dec 5, 2023

OlivierDehaene Dec 5, 2023

OlivierDehaene Dec 5, 2023

OlivierDehaene Dec 5, 2023

Narsil Dec 5, 2023

OlivierDehaene Dec 5, 2023

OlivierDehaene Dec 5, 2023

OlivierDehaene Dec 5, 2023

OlivierDehaene Dec 5, 2023

OlivierDehaene commented Dec 5, 2023

Narsil commented Dec 5, 2023

Narsil commented Dec 9, 2023

Narsil commented Dec 9, 2023

shcho1118 commented Dec 12, 2023 •

edited

Loading

DoubleVII commented Feb 22, 2024

		pub use pb::generate::v1::InfoResponse as ShardInfo;
		pub use pb::generate::v1::{

		@@ -515,6 +515,7 @@ fn send_responses(

		let mut stopped = false;

		tracing::info!("Generation: {:?}", generation);

Speculative #1308

Speculative #1308

Conversation

Narsil commented Dec 4, 2023

What does this PR do?

Before submitting

Who can review?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OlivierDehaene commented Dec 5, 2023

Narsil commented Dec 5, 2023

Narsil commented Dec 9, 2023

Narsil commented Dec 9, 2023

shcho1118 commented Dec 12, 2023 • edited Loading

DoubleVII commented Feb 22, 2024

shcho1118 commented Dec 12, 2023 •

edited

Loading