[TPU] Implement prefix caching for TPUs #10307

WoosukKwon · 2024-11-13T21:40:13Z

This PR implements the prefix caching support for the TPU backend.

github-actions · 2024-11-13T21:40:27Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

robertgshaw2-neuralmagic · 2024-11-13T22:41:06Z

Nice work!

vanbasten23 · 2024-11-14T19:21:42Z

vllm/attention/backends/pallas.py

+                output = output.permute(0, 2, 1, 3)
+            else:
+                # Prefill with paged KV cache.
+                # TODO(woosuk): Tune the below knobs.


Thanks Woosuk for writing the PR.

I'm benchmarking the kernel so likely I'll have some recommended num_kv_pages_per_compute_block/num_queries_per_compute_block to share.

Also, the revised paged attention kernel is in torch_xla nightly. Could you try again? I pulled your PR and it seems it needs additional work to get the effective_q_lens and plumb it to the kernel.

cc: @WoosukKwon

@vanbasten23 Is the fixed kernel available in today's nightly?

@vanbasten23 After the kernel fix, the model generates correct outputs with prefix caching 🎉

Awesome. Thanks for confirming!

vanbasten23 · 2024-11-15T18:36:18Z

examples/offline_inference_tpu.py

 outputs = llm.generate(prompts, sampling_params)
-for output, answer in zip(outputs, answers):
+for output in outputs:


I wonder if you need a test for the prefix caching.

vanbasten23 · 2024-11-15T19:28:19Z

Btw, which command did you use run examples/offline_inference_tpu.py. I used $ python vllm/examples/offline_inference_tpu.py but it fails. Do you need to use a model other than "google/gemma-2b"?

robertgshaw2-neuralmagic · 2024-11-16T04:35:58Z

vllm/attention/backends/pallas.py

+                num_kv_pages_per_compute_block = 16
+                num_queries_per_compute_block = 16
+                assert seq_len % num_queries_per_compute_block == 0
+                output = torch.ops.xla.multi_queries_paged_attention(


@vanbasten23 - does this new kernel have the same SMEM requirements as the original paged_attention where the entire block table is stored in SMEM?

E.g. for the decoding run (see below), we split the batch dimension into smaller chunks and run the kernel multiple times

hey @robertgshaw2-neuralmagic , yes this new kernel have the same SMEM requirements. I am aware of the SMEM OOM issue you mentioned and we plan to address it.

mergify · 2024-11-17T02:04:12Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @WoosukKwon.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

WoosukKwon · 2024-11-18T23:05:38Z

@vanbasten23

Btw, which command did you use run examples/offline_inference_tpu.py. I used $ python vllm/examples/offline_inference_tpu.py but it fails. Do you need to use a model other than "google/gemma-2b"?

This is weird. Which version & TPU are you using?

WoosukKwon · 2024-11-18T23:06:03Z

I will double check, update this PR, and merge it tonight.

vanbasten23 · 2024-11-19T17:20:26Z

@vanbasten23

Btw, which command did you use run examples/offline_inference_tpu.py. I used $ python vllm/examples/offline_inference_tpu.py but it fails. Do you need to use a model other than "google/gemma-2b"?

This is weird. Which version & TPU are you using?

I'm using TPU v5e but I'm not sure if it depends on a specific TPU version.

Signed-off-by: Woosuk Kwon <[email protected]>

Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]>

Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Clay <[email protected]>

Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Maxime Fournioux <[email protected]>

Signed-off-by: Woosuk Kwon <[email protected]>

mergify bot added the ci/build label Nov 13, 2024

WoosukKwon added the tpu Related to Google TPUs label Nov 13, 2024

vanbasten23 reviewed Nov 14, 2024

View reviewed changes

vanbasten23 reviewed Nov 15, 2024

View reviewed changes

vanbasten23 approved these changes Nov 15, 2024

View reviewed changes

robertgshaw2-neuralmagic reviewed Nov 16, 2024

View reviewed changes

mergify bot added the needs-rebase label Nov 17, 2024

mergify bot removed the needs-rebase label Nov 19, 2024

DCO

f525281

Signed-off-by: Woosuk Kwon <[email protected]>

WoosukKwon force-pushed the tpu-prefix-caching branch from 274af74 to f525281 Compare November 20, 2024 21:42

WoosukKwon marked this pull request as ready for review November 20, 2024 21:43

fix

2613820

Signed-off-by: Woosuk Kwon <[email protected]>

WoosukKwon merged commit 2f77b6c into main Nov 20, 2024
13 of 16 checks passed

WoosukKwon deleted the tpu-prefix-caching branch November 20, 2024 21:54

tlrmchlsmth pushed a commit to neuralmagic/vllm that referenced this pull request Nov 23, 2024

[TPU] Implement prefix caching for TPUs (vllm-project#10307)

732fccc

Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]>

ccs96307 pushed a commit to ccs96307/vllm that referenced this pull request Nov 25, 2024

[TPU] Implement prefix caching for TPUs (vllm-project#10307)

c7e87c7

Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Clay <[email protected]>

mfournioux pushed a commit to mfournioux/vllm that referenced this pull request Nov 28, 2024

[TPU] Implement prefix caching for TPUs (vllm-project#10307)

2cfbd36

Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Maxime Fournioux <[email protected]>

sleepwalker2017 pushed a commit to sleepwalker2017/vllm that referenced this pull request Dec 13, 2024

[TPU] Implement prefix caching for TPUs (vllm-project#10307)

6e316fe

Signed-off-by: Woosuk Kwon <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TPU] Implement prefix caching for TPUs #10307

[TPU] Implement prefix caching for TPUs #10307

WoosukKwon commented Nov 13, 2024 •

edited

Loading

github-actions bot commented Nov 13, 2024

robertgshaw2-neuralmagic commented Nov 13, 2024 •

edited

Loading

vanbasten23 Nov 14, 2024

WoosukKwon Nov 14, 2024

WoosukKwon Nov 14, 2024

vanbasten23 Nov 14, 2024

vanbasten23 Nov 15, 2024

vanbasten23 commented Nov 15, 2024

robertgshaw2-neuralmagic Nov 16, 2024

vanbasten23 Nov 18, 2024

mergify bot commented Nov 17, 2024

WoosukKwon commented Nov 18, 2024

WoosukKwon commented Nov 18, 2024

vanbasten23 commented Nov 19, 2024

[TPU] Implement prefix caching for TPUs #10307

[TPU] Implement prefix caching for TPUs #10307

Conversation

WoosukKwon commented Nov 13, 2024 • edited Loading

github-actions bot commented Nov 13, 2024

robertgshaw2-neuralmagic commented Nov 13, 2024 • edited Loading

vanbasten23 Nov 14, 2024

Choose a reason for hiding this comment

WoosukKwon Nov 14, 2024

Choose a reason for hiding this comment

WoosukKwon Nov 14, 2024

Choose a reason for hiding this comment

vanbasten23 Nov 14, 2024

Choose a reason for hiding this comment

vanbasten23 Nov 15, 2024

Choose a reason for hiding this comment

vanbasten23 commented Nov 15, 2024

robertgshaw2-neuralmagic Nov 16, 2024

Choose a reason for hiding this comment

vanbasten23 Nov 18, 2024

Choose a reason for hiding this comment

mergify bot commented Nov 17, 2024

WoosukKwon commented Nov 18, 2024

WoosukKwon commented Nov 18, 2024

vanbasten23 commented Nov 19, 2024

WoosukKwon commented Nov 13, 2024 •

edited

Loading

robertgshaw2-neuralmagic commented Nov 13, 2024 •

edited

Loading