[TPU][Quantization] TPU `W8A8` #11785

robertgshaw2-neuralmagic · 2025-01-07T00:06:27Z

SUMMARY:

support TPU for compressed-tensors w8a8 models.
To run, just load a W8A8 model:

from vllm import LLM
model = LLM("neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8", max_model_len=2048)
model.generate("Hello my name is")

TESTING:

verified accuracy on TPU for Llama-8B on TP=1 (exact score as GPU)
verified accuracy on TPU for Llama-8B on TP=4 (exact score as GPU)
verified accuracy on TPU for Llama-70B on TP=1 (exact score as GPU)
verified accuracy on TPU for Qwen on TP=1 (exact score as GPU) --- note: bias in model
confirmed all schemes still work on GPU, including:
- nm-testing/Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Asym
- nm-testing/Meta-Llama-3-8B-Instruct-W8A8-Static-Per-Tensor-Sym
- nm-testing/Meta-Llama-3-8B-Instruct-W8A8-Static-Per-Tensor-Asym
add Llama TP=1 tests to CI/CD
- FOLLOW UP: Add more than one model once we enable lm-eval framework on TPU
- FOLLOW UP: add TP>1 once we enable this machine type in the CI
figure out workaround for user warning re: cond

FOLLOW UP

[TPU] Mixed precision
[TPU] Estimated memory usage is elevated due to peak_bytes capturing some intermediate tensors, fix it.
[Software Quality] Add TritonScaledMMLinear abstraction
[Software Quality] Convert Fp8 methods to use Kernel abstraction

…pu-w8a8

vllm/model_executor/layers/quantization/kernels/scaled_mm/xla.py

robertgshaw2-neuralmagic · 2025-01-07T17:20:48Z

@mgoin this is ready to go.

robertgshaw2-neuralmagic · 2025-01-07T17:21:45Z

vllm/model_executor/layers/quantization/kernels/mixed_precision/__init__.py

@@ -0,0 +1,74 @@
+from typing import List, Optional, Type


NOTE for reviewer - this file is not changed, it is just moved

mgoin

LGTM, excellent work

…pu-w8a8

robertgshaw2-neuralmagic and others added 30 commits October 11, 2024 23:34

w8a8 working

3b0c8a6

format

36fc1db

added all kernels

d83c04c

format

af9d0f4

working on cuda

0f9fd21

added mixed precision directory

7b3203f

formatting

bf50fa4

cache current state - w8a16 running oom

226ef52

[TPU] Ensure torch._sync(param) is called after param.data.copy_()

bb7c741

yapf

cf842bd

[TPU] Correctly profile peak memory usage

67039bc

Upgrade PyTorch XLA

0695f77

Merge branch 'main' into tpu-peak-mem

11cf82f

stash

e016e38

Merge branch 'main' into compressed-tensors-tpu

717b859

proper merge

c848735

add mixed precision

1539915

format

f00412a

stash

b0a6b70

Merge branch 'tpu-peak-mem' into compressed-tensors-tpu

e812d7e

stash

764dda1

remove name

87b2ae6

revert woosuk change

e813ff8

format

8cfaa1b

update

bbc9741

fix nit

eb3f39e

update

bb2fbe1

fix spurious

14ccb90

stash branch for brittany

4092be2

Merge branch 'main' into tpu-w8a8

1aaa628

robertgshaw2-neuralmagic added 9 commits January 7, 2025 02:56

fix AZP

ead1e94

fixed!

cea5e54

Merge branch 'tpu-w8a8' of https://github.com/neuralmagic/vllm into t…

940ddde

…pu-w8a8

fix azp adju

84a5b29

make docker command look better on gh

a1d7b4a

remove torch warnings

2b4ecfd

stash

186c108

Merge branch 'tpu-w8a8' of https://github.com/neuralmagic/vllm into t…

7e8598a

…pu-w8a8

fix AZP

de773cd

robertgshaw2-neuralmagic commented Jan 7, 2025

View reviewed changes

vllm/model_executor/layers/quantization/kernels/scaled_mm/xla.py Show resolved Hide resolved

robertgshaw2-neuralmagic added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 7, 2025

robertgshaw2-neuralmagic added 5 commits January 7, 2025 14:49

merged

3a53d7d

added

0be5f69

fix formatting

cb69ba7

remove comment

3896f6c

formatted

33e1e13

robertgshaw2-neuralmagic added the tpu Related to Google TPUs label Jan 7, 2025

add llama to ci

dde72d6

robertgshaw2-neuralmagic commented Jan 7, 2025

View reviewed changes

robertgshaw2-neuralmagic and others added 3 commits January 7, 2025 19:27

Merge branch 'main' into tpu-w8a8

d7a9c93

Update supported_hardware.md

db9f795

Update supported_hardware.md

09ad869

mgoin approved these changes Jan 8, 2025

View reviewed changes

robertgshaw2-neuralmagic added 4 commits January 8, 2025 17:17

ixed docs build

b74c88a

Merge branch 'tpu-w8a8' of https://github.com/neuralmagic/vllm into t…

da4369e

…pu-w8a8

Merge branch 'main' into tpu-w8a8

5ddcac2

fix CI

f353c43

robertgshaw2-neuralmagic enabled auto-merge (squash) January 8, 2025 18:31

robertgshaw2-neuralmagic merged commit 56fe4c2 into vllm-project:main Jan 8, 2025
56 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TPU][Quantization] TPU `W8A8` #11785

[TPU][Quantization] TPU `W8A8` #11785

robertgshaw2-neuralmagic commented Jan 7, 2025 •

edited by github-actions bot

Loading

robertgshaw2-neuralmagic commented Jan 7, 2025

robertgshaw2-neuralmagic Jan 7, 2025

mgoin left a comment

[TPU][Quantization] TPU W8A8 #11785

[TPU][Quantization] TPU W8A8 #11785

Conversation

robertgshaw2-neuralmagic commented Jan 7, 2025 • edited by github-actions bot Loading

SUMMARY:

TESTING:

FOLLOW UP

robertgshaw2-neuralmagic commented Jan 7, 2025

robertgshaw2-neuralmagic Jan 7, 2025

Choose a reason for hiding this comment

mgoin left a comment

Choose a reason for hiding this comment

[TPU][Quantization] TPU `W8A8` #11785

[TPU][Quantization] TPU `W8A8` #11785

robertgshaw2-neuralmagic commented Jan 7, 2025 •

edited by github-actions bot

Loading