-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TPU][Quantization] TPU W8A8
#11785
Merged
robertgshaw2-neuralmagic
merged 73 commits into
vllm-project:main
from
neuralmagic:tpu-w8a8
Jan 8, 2025
Merged
[TPU][Quantization] TPU W8A8
#11785
Changes from 60 commits
Commits
Show all changes
73 commits
Select commit
Hold shift + click to select a range
3b0c8a6
w8a8 working
robertgshaw2-neuralmagic 36fc1db
format
robertgshaw2-neuralmagic d83c04c
added all kernels
robertgshaw2-neuralmagic af9d0f4
format
robertgshaw2-neuralmagic 0f9fd21
working on cuda
robertgshaw2-neuralmagic 7b3203f
added mixed precision directory
robertgshaw2-neuralmagic bf50fa4
formatting
robertgshaw2-neuralmagic 226ef52
cache current state - w8a16 running oom
robertgshaw2-neuralmagic bb7c741
[TPU] Ensure torch._sync(param) is called after param.data.copy_()
WoosukKwon cf842bd
yapf
WoosukKwon 67039bc
[TPU] Correctly profile peak memory usage
WoosukKwon 0695f77
Upgrade PyTorch XLA
WoosukKwon 11cf82f
Merge branch 'main' into tpu-peak-mem
WoosukKwon e016e38
stash
robertgshaw2-neuralmagic 717b859
Merge branch 'main' into compressed-tensors-tpu
robertgshaw2-neuralmagic c848735
proper merge
robertgshaw2-neuralmagic 1539915
add mixed precision
robertgshaw2-neuralmagic f00412a
format
robertgshaw2-neuralmagic b0a6b70
stash
robertgshaw2-neuralmagic e812d7e
Merge branch 'tpu-peak-mem' into compressed-tensors-tpu
robertgshaw2-neuralmagic 764dda1
stash
robertgshaw2-neuralmagic 87b2ae6
remove name
robertgshaw2-neuralmagic e813ff8
revert woosuk change
robertgshaw2-neuralmagic 8cfaa1b
format
robertgshaw2-neuralmagic bbc9741
update
robertgshaw2-neuralmagic eb3f39e
fix nit
robertgshaw2-neuralmagic bb2fbe1
update
robertgshaw2-neuralmagic 14ccb90
fix spurious
robertgshaw2-neuralmagic 4092be2
stash branch for brittany
robertgshaw2-neuralmagic 1aaa628
Merge branch 'main' into tpu-w8a8
robertgshaw2-neuralmagic 48aa54b
revert
robertgshaw2-neuralmagic 4efe915
fix
robertgshaw2-neuralmagic e98b79c
updated
robertgshaw2-neuralmagic 5a89668
reduce cruft
robertgshaw2-neuralmagic 57cbf5c
reduce cruft
robertgshaw2-neuralmagic 3451c4d
updated
robertgshaw2-neuralmagic 0c2e62a
update comment
robertgshaw2-neuralmagic 172c9ca
revert spurious change
robertgshaw2-neuralmagic 938ca81
remove cruft
robertgshaw2-neuralmagic 9e18911
cruft reduction
robertgshaw2-neuralmagic 5f58ec7
update docs
robertgshaw2-neuralmagic af9f298
added integration test
robertgshaw2-neuralmagic 6fe2f62
updated
robertgshaw2-neuralmagic f2c0beb
Add bias back
robertgshaw2-neuralmagic 8b29718
add bias support
robertgshaw2-neuralmagic 1e2a373
updated
robertgshaw2-neuralmagic 2a359ef
stash
robertgshaw2-neuralmagic f7e8975
Merge branch 'main' into remove-async-stream
robertgshaw2-neuralmagic 0d4c3fd
fix
robertgshaw2-neuralmagic 57340d2
update
robertgshaw2-neuralmagic 38291d5
trigger test in CI
robertgshaw2-neuralmagic ead1e94
fix AZP
robertgshaw2-neuralmagic cea5e54
fixed!
robertgshaw2-neuralmagic 940ddde
Merge branch 'tpu-w8a8' of https://github.com/neuralmagic/vllm into t…
robertgshaw2-neuralmagic 84a5b29
fix azp adju
robertgshaw2-neuralmagic a1d7b4a
make docker command look better on gh
robertgshaw2-neuralmagic 2b4ecfd
remove torch warnings
robertgshaw2-neuralmagic 186c108
stash
robertgshaw2-neuralmagic 7e8598a
Merge branch 'tpu-w8a8' of https://github.com/neuralmagic/vllm into t…
robertgshaw2-neuralmagic de773cd
fix AZP
robertgshaw2-neuralmagic 3a53d7d
merged
robertgshaw2-neuralmagic 0be5f69
added
robertgshaw2-neuralmagic cb69ba7
fix formatting
robertgshaw2-neuralmagic 3896f6c
remove comment
robertgshaw2-neuralmagic 33e1e13
formatted
robertgshaw2-neuralmagic dde72d6
add llama to ci
robertgshaw2-neuralmagic d7a9c93
Merge branch 'main' into tpu-w8a8
robertgshaw2-neuralmagic db9f795
Update supported_hardware.md
robertgshaw2-neuralmagic 09ad869
Update supported_hardware.md
robertgshaw2-neuralmagic b74c88a
ixed docs build
robertgshaw2-neuralmagic da4369e
Merge branch 'tpu-w8a8' of https://github.com/neuralmagic/vllm into t…
robertgshaw2-neuralmagic 5ddcac2
Merge branch 'main' into tpu-w8a8
robertgshaw2-neuralmagic f353c43
fix CI
robertgshaw2-neuralmagic File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
from dataclasses import dataclass | ||
|
||
import lm_eval | ||
import pytest | ||
|
||
TASK = "gsm8k" | ||
FILTER = "exact_match,strict-match" | ||
RTOL = 0.03 | ||
|
||
|
||
@dataclass | ||
class GSM8KAccuracyTestConfig: | ||
model_name: str | ||
excepted_value: float | ||
|
||
def get_model_args(self) -> str: | ||
return (f"pretrained={self.model_name}," | ||
"max_model_len=4096,max_num_seqs=128,tensor_parallel_size=4") | ||
|
||
|
||
# NOTE: Accuracy scores measured on GPUs. | ||
ACCURACY_CONFIGS = [ | ||
GSM8KAccuracyTestConfig( | ||
model_name="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8", | ||
excepted_value=0.76), # no bias | ||
# NOTE(rob): We cannot re-initialize VLLM in the same process for TPU, | ||
# so only one of these tests can run in a single call to pytest. As | ||
# a follow up, move this into the LM-EVAL section of the CI. | ||
# GSM8KAccuracyTestConfig( | ||
# model_name="neuralmagic/Qwen2-7B-Instruct-quantized.w8a8", | ||
# excepted_value=0.66), # bias in QKV layers | ||
] | ||
|
||
|
||
@pytest.mark.parametrize("config", ACCURACY_CONFIGS) | ||
def test_gsm8k_correctness(config: GSM8KAccuracyTestConfig): | ||
|
||
results = lm_eval.simple_evaluate( | ||
model="vllm", | ||
model_args=config.get_model_args(), | ||
tasks="gsm8k", | ||
batch_size="auto", | ||
) | ||
|
||
EXPECTED_VALUE = config.excepted_value | ||
measured_value = results["results"][TASK][FILTER] | ||
assert (measured_value - RTOL < EXPECTED_VALUE | ||
and measured_value + RTOL > EXPECTED_VALUE | ||
), f"Expected: {EXPECTED_VALUE} | Measured: {measured_value}" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
8 changes: 4 additions & 4 deletions
8
...r/layers/quantization/kernels/__init__.py → ...ation/kernels/mixed_precision/__init__.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
File renamed without changes.
File renamed without changes.
64 changes: 64 additions & 0 deletions
64
vllm/model_executor/layers/quantization/kernels/scaled_mm/ScaledMMLinearKernel.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,64 @@ | ||
from abc import ABC, abstractmethod | ||
from dataclasses import dataclass | ||
from typing import Optional, Tuple | ||
|
||
import torch | ||
|
||
|
||
@dataclass | ||
class ScaledMMLinearLayerConfig: | ||
is_channelwise: bool | ||
is_static_input_scheme: bool | ||
input_symmetric: bool | ||
|
||
|
||
class ScaledMMLinearKernel(ABC): | ||
|
||
@classmethod | ||
@abstractmethod | ||
def get_min_capability(cls) -> int: | ||
raise NotImplementedError | ||
|
||
@classmethod | ||
@abstractmethod | ||
def can_implement( | ||
cls, c: ScaledMMLinearLayerConfig) -> Tuple[bool, Optional[str]]: | ||
raise NotImplementedError | ||
|
||
def __init__(self, c: ScaledMMLinearLayerConfig, w_q_param_name: str, | ||
w_s_param_name: str, i_s_param_name: str, | ||
i_zp_param_name: str, azp_adj_param_name: str) -> None: | ||
assert self.can_implement(c) | ||
self.config = c | ||
self.w_q_name = w_q_param_name | ||
self.w_s_name = w_s_param_name | ||
self.i_s_name = i_s_param_name | ||
self.i_zp_name = i_zp_param_name | ||
self.azp_adj_name = azp_adj_param_name | ||
|
||
@abstractmethod | ||
def process_weights_after_loading(self, layer: torch.nn.Module) -> None: | ||
raise NotImplementedError | ||
|
||
@abstractmethod | ||
def apply_weights(self, | ||
layer: torch.nn.Module, | ||
x: torch.Tensor, | ||
bias: Optional[torch.Tensor] = None) -> torch.Tensor: | ||
raise NotImplementedError | ||
|
||
def _get_weight_params( | ||
self, layer: torch.nn.Module | ||
) -> Tuple[torch.Tensor, # weight | ||
torch.Tensor, # weight_scale | ||
Optional[torch.Tensor], # input_scale, | ||
Optional[torch.Tensor], # input_zp | ||
Optional[torch.Tensor], # azp_adj | ||
]: | ||
return ( | ||
getattr(layer, self.w_q_name), | ||
getattr(layer, self.w_s_name), | ||
getattr(layer, self.i_s_name), | ||
getattr(layer, self.i_zp_name), | ||
getattr(layer, self.azp_adj_name), | ||
) |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE for reviewer - this file is not changed, it is just moved