Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TPU][Quantization] TPU W8A8 #11785

Merged
merged 73 commits into from
Jan 8, 2025

Conversation

robertgshaw2-neuralmagic
Copy link
Collaborator

@robertgshaw2-neuralmagic robertgshaw2-neuralmagic commented Jan 7, 2025

SUMMARY:

  • support TPU for compressed-tensors w8a8 models.
  • To run, just load a W8A8 model:
from vllm import LLM
model = LLM("neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8", max_model_len=2048)
model.generate("Hello my name is")

TESTING:

  • verified accuracy on TPU for Llama-8B on TP=1 (exact score as GPU)
  • verified accuracy on TPU for Llama-8B on TP=4 (exact score as GPU)
  • verified accuracy on TPU for Llama-70B on TP=1 (exact score as GPU)
  • verified accuracy on TPU for Qwen on TP=1 (exact score as GPU) --- note: bias in model
  • confirmed all schemes still work on GPU, including:
    • nm-testing/Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Asym
    • nm-testing/Meta-Llama-3-8B-Instruct-W8A8-Static-Per-Tensor-Sym
    • nm-testing/Meta-Llama-3-8B-Instruct-W8A8-Static-Per-Tensor-Asym
  • add Llama TP=1 tests to CI/CD
    • FOLLOW UP: Add more than one model once we enable lm-eval framework on TPU
    • FOLLOW UP: add TP>1 once we enable this machine type in the CI
  • figure out workaround for user warning re: cond

FOLLOW UP

  • [TPU] Mixed precision
  • [TPU] Estimated memory usage is elevated due to peak_bytes capturing some intermediate tensors, fix it.
  • [Software Quality] Add TritonScaledMMLinear abstraction
  • [Software Quality] Convert Fp8 methods to use Kernel abstraction

@robertgshaw2-neuralmagic robertgshaw2-neuralmagic added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 7, 2025
@robertgshaw2-neuralmagic robertgshaw2-neuralmagic added the tpu Related to Google TPUs label Jan 7, 2025
@robertgshaw2-neuralmagic
Copy link
Collaborator Author

@mgoin this is ready to go.

@@ -0,0 +1,74 @@
from typing import List, Optional, Type
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE for reviewer - this file is not changed, it is just moved

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, excellent work

@robertgshaw2-neuralmagic robertgshaw2-neuralmagic enabled auto-merge (squash) January 8, 2025 18:31
@robertgshaw2-neuralmagic robertgshaw2-neuralmagic merged commit 56fe4c2 into vllm-project:main Jan 8, 2025
56 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build documentation Improvements or additions to documentation ready ONLY add when PR is ready to merge/full CI is needed tpu Related to Google TPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants