-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Marlin 2:4 sparsity #2102
Conversation
@@ -19,6 +19,23 @@ def gptq_marlin_gemm( | |||
""" | |||
... | |||
|
|||
def gptq_marlin_24_gemm( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we running a fork of marlin ?
Shouldn't we use the classic makefile + commit approach instead (or already made releases if possible) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These come from VLLM. We could also directly import them from VLLM if we prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
GPTQ_MARLIN_24_MIN_THREAD_N = 128 | ||
GPTQ_MARLIN_24_MIN_THREAD_K = 128 | ||
GPTQ_MARLIN_24_MAX_PARALLEL = 64 | ||
GPTQ_MARLIN_24_SUPPORTED_NUM_BITS = [4, 8] | ||
GPTQ_MARLIN_24_SUPPORTED_GROUP_SIZES = [-1, 128] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is coming from marlin directly I guess ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they come from upstream.
This change adds support for 2:4 sparsity when using Marlin quantization. The 2:4 kernel is used when: * The quantizer is `marlin`; * the quantizer checkpoint format is `marlin_24`. Fixes #2098.
8f541bc
to
b06dda9
Compare
This change adds support for 2:4 sparsity when using Marlin quantization. The 2:4 kernel is used when: * The quantizer is `marlin`; * the quantizer checkpoint format is `marlin_24`. Fixes #2098.
This change adds support for 2:4 sparsity when using Marlin quantization. The 2:4 kernel is used when: * The quantizer is `marlin`; * the quantizer checkpoint format is `marlin_24`. Fixes huggingface#2098.
What does this PR do?
This change adds support for 2:4 sparsity when using Marlin quantization. The 2:4 kernel is used when:
marlin
;marlin_24
.Fixes #2098.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.