Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About lora finetuning of 2:4 sparse and sparse quant models #952

Open
arunpatala opened this issue Dec 4, 2024 · 3 comments
Open

About lora finetuning of 2:4 sparse and sparse quant models #952

arunpatala opened this issue Dec 4, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@arunpatala
Copy link

I would like to thank for a great repo.

I have been testing the newly released sparse quant models and was amazed by speedup in both latency and throughput.
I just have some doubts regarding finetuning of 2:4 sparse models.

From what i understood, the model is first sparsified and then fully trained on some data to create sparse llama base model.
As this is not instruction tuned, we do another finetuning on instruction data (which is much smaller). But this still takes as much memory but lesser time.

The recipe provided in the examples, starts with a dense model and does sparsification based on calibration data. Then fine tuning is applied to create the sparse model to regain accuracy.

I would like to know if we can start with a sparse base model (like Sparse Llama 3.1 8B), and create a lora adapter using custom dataset. There can also be sparsity speedup for training lora adapters, is this possible? This would take a lot memory than finetuning step after sparsification.

Does this make sense, assuming VLLM supports serving sparse models with lora?
Can all this be also applied to sparse + w4a16 models to get Qlora +sparsity training and inference?

I would like to contribute if anyone points me in the right direction.

Thanks
Arun

@arunpatala arunpatala added the enhancement New feature or request label Dec 4, 2024
@robertgshaw2-neuralmagic
Copy link
Collaborator

robertgshaw2-neuralmagic commented Dec 6, 2024

Hey @arunpatala

Your understanding is correct

LoRA

  • Training LoRA adapters on the 2:4 sparse base model is 100% something that we want to support and intend to support (and we have designed the system in such a way that we could support it), however we do not currently have the bandwidth to iron our all the user stories and examples in the short term. We would 100% welcome an implementation if this feature is something you are interested in contributing

In terms of what the feature will enable:

  • You should be able to see compression during the LoRA fine-tuning from 2:4 sparsity + quantization. However, since we have not added CUDA kernels for acceleration of 2:4 sparsity to compressed-tensors (only in vllm), you will not see kernel level speedup. However, the weight compression will increase the amount of batching you can do, which may help end-to-end speed
  • This feature is still valuable, since if you train a LoRA adapter on top of the 2:4 sparse model, if you deploy the model with the unmerged adapter, you will get the benefit of faster deployment in vLLM! This is a great user story!

The key item here will be working on our integration of compressed-tensors with HFQuantizer (https://huggingface.co/docs/transformers/quantization/compressed_tensors) and making sure it is compatible with LoRA training with HF peft. If you're interested in taking this on --- we can connect over slack or live to discuss scoping!

Sneak preview:

We are launching support for 2:4 + fp8 in vllm next week, so this feature could be very valuable for deploying on H100s.

Performance snapshot :)
image

@arunpatala
Copy link
Author

Hi,

I’d be interested in contributing to the implementation of this feature. Please share the necessary details and pointers to help me get started. I would also appreciate it if you could verify my understanding:

  1. Current Integration with Compressed-Tensors:

    • As I understand, the base sparse model is not yet utilizing compressed-tensors. To make it compatible with Hugging Face (HF), we need to integrate HFQuantizer to load and save sparse models in a compressed format.
    • Sparse + GPTQ models are already using compressed-tensors. Are these tensors loaded into GPU memory in a compressed format, or are they decompressed before being loaded?
    • Does this suffice memory savings and increase batch size?
  2. Inference and Training Acceleration:

    • When you mention that the acceleration for 2:4 sparsity in inference is not yet added to compressed-tensors, does this also apply to training (e.g., QLoRA fine-tuning)?
    • Does llm-compressor already have these necessary kernels?
  3. LoRA Fine-Tuning with Sparsity:

    • If we use a base sparse model, QLoRA fine-tuning should be straightforward, though without speed benefits.
    • However, merging the LoRA adapter currently might result in the loss of sparsity. One potential solution is to mask the LoRA weights with the sparse weight mask during training. For example:
      output = SparseLinear(input) + mask ( Lora(input))
    • This approach could enable LoRA fine-tuning of sparse models without sacrificing sparsity.
    • Even if there is no speedup during LoRA training of sparse models, the merged model would retain sparsity, leading to faster inference when tuned on custom datasets.

Please point me what things are missing in current implementation. And where I could the related code.

Thanks
Arun

I have found the following related links:

HFQuantize
compressed_tensors
Quantization

compressed-tensors

marlin_24

@dsikka
Copy link
Collaborator

dsikka commented Dec 10, 2024

Hi @arunpatala:

  1. Our sparse models are supported in compressed-tensors and we are currently in the process of enabling loading then through HFQuantizer through this PR: Run model as compressed/uncompressed mode huggingface/transformers#34719 (comment)
  2. Compressed models are loaded in their compressed format and then each layer is decompressed before its forward pass. This is the case when run_compressed is set to True. When it is False, we decompress the entire model after loading.
    The general lifecycle of how quantized parameters are updated can be seen through here: https://github.com/neuralmagic/compressed-tensors/blob/2dcbc9d1dd3f4dc29c280efab481b9f0cfde0a27/src/compressed_tensors/quantization/lifecycle/apply.py#L105
  3. Generally speaking, neither llm-compressor nor compressed-tensors have CUDA kernels for acceleration. This is only in vllm. The focus of decompression in compressed-tensors is primarily for accuracy testing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants