From ed42966af65b5dd8318111e2031140d09f400f3f Mon Sep 17 00:00:00 2001 From: Steven Liu Date: Fri, 20 Dec 2024 16:04:26 -0800 Subject: [PATCH] quant pt 3 --- docs/source/en/quantization/aqlm.md | 27 ++-- docs/source/en/quantization/awq.md | 132 ++++++++------------ docs/source/en/quantization/bitnet.md | 51 ++------ docs/source/en/quantization/bitsandbytes.md | 111 ++++++---------- docs/source/en/quantization/gptq.md | 88 +++++++------ 5 files changed, 163 insertions(+), 246 deletions(-) diff --git a/docs/source/en/quantization/aqlm.md b/docs/source/en/quantization/aqlm.md index 2e00d94cfcfff3..9c9b6ac0715d65 100644 --- a/docs/source/en/quantization/aqlm.md +++ b/docs/source/en/quantization/aqlm.md @@ -16,19 +16,17 @@ rendered properly in your Markdown viewer. # AQLM -> [!TIP] -> Try AQLM on [Google Colab](https://colab.research.google.com/drive/1-xZmBRXT5Fm3Ghn4Mwa2KRypORXb855X?usp=sharing)! +Additive Quantization of Language Models ([AQLM](https://arxiv.org/abs/2401.06118)) quantizes multiple weights together and takes advantage of interdependencies between them. AQLM represents groups of 8-16 weights as a sum of multiple vector codes. -Additive Quantization of Language Models ([AQLM](https://arxiv.org/abs/2401.06118)) is a Large Language Models compression method. It quantizes multiple weights together and takes advantage of interdependencies between them. AQLM represents groups of 8-16 weights as a sum of multiple vector codes. +AQLM also supports fine-tuning with [LoRA](https://huggingface.co/docs/peft/package_reference/lora) with the [PEFT](https://huggingface.co/docs/peft) library, and is fully compatible with [torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) for even faster inference and training. + +Run the command below to install the AQLM library with kernel support for both GPU and CPU inference and training. AQLM only works with Python 3.10+. -Inference support for AQLM is realised in the `aqlm` library. Make sure to install it to run the models (note aqlm works only with python>=3.10): ```bash pip install aqlm[gpu,cpu] ``` -The library provides efficient kernels for both GPU and CPU inference and training. - -The instructions on how to quantize models yourself, as well as all the relevant code can be found in the corresponding GitHub [repository](https://github.com/Vahe1994/AQLM). To run AQLM models simply load a model that has been quantized with AQLM: +Load an AQLM-quantized model with [`~PreTrainedModel.from_pretrained`]. ```python from transformers import AutoTokenizer, AutoModelForCausalLM @@ -38,20 +36,21 @@ quantized_model = AutoModelForCausalLM.from_pretrained( torch_dtype="auto", device_map="auto" ) -tokenizer = AutoTokenizer.from_pretrained("ISTA-DASLab/Mixtral-8x7b-AQLM-2Bit-1x16-hf") ``` -## PEFT - -Starting with version `aqlm 1.0.2`, AQLM supports Parameter-Efficient Fine-Tuning in a form of [LoRA](https://huggingface.co/docs/peft/package_reference/lora) integrated into the [PEFT](https://huggingface.co/blog/peft) library. +## Configurations -## AQLM configurations +AQLM quantization setups vary mainly in the number of codebooks used, as well as codebook sizes in bits. The most popular setups and supported inference kernels are shown below. -AQLM quantization setups vary mainly on the number of codebooks used as well as codebook sizes in bits. The most popular setups, as well as inference kernels they support are: - | Kernel | Number of codebooks | Codebook size, bits | Notation | Accuracy | Speedup | Fast GPU inference | Fast CPU inference | |---|---------------------|---------------------|----------|-------------|-------------|--------------------|--------------------| | Triton | K | N | KxN | - | Up to ~0.7x | ✅ | ❌ | | CUDA | 1 | 16 | 1x16 | Best | Up to ~1.3x | ✅ | ❌ | | CUDA | 2 | 8 | 2x8 | OK | Up to ~3.0x | ✅ | ❌ | | Numba | K | 8 | Kx8 | Good | Up to ~4.0x | ❌ | ✅ | + +## Resources + +Run the AQLM demo [notebook](https://colab.research.google.com/drive/1-xZmBRXT5Fm3Ghn4Mwa2KRypORXb855X?usp=sharing) for more examples of how to quantize a model, push a quantized model to the Hub, and more. + +For more example demo notebooks, visit the AQLM [repository](https://github.com/Vahe1994/AQLM). diff --git a/docs/source/en/quantization/awq.md b/docs/source/en/quantization/awq.md index ca26844edd0294..19361bc71d18b2 100644 --- a/docs/source/en/quantization/awq.md +++ b/docs/source/en/quantization/awq.md @@ -16,23 +16,17 @@ rendered properly in your Markdown viewer. # AWQ - - -Try AWQ quantization with this [notebook](https://colab.research.google.com/drive/1HzZH89yAXJaZgwJDhQj9LqSBux932BvY)! - - - -[Activation-aware Weight Quantization (AWQ)](https://hf.co/papers/2306.00978) doesn't quantize all the weights in a model, and instead, it preserves a small percentage of weights that are important for LLM performance. This significantly reduces quantization loss such that you can run models in 4-bit precision without experiencing any performance degradation. +[Activation-aware Weight Quantization (AWQ)](https://hf.co/papers/2306.00978) preserves a small fraction of the weights that are important for LLM performance to compress a model to 4-bits with minimal performance degradation. There are several libraries for quantizing models with the AWQ algorithm, such as [llm-awq](https://github.com/mit-han-lab/llm-awq), [autoawq](https://github.com/casper-hansen/AutoAWQ) or [optimum-intel](https://huggingface.co/docs/optimum/main/en/intel/optimization_inc). Transformers supports loading models quantized with the llm-awq and autoawq libraries. This guide will show you how to load models quantized with autoawq, but the process is similar for llm-awq quantized models. -Make sure you have autoawq installed: +Run the command below to install autoawq ```bash pip install autoawq ``` -AWQ-quantized models can be identified by checking the `quantization_config` attribute in the model's [config.json](https://huggingface.co/TheBloke/zephyr-7B-alpha-AWQ/blob/main/config.json) file: +Identify an AWQ-quantized model by checking the `quant_method` key in the models [config.json](https://huggingface.co/TheBloke/zephyr-7B-alpha-AWQ/blob/main/config.json) file. ```json { @@ -53,62 +47,59 @@ AWQ-quantized models can be identified by checking the `quantization_config` att } ``` -A quantized model is loaded with the [`~PreTrainedModel.from_pretrained`] method. If you loaded your model on the CPU, make sure to move it to a GPU device first. Use the `device_map` parameter to specify where to place the model: +Load the AWQ-quantized model with [`~PreTrainedModel.from_pretrained`]. This automatically sets the other weights to fp16 by default for performance reasons. Use the `torch_dtype` parameter to load these other weights in a different format. -```py -from transformers import AutoModelForCausalLM, AutoTokenizer - -model_id = "TheBloke/zephyr-7B-alpha-AWQ" -model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda:0") -``` - -Loading an AWQ-quantized model automatically sets other weights to fp16 by default for performance reasons. If you want to load these other weights in a different format, use the `torch_dtype` parameter: +If the model is loaded on the CPU, use the `device_map` parameter to move it to a GPU. ```py from transformers import AutoModelForCausalLM, AutoTokenizer -model_id = "TheBloke/zephyr-7B-alpha-AWQ" -model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float32) +model = AutoModelForCausalLM.from_pretrained( + "TheBloke/zephyr-7B-alpha-AWQ", + torch_dtype=torch.float32, + device_map="cuda:0" +) ``` -AWQ quantization can also be combined with [FlashAttention-2](../perf_infer_gpu_one#flashattention-2) to further accelerate inference: +Use `attn_implementation` to enable [FlashAttention2](../perf_infer_gpu_one#flashattention-2) to further accelerate inference. ```py from transformers import AutoModelForCausalLM, AutoTokenizer -model = AutoModelForCausalLM.from_pretrained("TheBloke/zephyr-7B-alpha-AWQ", attn_implementation="flash_attention_2", device_map="cuda:0") +model = AutoModelForCausalLM.from_pretrained( + "TheBloke/zephyr-7B-alpha-AWQ", + attn_implementation="flash_attention_2", + device_map="cuda:0" +) ``` ## Fused modules -Fused modules offers improved accuracy and performance and it is supported out-of-the-box for AWQ modules for [Llama](https://huggingface.co/meta-llama) and [Mistral](https://huggingface.co/mistralai/Mistral-7B-v0.1) architectures, but you can also fuse AWQ modules for unsupported architectures. - - - -Fused modules cannot be combined with other optimization techniques such as FlashAttention-2. +Fused modules offer improved accuracy and performance. They are supported out-of-the-box for AWQ modules for [Llama](https://huggingface.co/meta-llama) and [Mistral](https://huggingface.co/mistralai/Mistral-7B-v0.1) architectures, but you can also fuse AWQ modules for unsupported architectures. - +> [!WARNING] +> Fused modules cannot be combined with other optimization techniques such as FlashAttention2. -To enable fused modules for supported architectures, create an [`AwqConfig`] and set the parameters `fuse_max_seq_len` and `do_fuse=True`. The `fuse_max_seq_len` parameter is the total sequence length and it should include the context length and the expected generation length. You can set it to a larger value to be safe. +Create an [`AwqConfig`] and set the parameters `fuse_max_seq_len` and `do_fuse=True` to enable fused modules. The `fuse_max_seq_len` parameter is the total sequence length and it should include the context length and the expected generation length. Set it to a larger value to be safe. -For example, to fuse the AWQ modules of the [TheBloke/Mistral-7B-OpenOrca-AWQ](https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-AWQ) model. +The example below fuses the AWQ modules of the [TheBloke/Mistral-7B-OpenOrca-AWQ](https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-AWQ) model. ```python import torch from transformers import AwqConfig, AutoModelForCausalLM -model_id = "TheBloke/Mistral-7B-OpenOrca-AWQ" - quantization_config = AwqConfig( bits=4, fuse_max_seq_len=512, do_fuse=True, ) - -model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config).to(0) +model = AutoModelForCausalLM.from_pretrained( + "TheBloke/Mistral-7B-OpenOrca-AWQ", + quantization_config=quantization_config +).to(0) ``` The [TheBloke/Mistral-7B-OpenOrca-AWQ](https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-AWQ) model was benchmarked with `batch_size=1` with and without fused modules. @@ -153,14 +144,14 @@ The speed and throughput of fused and unfused modules were also tested with the -For architectures that don't support fused modules yet, you need to create a custom fusing mapping to define which modules need to be fused with the `modules_to_fuse` parameter. For example, to fuse the AWQ modules of the [TheBloke/Yi-34B-AWQ](https://huggingface.co/TheBloke/Yi-34B-AWQ) model. +For architectures that don't support fused modules, create an [`AwqConfig`] and define a custom fusing mapping in `modules_to_fuse` to determine which modules need to be fused. + +The example below fuses the AWQ modules of the [TheBloke/Yi-34B-AWQ](https://huggingface.co/TheBloke/Yi-34B-AWQ) model. ```python import torch from transformers import AwqConfig, AutoModelForCausalLM -model_id = "TheBloke/Yi-34B-AWQ" - quantization_config = AwqConfig( bits=4, fuse_max_seq_len=512, @@ -175,35 +166,46 @@ quantization_config = AwqConfig( } ) -model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config).to(0) +model = AutoModelForCausalLM.from_pretrained( + "TheBloke/Yi-34B-AWQ", + quantization_config=quantization_config +).to(0) ``` -The parameter `modules_to_fuse` should include: +The parameter `modules_to_fuse` should include the following keys. - `"attention"`: The names of the attention layers to fuse in the following order: query, key, value and output projection layer. If you don't want to fuse these layers, pass an empty list. - `"layernorm"`: The names of all the LayerNorm layers you want to replace with a custom fused LayerNorm. If you don't want to fuse these layers, pass an empty list. - `"mlp"`: The names of the MLP layers you want to fuse into a single MLP layer in the order: (gate (dense, layer, post-attention) / up / down layers). - `"use_alibi"`: If your model uses ALiBi positional embedding. - `"num_attention_heads"`: The number of attention heads. -- `"num_key_value_heads"`: The number of key value heads that should be used to implement Grouped Query Attention (GQA). If `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if `num_key_value_heads=1` the model will use Multi Query Attention (MQA), otherwise GQA is used. +- `"num_key_value_heads"`: The number of key value heads that should be used to implement Grouped Query Attention (GQA). + + | parameter value | attention | + |---|---| + | `num_key_value_heads=num_attention_heads` | Multi-Head Attention | + | `num_key_value_heads=1` | Multi-Query Attention | + | `num_key_value_heads=...` | Grouped Query Attention | + - `"hidden_size"`: The dimension of the hidden representations. +## ExLlamaV2 - -## ExLlama-v2 support - -Recent versions of `autoawq` supports ExLlama-v2 kernels for faster prefill and decoding. To get started, first install the latest version of `autoawq` by running: +[ExLlamaV2](https://github.com/turboderp/exllamav2) kernels support faster prefill and decoding. Run the command below to install the latest version of autoawq with ExLlamaV2 support. ```bash pip install git+https://github.com/casper-hansen/AutoAWQ.git ``` -Get started by passing an `AwqConfig()` with `version="exllama"`. +Set `version="exllama"` in [`AwqConfig`] to enable ExLlamaV2 kernels. -```python +> [!TIP] +> ExLlamaV2 is supported on AMD GPUs. + +```py import torch from transformers import AutoModelForCausalLM, AutoTokenizer, AwqConfig @@ -214,34 +216,18 @@ model = AutoModelForCausalLM.from_pretrained( quantization_config=quantization_config, device_map="auto", ) - -input_ids = torch.randint(0, 100, (1, 128), dtype=torch.long, device="cuda") -output = model(input_ids) -print(output.logits) - -tokenizer = AutoTokenizer.from_pretrained("TheBloke/Mistral-7B-Instruct-v0.1-AWQ") -input_ids = tokenizer.encode("How to make a cake", return_tensors="pt").to(model.device) -output = model.generate(input_ids, do_sample=True, max_length=50, pad_token_id=50256) -print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` - - -Note this feature is supported on AMD GPUs. - - - +## CPU -## CPU support - -Recent versions of `autoawq` supports CPU with ipex op optimizations. To get started, first install the latest version of `autoawq` by running: +[Intel Extension for PyTorch (IPEX)](https://intel.github.io/intel-extension-for-pytorch/cpu/latest/) is designed to enable performance optimizations on Intel hardware. Run the command below to install the latest version of autoawq with IPEX support. ```bash pip install intel-extension-for-pytorch pip install git+https://github.com/casper-hansen/AutoAWQ.git ``` -Get started by passing an `AwqConfig()` with `version="ipex"`. +Set `version="ipex"` in [`AwqConfig`] to enable ExLlamaV2 kernels. ```python import torch @@ -254,20 +240,8 @@ model = AutoModelForCausalLM.from_pretrained( quantization_config=quantization_config, device_map="cpu", ) - -input_ids = torch.randint(0, 100, (1, 128), dtype=torch.long, device="cpu") -output = model(input_ids) -print(output.logits) - -tokenizer = AutoTokenizer.from_pretrained("TheBloke/TinyLlama-1.1B-Chat-v0.3-AWQ") -input_ids = tokenizer.encode("How to make a cake", return_tensors="pt") -pad_token_id = tokenizer.eos_token_id -output = model.generate(input_ids, do_sample=True, max_length=50, pad_token_id=pad_token_id) -print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` - - -Note this feature is supported on Intel CPUs. +## Resources - \ No newline at end of file +Run the AWQ demo [notebook](https://colab.research.google.com/drive/1HzZH89yAXJaZgwJDhQj9LqSBux932BvY#scrollTo=Wwsg6nCwoThm) for more examples of how to quantize a model, push a quantized model to the Hub, and more. diff --git a/docs/source/en/quantization/bitnet.md b/docs/source/en/quantization/bitnet.md index 6bd65e8b53a476..5f713a20d3fd23 100644 --- a/docs/source/en/quantization/bitnet.md +++ b/docs/source/en/quantization/bitnet.md @@ -16,60 +16,33 @@ rendered properly in your Markdown viewer. # BitNet -[BitNet](https://arxiv.org/abs/2402.17764) replaces traditional Linear layers in Multi-Head Attention and Feed-Forward Networks with specialized layers called BitLinear with ternary (or binary in the older version) precision. The BitLinear layers introduced here quantize the weights using ternary precision (with values of -1, 0, and 1) and quantize the activations to 8-bit precision. - +[BitNet](https://arxiv.org/abs/2402.17764) replaces traditional linear layers in Multi-Head Attention and feed-forward networks with specialized BitLinear layers. The BitLinear layers quantize the weights using ternary precision (with values of -1, 0, and 1) and quantize the activations to 8-bit precision.
Alt Text -
The architecture of BitNet with BitLinear layers
+
The architecture of BitNet with BitLinear layers.
-During training, we start by quantizing the weights into ternary values, using symmetric per tensor quantization. First, we compute the average of the absolute values of the weight matrix and use this as a scale. We then divide the weights by the scale, round the values, constrain them between -1 and 1, and finally rescale them to continue in full precision. - -$$ -scale_w = \frac{1}{\frac{1}{nm} \sum_{ij} |W_{ij}|} -$$ - -$$ -W_q = \text{clamp}_{[-1,1]}(\text{round}(W*scale)) -$$ - -$$ -W_{dequantized} = W_q*scale_w -$$ - -Activations are then quantized to a specified bit-width (e.g., 8-bit) using [absmax](https://arxiv.org/pdf/2208.07339) quantization (symmetric per channel quantization). This involves scaling the activations into a range [−128,127[. The quantization formula is: +BitNet models can't be quantized on the fly. They need to be quantized during pretraining or fine-tuning because it is a Quantization-Aware Training (QAT) technique. During training, the weights are quantized to ternary values with symmetric per tensor quantization. -$$ -scale_x = \frac{127}{|X|_{\text{max}, \, \text{dim}=-1}} -$$ +1. Compute the average of the absolute values of the weight matrix and use as a scale. +2. Divide the weights by the scale, round the values, constrain them between -1 and 1, and rescale them to continue in full precision. +3. Activations are quantized to a specified bit-width (8-bit) using [absmax](https://arxiv.org/pdf/2208.07339) quantization (symmetric per channel quantization). This involves scaling the activations into a range of [−128,127]. -$$ -X_q = \text{clamp}_{[-128,127]}(\text{round}(X*scale)) -$$ +Refer to this [PR](https://github.com/huggingface/nanotron/pull/180) to pretrain or fine-tune a 1.58-bit model with [Nanotron](https://github.com/huggingface/nanotron). For fine-tuning, convert a model from the Hugging Face to Nanotron format. Find the conversion steps in this [PR](https://github.com/huggingface/nanotron/pull/174). -$$ -X_{dequantized} = X_q * scale_x -$$ - -To learn more about how we trained, and fine-tuned bitnet models checkout the blogpost [here](https://huggingface.co/blog/1_58_llm_extreme_quantization) - -## Load a BitNet Model from the Hub -BitNet models can't be quantized on the fly—they need to be pre-trained or fine-tuned with the quantization applied (it's a Quantization aware training technique). Once trained, these models are already quantized and available as packed versions on the hub. - -A quantized model can be load : +Load a BitNet quantized model with [`~PreTrainedModel.from_pretrained`]. ```py from transformers import AutoModelForCausalLM path = "/path/to/model" model = AutoModelForCausalLM.from_pretrained(path, device_map="auto") ``` -## Pre-training / Fine-tuning a BitNet Model -If you're looking to pre-train or fine-tune your own 1.58-bit model using Nanotron, check out this [PR](https://github.com/huggingface/nanotron/pull/180), all you need to get started is there ! +## Kernels -For fine-tuning, you'll need to convert the model from Hugging Face format to Nanotron format (which has some differences). You can find the conversion steps in this [PR](https://github.com/huggingface/nanotron/pull/174). +`@torch.compile` is used to unpack the weights and perform the forward pass. It’s very straightforward to implement and delivers significant speed improvements. Additional optimized kernels will be integrated in future versions. -## Kernels +## Resources -In our initial version, we chose to use `@torch.compile` to unpack the weights and perform the forward pass. It’s very straightforward to implement and delivers significant speed improvements. We plan to integrate additional optimized kernels in future versions. \ No newline at end of file +Read [Fine-tuning LLMs to 1.58bit: extreme quantization made easy](https://huggingface.co/blog/1_58_llm_extreme_quantization) to learn more about how BitNet models are trained and fine-tuned. diff --git a/docs/source/en/quantization/bitsandbytes.md b/docs/source/en/quantization/bitsandbytes.md index 6c6b92d0a6e594..d00fdc1691b865 100644 --- a/docs/source/en/quantization/bitsandbytes.md +++ b/docs/source/en/quantization/bitsandbytes.md @@ -16,42 +16,24 @@ rendered properly in your Markdown viewer. # bitsandbytes -[bitsandbytes](https://github.com/TimDettmers/bitsandbytes) is the easiest option for quantizing a model to 8 and 4-bit. 8-bit quantization multiplies outliers in fp16 with non-outliers in int8, converts the non-outlier values back to fp16, and then adds them together to return the weights in fp16. This reduces the degradative effect outlier values have on a model's performance. 4-bit quantization compresses a model even further, and it is commonly used with [QLoRA](https://hf.co/papers/2305.14314) to finetune quantized LLMs. +[bitsandbytes](https://github.com/bitsandbytes-foundation/bitsandbytes) features the LLM.int8 and QLoRA quantization to enable accessible large language model inference and training. -To use bitsandbytes, make sure you have the following libraries installed: +LLM.int8 matrix multiplication, or 8-bit quantization, is based on vector-wise quantization to quantize most of the weights to 8-bits and treating outliers with 16-bit matrix multiplication to reduce their degradative effect on model accuracy. - - +QLoRA, or 4-bit quantization, compresses a model even further to 4-bits and inserts a small set of trainable low-rank adaptation (LoRA) weights to allowing training. -```bash -pip install transformers accelerate bitsandbytes>0.37.0 -``` - - - +Run the command below to install bitsandbytes. ```bash -pip install bitsandbytes>=0.39.0 -pip install --upgrade accelerate transformers +pip install --upgrade transformers accelerate bitsandbytes ``` - - - - - -bitsandbytes is being refactored to support multiple backends beyond CUDA. Currently, ROCm (AMD GPU) and Intel CPU implementations are mature, with Intel XPU in progress and Apple Silicon support expected by Q4/Q1. For installation instructions and the latest backend updates, visit [this link](https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend). - -We value your feedback to help identify bugs before the full release! Check out [these docs](https://huggingface.co/docs/bitsandbytes/main/en/non_cuda_backends) for more details and feedback links. - - - -Now you can quantize a model by passing a `BitsAndBytesConfig` to [`~PreTrainedModel.from_pretrained`] method. This works for any model in any modality, as long as it supports loading with Accelerate and contains `torch.nn.Linear` layers. +Quantize a model by passing a [`BitsAndBytesConfig`] to [`~PreTrainedModel.from_pretrained`]. This works for any model in any modality, as long as it supports [Accelerate]https://huggingface.co/docs/accelerate/index() and contains [torch.nn.Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) layers. -Quantizing a model in 8-bit halves the memory-usage, and for large models, set `device_map="auto"` to efficiently use the GPUs available: +Quantizing a model in 8-bit halves the memory-usage, and for large models, set `device_map="auto"` to efficiently distribute the weights across all available GPUs. ```py from transformers import AutoModelForCausalLM, BitsAndBytesConfig @@ -64,7 +46,7 @@ model_8bit = AutoModelForCausalLM.from_pretrained( ) ``` -By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter if you want. Setting `torch_dtype="auto"` loads the model in the data type defined in a model's `config.json` file. +By default, all other modules such as [torch.nn.LayerNorm](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html) are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter.. Setting `torch_dtype="auto"` loads the model in the data type defined in a model's `config.json` file. ```py import torch @@ -80,7 +62,7 @@ model_8bit = AutoModelForCausalLM.from_pretrained( model_8bit.model.decoder.layers[-1].final_layer_norm.weight.dtype ``` -Once a model is quantized to 8-bit, you can't push the quantized weights to the Hub unless you're using the latest version of Transformers and bitsandbytes. If you have the latest versions, then you can push the 8-bit model to the Hub with the [`~PreTrainedModel.push_to_hub`] method. The quantization config.json file is pushed first, followed by the quantized model weights. +Once a model is quantized to 8-bit, you can't push the quantized weights to the Hub unless you're using the latest version of Transformers and bitsandbytes. If you have the latest versions, then you can push the 8-bit model to the Hub with [`~PreTrainedModel.push_to_hub`]. The quantization config.json file is pushed first, followed by the quantized model weights. ```py from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig @@ -99,7 +81,7 @@ model.push_to_hub("bloom-560m-8bit") -Quantizing a model in 4-bit reduces your memory-usage by 4x, and for large models, set `device_map="auto"` to efficiently use the GPUs available: +Quantizing a model in 4-bit reduces your memory-usage by 4x, and for large models, set `device_map="auto"` to efficiently distribute the weights across all available GPUs. ```py from transformers import AutoModelForCausalLM, BitsAndBytesConfig @@ -112,7 +94,7 @@ model_4bit = AutoModelForCausalLM.from_pretrained( ) ``` -By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter if you want. Setting `torch_dtype="auto"` loads the model in the data type defined in a model's `config.json` file. +By default, all other modules such as [torch.nn.LayerNorm](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html) are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter.. Setting `torch_dtype="auto"` loads the model in the data type defined in a model's `config.json` file. ```py import torch @@ -128,24 +110,21 @@ model_4bit = AutoModelForCausalLM.from_pretrained( model_4bit.model.decoder.layers[-1].final_layer_norm.weight.dtype ``` -If you have `bitsandbytes>=0.41.3`, you can serialize 4-bit models and push them on Hugging Face Hub. Simply call `model.push_to_hub()` after loading it in 4-bit precision. You can also save the serialized 4-bit models locally with `model.save_pretrained()` command. +Make sure you have the latest bitsandbytes version so you can serialize 4-bit models and push them to the Hub with [`~PreTrainedModel.push_to_hub`]. Use [`~PreTrainedModel.save_pretrained`] to save the 4-bit model locally. - +> [!WARNING] +> 8 and 4-bit training is only supported for training *extra* parameters. -Training with 8-bit and 4-bit weights are only supported for training *extra* parameters. - - - -You can check your memory footprint with the `get_memory_footprint` method: +Check your memory footprint with `get_memory_footprint`. ```py print(model.get_memory_footprint()) ``` -Quantized models can be loaded from the [`~PreTrainedModel.from_pretrained`] method without needing to specify the `load_in_8bit` or `load_in_4bit` parameters: +Load quantized models with [`~PreTrainedModel.from_pretrained`] without a `quantization_config`. ```py from transformers import AutoModelForCausalLM, AutoTokenizer @@ -153,19 +132,13 @@ from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("{your_username}/bloom-560m-8bit", device_map="auto") ``` -## 8-bit (LLM.int8() algorithm) - - - -Learn more about the details of 8-bit quantization in this [blog post](https://huggingface.co/blog/hf-bitsandbytes-integration)! +## LLM.int8 - - -This section explores some of the specific features of 8-bit models, such as offloading, outlier thresholds, skipping module conversion, and finetuning. +This section explores some of the specific features of 8-bit quantization, such as offloading, outlier thresholds, skipping module conversion, and finetuning. ### Offloading -8-bit models can offload weights between the CPU and GPU to support fitting very large models into memory. The weights dispatched to the CPU are actually stored in **float32**, and aren't converted to 8-bit. For example, to enable offloading for the [bigscience/bloom-1b7](https://huggingface.co/bigscience/bloom-1b7) model, start by creating a [`BitsAndBytesConfig`]: +8-bit models can offload weights between the CPU and GPU to fit very large models into memory. The weights dispatched to the CPU are stored in **float32** and aren't converted to 8-bit. For example, enable offloading for [bigscience/bloom-1b7](https://huggingface.co/bigscience/bloom-1b7) through [`BitsAndBytesConfig`]. ```py from transformers import AutoModelForCausalLM, BitsAndBytesConfig @@ -173,7 +146,7 @@ from transformers import AutoModelForCausalLM, BitsAndBytesConfig quantization_config = BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=True) ``` -Design a custom device map to fit everything on your GPU except for the `lm_head`, which you'll dispatch to the CPU: +Design a custom device map to fit everything on your GPU except for the `lm_head`, which is dispatched to the CPU. ```py device_map = { @@ -185,7 +158,7 @@ device_map = { } ``` -Now load your model with the custom `device_map` and `quantization_config`: +Now load your model with the custom `device_map` and `quantization_config`. ```py model_8bit = AutoModelForCausalLM.from_pretrained( @@ -200,7 +173,7 @@ model_8bit = AutoModelForCausalLM.from_pretrained( An "outlier" is a hidden state value greater than a certain threshold, and these values are computed in fp16. While the values are usually normally distributed ([-3.5, 3.5]), this distribution can be very different for large models ([-60, 6] or [6, 60]). 8-bit quantization works well for values ~5, but beyond that, there is a significant performance penalty. A good default threshold value is 6, but a lower threshold may be needed for more unstable models (small models or finetuning). -To find the best threshold for your model, we recommend experimenting with the `llm_int8_threshold` parameter in [`BitsAndBytesConfig`]: +To find the best threshold for your model, experiment with the `llm_int8_threshold` parameter in [`BitsAndBytesConfig`]. ```py from transformers import AutoModelForCausalLM, BitsAndBytesConfig @@ -221,7 +194,7 @@ model_8bit = AutoModelForCausalLM.from_pretrained( ### Skip module conversion -For some models, like [Jukebox](model_doc/jukebox), you don't need to quantize every module to 8-bit which can actually cause instability. With Jukebox, there are several `lm_head` modules that should be skipped using the `llm_int8_skip_modules` parameter in [`BitsAndBytesConfig`]: +For some models, like [Jukebox](model_doc/jukebox), you don't need to quantize every module to 8-bit because it can actually cause instability. With Jukebox, there are several `lm_head` modules that should be skipped using the `llm_int8_skip_modules` parameter in [`BitsAndBytesConfig`]. ```py from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig @@ -242,22 +215,15 @@ model_8bit = AutoModelForCausalLM.from_pretrained( ### Finetuning -With the [PEFT](https://github.com/huggingface/peft) library, you can finetune large models like [flan-t5-large](https://huggingface.co/google/flan-t5-large) and [facebook/opt-6.7b](https://huggingface.co/facebook/opt-6.7b) with 8-bit quantization. You don't need to pass the `device_map` parameter for training because it'll automatically load your model on a GPU. However, you can still customize the device map with the `device_map` parameter if you want to (`device_map="auto"` should only be used for inference). - -## 4-bit (QLoRA algorithm) - - +The [PEFT](https://github.com/huggingface/peft) library supports fine-tuning large models like [flan-t5-large](https://huggingface.co/google/flan-t5-large) and [facebook/opt-6.7b](https://huggingface.co/facebook/opt-6.7b) with 8-bit quantization. You don't need to pass the `device_map` parameter for training because it automatically loads your model on a GPU. However, you can still customize the device map with the `device_map` parameter (`device_map="auto"` should only be used for inference). -Try 4-bit quantization in this [notebook](https://colab.research.google.com/drive/1ge2F1QSK8Q7h0hn3YKuBCOAS0bK8E0wf) and learn more about it's details in this [blog post](https://huggingface.co/blog/4bit-transformers-bitsandbytes). - - - -This section explores some of the specific features of 4-bit models, such as changing the compute data type, using the Normal Float 4 (NF4) data type, and using nested quantization. +## QLoRA +This section explores some of the specific features of 4-bit quantization, such as changing the compute data type, the Normal Float 4 (NF4) data type, and nested quantization. ### Compute data type -To speedup computation, you can change the data type from float32 (the default value) to bf16 using the `bnb_4bit_compute_dtype` parameter in [`BitsAndBytesConfig`]: +Change the data type from float32 (the default value) to bf16 in [`BitsAndBytesConfig`] to speedup computation. ```py import torch @@ -268,7 +234,7 @@ quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dty ### Normal Float 4 (NF4) -NF4 is a 4-bit data type from the [QLoRA](https://hf.co/papers/2305.14314) paper, adapted for weights initialized from a normal distribution. You should use NF4 for training 4-bit base models. This can be configured with the `bnb_4bit_quant_type` parameter in the [`BitsAndBytesConfig`]: +NF4 is a 4-bit data type from the [QLoRA](https://hf.co/papers/2305.14314) paper, adapted for weights initialized from a normal distribution. You should use NF4 for training 4-bit base models. ```py from transformers import BitsAndBytesConfig @@ -285,7 +251,7 @@ For inference, the `bnb_4bit_quant_type` does not have a huge impact on performa ### Nested quantization -Nested quantization is a technique that can save additional memory at no additional performance cost. This feature performs a second quantization of the already quantized weights to save an additional 0.4 bits/parameter. For example, with nested quantization, you can finetune a [Llama-13b](https://huggingface.co/meta-llama/Llama-2-13b) model on a 16GB NVIDIA T4 GPU with a sequence length of 1024, a batch size of 1, and enabling gradient accumulation with 4 steps. +Nested quantization can save additional memory at no additional performance cost. This feature performs a second quantization of the already quantized weights to save an additional 0.4 bits/parameter. For example, with nested quantization, you can finetune a [Llama-13b](https://huggingface.co/meta-llama/Llama-2-13b) model on a 16GB NVIDIA T4 GPU with a sequence length of 1024, a batch size of 1, and enable gradient accumulation with 4 steps. ```py from transformers import BitsAndBytesConfig @@ -298,22 +264,19 @@ double_quant_config = BitsAndBytesConfig( model_double_quant = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-13b", torch_dtype="auto", quantization_config=double_quant_config) ``` -## Dequantizing `bitsandbytes` models +## Dequantizing bitsandbytes models -Once quantized, you can dequantize the model to the original precision but this might result in a small quality loss of the model. Make sure you have enough GPU RAM to fit the dequantized model. +Once quantized, you can [`~PreTrainedModel.dequantize`] a model to the original precision but this may result in some quality loss. Make sure you have enough GPU memory to fit the dequantized model. ```python from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer -model_id = "facebook/opt-125m" - -model = AutoModelForCausalLM.from_pretrained(model_id, BitsAndBytesConfig(load_in_4bit=True)) -tokenizer = AutoTokenizer.from_pretrained(model_id) - +model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m", BitsAndBytesConfig(load_in_4bit=True)) model.dequantize() +``` + +## Resources -text = tokenizer("Hello my name is", return_tensors="pt").to(0) +Learn more about the details of 8-bit quantization in [A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using Hugging Face Transformers, Accelerate and bitsandbytes](https://huggingface.co/blog/hf-bitsandbytes-integration). -out = model.generate(**text) -print(tokenizer.decode(out[0])) -``` \ No newline at end of file +Try 4-bit quantization in this [notebook](https://colab.research.google.com/drive/1ge2F1QSK8Q7h0hn3YKuBCOAS0bK8E0wf) and learn more about it's details in [Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA](https://huggingface.co/blog/4bit-transformers-bitsandbytes). diff --git a/docs/source/en/quantization/gptq.md b/docs/source/en/quantization/gptq.md index 5713ef4132a9a8..7c79c360b328d0 100644 --- a/docs/source/en/quantization/gptq.md +++ b/docs/source/en/quantization/gptq.md @@ -16,64 +16,60 @@ rendered properly in your Markdown viewer. # GPTQ - +[AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) implements the GPTQ algorithm, a post-training quantization technique where each row of the weight matrix is quantized independently to find a version of the weights that minimizes the error. These weights are quantized to int4, but they're restored to fp16 on the fly during inference. This can save your memory-usage by 4x because the int4 weights are dequantized in a fused kernel rather than a GPU's global memory. Inference is also faster because a lower bitwidth takes less time to communicate. -Try GPTQ quantization with PEFT in this [notebook](https://colab.research.google.com/drive/1_TIrmuKOFhuRRiTWN94iLKUFu6ZX4ceb?usp=sharing) and learn more about it's details in this [blog post](https://huggingface.co/blog/gptq-integration)! - - - -The [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) library implements the GPTQ algorithm, a post-training quantization technique where each row of the weight matrix is quantized independently to find a version of the weights that minimizes the error. These weights are quantized to int4, but they're restored to fp16 on the fly during inference. This can save your memory-usage by 4x because the int4 weights are dequantized in a fused kernel rather than a GPU's global memory, and you can also expect a speedup in inference because using a lower bitwidth takes less time to communicate. - -Before you begin, make sure the following libraries are installed: +Run the commands below to install AutoGPTQ. ```bash pip install auto-gptq pip install --upgrade accelerate optimum transformers ``` -To quantize a model (currently only supported for text models), you need to create a [`GPTQConfig`] class and set the number of bits to quantize to, a dataset to calibrate the weights for quantization, and a tokenizer to prepare the dataset. +Create a [`GPTQConfig`] class and set the number of bits to quantize to, a dataset to calbrate the weights for quantization, and a tokenizer to prepare the dataset. ```py from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig -model_id = "facebook/opt-125m" -tokenizer = AutoTokenizer.from_pretrained(model_id) +tokenizer = AutoTokenizer.from_pretrained("facebook/opt-125m") gptq_config = GPTQConfig(bits=4, dataset="c4", tokenizer=tokenizer) ``` -You could also pass your own dataset as a list of strings, but it is highly recommended to use the same dataset from the GPTQ paper. +> [!TIP] +> You can pass your own dataset as a list of strings, but it is highly recommended to use the same dataset from the GPTQ paper. +> +> ```py +> dataset = ["auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."] +> gptq_config = GPTQConfig(bits=4, dataset=dataset, tokenizer=tokenizer) +> ``` -```py -dataset = ["auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."] -gptq_config = GPTQConfig(bits=4, dataset=dataset, tokenizer=tokenizer) -``` - -Load a model to quantize and pass the `gptq_config` to the [`~AutoModelForCausalLM.from_pretrained`] method. Set `device_map="auto"` to automatically offload the model to a CPU to help fit the model in memory, and allow the model modules to be moved between the CPU and GPU for quantization. +Load a model to quantize and pass [`GPTQConfig`] to [`~AutoModelForCausalLM.from_pretrained`]. Set `device_map="auto"` to automatically offload the model to a CPU to help fit the model in memory, and allow the model modules to be moved between the CPU and GPU for quantization. ```py -quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", quantization_config=gptq_config) +quantized_model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m", device_map="auto", quantization_config=gptq_config) ``` -If you're running out of memory because a dataset is too large, disk offloading is not supported. If this is the case, try passing the `max_memory` parameter to allocate the amount of memory to use on your device (GPU and CPU): +If you're running out of memory because a dataset is too large (disk offloading is not supported), try passing the `max_memory` parameter to allocate the amount of memory to use on your device (GPU and CPU). ```py -quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", max_memory={0: "30GiB", 1: "46GiB", "cpu": "30GiB"}, quantization_config=gptq_config) +quantized_model = AutoModelForCausalLM.from_pretrained( + "facebook/opt-125m", + device_map="auto", + max_memory={0: "30GiB", 1: "46GiB", "cpu": "30GiB"}, + quantization_config=gptq_config +) ``` - - -Depending on your hardware, it can take some time to quantize a model from scratch. It can take ~5 minutes to quantize the [facebook/opt-350m](https://huggingface.co/facebook/opt-350m) model on a free-tier Google Colab GPU, but it'll take ~4 hours to quantize a 175B parameter model on a NVIDIA A100. Before you quantize a model, it is a good idea to check the Hub if a GPTQ-quantized version of the model already exists. +> [!WARNING] +> Depending on your hardware, it can take some time to quantize a model from scratch. It can take ~5 minutes to quantize the [facebook/opt-350m](https://huggingface.co/facebook/opt-350m) model on a free-tier Google Colab GPU, but it'll take ~4 hours to quantize a 175B parameter model on a NVIDIA A100. Before you quantize a model, it is a good idea to check the Hub if a GPTQ-quantized version of the model already exists. - - -Once your model is quantized, you can push the model and tokenizer to the Hub where it can be easily shared and accessed. Use the [`~PreTrainedModel.push_to_hub`] method to save the [`GPTQConfig`]: +Once a model is quantized, you can use [`~PreTrainedModel.push_to_hub`] to push the model and tokenizer to the Hub where it can be easily shared and accessed. This saves the [`GPTQConfig`]. ```py quantized_model.push_to_hub("opt-125m-gptq") tokenizer.push_to_hub("opt-125m-gptq") ``` -You could also save your quantized model locally with the [`~PreTrainedModel.save_pretrained`] method. If the model was quantized with the `device_map` parameter, make sure to move the entire model to a GPU or CPU before saving it. For example, to save the model on a CPU: +[`~PreTrainedModel.save_pretrained`] saves a quantized model locally. If the model was quantized with the `device_map` parameter, make sure to move the entire model to a GPU or CPU before saving it. The example below saves the model on a CPU. ```py quantized_model.save_pretrained("opt-125m-gptq") @@ -84,7 +80,7 @@ quantized_model.to("cpu") quantized_model.save_pretrained("opt-125m-gptq") ``` -Reload a quantized model with the [`~PreTrainedModel.from_pretrained`] method, and set `device_map="auto"` to automatically distribute the model on all available GPUs to load the model faster without using more memory than needed. +Reload a quantized model with [`~PreTrainedModel.from_pretrained`], and set `device_map="auto"` to automatically distribute the model on all available GPUs to load the model faster without using more memory than needed. ```py from transformers import AutoModelForCausalLM @@ -94,27 +90,39 @@ model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", de ## ExLlama -[ExLlama](https://github.com/turboderp/exllama) is a Python/C++/CUDA implementation of the [Llama](model_doc/llama) model that is designed for faster inference with 4-bit GPTQ weights (check out these [benchmarks](https://github.com/huggingface/optimum/tree/main/tests/benchmark#gptq-benchmark)). The ExLlama kernel is activated by default when you create a [`GPTQConfig`] object. To boost inference speed even further, use the [ExLlamaV2](https://github.com/turboderp/exllamav2) kernels by configuring the `exllama_config` parameter: +> [!WARNING] +> Only 4-bit models are supported, and we recommend deactivating the ExLlama kernels if you're finetuning a quantized model with PEFT. + +[ExLlama](https://github.com/turboderp/exllama) is a Python/C++/CUDA implementation of the [Llama](model_doc/llama) model that is designed for faster inference with 4-bit GPTQ weights (check out these [benchmarks](https://github.com/huggingface/optimum/tree/main/tests/benchmark#gptq-benchmark)). The ExLlama kernel is activated by default when you create a [`GPTQConfig`] object. + +To boost inference speed even further, use the [ExLlamaV2](https://github.com/turboderp/exllamav2) kernels by configuring the `exllama_config` parameter in [`GPTQConfig`]. ```py import torch from transformers import AutoModelForCausalLM, GPTQConfig gptq_config = GPTQConfig(bits=4, exllama_config={"version":2}) -model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="auto", quantization_config=gptq_config) +model = AutoModelForCausalLM.from_pretrained( + "{your_username}/opt-125m-gptq", + device_map="auto", + quantization_config=gptq_config +) ``` - - -Only 4-bit models are supported, and we recommend deactivating the ExLlama kernels if you're finetuning a quantized model with PEFT. - - - -The ExLlama kernels are only supported when the entire model is on the GPU. If you're doing inference on a CPU with AutoGPTQ (version > 0.4.2), then you'll need to disable the ExLlama kernel. This overwrites the attributes related to the ExLlama kernels in the quantization config of the config.json file. +The ExLlama kernels are only supported when the entire model is on the GPU. If you're doing inference on a CPU with AutoGPTQ 0.4.2+, disable the ExLlama kernel in [`GPTQConfig`]. This overwrites the attributes related to the ExLlama kernels in the quantization config of the `config.json` file. ```py import torch from transformers import AutoModelForCausalLM, GPTQConfig + gptq_config = GPTQConfig(bits=4, use_exllama=False) -model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="cpu", quantization_config=gptq_config) -``` \ No newline at end of file +model = AutoModelForCausalLM.from_pretrained( + "{your_username}/opt-125m-gptq", + device_map="cpu", + quantization_config=gptq_config +) +``` + +## Resources + +Run the GPTQ quantization with PEFT [notebook](https://colab.research.google.com/drive/1_TIrmuKOFhuRRiTWN94iLKUFu6ZX4ceb?usp=sharing) for a hands-on experience, and read [Making LLMs lighter with AutoGPTQ and transformers](https://huggingface.co/blog/gptq-integration) to learn more about the AutoGPTQ integration.