serialization

huggingface · Dec 31, 2024 · 01a9c91 · 01a9c91
1 parent 26ab4ec
commit 01a9c91
Show file tree

Hide file tree

Showing 5 changed files with 90 additions and 181 deletions.
diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml
@@ -174,6 +174,8 @@
     title: GGUF
   - local: quantization/gptq
     title: GPTQ
+  - local: quantization/higgs
+    title: HIGGS
   - local: quantization/hqq
     title: HQQ
   - local: quantization/optimum
@@ -190,11 +192,11 @@
   isExpanded: False
   sections:
   - local: serialization
-    title: Export to ONNX
+    title: ONNX
   - local: tflite
-    title: Export to TFLite
+    title: LiteRT
   - local: torchscript
-    title: Export to TorchScript
+    title: TorchScript
 - title: Resources
   isExpanded: False
   sections:

diff --git a/docs/source/en/quantization/higgs.md b/docs/source/en/quantization/higgs.md
@@ -16,11 +16,30 @@ rendered properly in your Markdown viewer.
 
 # HIGGS
 
-HIGGS is a 0-shot quantization algorithm that combines Hadamard preprocessing with MSE-Optimal quantization grids to achieve lower quantization error and SOTA performance. You can find more information in the paper [arxiv.org/abs/2411.17525](https://arxiv.org/abs/2411.17525).
+[HIGGS](https://arxiv.org/abs/2411.17525) is a zero-shot quantization algorithm that combines Hadamard preprocessing with MSE-Optimal quantization grids to achieve lower quantization error and state-of-the-art performance.
 
-Runtime support for HIGGS is implemented through [FLUTE](https://arxiv.org/abs/2407.10960), and its [library](https://github.com/HanGuo97/flute).
+Runtime support for HIGGS is implemented through the [FLUTE](https://github.com/HanGuo97/flute) library. Only the 70B and 405B variants of Llama 3 and Llama 3.0, and the 8B and 27B variants of Gemma 2 are currently supported. HIGGS also doesn't support quantized training and backward passes in general at the moment.
 
-## Quantization Example
+Run the command below to install FLUTE.
+
+<hfoptions id="install">
+<hfoption id="CUDA 12.1">
+
+```bash
+pip install flute-kernel
+```
+
+</hfoption>
+<hfoption id="CUDA 11.8">
+
+```bash
+pip install flute-kernel -i https://flute-ai.github.io/whl/cu118
+```
+
+</hfoption>
+</hfoptions>
+
+Create a [`HiggsConfig`] with the number of bits to quantize a model to.
 
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer, HiggsConfig
@@ -30,37 +49,32 @@ model = AutoModelForCausalLM.from_pretrained(
     quantization_config=HiggsConfig(bits=4),
     device_map="auto",
 )
-
-tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b-it")
-
-tokenizer.decode(model.generate(
-    **tokenizer("Hi,", return_tensors="pt").to(model.device),
-    temperature=0.5,
-    top_p=0.80,
-)[0])
 ```
 
-## Pre-quantized models
+> [!TIP]
+> Find models pre-quantized with HIGGS in the official ISTA-DASLab [collection](https://huggingface.co/collections/ISTA-DASLab/higgs-675308e432fd56b7f6dab94e).
 
-Some pre-quantized models can be found in the [official collection](https://huggingface.co/collections/ISTA-DASLab/higgs-675308e432fd56b7f6dab94e) on Hugging Face Hub.
+## torch.compile
 
-## Current Limitations
+HIGGS is fully compatible with [torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html).
 
-**Architectures**
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer, HiggsConfig
 
-Currently, FLUTE, and HIGGS by extension, **only support Llama 3 and 3.0 of 8B, 70B and 405B parameters, as well as Gemma-2 9B and 27B**. We're working on allowing to run more diverse models as well as allow arbitrary models by modifying the FLUTE compilation procedure.
+model = AutoModelForCausalLM.from_pretrained(
+    "google/gemma-2-9b-it",
+    quantization_config=HiggsConfig(bits=4),
+    device_map="auto",
+)
 
-**torch.compile**
+model = torch.compile(model)
+```
 
-HIGGS is fully compatible with `torch.compile`. Compiling `model.forward`, as described [here](../perf_torch_compile.md), here're the speedups it provides on RTX 4090 for `Llama-3.1-8B-Instruct` (forward passes/sec):
+Refer to the table below for a benchmark of forward passes/sec for Llama-3.1-8B-Instruct on a RTX4090.
 
-| Batch Size | BF16 (With `torch.compile`) | HIGGS 4bit (No `torch.compile`) | HIGGS 4bit (With `torch.compile`) |
+| Batch Size | BF16 (with `torch.compile`) | HIGGS 4bit (without `torch.compile`) | HIGGS 4bit (with `torch.compile`) |
 |------------|-----------------------------|----------------------------------|-----------------------------------|
 | 1          | 59                          | 41                               | 124                               |
 | 4          | 57                          | 42                               | 123                               |
 | 16         | 56                          | 41                               | 120                               |
-
-
-**Quantized training**
-
-Currently, HIGGS doesn't support quantized training (and backward passes in general). We're working on adding support for it.
diff --git a/docs/source/en/serialization.md b/docs/source/en/serialization.md
@@ -14,69 +14,41 @@ rendered properly in your Markdown viewer.
 
 -->
 
-# Export to ONNX
+# ONNX
 
-Deploying 🤗 Transformers models in production environments often requires, or can benefit from exporting the models into 
-a serialized format that can be loaded and executed on specialized runtimes and hardware.
+[ONNX](http://onnx.ai) is an open standard that defines a common set of operators and a file format to represent deep learning models in different frameworks, including PyTorch and TensorFlow. When a model is exported to ONNX, the operators construct a computational graph (or *intermediate representation*) which represents the flow of data through the model. Standardized operators and data types makes it easy to switch between frameworks.
 
-🤗 Optimum is an extension of Transformers that enables exporting models from PyTorch or TensorFlow to serialized formats 
-such as ONNX and TFLite through its `exporters` module. 🤗 Optimum also provides a set of performance optimization tools to train 
-and run models on targeted hardware with maximum efficiency.
+The [Optimum](https://huggingface.co/docs/optimum/index) library exports a model to ONNX with configuration objects which are supported for [many architectures]((https://huggingface.co/docs/optimum/exporters/onnx/overview)) and can be easily extended. If a model isn't supported, feel free to make a [contribution](https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/contribute) to Optimum.
 
-This guide demonstrates how you can export 🤗 Transformers models to ONNX with 🤗 Optimum, for the guide on exporting models to TFLite, 
-please refer to the [Export to TFLite page](tflite).
+The benefits of exporting to ONNX include the following.
 
-## Export to ONNX 
+- [Graph optimization](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/optimization) and [quantization](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/quantization) for improving inference.
+- Use the [`~optimum.onnxruntime.ORTModel`] API to run a model with [ONNX Runtime](https://onnxruntime.ai/).
+- Use [optimized inference pipelines](https://huggingface.co/docs/optimum/main/en/onnxruntime/usage_guides/pipelines) for ONNX models.
 
-[ONNX (Open Neural Network eXchange)](http://onnx.ai) is an open standard that defines a common set of operators and a 
-common file format to represent deep learning models in a wide variety of frameworks, including PyTorch and
-TensorFlow. When a model is exported to the ONNX format, these operators are used to
-construct a computational graph (often called an _intermediate representation_) which
-represents the flow of data through the neural network.
+Export a Transformers model to ONNX with the Optimum CLI or the `optimum.onnxruntime` module.
 
-By exposing a graph with standardized operators and data types, ONNX makes it easy to
-switch between frameworks. For example, a model trained in PyTorch can be exported to
-ONNX format and then imported in TensorFlow (and vice versa).
+## Optimum CLI
 
-Once exported to ONNX format, a model can be:
-- optimized for inference via techniques such as [graph optimization](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/optimization) and [quantization](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/quantization). 
-- run with ONNX Runtime via [`ORTModelForXXX` classes](https://huggingface.co/docs/optimum/onnxruntime/package_reference/modeling_ort),
-which follow the same `AutoModel` API as the one you are used to in 🤗 Transformers.
-- run with [optimized inference pipelines](https://huggingface.co/docs/optimum/main/en/onnxruntime/usage_guides/pipelines),
-which has the same API as the [`pipeline`] function in 🤗 Transformers. 
-
-🤗 Optimum provides support for the ONNX export by leveraging configuration objects. These configuration objects come 
-ready-made for a number of model architectures, and are designed to be easily extendable to other architectures.
-
-For the list of ready-made configurations, please refer to [🤗 Optimum documentation](https://huggingface.co/docs/optimum/exporters/onnx/overview).
-
-There are two ways to export a 🤗 Transformers model to ONNX, here we show both:
-
-- export with 🤗 Optimum via CLI.
-- export with 🤗 Optimum with `optimum.onnxruntime`.
-
-### Exporting a 🤗 Transformers model to ONNX with CLI
-
-To export a 🤗 Transformers model to ONNX, first install an extra dependency:
+Run the command below to install Optimum and the [exporters](https://huggingface.co/docs/optimum/exporters/overview) module.
 
 ```bash
 pip install optimum[exporters]
 ```
 
-To check out all available arguments, refer to the [🤗 Optimum docs](https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/export_a_model#exporting-a-model-to-onnx-using-the-cli), 
-or view help in command line:
+> [!TIP]
+> Refer to the [Export a model to ONNX with optimum.exporters.onnx](https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/export_a_model#exporting-a-model-to-onnx-using-the-cli) guide for all available arguments or with the command below.
+> ```bash
+> optimum-cli export onnx --help
+> ```
 
-```bash
-optimum-cli export onnx --help
-```
-
-To export a model's checkpoint from the 🤗 Hub, for example, `distilbert/distilbert-base-uncased-distilled-squad`, run the following command: 
+Set the `--model` argument to export a PyTorch or TensorFlow model from the Hub.
 
 ```bash
 optimum-cli export onnx --model distilbert/distilbert-base-uncased-distilled-squad distilbert_base_uncased_squad_onnx/
 ```
 
-You should see the logs indicating progress and showing where the resulting `model.onnx` is saved, like this:
+You should see logs indicating the progress and showing where the resulting `model.onnx` is saved.
 
 ```bash
 Validating ONNX model distilbert_base_uncased_squad_onnx/model.onnx...
@@ -90,20 +62,13 @@ Validating ONNX model distilbert_base_uncased_squad_onnx/model.onnx...
 The ONNX export succeeded and the exported model was saved at: distilbert_base_uncased_squad_onnx
 ```
 
-The example above illustrates exporting a checkpoint from 🤗 Hub. When exporting a local model, first make sure that you 
-saved both the model's weights and tokenizer files in the same directory (`local_path`). When using CLI, pass the 
-`local_path` to the `model` argument instead of the checkpoint name on 🤗 Hub and provide the `--task` argument. 
-You can review the list of supported tasks in the [🤗 Optimum documentation](https://huggingface.co/docs/optimum/exporters/task_manager).
-If `task` argument is not provided, it will default to the model architecture without any task specific head.
+For local models, make sure the model weights and tokenizer files are saved in the same directory, for example `local_path`. Pass the directory to the `--model` argument and use `--task` to indicate the [task](https://huggingface.co/docs/optimum/exporters/task_manager) a model can perform. If `--task` isn't provided, the model architecture without a task-specific head is used.
 
 ```bash
 optimum-cli export onnx --model local_path --task question-answering distilbert_base_uncased_squad_onnx/
 ```
 
-The resulting `model.onnx` file can then be run on one of the [many
-accelerators](https://onnx.ai/supported-tools.html#deployModel) that support the ONNX
-standard. For example, we can load and run the model with [ONNX
-Runtime](https://onnxruntime.ai/) as follows:
+The `model.onnx` file can be deployed with any [accelerator](https://onnx.ai/supported-tools.html#deployModel) that supports ONNX. The example below demonstrates loading and running a model with ONNX Runtime.
 
 ```python
 >>> from transformers import AutoTokenizer
@@ -115,16 +80,9 @@ Runtime](https://onnxruntime.ai/) as follows:
 >>> outputs = model(**inputs)
 ```
 
-The process is identical for TensorFlow checkpoints on the Hub. For instance, here's how you would
-export a pure TensorFlow checkpoint from the [Keras organization](https://huggingface.co/keras-io):
-
-```bash
-optimum-cli export onnx --model keras-io/transformers-qa distilbert_base_cased_squad_onnx/
-```
+## optimum.onnxruntime
 
-### Exporting a 🤗 Transformers model to ONNX with `optimum.onnxruntime`
-
-Alternative to CLI, you can export a 🤗 Transformers model to ONNX programmatically like so: 
+The `optimum.onnxruntime` module supports programmatically exporting a Transformers model. Instantiate a [`~optimum.onnxruntime.ORTModel`] for a task and set `export=True`. Use [`~OptimizedModel.save_pretrained`] to save the ONNX model.
 
 ```python
 >>> from optimum.onnxruntime import ORTModelForSequenceClassification
@@ -133,78 +91,9 @@ Alternative to CLI, you can export a 🤗 Transformers model to ONNX programmati
 >>> model_checkpoint = "distilbert_base_uncased_squad"
 >>> save_directory = "onnx/"
 
->>> # Load a model from transformers and export it to ONNX
 >>> ort_model = ORTModelForSequenceClassification.from_pretrained(model_checkpoint, export=True)
 >>> tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
 
->>> # Save the onnx model and tokenizer
 >>> ort_model.save_pretrained(save_directory)
 >>> tokenizer.save_pretrained(save_directory)
 ```
-
-### Exporting a model for an unsupported architecture
-
-If you wish to contribute by adding support for a model that cannot be currently exported, you should first check if it is
-supported in [`optimum.exporters.onnx`](https://huggingface.co/docs/optimum/exporters/onnx/overview),
-and if it is not, [contribute to 🤗 Optimum](https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/contribute)
-directly.
-
-### Exporting a model with `transformers.onnx`
-
-<Tip warning={true}>
-
-`transformers.onnx` is no longer maintained, please export models with 🤗 Optimum as described above. This section will be removed in the future versions.
-
-</Tip>
-
-To export a 🤗 Transformers model to ONNX with `transformers.onnx`, install extra dependencies:
-
-```bash
-pip install transformers[onnx]
-```
-
-Use `transformers.onnx` package as a Python module to export a checkpoint using a ready-made configuration:
-
-```bash
-python -m transformers.onnx --model=distilbert/distilbert-base-uncased onnx/
-```
-
-This exports an ONNX graph of the checkpoint defined by the `--model` argument. Pass any checkpoint on the 🤗 Hub or one that's stored locally.
-The resulting `model.onnx` file can then be run on one of the many accelerators that support the ONNX standard. For example, 
-load and run the model with ONNX Runtime as follows:
-
-```python
->>> from transformers import AutoTokenizer
->>> from onnxruntime import InferenceSession
-
->>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")
->>> session = InferenceSession("onnx/model.onnx")
->>> # ONNX Runtime expects NumPy arrays as input
->>> inputs = tokenizer("Using DistilBERT with ONNX Runtime!", return_tensors="np")
->>> outputs = session.run(output_names=["last_hidden_state"], input_feed=dict(inputs))
-```
-
-The required output names (like `["last_hidden_state"]`) can be obtained by taking a look at the ONNX configuration of 
-each model. For example, for DistilBERT we have:
-
-```python
->>> from transformers.models.distilbert import DistilBertConfig, DistilBertOnnxConfig
-
->>> config = DistilBertConfig()
->>> onnx_config = DistilBertOnnxConfig(config)
->>> print(list(onnx_config.outputs.keys()))
-["last_hidden_state"]
-```
-
-The process is identical for TensorFlow checkpoints on the Hub. For example, export a pure TensorFlow checkpoint like so:
-
-```bash
-python -m transformers.onnx --model=keras-io/transformers-qa onnx/
-```
-
-To export a model that's stored locally, save the model's weights and tokenizer files in the same directory (e.g. `local-pt-checkpoint`), 
-then export it to ONNX by pointing the `--model` argument of the `transformers.onnx` package to the desired directory:
-
-```bash
-python -m transformers.onnx --model=local-pt-checkpoint onnx/
-```