Skip to content

Commit

Permalink
serialization
Browse files Browse the repository at this point in the history
  • Loading branch information
stevhliu committed Dec 31, 2024
1 parent 26ab4ec commit 01a9c91
Show file tree
Hide file tree
Showing 5 changed files with 90 additions and 181 deletions.
8 changes: 5 additions & 3 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -174,6 +174,8 @@
title: GGUF
- local: quantization/gptq
title: GPTQ
- local: quantization/higgs
title: HIGGS
- local: quantization/hqq
title: HQQ
- local: quantization/optimum
Expand All @@ -190,11 +192,11 @@
isExpanded: False
sections:
- local: serialization
title: Export to ONNX
title: ONNX
- local: tflite
title: Export to TFLite
title: LiteRT
- local: torchscript
title: Export to TorchScript
title: TorchScript
- title: Resources
isExpanded: False
sections:
Expand Down
62 changes: 38 additions & 24 deletions docs/source/en/quantization/higgs.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,30 @@ rendered properly in your Markdown viewer.

# HIGGS

HIGGS is a 0-shot quantization algorithm that combines Hadamard preprocessing with MSE-Optimal quantization grids to achieve lower quantization error and SOTA performance. You can find more information in the paper [arxiv.org/abs/2411.17525](https://arxiv.org/abs/2411.17525).
[HIGGS](https://arxiv.org/abs/2411.17525) is a zero-shot quantization algorithm that combines Hadamard preprocessing with MSE-Optimal quantization grids to achieve lower quantization error and state-of-the-art performance.

Runtime support for HIGGS is implemented through [FLUTE](https://arxiv.org/abs/2407.10960), and its [library](https://github.com/HanGuo97/flute).
Runtime support for HIGGS is implemented through the [FLUTE](https://github.com/HanGuo97/flute) library. Only the 70B and 405B variants of Llama 3 and Llama 3.0, and the 8B and 27B variants of Gemma 2 are currently supported. HIGGS also doesn't support quantized training and backward passes in general at the moment.

## Quantization Example
Run the command below to install FLUTE.

<hfoptions id="install">
<hfoption id="CUDA 12.1">

```bash
pip install flute-kernel
```

</hfoption>
<hfoption id="CUDA 11.8">

```bash
pip install flute-kernel -i https://flute-ai.github.io/whl/cu118
```

</hfoption>
</hfoptions>

Create a [`HiggsConfig`] with the number of bits to quantize a model to.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, HiggsConfig
Expand All @@ -30,37 +49,32 @@ model = AutoModelForCausalLM.from_pretrained(
quantization_config=HiggsConfig(bits=4),
device_map="auto",
)

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b-it")

tokenizer.decode(model.generate(
**tokenizer("Hi,", return_tensors="pt").to(model.device),
temperature=0.5,
top_p=0.80,
)[0])
```

## Pre-quantized models
> [!TIP]
> Find models pre-quantized with HIGGS in the official ISTA-DASLab [collection](https://huggingface.co/collections/ISTA-DASLab/higgs-675308e432fd56b7f6dab94e).
Some pre-quantized models can be found in the [official collection](https://huggingface.co/collections/ISTA-DASLab/higgs-675308e432fd56b7f6dab94e) on Hugging Face Hub.
## torch.compile

## Current Limitations
HIGGS is fully compatible with [torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html).

**Architectures**
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, HiggsConfig

Currently, FLUTE, and HIGGS by extension, **only support Llama 3 and 3.0 of 8B, 70B and 405B parameters, as well as Gemma-2 9B and 27B**. We're working on allowing to run more diverse models as well as allow arbitrary models by modifying the FLUTE compilation procedure.
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-2-9b-it",
quantization_config=HiggsConfig(bits=4),
device_map="auto",
)

**torch.compile**
model = torch.compile(model)
```

HIGGS is fully compatible with `torch.compile`. Compiling `model.forward`, as described [here](../perf_torch_compile.md), here're the speedups it provides on RTX 4090 for `Llama-3.1-8B-Instruct` (forward passes/sec):
Refer to the table below for a benchmark of forward passes/sec for Llama-3.1-8B-Instruct on a RTX4090.

| Batch Size | BF16 (With `torch.compile`) | HIGGS 4bit (No `torch.compile`) | HIGGS 4bit (With `torch.compile`) |
| Batch Size | BF16 (with `torch.compile`) | HIGGS 4bit (without `torch.compile`) | HIGGS 4bit (with `torch.compile`) |
|------------|-----------------------------|----------------------------------|-----------------------------------|
| 1 | 59 | 41 | 124 |
| 4 | 57 | 42 | 123 |
| 16 | 56 | 41 | 120 |


**Quantized training**

Currently, HIGGS doesn't support quantized training (and backward passes in general). We're working on adding support for it.
153 changes: 21 additions & 132 deletions docs/source/en/serialization.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,69 +14,41 @@ rendered properly in your Markdown viewer.
-->

# Export to ONNX
# ONNX

Deploying 🤗 Transformers models in production environments often requires, or can benefit from exporting the models into
a serialized format that can be loaded and executed on specialized runtimes and hardware.
[ONNX](http://onnx.ai) is an open standard that defines a common set of operators and a file format to represent deep learning models in different frameworks, including PyTorch and TensorFlow. When a model is exported to ONNX, the operators construct a computational graph (or *intermediate representation*) which represents the flow of data through the model. Standardized operators and data types makes it easy to switch between frameworks.

🤗 Optimum is an extension of Transformers that enables exporting models from PyTorch or TensorFlow to serialized formats
such as ONNX and TFLite through its `exporters` module. 🤗 Optimum also provides a set of performance optimization tools to train
and run models on targeted hardware with maximum efficiency.
The [Optimum](https://huggingface.co/docs/optimum/index) library exports a model to ONNX with configuration objects which are supported for [many architectures]((https://huggingface.co/docs/optimum/exporters/onnx/overview)) and can be easily extended. If a model isn't supported, feel free to make a [contribution](https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/contribute) to Optimum.

This guide demonstrates how you can export 🤗 Transformers models to ONNX with 🤗 Optimum, for the guide on exporting models to TFLite,
please refer to the [Export to TFLite page](tflite).
The benefits of exporting to ONNX include the following.

## Export to ONNX
- [Graph optimization](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/optimization) and [quantization](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/quantization) for improving inference.
- Use the [`~optimum.onnxruntime.ORTModel`] API to run a model with [ONNX Runtime](https://onnxruntime.ai/).
- Use [optimized inference pipelines](https://huggingface.co/docs/optimum/main/en/onnxruntime/usage_guides/pipelines) for ONNX models.

[ONNX (Open Neural Network eXchange)](http://onnx.ai) is an open standard that defines a common set of operators and a
common file format to represent deep learning models in a wide variety of frameworks, including PyTorch and
TensorFlow. When a model is exported to the ONNX format, these operators are used to
construct a computational graph (often called an _intermediate representation_) which
represents the flow of data through the neural network.
Export a Transformers model to ONNX with the Optimum CLI or the `optimum.onnxruntime` module.

By exposing a graph with standardized operators and data types, ONNX makes it easy to
switch between frameworks. For example, a model trained in PyTorch can be exported to
ONNX format and then imported in TensorFlow (and vice versa).
## Optimum CLI

Once exported to ONNX format, a model can be:
- optimized for inference via techniques such as [graph optimization](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/optimization) and [quantization](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/quantization).
- run with ONNX Runtime via [`ORTModelForXXX` classes](https://huggingface.co/docs/optimum/onnxruntime/package_reference/modeling_ort),
which follow the same `AutoModel` API as the one you are used to in 🤗 Transformers.
- run with [optimized inference pipelines](https://huggingface.co/docs/optimum/main/en/onnxruntime/usage_guides/pipelines),
which has the same API as the [`pipeline`] function in 🤗 Transformers.

🤗 Optimum provides support for the ONNX export by leveraging configuration objects. These configuration objects come
ready-made for a number of model architectures, and are designed to be easily extendable to other architectures.

For the list of ready-made configurations, please refer to [🤗 Optimum documentation](https://huggingface.co/docs/optimum/exporters/onnx/overview).

There are two ways to export a 🤗 Transformers model to ONNX, here we show both:

- export with 🤗 Optimum via CLI.
- export with 🤗 Optimum with `optimum.onnxruntime`.

### Exporting a 🤗 Transformers model to ONNX with CLI

To export a 🤗 Transformers model to ONNX, first install an extra dependency:
Run the command below to install Optimum and the [exporters](https://huggingface.co/docs/optimum/exporters/overview) module.

```bash
pip install optimum[exporters]
```

To check out all available arguments, refer to the [🤗 Optimum docs](https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/export_a_model#exporting-a-model-to-onnx-using-the-cli),
or view help in command line:
> [!TIP]
> Refer to the [Export a model to ONNX with optimum.exporters.onnx](https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/export_a_model#exporting-a-model-to-onnx-using-the-cli) guide for all available arguments or with the command below.
> ```bash
> optimum-cli export onnx --help
> ```
```bash
optimum-cli export onnx --help
```

To export a model's checkpoint from the 🤗 Hub, for example, `distilbert/distilbert-base-uncased-distilled-squad`, run the following command:
Set the `--model` argument to export a PyTorch or TensorFlow model from the Hub.
```bash
optimum-cli export onnx --model distilbert/distilbert-base-uncased-distilled-squad distilbert_base_uncased_squad_onnx/
```
You should see the logs indicating progress and showing where the resulting `model.onnx` is saved, like this:
You should see logs indicating the progress and showing where the resulting `model.onnx` is saved.

```bash
Validating ONNX model distilbert_base_uncased_squad_onnx/model.onnx...
Expand All @@ -90,20 +62,13 @@ Validating ONNX model distilbert_base_uncased_squad_onnx/model.onnx...
The ONNX export succeeded and the exported model was saved at: distilbert_base_uncased_squad_onnx
```

The example above illustrates exporting a checkpoint from 🤗 Hub. When exporting a local model, first make sure that you
saved both the model's weights and tokenizer files in the same directory (`local_path`). When using CLI, pass the
`local_path` to the `model` argument instead of the checkpoint name on 🤗 Hub and provide the `--task` argument.
You can review the list of supported tasks in the [🤗 Optimum documentation](https://huggingface.co/docs/optimum/exporters/task_manager).
If `task` argument is not provided, it will default to the model architecture without any task specific head.
For local models, make sure the model weights and tokenizer files are saved in the same directory, for example `local_path`. Pass the directory to the `--model` argument and use `--task` to indicate the [task](https://huggingface.co/docs/optimum/exporters/task_manager) a model can perform. If `--task` isn't provided, the model architecture without a task-specific head is used.

```bash
optimum-cli export onnx --model local_path --task question-answering distilbert_base_uncased_squad_onnx/
```

The resulting `model.onnx` file can then be run on one of the [many
accelerators](https://onnx.ai/supported-tools.html#deployModel) that support the ONNX
standard. For example, we can load and run the model with [ONNX
Runtime](https://onnxruntime.ai/) as follows:
The `model.onnx` file can be deployed with any [accelerator](https://onnx.ai/supported-tools.html#deployModel) that supports ONNX. The example below demonstrates loading and running a model with ONNX Runtime.

```python
>>> from transformers import AutoTokenizer
Expand All @@ -115,16 +80,9 @@ Runtime](https://onnxruntime.ai/) as follows:
>>> outputs = model(**inputs)
```

The process is identical for TensorFlow checkpoints on the Hub. For instance, here's how you would
export a pure TensorFlow checkpoint from the [Keras organization](https://huggingface.co/keras-io):

```bash
optimum-cli export onnx --model keras-io/transformers-qa distilbert_base_cased_squad_onnx/
```
## optimum.onnxruntime

### Exporting a 🤗 Transformers model to ONNX with `optimum.onnxruntime`

Alternative to CLI, you can export a 🤗 Transformers model to ONNX programmatically like so:
The `optimum.onnxruntime` module supports programmatically exporting a Transformers model. Instantiate a [`~optimum.onnxruntime.ORTModel`] for a task and set `export=True`. Use [`~OptimizedModel.save_pretrained`] to save the ONNX model.

```python
>>> from optimum.onnxruntime import ORTModelForSequenceClassification
Expand All @@ -133,78 +91,9 @@ Alternative to CLI, you can export a 🤗 Transformers model to ONNX programmati
>>> model_checkpoint = "distilbert_base_uncased_squad"
>>> save_directory = "onnx/"

>>> # Load a model from transformers and export it to ONNX
>>> ort_model = ORTModelForSequenceClassification.from_pretrained(model_checkpoint, export=True)
>>> tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

>>> # Save the onnx model and tokenizer
>>> ort_model.save_pretrained(save_directory)
>>> tokenizer.save_pretrained(save_directory)
```

### Exporting a model for an unsupported architecture

If you wish to contribute by adding support for a model that cannot be currently exported, you should first check if it is
supported in [`optimum.exporters.onnx`](https://huggingface.co/docs/optimum/exporters/onnx/overview),
and if it is not, [contribute to 🤗 Optimum](https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/contribute)
directly.

### Exporting a model with `transformers.onnx`

<Tip warning={true}>

`transformers.onnx` is no longer maintained, please export models with 🤗 Optimum as described above. This section will be removed in the future versions.

</Tip>

To export a 🤗 Transformers model to ONNX with `transformers.onnx`, install extra dependencies:

```bash
pip install transformers[onnx]
```

Use `transformers.onnx` package as a Python module to export a checkpoint using a ready-made configuration:

```bash
python -m transformers.onnx --model=distilbert/distilbert-base-uncased onnx/
```

This exports an ONNX graph of the checkpoint defined by the `--model` argument. Pass any checkpoint on the 🤗 Hub or one that's stored locally.
The resulting `model.onnx` file can then be run on one of the many accelerators that support the ONNX standard. For example,
load and run the model with ONNX Runtime as follows:

```python
>>> from transformers import AutoTokenizer
>>> from onnxruntime import InferenceSession

>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")
>>> session = InferenceSession("onnx/model.onnx")
>>> # ONNX Runtime expects NumPy arrays as input
>>> inputs = tokenizer("Using DistilBERT with ONNX Runtime!", return_tensors="np")
>>> outputs = session.run(output_names=["last_hidden_state"], input_feed=dict(inputs))
```

The required output names (like `["last_hidden_state"]`) can be obtained by taking a look at the ONNX configuration of
each model. For example, for DistilBERT we have:

```python
>>> from transformers.models.distilbert import DistilBertConfig, DistilBertOnnxConfig

>>> config = DistilBertConfig()
>>> onnx_config = DistilBertOnnxConfig(config)
>>> print(list(onnx_config.outputs.keys()))
["last_hidden_state"]
```

The process is identical for TensorFlow checkpoints on the Hub. For example, export a pure TensorFlow checkpoint like so:

```bash
python -m transformers.onnx --model=keras-io/transformers-qa onnx/
```

To export a model that's stored locally, save the model's weights and tokenizer files in the same directory (e.g. `local-pt-checkpoint`),
then export it to ONNX by pointing the `--model` argument of the `transformers.onnx` package to the desired directory:

```bash
python -m transformers.onnx --model=local-pt-checkpoint onnx/
```
Loading

0 comments on commit 01a9c91

Please sign in to comment.