Skip to content

Commit

Permalink
Merge branch 'main' into del-convert-tokenizer-flag
Browse files Browse the repository at this point in the history
  • Loading branch information
apaniukov authored Mar 13, 2024
2 parents 40f227d + 358f389 commit d9e3a0f
Show file tree
Hide file tree
Showing 38 changed files with 1,297 additions and 272 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/test_openvino.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ jobs:
python -m pip install --upgrade pip
# install PyTorch CPU version to avoid installing CUDA packages on GitHub runner without GPU
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
pip install .[openvino,openvino-tokenizers,nncf,tests,diffusers]
pip install .[openvino,openvino-tokenizers,tests,diffusers] onnxruntime
- name: Test with Pytest
run: |
pytest tests/openvino/ --ignore test_modeling_basic
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@

Intel [Neural Compressor](https://www.intel.com/content/www/us/en/developer/tools/oneapi/neural-compressor.html) is an open-source library enabling the usage of the most popular compression techniques such as quantization, pruning and knowledge distillation. It supports automatic accuracy-driven tuning strategies in order for users to easily generate quantized model. The users can easily apply static, dynamic and aware-training quantization approaches while giving an expected accuracy criteria. It also supports different weight pruning techniques enabling the creation of pruned model giving a predefined sparsity target.

[OpenVINO](https://docs.openvino.ai/latest/index.html) is an open-source toolkit that enables high performance inference capabilities for Intel CPUs, GPUs, and special DL inference accelerators ([see](https://docs.openvino.ai/latest/openvino_docs_OV_UG_supported_plugins_Supported_Devices.html) the full list of supported devices). It is supplied with a set of tools to optimize your models with compression techniques such as quantization, pruning and knowledge distillation. Optimum Intel provides a simple interface to optimize your Transformers and Diffusers models, convert them to the OpenVINO Intermediate Representation (IR) format and run inference using OpenVINO Runtime.
[OpenVINO](https://docs.openvino.ai) is an open-source toolkit that enables high performance inference capabilities for Intel CPUs, GPUs, and special DL inference accelerators ([see](https://docs.openvino.ai/2024/about-openvino/compatibility-and-support/supported-devices.html) the full list of supported devices). It is supplied with a set of tools to optimize your models with compression techniques such as quantization, pruning and knowledge distillation. Optimum Intel provides a simple interface to optimize your Transformers and Diffusers models, convert them to the OpenVINO Intermediate Representation (IR) format and run inference using OpenVINO Runtime.


## Installation
Expand All @@ -20,7 +20,7 @@ To install the latest release of 🤗 Optimum Intel with the corresponding requi
| Accelerator | Installation |
|:-----------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------|
| [Intel Neural Compressor](https://www.intel.com/content/www/us/en/developer/tools/oneapi/neural-compressor.html) | `pip install --upgrade-strategy eager "optimum[neural-compressor]"` |
| [OpenVINO](https://docs.openvino.ai/latest/index.html) | `pip install --upgrade-strategy eager "optimum[openvino,nncf]"` |
| [OpenVINO](https://docs.openvino.ai) | `pip install --upgrade-strategy eager "optimum[openvino]"` |
| [Intel Extension for PyTorch](https://intel.github.io/intel-extension-for-pytorch/#introduction) | `pip install --upgrade-strategy eager "optimum[ipex]"` |

The `--upgrade-strategy eager` option is needed to ensure `optimum-intel` is upgraded to the latest version.
Expand Down Expand Up @@ -68,11 +68,11 @@ For more details on the supported compression techniques, please refer to the [d

## OpenVINO

Below are the examples of how to use OpenVINO and its [NNCF](https://docs.openvino.ai/latest/tmo_introduction.html) framework to accelerate inference.
Below are examples of how to use OpenVINO and its [NNCF](https://docs.openvino.ai/2024/openvino-workflow/model-optimization-guide/compressing-models-during-training.html) framework to accelerate inference.

#### Export:

It is possible to export your model to the [OpenVINO](https://docs.openvino.ai/2023.1/openvino_ir.html) IR format with the CLI :
It is possible to export your model to the [OpenVINO IR](https://docs.openvino.ai/2024/documentation/openvino-ir-format.html) format with the CLI :

```plain
optimum-cli export openvino --model gpt2 ov_model
Expand Down
4 changes: 2 additions & 2 deletions docs/source/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ limitations under the License.

[Intel Neural Compressor](https://www.intel.com/content/www/us/en/developer/tools/oneapi/neural-compressor.html) is an open-source library enabling the usage of the most popular compression techniques such as quantization, pruning and knowledge distillation. It supports automatic accuracy-driven tuning strategies in order for users to easily generate quantized model. The users can easily apply static, dynamic and aware-training quantization approaches while giving an expected accuracy criteria. It also supports different weight pruning techniques enabling the creation of pruned model giving a predefined sparsity target.

[OpenVINO](https://docs.openvino.ai/latest/index.html) is an open-source toolkit that enables high performance inference capabilities for Intel CPUs, GPUs, and special DL inference accelerators ([see](https://docs.openvino.ai/latest/openvino_docs_OV_UG_supported_plugins_Supported_Devices.html) the full list of supported devices). It is supplied with a set of tools to optimize your models with compression techniques such as quantization, pruning and knowledge distillation. Optimum Intel provides a simple interface to optimize your Transformers and Diffusers models, convert them to the OpenVINO Intermediate Representation (IR) format and run inference using OpenVINO Runtime.
[OpenVINO](https://docs.openvino.ai) is an open-source toolkit that enables high performance inference capabilities for Intel CPUs, GPUs, and special DL inference accelerators ([see](https://docs.openvino.ai/2024/about-openvino/compatibility-and-support/supported-devices.html) the full list of supported devices). It is supplied with a set of tools to optimize your models with compression techniques such as quantization, pruning and knowledge distillation. Optimum Intel provides a simple interface to optimize your Transformers and Diffusers models, convert them to the OpenVINO Intermediate Representation (IR) format and run inference using OpenVINO Runtime.

<div class="mt-10">
<div class="w-full flex flex-col space-x-4 md:grid md:grid-cols-2 md:gap-x-5">
Expand All @@ -34,4 +34,4 @@ limitations under the License.
<p class="text-gray-700">Learn how to run inference with OpenVINO Runtime and to apply quantization, pruning and knowledge distillation on your model to further speed up inference.</p>
</a>
</div>
</div>
</div>
18 changes: 11 additions & 7 deletions docs/source/inference.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,8 @@ Optimum Intel can be used to load optimized models from the [Hugging Face Hub](h

## Transformers models

You can now easily perform inference with OpenVINO Runtime on a variety of Intel processors ([see](https://docs.openvino.ai/latest/openvino_docs_OV_UG_supported_plugins_Supported_Devices.html) the full list of supported devices).
You can now easily perform inference with OpenVINO Runtime on a variety of Intel processors
([see](https://docs.openvino.ai/2024/about-openvino/compatibility-and-support/supported-devices.html) the full list of supported devices).
For that, just replace the `AutoModelForXxx` class with the corresponding `OVModelForXxx` class.

As shown in the table below, each task is associated with a class enabling to automatically load your model.
Expand All @@ -33,7 +34,7 @@ As shown in the table below, each task is associated with a class enabling to au

### Export

It is possible to export your model to the [OpenVINO](https://docs.openvino.ai/2023.1/openvino_ir.html) IR format with the CLI :
It is possible to export your model to the [OpenVINO IR](https://docs.openvino.ai/2024/documentation/openvino-ir-format.html) format with the CLI :

```bash
optimum-cli export openvino --model gpt2 ov_model
Expand Down Expand Up @@ -110,7 +111,7 @@ By default the quantization scheme will be [assymmetric](https://github.com/open

For INT4 quantization you can also specify the following arguments :
* The `--group-size` parameter will define the group size to use for quantization, `-1` it will results in per-column quantization.
* The `--ratio` CLI parameter controls the ratio between 4-bit and 8-bit quantization. If set to 0.9, it means that 90% of the layers will be quantized to `int4` while 10% will be quantized to `int8`.
* The `--ratio` parameter controls the ratio between 4-bit and 8-bit quantization. If set to 0.9, it means that 90% of the layers will be quantized to `int4` while 10% will be quantized to `int8`.

Smaller `group_size` and `ratio` of usually improve accuracy at the sacrifice of the model size and inference latency.

Expand All @@ -122,8 +123,11 @@ from optimum.intel import OVModelForCausalLM
model = OVModelForCausalLM.from_pretrained(model_id, load_in_8bit=True)
```

> **NOTE:** `load_in_8bit` is enabled by default for the models larger than 1 billion parameters.
<Tip warning={true}>

`load_in_8bit` is enabled by default for the models larger than 1 billion parameters.

</Tip>

To apply quantization on both weights and activations, you can use the `OVQuantizer`, more information in the [documentation](https://huggingface.co/docs/optimum/main/en/intel/optimization_ov#optimization).

Expand Down Expand Up @@ -179,7 +183,7 @@ model.reshape(1,128)
model.compile()
```

To run inference on Intel integrated or discrete GPU, use `.to("gpu")`. On GPU, models run in FP16 precision by default. (See [OpenVINO documentation](https://docs.openvino.ai/nightly/openvino_docs_install_guides_configurations_for_intel_gpu.html) about installing drivers for GPU inference).
To run inference on Intel integrated or discrete GPU, use `.to("gpu")`. On GPU, models run in FP16 precision by default. (See [OpenVINO documentation](https://docs.openvino.ai/2024/get-started/configurations/configurations-intel-gpu.html) about installing drivers for GPU inference).

```python
# Static shapes speed up inference
Expand Down Expand Up @@ -468,15 +472,15 @@ image = refiner(prompt=prompt, image=image[None, :]).images[0]
```


## Latent Consistency Models
### Latent Consistency Models


| Task | Auto Class |
|--------------------------------------|--------------------------------------|
| `text-to-image` | `OVLatentConsistencyModelPipeline` |


### Text-to-Image
#### Text-to-Image

Here is an example of how you can load a Latent Consistency Models (LCMs) from [SimianLuo/LCM_Dreamshaper_v7](https://huggingface.co/SimianLuo/LCM_Dreamshaper_v7) and run inference using OpenVINO :

Expand Down
4 changes: 2 additions & 2 deletions docs/source/installation.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ To install the latest release of 🤗 Optimum Intel with the corresponding requi
| Accelerator | Installation |
|:-----------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------|
| [Intel Neural Compressor (INC)](https://www.intel.com/content/www/us/en/developer/tools/oneapi/neural-compressor.html) | `pip install --upgrade-strategy eager "optimum[neural-compressor]"`|
| [Intel OpenVINO](https://docs.openvino.ai/latest/index.html) | `pip install --upgrade-strategy eager "optimum[openvino,nncf]"` |
| [Intel OpenVINO](https://docs.openvino.ai ) | `pip install --upgrade-strategy eager "optimum[openvino]"` |

The `--upgrade-strategy eager` option is needed to ensure `optimum-intel` is upgraded to the latest version.

Expand All @@ -42,4 +42,4 @@ or to install from source including dependencies:
python -m pip install "optimum-intel[extras]"@git+https://github.com/huggingface/optimum-intel.git
```

where `extras` can be one or more of `neural-compressor`, `openvino`, `nncf`.
where `extras` can be one or more of `neural-compressor`, `openvino`, `nncf`.
38 changes: 33 additions & 5 deletions docs/source/optimization_ov.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -38,8 +38,6 @@ save_dir = "ptq_model"
def preprocess_function(examples, tokenizer):
return tokenizer(examples["sentence"], padding="max_length", max_length=128, truncation=True)

# Load the default quantization configuration detailing the quantization we wish to apply
quantization_config = OVConfig()
# Instantiate our OVQuantizer using the desired configuration
quantizer = OVQuantizer.from_pretrained(model)
# Create the calibration dataset used to perform static quantization
Expand All @@ -52,7 +50,6 @@ calibration_dataset = quantizer.get_calibration_dataset(
)
# Apply static quantization and export the resulting quantized model to OpenVINO IR format
quantizer.quantize(
quantization_config=quantization_config,
calibration_dataset=calibration_dataset,
save_directory=save_dir,
)
Expand All @@ -72,7 +69,28 @@ from optimum.intel import OVModelForCausalLM
model = OVModelForCausalLM.from_pretrained(model_id, load_in_8bit=True)
```

> **NOTE:** `load_in_8bit` is enabled by default for models larger than 1 billion parameters.
## Hybrid quantization

Traditional optimization methods like post-training 8-bit quantization do not work well for Stable Diffusion models and can lead to poor generation results. On the other hand, weight compression does not improve performance significantly when applied to Stable Diffusion models, as the size of activations is comparable to weights.
The UNet model takes up most of the overall execution time of the pipeline. Thus, optimizing just one model brings substantial benefits in terms of inference speed while keeping acceptable accuracy without fine-tuning. Quantizing the rest of the diffusion pipeline does not significantly improve inference performance but could potentially lead to substantial degradation of accuracy.
Therefore, the proposal is to apply quantization in *hybrid mode* for the UNet model and weight-only quantization for the rest of the pipeline components. The hybrid mode involves the quantization of weights in MatMul and Embedding layers, and activations of other layers, facilitating accuracy preservation post-optimization while reducing the model size.
The `quantization_config` is utilized to define optimization parameters for optimizing the Stable Diffusion pipeline. To enable hybrid quantization, specify the quantization dataset in the `quantization_config`. Otherwise, weight-only quantization to a specified data type (8 tr 4 bits) is applied to UNet model.

```python
from optimum.intel import OVStableDiffusionPipeline, OVWeightQuantizationConfig

model = OVStableDiffusionPipeline.from_pretrained(
model_id,
export=True,
quantization_config=OVWeightQuantizationConfig(bits=8, dataset="conceptual_captions"),
)
```

<Tip warning={true}>

`load_in_8bit` is enabled by default for the models larger than 1 billion parameters.

</Tip>

For the 4-bit weight quantization you can use the `quantization_config` to specify the optimization parameters, for example:

Expand All @@ -81,7 +99,17 @@ from optimum.intel import OVModelForCausalLM, OVWeightQuantizationConfig

model = OVModelForCausalLM.from_pretrained(
model_id,
export=True,
quantization_config=OVWeightQuantizationConfig(bits=4),
)
```

You can tune quantization parameters to achieve a better performance accuracy trade-off as follows:

```python
from optimum.intel import OVModelForCausalLM, OVWeightQuantizationConfig

model = OVModelForCausalLM.from_pretrained(
model_id,
quantization_config=OVWeightQuantizationConfig(bits=4, sym=False, ratio=0.8, dataset="ptb"),
)
```
Expand Down
18 changes: 8 additions & 10 deletions examples/openvino/image-classification/run_image_classification.py
Original file line number Diff line number Diff line change
Expand Up @@ -151,12 +151,12 @@ class ModelArguments:
metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."},
)
feature_extractor_name: str = field(default=None, metadata={"help": "Name or path of preprocessor config."})
use_auth_token: bool = field(
default=False,
token: str = field(
default=None,
metadata={
"help": (
"Will use the token generated when running `huggingface-cli login` (necessary to use this script "
"with private models)."
"The token to use as HTTP bearer authorization for remote files. If not specified, will use the token "
"generated when running `huggingface-cli login` (stored in `~/.huggingface`)."
)
},
)
Expand Down Expand Up @@ -239,8 +239,7 @@ def main():
data_args.dataset_name,
data_args.dataset_config_name,
cache_dir=model_args.cache_dir,
task="image-classification",
use_auth_token=True if model_args.use_auth_token else None,
token=model_args.token,
)
else:
data_files = {}
Expand All @@ -252,7 +251,6 @@ def main():
"imagefolder",
data_files=data_files,
cache_dir=model_args.cache_dir,
task="image-classification",
)

# If we don't have a validation split, split off a percentage of train as validation.
Expand Down Expand Up @@ -287,15 +285,15 @@ def compute_metrics(p):
finetuning_task="image-classification",
cache_dir=model_args.cache_dir,
revision=model_args.model_revision,
use_auth_token=True if model_args.use_auth_token else None,
token=model_args.token,
)
model = AutoModelForImageClassification.from_pretrained(
model_args.model_name_or_path,
from_tf=bool(".ckpt" in model_args.model_name_or_path),
config=config,
cache_dir=model_args.cache_dir,
revision=model_args.model_revision,
use_auth_token=True if model_args.use_auth_token else None,
token=model_args.token,
ignore_mismatched_sizes=model_args.ignore_mismatched_sizes,
)

Expand All @@ -311,7 +309,7 @@ def compute_metrics(p):
model_args.feature_extractor_name or model_args.model_name_or_path,
cache_dir=model_args.cache_dir,
revision=model_args.model_revision,
use_auth_token=True if model_args.use_auth_token else None,
token=model_args.token,
)

# Define torchvision transforms to be applied to each image.
Expand Down
Loading

0 comments on commit d9e3a0f

Please sign in to comment.