Skip to content

Commit

Permalink
Fix documentation (#583)
Browse files Browse the repository at this point in the history
* Fix documentation

* fix
  • Loading branch information
echarlaix authored Mar 4, 2024
1 parent 77365f4 commit 5e319aa
Show file tree
Hide file tree
Showing 2 changed files with 10 additions and 3 deletions.
7 changes: 5 additions & 2 deletions docs/source/inference.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,7 @@ By default the quantization scheme will be [assymmetric](https://github.com/open

For INT4 quantization you can also specify the following arguments :
* The `--group-size` parameter will define the group size to use for quantization, `-1` it will results in per-column quantization.
* The `--ratio` CLI parameter controls the ratio between 4-bit and 8-bit quantization. If set to 0.9, it means that 90% of the layers will be quantized to `int4` while 10% will be quantized to `int8`.
* The `--ratio` parameter controls the ratio between 4-bit and 8-bit quantization. If set to 0.9, it means that 90% of the layers will be quantized to `int4` while 10% will be quantized to `int8`.

Smaller `group_size` and `ratio` of usually improve accuracy at the sacrifice of the model size and inference latency.

Expand All @@ -122,8 +122,11 @@ from optimum.intel import OVModelForCausalLM
model = OVModelForCausalLM.from_pretrained(model_id, load_in_8bit=True)
```

> **NOTE:** `load_in_8bit` is enabled by default for the models larger than 1 billion parameters.
<Tip warning={true}>

`load_in_8bit` is enabled by default for the models larger than 1 billion parameters.

</Tip>

To apply quantization on both weights and activations, you can use the `OVQuantizer`, more information in the [documentation](https://huggingface.co/docs/optimum/main/en/intel/optimization_ov#optimization).

Expand Down
6 changes: 5 additions & 1 deletion docs/source/optimization_ov.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,11 @@ from optimum.intel import OVModelForCausalLM
model = OVModelForCausalLM.from_pretrained(model_id, load_in_8bit=True)
```

> **NOTE:** `load_in_8bit` is enabled by default for models larger than 1 billion parameters.
<Tip warning={true}>

`load_in_8bit` is enabled by default for the models larger than 1 billion parameters.

</Tip>

For the 4-bit weight quantization you can use the `quantization_config` to specify the optimization parameters, for example:

Expand Down

0 comments on commit 5e319aa

Please sign in to comment.