diff --git a/docs/source/inference.mdx b/docs/source/inference.mdx index a9ee5529da..905e0aa4dd 100644 --- a/docs/source/inference.mdx +++ b/docs/source/inference.mdx @@ -110,7 +110,7 @@ By default the quantization scheme will be [assymmetric](https://github.com/open For INT4 quantization you can also specify the following arguments : * The `--group-size` parameter will define the group size to use for quantization, `-1` it will results in per-column quantization. -* The `--ratio` CLI parameter controls the ratio between 4-bit and 8-bit quantization. If set to 0.9, it means that 90% of the layers will be quantized to `int4` while 10% will be quantized to `int8`. +* The `--ratio` parameter controls the ratio between 4-bit and 8-bit quantization. If set to 0.9, it means that 90% of the layers will be quantized to `int4` while 10% will be quantized to `int8`. Smaller `group_size` and `ratio` of usually improve accuracy at the sacrifice of the model size and inference latency. @@ -122,8 +122,11 @@ from optimum.intel import OVModelForCausalLM model = OVModelForCausalLM.from_pretrained(model_id, load_in_8bit=True) ``` -> **NOTE:** `load_in_8bit` is enabled by default for the models larger than 1 billion parameters. + +`load_in_8bit` is enabled by default for the models larger than 1 billion parameters. + + To apply quantization on both weights and activations, you can use the `OVQuantizer`, more information in the [documentation](https://huggingface.co/docs/optimum/main/en/intel/optimization_ov#optimization). diff --git a/docs/source/optimization_ov.mdx b/docs/source/optimization_ov.mdx index 5686af4bf3..51067b0b64 100644 --- a/docs/source/optimization_ov.mdx +++ b/docs/source/optimization_ov.mdx @@ -69,7 +69,11 @@ from optimum.intel import OVModelForCausalLM model = OVModelForCausalLM.from_pretrained(model_id, load_in_8bit=True) ``` -> **NOTE:** `load_in_8bit` is enabled by default for models larger than 1 billion parameters. + + +`load_in_8bit` is enabled by default for the models larger than 1 billion parameters. + + For the 4-bit weight quantization you can use the `quantization_config` to specify the optimization parameters, for example: