This document describes the step-by-step instructions to run large language models (LLMs) on 4th Gen Intel® Xeon® Scalable Processor (codenamed Sapphire Rapids) with PyTorch and Intel® Extension for PyTorch.
The scripts run_clm.py
, run_mlm.py
and run_plm.py
provide three quantization approaches respectively (PostTrainingDynamic, PostTrainingStatic, QuantAwareTraining) based on Intel® Neural Compressor and return last token prediction accuracy by trainer
.
The script run_clm_no_trainer.py
supports GPTJ
, OPT
, LLaMA
, BLOOM
, MPT
and Falcon
quantization and validates last word prediction accuracy with lm_eval now, and we are adding more models.
# Installation
git clone https://github.com/intel/intel-extension-for-transformers.git itrex
cd itrex
pip install -v .
cd examples/huggingface/pytorch/language-modeling/quantization
pip install -r requirements.txt
Here is how to run the scripts:
Causal Language Modeling (CLM)
run_clm_no_trainer.py
quantizes the large language models using the dataset NeelNanda/pile-10k calibration and validates lambada_openai
, piqa
, winogrande
, hellaswag
and other datasets accuracy provided by lm_eval, an example command is as follows.
# "--sq" is used to enable smooth quant
# "--int8_bf16_mixed" is used to enable int8-bf16 mixed mode for platform that natively supports bf16
# "--peft_model_id" is used to loaded PEFT weights from peft_model_id
python run_clm_no_trainer.py \
--model EleutherAI/gpt-j-6B \
--quantize \
--sq \
--alpha 1.0 \
--output_dir "saved_results" \
--ipex \
--peft_model_id "peft_model_id"
# "--approach weight_only" is used to enable weight only quantization.
python run_clm_no_trainer.py \
--model EleutherAI/gpt-j-6B \
--quantize \
--approach weight_only \
--woq_bits 4 \
--woq_group_size 128 \
--woq_scheme asym \
--woq_algo RTN \
--woq_enable_mse_search \
--output_dir "saved_results"
Notes: Weight-only quantization based on fake quantization is previewly supported and supports RTN, GPTQ[1], AWQ[2], TEQ algorithms. For more details, please refer to link
python run_clm_no_trainer.py \
--model EleutherAI/gpt-j-6B \
--woq_algo GPTQ \
--woq_bits 4 \
--quantize \
--pad_max_length 2048 \
--gptq_pad_max_length 2048 \
--gptq_use_max_length \
--approach weight_only \
--output_dir "test_models" \
# FP32 Accuracy
python run_clm_no_trainer.py \
--model EleutherAI/gpt-j-6B \
--accuracy \
--batch_size 112 \
--tasks "lambada_openai" "lambada_standard"\
--int8 \
--ipex \
--output_dir "saved_results" # load int8 model
# to validate FP32 model, please remove "--int8" and "--output_dir".
# "--sq" is used to enable smooth quant
# "--int8_bf16_mixed" is used to enable int8-bf16 mixed mode for platform that natively supports bf16
# "--peft_model_id" is used to loaded PEFT weights from peft_model_id
python run_clm_no_trainer.py \
--model facebook/opt-2.7b \
--quantize \
--sq \
--alpha 0.5 \
--ipex \
--output_dir "saved_results" \
--int8_bf16_mixed \
--peft_model_id "peft_model_id"
python run_clm_no_trainer.py \
--model facebook/opt-2.7b \
--accuracy \
--batch_size 112 \
--tasks "lambada_openai" "lambada_standard" \
--int8 \
--ipex \
--output_dir "saved_results" # load int8 model
# to validate FP32 model, please remove "--int8" and "--output_dir".
Note: LLAMA requires IPEX requirements >= 2.1 to get better accuracy, please source install from intel_extension_for_pytorch.
# "--sq" is used to enable smooth quant
# "--int8_bf16_mixed" is used to enable int8-bf16 mixed mode for platform that natively supports bf16
# "--peft_model_id" is used to loaded PEFT weights from peft_model_id
python run_clm_no_trainer.py \
--model decapoda-research/llama-7b-hf \
--quantize \
--sq \
--alpha 0.8 \
--ipex \
--output_dir "saved_results" \
--int8_bf16_mixed \
--peft_model_id "peft_model_id"
python run_clm_no_trainer.py \
--model decapoda-research/llama-7b-hf \
--accuracy \
--batch_size 112 \
--tasks "lambada_openai" "lambada_standard" \
--int8 \
--ipex \
--output_dir "saved_results" # load int8 model
# to validate FP32 model, please remove "--int8" and "--output_dir".
mosaicml/mpt-7b
has been updated frequently, and has not yet been integrated into transformers
, so we fixed a commit number 68e1a8e0ebb9b30f3c45c1ef6195980f29063ae2
as local folder to enable it.
# "--sq" is used to enable smooth quant
# "--int8_bf16_mixed" is used to enable int8-bf16 mixed mode for platform that natively supports bf16
python run_clm_no_trainer.py \
--model mosaicml/mpt-7b-chat \
--quantize \
--sq \
--alpha 0.85 \
--ipex \
--output_dir "saved_results"
python run_clm_no_trainer.py \
--model mosaicml/mpt-7b-chat \
--accuracy \
--batch_size 112 \
--tasks "lambada_openai" \
--int8 \
--ipex \
--output_dir "saved_results" # load int8 model
# to validate FP32 model, please remove "--int8" and "--output_dir".
tiiuae/falcon-7b-instruct
has been updated frequently, and has not yet been integrated into transformers
, so we fixed a commit number c7f670a03d987254220f343c6b026ea0c5147185
as local folder to enable it.
# "--sq" is used to enable smooth quant
# "--int8_bf16_mixed" is used to enable int8-bf16 mixed mode for platform that natively supports bf16
python run_clm_no_trainer.py \
--model tiiuae/falcon-7b-instruct \
--quantize \
--sq \
--alpha 0.7 \
--output_dir "saved_results"
python run_clm_no_trainer.py \
--model tiiuae/falcon-7b-instruct \
--accuracy \
--batch_size 112 \
--tasks "lambada_openai" \
--int8 \
--ipex \
--output_dir "saved_results" # load int8 model
# to validate FP32 model, please remove "--int8" and "--output_dir".
To do quantization based transformers language-modeling example run_clm.py
, please use the following command.
Causal Language Modeling (CLM)
python run_clm.py \
--model_name_or_path EleutherAI/gpt-neo-125M \
--dataset_name wikitext \
--dataset_config_name wikitext-2-raw-v1 \
--tune \
--quantization_approach PostTrainingStatic \
--do_train \
--do_eval \
--output_dir ./tmp/clm_output \
--overwrite_output_dir
Masked Language Modeling (MLM)
python run_mlm.py \
--model_name_or_path bert-base-uncased \
--dataset_name wikitext \
--dataset_config_name wikitext-2-raw-v1 \
--tune \
--quantization_approach PostTrainingStatic \
--do_train \
--do_eval \
--output_dir ./tmp/mlm_output \
--overwrite_output_dir
Permutation Language Modeling (PLM)
python run_plm.py \
--model_name_or_path xlnet-base-cased \
--dataset_name wikitext \
--dataset_config_name wikitext-2-raw-v1 \
--tune \
--quantization_approach PostTrainingStatic \
--do_train \
--do_eval \
--output_dir ./tmp/plm_output \
--overwrite_output_dir
[1]. Elias, Frantar, et al. "GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers." arXiv preprint arXiv:2210.17323 (2023). [2]. Lin, Ji, et al. "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." arXiv preprint arXiv:2306.00978 (2023).