This document explains how to build the GPT-NeoX model using TensorRT-LLM and run on a single GPU and a single node with multiple GPUs.
The TensorRT-LLM GPT-NeoX implementation can be found in tensorrt_llm/models/gptneox/model.py
. The TensorRT-LLM GPT-NeoX example code is located in examples/gptneox
. There is one main file:
convert_checkpoint.py
to convert a checkpoint from the HuggingFace (HF) Transformers format to the TensorRT-LLM format.
In addition, there are two shared files in the parent folder examples
for inference and evaluation:
../run.py
to run the inference on an input text;../summarize.py
to summarize the articles in the cnn_dailymail dataset.
- FP16
- INT8 Weight-Only
- INT4 GPTQ
- Tensor Parallel
The TensorRT-LLM GPT-NeoX example code locates at examples/gptneox. It takes HF weights as input, and builds the corresponding TensorRT engines. The number of TensorRT engines depends on the number of GPUs used to run inference.
# Weights & config
git clone https://huggingface.co/EleutherAI/gpt-neox-20b gptneox_model
If you want to use Int8 weight only quantization, just need to add --use_weight_only
flag.
# Single GPU
python3 convert_checkpoint.py --model_dir ./gptneox_model \
--dtype float16 \
--output_dir ./gptneox/20B/trt_ckpt/fp16/1-gpu/
# With 2-way Tensor Parallel
python3 convert_checkpoint.py --model_dir ./gptneox_model \
--dtype float16 \
--tp_size 2 \
--workers 2 \
--output_dir ./gptneox/20B/trt_ckpt/fp16/2-gpu/
# Single GPU with int8 weight only
python3 convert_checkpoint.py --model_dir ./gptneox_model \
--dtype float16 \
--use_weight_only \
--output_dir ./gptneox/20B/trt_ckpt/int8_wo/1-gpu/
# With 2-way Tensor Parallel with int8 weight only
python3 convert_checkpoint.py --model_dir ./gptneox_model \
--dtype float16 \
--use_weight_only \
--tp_size 2 \
--workers 2 \
--output_dir ./gptneox/20B/trt_ckpt/int8_wo/2-gpu/
# Single GPU
trtllm-build --checkpoint_dir ./gptneox/20B/trt_ckpt/fp16/1-gpu/ \
--gemm_plugin float16 \
--paged_kv_cache disable \
--max_batch_size 8 \
--max_input_len 924 \
--max_output_len 100 \
--output_dir ./gptneox/20B/trt_engines/fp16/1-gpu/
# With 2-way Tensor Parallel
trtllm-build --checkpoint_dir ./gptneox/20B/trt_ckpt/fp16/2-gpu/ \
--gemm_plugin float16 \
--paged_kv_cache disable \
--max_batch_size 8 \
--max_input_len 924 \
--max_output_len 100 \
--workers 2 \
--output_dir ./gptneox/20B/trt_engines/fp16/2-gpu/
# Single GPU with int8 weight only
trtllm-build --checkpoint_dir ./gptneox/20B/trt_ckpt/int8_wo/1-gpu/ \
--gemm_plugin float16 \
--paged_kv_cache disable \
--max_batch_size 8 \
--max_input_len 924 \
--max_output_len 100 \
--output_dir ./gptneox/20B/trt_engines/int8_wo/1-gpu/
# With 2-way Tensor Parallel with int8 weight only
trtllm-build --checkpoint_dir ./gptneox/20B/trt_ckpt/int8_wo/2-gpu/ \
--gemm_plugin float16 \
--paged_kv_cache disable \
--max_batch_size 8 \
--max_input_len 924 \
--max_output_len 100 \
--workers 2 \
--output_dir ./gptneox/20B/trt_engines/int8_wo/2-gpu/
The following section describes how to run a TensorRT-LLM GPT-NeoX model to summarize the articles from the
cnn_dailymail dataset. For each summary, the script can compute the
ROUGE scores and use the ROUGE-1
score to validate the implementation.
The script can also perform the same summarization using the HF GPT-NeoX model.
# Single GPU
python3 ../summarize.py --engine_dir ./gptneox/20B/trt_engines/fp16/1-gpu/ \
--test_trt_llm \
--hf_model_dir gptneox_model \
--data_type fp16
# With 2-way Tensor Parallel
mpirun -np 2 --oversubscribe --allow-run-as-root \
python3 ../summarize.py --engine_dir ./gptneox/20B/trt_engines/fp16/2-gpu/ \
--test_trt_llm \
--hf_model_dir gptneox_model \
--data_type fp16
# Single GPU with int8 weight only
python3 ../summarize.py --engine_dir ./gptneox/20B/trt_engines/int8_wo/1-gpu/ \
--test_trt_llm \
--hf_model_dir gptneox_model \
--data_type fp16
# With 2-way Tensor Parallel with int8 weight only
mpirun -np 2 --oversubscribe --allow-run-as-root \
python3 ../summarize.py --engine_dir ./gptneox/20B/trt_engines/int8_wo/2-gpu/ \
--test_trt_llm \
--hf_model_dir gptneox_model \
--data_type fp16
# Weights & config
sh get_weights.sh
In this example, the weights are quantized using GPTQ-for-LLaMa. Note that the parameter --act-order
referring to whether to apply the activation order GPTQ heuristic is not supported by TRT-LLM.
sh gptq_convert.sh
To apply groupwise quantization GPTQ, addition commandline flags need to be passed to convert_checkpoint.py
:
Here --ammo_quant_ckpt_path
flag specifies the output safetensors of gptq_convert.sh
script.
# Single GPU
python3 convert_checkpoint.py --model_dir ./gptneox_model \
--dtype float16 \
--use_weight_only \
--weight_only_precision int4_gptq \
--ammo_quant_ckpt_path ./gptneox_model/gptneox-20b-4bit-gs128.safetensors \
--output_dir ./gptneox/20B/trt_ckpt/int4_gptq/1-gpu/
# With 2-way Tensor Parallel
python3 convert_checkpoint.py --model_dir ./gptneox_model \
--dtype float16 \
--use_weight_only \
--weight_only_precision int4_gptq \
--tp_size 2 \
--workers 2 \
--ammo_quant_ckpt_path ./gptneox_model/gptneox-20b-4bit-gs128.safetensors \
--output_dir ./gptneox/20B/trt_ckpt/int4_gptq/2-gpu/
The command to build TensorRT engines to apply GPTQ are almost no change:
# Single GPU
trtllm-build --checkpoint_dir ./gptneox/20B/trt_ckpt/int4_gptq/1-gpu/ \
--gemm_plugin float16 \
--paged_kv_cache disable \
--max_batch_size 8 \
--max_input_len 924 \
--max_output_len 100 \
--output_dir ./gptneox/20B/trt_engines/int4_gptq/1-gpu/
# With 2-way Tensor Parallel
trtllm-build --checkpoint_dir ./gptneox/20B/trt_ckpt/int4_gptq/2-gpu/ \
--gemm_plugin float16 \
--paged_kv_cache disable \
--max_batch_size 8 \
--max_input_len 924 \
--max_output_len 100 \
--workers 2 \
--output_dir ./gptneox/20B/trt_engines/int4_gptq/2-gpu/
The command to run summarization with GPTQ qunatized model are also no change:
# Single GPU
python3 ../summarize.py --engine_dir ./gptneox/20B/trt_engines/int4_gptq/1-gpu/ \
--test_trt_llm \
--hf_model_dir gptneox_model \
--data_type fp16
# With 2-way Tensor Parallel
mpirun -np 2 --oversubscribe --allow-run-as-root \
python3 ../summarize.py --engine_dir ./gptneox/20B/trt_engines/int4_gptq/2-gpu/ \
--test_trt_llm \
--hf_model_dir gptneox_model \
--data_type fp16