From 9d487ff8c511abc8ab6aa4965e5c70f406b1f2d6 Mon Sep 17 00:00:00 2001 From: "Sun, Jiwei1" Date: Mon, 18 Nov 2024 02:03:23 +0000 Subject: [PATCH] update llama8b3.1 for intel GPU --- torchao/quantization/README.md | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/torchao/quantization/README.md b/torchao/quantization/README.md index f4699c0a21..ae2f4772b8 100644 --- a/torchao/quantization/README.md +++ b/torchao/quantization/README.md @@ -27,8 +27,10 @@ Benchmarks and evaluation are run on a machine with a single NVIDIA-A100-80GB GP | | int8dq | 12.262 | 9.87 | 65.35 | 14.60 | 6.62 | | | int8wo | 12.204 | 66.24 | 438.61 | 14.60 | 6.62 + Benchmarks and evaluation for model meta-llama/Meta-Llama-3.1-8B are run on a machine with a single NVIDIA-H100 GPU using the scripts for [generation](../_models/llama/generate.py) and [eval](../_models/llama/eval.py). Evaluation was done using the lm_eval library for tasks/data. +### CUDA backend | Model | Technique | wikitext-perplexity | Tokens/Second | Memory Bandwidth (GB/s) | Peak Memory (GB) | Model Size (GB) | | ----------- | ----------------------- | ------------------- | ------------- | ----------------------- | ---------------- | --------------- | | Llama-3.1-8B | Base (bfloat16) | 7.54 | 126.90 | 1904.75 | 16.75 | 15.01 | @@ -37,7 +39,12 @@ Benchmarks and evaluation for model meta-llama/Meta-Llama-3.1-8B are run on a ma | | float8wo | 7.60 | 178.46 | 1339.93 | 12.09 | 7.51 | | | float8dq (PerTensor) | 7.62 | 116.40 | 873.58 | 11.14 | 7.51 | | | float8dq (Per Row) | 7.61 | 154.63 | 1161.47 | 11.14 | 7.51 | - +### XPU backend(Intel MAX 1100) +| Model | Technique | wikitext-perplexity | Tokens/Second | Memory Bandwidth (GB/s) | Peak Memory (GB) | Model Size (GB) | +| ----------- | ----------------------- | ------------------- | ------------- | ----------------------- | ---------------- | --------------- | +| Llama-3-8.1B | Base (bfloat16) | 7.441 | 40.36 | 605.77 | 16.35 | 15.01 | +| | int8dq | 7.581 | 13.60 | 102.28 | 18.69 | 7.52 | +| | int8wo | 7.447 | 59.49 | 447.27 | 18.60 | 7.52 note: Int8 dynamic quantization works best on compute bound models like [SAM](https://github.com/pytorch-labs/segment-anything-fast) whereas Llama with batchsize=1 tends to be memory bound, thus the rather low performance. For int4 we make heavy use of [tinygemm](https://github.com/pytorch/ao/blob/cb3bd8c674f2123af232a0231b5e38ddafa756a8/torchao/dtypes/aqt.py#L526) of `torch.ops.aten._weight_int4pack_mm` to bitpack into a layout optimized for tensor cores