Skip to content

Commit

Permalink
update llama8b3.1 for intel GPU
Browse files Browse the repository at this point in the history
  • Loading branch information
sunjiweiswift committed Nov 18, 2024
1 parent cc43897 commit 9d487ff
Showing 1 changed file with 8 additions and 1 deletion.
9 changes: 8 additions & 1 deletion torchao/quantization/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,10 @@ Benchmarks and evaluation are run on a machine with a single NVIDIA-A100-80GB GP
| | int8dq | 12.262 | 9.87 | 65.35 | 14.60 | 6.62 |
| | int8wo | 12.204 | 66.24 | 438.61 | 14.60 | 6.62


Benchmarks and evaluation for model meta-llama/Meta-Llama-3.1-8B are run on a machine with a single NVIDIA-H100 GPU using the scripts for [generation](../_models/llama/generate.py) and [eval](../_models/llama/eval.py). Evaluation was done using the lm_eval library for tasks/data.

### CUDA backend
| Model | Technique | wikitext-perplexity | Tokens/Second | Memory Bandwidth (GB/s) | Peak Memory (GB) | Model Size (GB) |
| ----------- | ----------------------- | ------------------- | ------------- | ----------------------- | ---------------- | --------------- |
| Llama-3.1-8B | Base (bfloat16) | 7.54 | 126.90 | 1904.75 | 16.75 | 15.01 |
Expand All @@ -37,7 +39,12 @@ Benchmarks and evaluation for model meta-llama/Meta-Llama-3.1-8B are run on a ma
| | float8wo | 7.60 | 178.46 | 1339.93 | 12.09 | 7.51 |
| | float8dq (PerTensor) | 7.62 | 116.40 | 873.58 | 11.14 | 7.51 |
| | float8dq (Per Row) | 7.61 | 154.63 | 1161.47 | 11.14 | 7.51 |

### XPU backend(Intel MAX 1100)
| Model | Technique | wikitext-perplexity | Tokens/Second | Memory Bandwidth (GB/s) | Peak Memory (GB) | Model Size (GB) |
| ----------- | ----------------------- | ------------------- | ------------- | ----------------------- | ---------------- | --------------- |
| Llama-3-8.1B | Base (bfloat16) | 7.441 | 40.36 | 605.77 | 16.35 | 15.01 |
| | int8dq | 7.581 | 13.60 | 102.28 | 18.69 | 7.52 |
| | int8wo | 7.447 | 59.49 | 447.27 | 18.60 | 7.52
note: Int8 dynamic quantization works best on compute bound models like [SAM](https://github.com/pytorch-labs/segment-anything-fast) whereas Llama with batchsize=1 tends to be memory bound, thus the rather low performance.

For int4 we make heavy use of [tinygemm](https://github.com/pytorch/ao/blob/cb3bd8c674f2123af232a0231b5e38ddafa756a8/torchao/dtypes/aqt.py#L526) of `torch.ops.aten._weight_int4pack_mm` to bitpack into a layout optimized for tensor cores
Expand Down

0 comments on commit 9d487ff

Please sign in to comment.