Skip to content

Multiple NVIDIA GPUs or Apple Silicon for Large Language Model Inference?

Notifications You must be signed in to change notification settings

afshinebtia/Forked_GPU-Benchmarks-on-LLM-Inference

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GPU-Benchmarks-on-LLM-Inference

Multiple NVIDIA GPUs or Apple Silicon for Large Language Model Inference? 🧐

Description

Use llama.cpp to test the LLaMA models inference speed of different GPUs on RunPod, 16-inch M1 Max MacBook Pro, M2 Ultra Mac Studio, 14-inch M3 MacBook Pro and 16-inch M3 Max MacBook Pro.

Overview

Average eval speed (tokens/s) by GPUs on LLaMA 2. A higher eval speed is better.

GPU 7B_q4_0 7B_f16 70B_q4_0 70B_f16
4080 16GB 118.01 44.2 OOM OOM
RTX 4000 Ada 20GB 64.53 22.96 OOM OOM
3090 24GB 120.6 51.85 OOM OOM
4090 24GB 149.37 60.78 OOM OOM
RTX 5000 Ada 32GB 99.19 36.31 OOM OOM
RTX 4000 Ada 20GB * 2 60.68 31.83 OOM OOM
3090 24GB * 2 53.25 38.73 14.28 OOM
4090 24GB * 2 66.26 49.2 18.5 OOM
RTX A6000 48GB 111.57 44.01 15.18 OOM
RTX 6000 Ada 48GB 127.45 50.56 17.02 OOM
3090 24GB * 3 46.14 35.83 13.64 OOM
4090 24GB * 3 62.14 50.39 14.92 OOM
RTX 4000 Ada 20GB * 4 55.41 40.52 15.48 OOM
A100 80GB 136.24 73.83 25.68 OOM
H100 PCIe 80GB 133.77 83.32 25.88 OOM
RTX A6000 48GB * 2 74.43 47.08 17.31 OOM
3090 24GB * 6 27.71 27.37 10.15 7.54
4090 24GB * 6 38.19 37.5 15.59 10.24
M1 Max 24-Core GPU 32GB 48.65 13.83 OOM OOM
M2 Ultra 76-Core GPU 192GB 91.89 40.68 14.37 4.81
M3 10-Core GPU 16GB 20.79 3.61 OOM OOM
M3 Max 40-Core GPU 48GB 63.11 24.76 3.11 OOM

Average prompt eval speed (tokens/s) by GPUs on LLaMA 2.

GPU 7B_q4_0 7B_f16 70B_q4_0 70B_f16
4080 16GB 3774.79 5158.19 OOM OOM
RTX 4000 Ada 20GB 1819.86 2369.09 OOM OOM
3090 24GB 2601.81 3075.37 OOM OOM
4090 24GB 5531.19 7348.69 OOM OOM
RTX 5000 Ada 32GB 3525.7 4936.59 OOM OOM
RTX 4000 Ada 20GB * 2 522.31 499.6 OOM OOM
3090 24GB * 2 252.15 247.31 48.46 OOM
4090 24GB * 2 1711.03 1729.35 256.97 OOM
RTX A6000 48GB 2979.12 3707.88 405.97 OOM
RTX 6000 Ada 48GB 3911.18 5403.6 480.93 OOM
3090 24GB * 3 236.49 231.55 41.27 OOM
4090 24GB * 3 996.25 1006.19 178.36 OOM
RTX 4000 Ada 20GB * 4 280.49 293.29 46.76 OOM
A100 80GB 3443.98 5105.77 612.41 OOM
H100 PCIe 80GB 4868.59 7367 855.73 OOM
RTX A6000 48GB * 2 1309.83 1145.15 198.24 OOM
3090 24GB * 6 89.74 87.62 14.8 14.8
4090 24GB * 6 437.77 425.87 88.47 85.1
M1 Max 24-Core GPU 32GB 199.65 213.14 OOM OOM
M2 Ultra 76-Core GPU 192GB 1217.03 1379.54 133.18 150.98
M3 10-Core GPU 16GB 184.51 25.21 OOM OOM
M3 Max 40-Core GPU 48GB 740.97 761.4 8.68 OOM

Model

Thanks to shawwn for LLaMA model weights (7B, 13B, 30B, 65B): llama-dl. Access LLaMA 2 from Meta AI.

Usage

  • For NVIDIA GPUs:

    This provides BLAS acceleration using the CUDA cores of your Nvidia GPU. Multiple GPU works fine with no CPU bottleneck. -ngl 10000 to make sure all layers are offloaded to GPU. (Thanks to: ggerganov/llama.cpp#1827)

    make clean && LLAMA_CUBLAS=1 make -j

    A longer prompt will make prompt processing speed per token faster. Here, we input around 500 tokens for the test. Test arguments:

    !./main --color --no-mmap -ngl 10000 --temp 1.1 --repeat_penalty 1.1 -n 1024 --ignore-eos -m ./models/7B/ggml-model-q4_0.gguf  -p "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way – in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only. <0x0A>\
    There were a king with a large jaw and a queen with a plain face, on the throne of England; there were a king with a large jaw and a queen with a fair face, on the throne of France. In both countries it was clearer than crystal to the lords of the State preserves of loaves and fishes, that things in general were settled for ever. <0x0A>\
    It was the year of Our Lord one thousand seven hundred and seventy-five. Spiritual revelations were conceded to England at that favoured period, as at this. Mrs. Southcott had recently attained her five-and-twentieth blessed birthday, of whom a prophetic private in the Life Guards had heralded the sublime appearance by announcing that arrangements were made for the swallowing up of London and Westminster. Even the Cock-lane ghost had been laid only a round dozen of years, after rapping out its messages, as the spirits of this very year last past (supernaturally deficient in originality) rapped out theirs. Mere messages in the earthly order of events had lately come to the English Crown and People, from a congress of British subjects in America: which, strange to relate, have proved more important to the human race than any communications yet received through any of the chickens of the Cock-lane brood. <0x0A>\
    France, less favoured on the whole as to matters spiritual than her sister of the shield and trident, rolled with exceeding smoothness down hill, making paper money and spending it."
  • For Apple Silicon:

    Using Metal allows the computation to be executed on the GPU for Apple devices:

    make clean && LLAMA_METAL=1 make -j

    The arguments are the same as those on Nvidia GPUs, except we only need -ngl 1 to ensure all layers are offloaded to the GPU. Test arguments:

    !./main --color --no-mmap -ngl 1 --temp 1.1 --repeat_penalty 1.1 -n 1024 --ignore-eos -m ./models/7B/ggml-model-q4_0.gguf  -p "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way – in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only. <0x0A>\
    There were a king with a large jaw and a queen with a plain face, on the throne of England; there were a king with a large jaw and a queen with a fair face, on the throne of France. In both countries it was clearer than crystal to the lords of the State preserves of loaves and fishes, that things in general were settled for ever. <0x0A>\
    It was the year of Our Lord one thousand seven hundred and seventy-five. Spiritual revelations were conceded to England at that favoured period, as at this. Mrs. Southcott had recently attained her five-and-twentieth blessed birthday, of whom a prophetic private in the Life Guards had heralded the sublime appearance by announcing that arrangements were made for the swallowing up of London and Westminster. Even the Cock-lane ghost had been laid only a round dozen of years, after rapping out its messages, as the spirits of this very year last past (supernaturally deficient in originality) rapped out theirs. Mere messages in the earthly order of events had lately come to the English Crown and People, from a congress of British subjects in America: which, strange to relate, have proved more important to the human race than any communications yet received through any of the chickens of the Cock-lane brood. <0x0A>\
    France, less favoured on the whole as to matters spiritual than her sister of the shield and trident, rolled with exceeding smoothness down hill, making paper money and spending it."

    Check the recommendedMaxWorkingSetSize in the result to see how much memory can be allocated on GPU and maintain its performance. Only 65% of unified memory can be allocated to the GPU on 32GB M1 Max, and we expect 75% of usable memory for the GPU on larger memory. (Source: https://developer.apple.com/videos/play/tech-talks/10580/?time=346) To utilize the whole memory, use -ngl 0 to only use the CPU for inference. (Thanks to: ggerganov/llama.cpp#1826)

Total VRAM Requirements

NIVIDA GPUs (snapshots in Dec 2023)

Model Quantized size (4-bit) Original size (f16)
7B 3.82 GB 12.63 GB
13B 7.24 GB 24.41 GB
30B 17.84 GB 63.7 GB
65B 35.5 GB 122.48 GB
70B 36.37 GB 128.3 GB

Apple Silicon (snapshots in Dec 2023)

Model Quantized size (4-bit) Original size (f16)
7B 3.89 GB 12.88 GB
13B 7.33 GB 24.71 GB
30B 17.96 GB 61.45 GB
65B 35.64 GB 123.47 GB
70B 36.51 GB 129.27 GB

Benchmarks

Run three times for each model. PP means "prompt processing", TG means "text-generation".

LLaMA 🦙:

NVIDIA GPUs (CPU: AMD EPYC, OS: Ubuntu 22.04.2 LTS, pytorch:2.1.1, py: 3.10, cuda: 12.1.1 or 11.8.0 on RunPod) (snapshots in Dec 2023)

GPU Model TG [t/s] PP [t/s] mean TG [t/s] mean PP [t/s]
4080 16GB 7B_q4_0 118.26 117.91 117.98 3730.53 3765.79 3781.64 118.05 3759.32
7B_f16 44.22 44.22 44.21 5127.89 5299.86 5216.58 44.22 5214.78
13B_q4_0 67.85 67.91 67.99 2230.36 2271.01 2236.63 67.92 2246
13B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
30B_q4_0 OOM OOM OOM OOM OOM OOM OOM OOM
30B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
65B_q4_0 OOM OOM OOM OOM OOM OOM OOM OOM
65B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
RTX 4000 Ada 20GB 7B_q4_0 64.83 64.56 64.7 1485.86 1819.13 1814.07 64.7 1706.35
7B_f16 22.98 22.97 22.96 2365.84 2355.97 2357.47 22.97 2359.76
13B_q4_0 36.48 36.38 36.39 1044.6 1038.25 1039.71 36.42 1040.85
13B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
30B_q4_0 15.54 15.55 15.54 466.82 463.14 463.11 15.54 464.36
30B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
65B_q4_0 OOM OOM OOM OOM OOM OOM OOM OOM
65B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
3090 24GB 7B_q4_0 120.84 121.12 121.55 2497.43 2625.36 2623.24 121.17 2582.01
7B_f16 51.82 51.86 51.76 3141.79 3224.82 3140.9 51.81 3169.17
13B_q4_0 75.09 74.99 74.95 1679.16 1596.61 1628.07 75.01 1634.61
13B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
30B_q4_0 33.71 33.7 33.64 769.23 767.21 784.02 33.68 773.49
30B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
65B_q4_0 OOM OOM OOM OOM OOM OOM OOM OOM
65B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
4090 24GB 7B_q4_0 149.67 149.21 149.35 4113.14 5427.62 2343.25 149.41 3961.34
7B_f16 60.86 61.04 60.87 7178.04 7268.87 7286.89 60.92 7244.6
13B_q4_0 88.77 88.84 89.02 3141.23 3143.35 3140.72 88.88 3141.77
13B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
30B_q4_0 40.13 40.04 40.1 1415.81 1414.77 1412.93 40.09 1414.5
30B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
65B_q4_0 OOM OOM OOM OOM OOM OOM OOM OOM
65B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
RTX 5000 Ada 32GB 7B_q4_0 100.01 99.77 99.93 3812.96 3792.65 3714.7 99.9 3773.44
7B_f16 36.24 36.38 36.39 4770.23 5191.42 5201.48 36.34 5054.38
13B_q4_0 56.68 55.96 56.61 2202.27 2048.25 2175.12 56.42 2141.88
13B_f16 19.21 19.23 19.23 3131.34 3117.55 3130.19 19.22 3126.36
30B_q4_0 24.45 24.46 24.46 978.34 971.96 973.47 24.46 974.59
30B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
65B_q4_0 OOM OOM OOM OOM OOM OOM OOM OOM
65B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
RTX 4000 Ada 20GB * 2 7B_q4_0 60.53 60.47 57.69 521.68 502.74 497.34 59.56 507.25
7B_f16 31.79 31.03 31.9 499.21 489.35 504.63 31.57 497.73
13B_q4_0 39.52 39.7 39.43 325.2 323.59 325.5 39.55 324.76
13B_f16 18.37 18.34 18.34 308.01 307.61 306.49 18.35 307.37
30B_q4_0 19.85 19.86 19.83 158.89 159.31 158.77 19.85 158.99
30B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
65B_q4_0 OOM OOM OOM OOM OOM OOM OOM OOM
65B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
3090 24GB * 2 7B_q4_0 53.58 53.14 53.18 252.66 252.07 251.92 53.3 252.22
7B_f16 38.91 38.73 38.8 247.9 247.42 248.03 38.81 247.78
13B_q4_0 38.66 38.76 38.51 173.66 173.38 173.26 38.64 173.43
13B_f16 26.21 25.77 25.78 169.89 169.88 169.93 25.92 169.9
30B_q4_0 21.98 22.51 22.18 89.65 89.54 89.49 22.22 89.56
30B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
65B_q4_0 14.39 14.44 14.41 54.41 54.38 54.45 14.41 54.41
65B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
4090 24GB * 2 7B_q4_0 64.36 64.25 64.49 1504.06 1488.26 1504.63 64.37 1498.98
7B_f16 48.49 48.49 48.41 1530.1 1532.7 1523.79 48.46 1528.86
13B_q4_0 47.46 47.62 47.6 936.21 951.01 953.27 47.56 946.83
13B_f16 31.73 31.72 31.75 919.07 906.36 917.53 31.73 914.32
30B_q4_0 27.38 27.33 27.32 423.95 452.05 455.48 27.34 443.83
30B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
65B_q4_0 17.38 17.28 17.97 269.39 271.29 266.59 17.54 269.09
65B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
RTX A6000 48GB 7B_q4_0 111.68 111.39 111.24 2985.87 2987.86 3002.24 111.44 2991.99
7B_f16 44.02 43.94 43.91 3743.16 3694.23 3712.89 43.96 3716.76
13B_q4_0 67.14 67.03 66.98 1790.1 1785.51 1777.19 67.05 1784.27
13B_f16 24.65 24.66 24.65 2375.49 2390.33 2376.3 24.65 2380.71
30B_q4_0 29.72 29.7 29.69 797.64 794.67 793.23 29.7 795.18
30B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
65B_q4_0 15.7 15.68 15.68 424.77 421.5 418.24 15.69 421.5
65B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
RTX 6000 Ada 48GB 7B_q4_0 127.93 128.23 128.09 4011.08 4213.76 4329.84 128.08 4184.89
7B_f16 50.54 50.54 50.6 5278.34 5004.36 5010.6 50.56 5097.77
13B_q4_0 74.39 73.58 74.23 2269.22 2298.02 2345.37 74.07 2304.2
13B_f16 26.88 26.88 26.87 2982.1 2913.13 2943.37 26.88 2946.2
30B_q4_0 32.68 32.64 32.64 931.86 940.87 943.7 32.65 938.81
30B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
65B_q4_0 17.42 17.37 17.39 487.49 472.7 470.28 17.39 476.82
65B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
3090 24GB * 3 7B_q4_0 44.68 45.84 44.94 236.87 236.61 237.62 45.15 237.03
7B_f16 35.74 35.31 35.2 232.51 230.31 232.83 35.42 231.88
13B_q4_0 34.09 34.36 34.14 147.56 147.53 147.41 34.2 147.5
13B_f16 25.72 25.72 25.46 145.04 145.04 144.45 25.63 144.84
30B_q4_0 20.36 20.31 20.33 74.76 74.64 74.51 20.33 74.64
30B_f16 14 14.19 14.02 72.09 71.95 71.96 14.07 72
65B_q4_0 13.62 13.59 13.85 47.22 47.1 47.22 13.69 47.18
65B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
4090 24GB * 3 7B_q4_0 62.14 61.79 62.03 1008.36 694.59 1008.21 61.99 903.72
7B_f16 50.7 50.73 50.56 1002.56 961.68 1001.81 50.66 988.68
13B_q4_0 47.05 46.99 47.18 612.06 610.24 609.81 47.07 610.7
13B_f16 35 35.05 35.11 596.67 597.92 594.88 35.05 596.49
30B_q4_0 28.28 28.31 28.27 316.99 317.8 310.94 28.29 315.24
30B_f16 18.87 18.86 18.86 299.08 299.93 299.45 18.86 299.49
65B_q4_0 19.08 19.07 19.06 192.37 192.98 193.45 19.07 192.93
65B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
RTX 4000 Ada 20GB * 4 7B_q4_0 54.17 54.86 53.5 227.07 230.08 233.99 54.18 230.38
7B_f16 40.25 40.19 40.25 236.98 239.86 243.33 40.23 240.06
13B_q4_0 41.72 41.72 41.54 145.14 142.7 147.7 41.66 145.18
13B_f16 24.37 24.3 24.3 141.78 140.48 142.59 24.32 141.62
30B_q4_0 24.41 24.39 24.2 91.87 92.32 74.44 24.33 86.21
30B_f16 12.08 12.06 12.08 73.13 70.13 73 12.07 72.09
65B_q4_0 15.24 15.22 15.24 47.54 45.29 46.61 15.23 46.48
65B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
A100 80GB 7B_q4_0 136.66 136.71 136.73 3779.5 3981.72 4109.75 136.7 3956.99
7B_f16 73.96 73.91 73.86 5099.41 5285.4 5555.24 73.91 5313.35
13B_q4_0 92.21 92.27 92.19 2506.69 2507.4 2474.8 92.22 2496.3
13B_f16 45 45.01 45.03 3600.85 3584.78 3691.64 45.01 3625.76
30B_q4_0 46.83 46.82 46.82 1204.77 1219.03 1207.09 46.82 1210.3
30B_f16 20.18 20.21 20.19 1661.54 1893.04 1897.04 20.19 1817.21
65B_q4_0 26.61 26.62 26.59 637.52 637.87 628.66 26.61 634.68
65B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
H100 PCIe 80GB 7B_q4_0 131.53 135.76 130.86 4919.98 4993.08 4987.58 132.72 4966.88
7B_f16 83.16 82.09 81.95 7146.27 7148.23 7361.38 82.4 7218.63
13B_q4_0 90.46 90.55 92.56 3238.79 3200.25 2700.34 91.19 3046.46
13B_f16 50.88 51.07 50.79 5045.45 5114.67 5100.83 50.91 5086.98
30B_q4_0 48.92 48.89 48.95 1578.69 1570.08 1572.97 48.92 1573.91
30B_f16 23.15 23.06 23.1 2498.44 2503.4 2497.84 23.1 2499.89
65B_q4_0 26.81 27.16 26.74 876.01 879.23 872.77 26.9 876
65B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
RTX A6000 48GB * 2 7B_q4_0 72.83 72.18 72.61 1139 1112.94 1115.61 72.54 1122.52
7B_f16 47.44 47.01 47 999.03 1091.13 1085.11 47.15 1058.42
13B_q4_0 51.18 51.07 50.84 727.93 726.87 723.13 51.03 725.98
13B_f16 30.06 30.38 29.81 692.74 764.51 670.4 30.08 709.22
30B_q4_0 28.38 28.26 28.21 353.16 379 343.37 28.28 358.51
30B_f16 14.45 14.43 14.4 335.53 335.07 334.34 14.43 334.98
65B_q4_0 17.51 17.22 17.26 215.63 206.24 205.88 17.33 209.25
65B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
3090 24GB * 6 7B_q4_0 28.03 28.4 28.71 1208.49 89.73 89.94 28.38 462.72
7B_f16 26.16 27.04 25.86 88.51 88.52 88.29 26.35 88.44
13B_q4_0 21.44 20.98 21.38 49.34 48.93 49.29 21.27 49.19
13B_f16 20.24 20.32 20.53 47.97 48.06 47.95 20.36 47.99
30B_q4_0 14.17 14.37 14.35 24.88 24.82 24.92 14.3 24.87
30B_f16 11.99 12.13 11.94 23.92 24.48 23.84 12.02 24.08
65B_q4_0 10.37 10.29 10.33 16.11 16.11 15.58 10.33 15.93
65B_f16 7.87 7.88 7.79 15.81 15.84 15.84 7.85 15.83
4090 24GB * 6 7B_q4_0 35.85 37.17 37.01 443.26 455.17 455.58 36.68 451.34
7B_f16 30.47 37.63 37.75 463.42 446.24 431.44 35.28 447.03
13B_q4_0 30.13 30.47 30.01 289.08 285.65 288.98 30.2 287.9
13B_f16 29.24 29 29.03 282.11 267.74 267.48 29.09 272.44
30B_q4_0 19.88 21.4 21.5 154.9 155.45 152.41 20.93 154.25
30B_f16 18.93 18.9 18.76 150.76 136.95 137.75 18.86 141.82
65B_q4_0 15.94 15.88 15.79 95.75 95.21 95.29 15.87 95.42
65B_f16 11.31 12.47 12.6 82.26 85.95 84.87 12.13 84.36

Apple Silicon (snapshots for M1 Max in Aug 2023, others in Dec 2023)

GPU Model TG [t/s] PP [t/s] mean TG [t/s] mean PP [t/s]
M1 Max 24-Core GPU 32GB 7B_q4_0 48.88 48.9 46.65 199.33 199.19 199.43 48.14 199.32
7B_f16 13.99 13.96 13.96 213.61 213.37 213.8 13.97 213.59
13B_q4_0 28.59 27.14 24.64 111.92 95.93 92.92 26.79 100.26
13B_f16 (CPU) 4.49 4.33 4.49 26.55 28.15 18.65 4.44 24.45
30B_q4_0 13.18 13.13 13.12 49.29 49.33 49.3 13.14 49.31
30B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
65B_q4_0 OOM OOM OOM OOM OOM OOM OOM OOM
65B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
M2 Ultra 76-Core GPU 192GB 7B_q4_0 91.81 92.37 92.22 1208.49 1212.66 1211.63 92.13 1210.93
7B_f16 41.18 41.36 41.41 1378.83 1377.31 1380.34 41.32 1378.83
13B_q4_0 55.86 55.79 55.75 660.85 660.91 660.65 55.8 660.8
13B_f16 22.14 22.17 22.17 749.67 748.86 749.73 22.16 749.42
30B_q4_0 26.75 26.72 26.69 273.56 272.89 272.67 26.72 273.04
30B_f16 9.87 9.81 9.83 310.12 309.66 309.94 9.84 309.91
65B_q4_0 15 14.59 14.61 139.04 139.06 139.22 14.73 139.11
65B_f16 5.01 5.01 5.02 158.01 157.91 157.7 5.01 157.87
M3 10-Core GPU 16GB 7B_q4_0 19.48 19.54 20.03 182.59 184.21 184.26 19.68 183.69
7B_f16 (CPU) 4.09 2.89 2.67 28 29.56 30.36 3.22 29.31
13B_q4_0 11.05 11.11 11.1 97.14 97.03 96.87 11.09 97.01
13B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
30B_q4_0 OOM OOM OOM OOM OOM OOM OOM OOM
30B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
65B_q4_0 OOM OOM OOM OOM OOM OOM OOM OOM
65B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
M3 Max 40-Core GPU 48GB 7B_q4_0 63.68 62.95 62.56 743.8 741.39 741.28 63.06 742.16
7B_f16 24.77 24.7 24.5 767.62 722.39 686.75 24.66 725.59
13B_q4_0 35.99 35.94 36.01 378.94 370.76 373.19 35.98 374.3
13B_f16 13.3 13.36 13.34 360.11 376.68 365.01 13.33 367.27
30B_q4_0 16.2 16.24 16.27 152.22 153.36 154.6 16.24 153.39
30B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
65B_q4_0 3.17 3.13 3.12 10.08 9.73 9.91 3.14 9.91
65B_f16 OOM OOM OOM OOM OOM OOM OOM OOM

LLaMA 2 🦙🦙:

NVIDIA GPUs (CPU: AMD EPYC, OS: Ubuntu 22.04.2 LTS, pytorch:2.1.1, py: 3.10, cuda: 12.1.1 or 11.8.0 on RunPod) (snapshots in Dec 2023)

GPU Model TG [t/s] PP [t/s] mean TG [t/s] mean PP [t/s]
4080 16GB 7B_q4_0 118.05 117.98 118.01 3763.15 3777.83 3783.38 118.01 3774.79
7B_f16 44.21 44.21 44.19 5153.89 5125.39 5195.3 44.2 5158.19
13B_q4_0 68.1 68.13 67.97 2240.63 2253 2271.43 68.07 2255.02
13B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
70B_q4_0 OOM OOM OOM OOM OOM OOM OOM OOM
70B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
RTX 4000 Ada 20GB 7B_q4_0 64.07 64.64 64.89 1788.29 1836.82 1834.47 64.53 1819.86
7B_f16 22.89 23 22.99 2285.26 2409.71 2412.29 22.96 2369.09
13B_q4_0 36.58 36.32 36.59 1064.43 1047.79 1053.87 36.5 1055.36
13B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
70B_q4_0 OOM OOM OOM OOM OOM OOM OOM OOM
70B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
3090 24GB 7B_q4_0 120.48 120.7 120.62 2605.87 2612.02 2587.55 120.6 2601.81
7B_f16 51.87 51.82 51.85 3090.47 3096.25 3039.38 51.85 3075.37
13B_q4_0 75.11 74.62 74.64 1652.83 1638.8 1667.26 74.79 1652.96
13B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
70B_q4_0 OOM OOM OOM OOM OOM OOM OOM OOM
70B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
4090 24GB 7B_q4_0 150.09 148.67 149.35 5560.5 5528.08 5504.98 149.37 5531.19
7B_f16 60.75 61 60.6 7331.23 7338.41 7376.44 60.78 7348.69
13B_q4_0 88.41 88.51 88.75 3163.32 3162.81 3156.39 88.56 3160.84
13B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
70B_q4_0 OOM OOM OOM OOM OOM OOM OOM OOM
70B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
RTX 5000 Ada 32GB 7B_q4_0 99.07 99.23 99.27 3525.95 3702.68 3348.47 99.19 3525.7
7B_f16 36.41 36.22 36.29 5021.4 4915.77 4872.61 36.31 4936.59
13B_q4_0 56.06 56.25 56.38 2158.19 2025.15 2172.96 56.23 2118.77
13B_f16 19.21 19.23 19.22 2715.48 3212.32 3187.01 19.22 3038.27
70B_q4_0 OOM OOM OOM OOM OOM OOM OOM OOM
70B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
RTX 4000 Ada 20GB * 2 7B_q4_0 60.83 60.24 60.98 510.63 527.62 528.69 60.68 522.31
7B_f16 31.88 31.75 31.87 503.6 488.38 506.83 31.83 499.6
13B_q4_0 39.72 39.6 39.64 324.77 324.84 324.78 39.65 324.8
13B_f16 18.37 18.33 18.36 308.54 303.42 306.68 18.35 306.21
70B_q4_0 OOM OOM OOM OOM OOM OOM OOM OOM
70B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
3090 24GB * 2 7B_q4_0 53.44 52.98 53.34 251.78 252.35 252.33 53.25 252.15
7B_f16 38.68 38.72 38.8 247.18 247.32 247.42 38.73 247.31
13B_q4_0 38.6 38.65 38.25 173.08 172.65 173.01 38.5 172.91
13B_f16 25.72 25.74 25.8 169.65 169.76 170.14 25.75 169.85
70B_q4_0 14.27 14.2 14.36 48.37 48.34 48.68 14.28 48.46
70B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
4090 24GB * 2 7B_q4_0 66.37 66.21 66.21 1720.22 1696.08 1716.8 66.26 1711.03
7B_f16 49.25 49.04 49.31 1730.6 1725.1 1732.34 49.2 1729.35
13B_q4_0 48.8 48.8 48.86 1015.44 660.63 1072.3 48.82 916.12
13B_f16 32.4 32.43 32.43 1035.41 1037.49 1035.18 32.42 1036.03
70B_q4_0 18.49 18.55 18.46 254.72 258.2 258 18.5 256.97
70B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
RTX A6000 48GB 7B_q4_0 111.52 111.67 111.52 2970.46 2979.12 2987.78 111.57 2979.12
7B_f16 44.06 44.02 43.96 3729.26 3704.12 3690.26 44.01 3707.88
13B_q4_0 67.1 67.14 67.08 1787.32 1799.81 1793.19 67.11 1793.44
13B_f16 24.68 24.68 24.66 2393.86 2411.16 2386.01 24.67 2397.01
70B_q4_0 15.18 15.18 15.17 404.13 408.22 405.55 15.18 405.97
70B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
RTX 6000 Ada 48GB 7B_q4_0 125.92 128.18 128.24 3689.46 4021.62 4022.47 127.45 3911.18
7B_f16 50.58 50.58 50.53 5443.89 5494.02 5272.88 50.56 5403.6
13B_q4_0 74.29 74.32 74.42 2317.18 2227.36 2251.14 74.34 2265.23
13B_f16 26.92 26.91 26.89 3063.26 3028.63 3029.11 26.91 3040.33
70B_q4_0 17.03 17.01 17.02 506.74 474.26 461.78 17.02 480.93
70B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
3090 24GB * 3 7B_q4_0 46.11 46.06 46.25 236.75 235.82 236.91 46.14 236.49
7B_f16 35.84 35.83 35.83 232.06 230.81 231.78 35.83 231.55
13B_q4_0 34.45 34.28 33.67 147.44 147.06 147.45 34.13 147.32
13B_f16 25.15 25.2 25.21 145.01 144.67 144.73 25.19 144.8
70B_q4_0 13.52 13.76 13.65 41.26 41.32 41.24 13.64 41.27
70B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
4090 24GB * 3 7B_q4_0 62.07 62.2 62.15 1011.64 970.65 1006.45 62.14 996.25
7B_f16 50.04 50.59 50.55 1007.3 1004.61 1006.65 50.39 1006.19
13B_q4_0 47.02 32.64 47.27 433.19 594.15 473.47 42.31 500.27
13B_f16 34.75 35.03 35.16 594.69 588.52 596.4 34.98 593.2
70B_q4_0 14.9 15.02 14.83 180.08 177.07 177.92 14.92 178.36
70B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
RTX 4000 Ada 20GB * 4 7B_q4_0 55.3 54.83 56.1 279.26 280.56 281.66 55.41 280.49
7B_f16 40.55 40.49 40.52 293.28 293.5 293.08 40.52 293.29
13B_q4_0 40.36 41.7 41.82 146.25 147.51 147.36 41.29 147.04
13B_f16 24.41 24.44 24.28 177.98 143.95 147.04 24.38 156.32
70B_q4_0 15.46 15.48 15.49 41.81 47.4 51.07 15.48 46.76
70B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
A100 80GB 7B_q4_0 136.09 136.17 136.47 3224.44 3613.25 3494.25 136.24 3443.98
7B_f16 73.86 73.79 73.84 4892.15 5166.66 5258.51 73.83 5105.77
13B_q4_0 92.13 92.1 92.12 2343.26 2335.05 2335.44 92.12 2337.92
13B_f16 44.97 45.02 45.03 3662.05 3672.21 3757.6 45.01 3697.29
70B_q4_0 25.72 25.67 25.65 618.99 621.06 597.18 25.68 612.41
70B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
H100 PCIe 80GB 7B_q4_0 131.44 134.81 135.07 4773.18 4975.78 4856.8 133.77 4868.59
7B_f16 83.19 83.36 83.4 7464.15 7287.64 7349.22 83.32 7367
13B_q4_0 90.42 89.93 90.44 3040.04 2984.15 3012.51 90.26 3012.23
13B_f16 50.86 51.33 50.78 5000.66 5052.73 5090.47 50.99 5047.95
70B_q4_0 26.08 25.8 25.76 863.08 855.95 848.17 25.88 855.73
70B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
RTX A6000 48GB * 2 7B_q4_0 74.87 74.06 74.37 1321.94 1309.73 1297.81 74.43 1309.83
7B_f16 47.8 46.28 47.15 1275.84 1073.5 1086.1 47.08 1145.15
13B_q4_0 52.08 51 51.71 806.56 810.14 822.28 51.6 812.99
13B_f16 30.39 30.16 30.3 777.89 770.82 773.55 30.28 774.09
70B_q4_0 17.32 17.3 17.32 198.61 198.21 197.89 17.31 198.24
70B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
3090 24GB * 6 7B_q4_0 26.61 28.23 28.28 89.71 90.15 89.36 27.71 89.74
7B_f16 27.21 27.43 27.46 87.8 87.61 87.45 27.37 87.62
13B_q4_0 22.61 22.36 21.99 49.56 49.57 49.81 22.32 49.65
13B_f16 20.3 20.23 20.15 48.36 48.4 48.31 20.23 48.36
70B_q4_0 10.18 10.12 10.15 14.78 14.81 14.8 10.15 14.8
70B_f16 7.57 7.59 7.47 14.78 14.81 14.8 7.54 14.8
4090 24GB * 6 7B_q4_0 38.34 38 38.24 431.76 440.64 440.91 38.19 437.77
7B_f16 37.5 37.58 37.43 425.93 424.68 426.99 37.5 425.87
13B_q4_0 30.34 30.24 29.97 279.47 290.72 283.99 30.18 284.73
13B_f16 29.43 29.26 29.09 275.42 275.61 265.37 29.26 272.13
70B_q4_0 15.59 15.62 15.57 88.16 88.64 88.62 15.59 88.47
70B_f16 9.75 8.63 12.34 86.7 87.36 81.25 10.24 85.1

Apple Silicon (snapshots for M1 Max in Aug 2023, others in Dec 2023)

GPU Model TG [t/s] PP [t/s] mean TG [t/s] mean PP [t/s]
M1 Max 24-Core GPU 32GB 7B_q4_0 48.81 48.43 48.71 199.6 199.66 199.7 48.65 199.65
7B_f16 13.98 13.77 13.73 212.96 213.6 212.85 13.83 213.14
13B_q4_0 26.1 26.77 26.31 111.73 100.97 91.17 26.39 101.29
13B_f16 (CPU) 4.4 4.43 4.37 27.57 26.38 24.93 4.4 26.29
70B_q4_0 OOM OOM OOM OOM OOM OOM OOM OOM
70B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
M2 Ultra 76-Core GPU 192GB 7B_q4_0 92 91.96 91.72 1213.54 1219.65 1217.91 91.89 1217.03
7B_f16 40.7 40.66 40.69 1379.46 1379.77 1379.39 40.68 1379.54
13B_q4_0 55.3 55.41 55.32 660.36 659.66 659.47 55.34 659.83
13B_f16 22.14 22.14 22.16 749.52 748.59 749.4 22.15 749.17
70B_q4_0 14.37 14.37 14.37 133.27 133.24 133.03 14.37 133.18
70B_f16 4.81 4.81 4.81 150.66 151.22 151.05 4.81 150.98
M3 10-Core GPU 16GB 7B_q4_0 20.86 20.74 20.77 184.59 184.49 184.45 20.79 184.51
7B_f16 (CPU) 4.03 4.08 2.73 31.49 31.62 12.51 3.61 25.21
13B_q4_0 11.26 10.96 11.32 97.08 96.97 97.14 11.18 97.06
13B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
70B_q4_0 OOM OOM OOM OOM OOM OOM OOM OOM
70B_f16 OOM OOM OOM OOM OOM OOM OOM OOM
M3 Max 40-Core GPU 48GB 7B_q4_0 63.51 63.39 62.42 741.79 740.78 740.35 63.11 740.97
7B_f16 24.77 24.77 24.73 767.87 766.13 750.19 24.76 761.4
13B_q4_0 36.55 36.32 36.61 345.06 351.87 387.76 36.49 361.56
13B_f16 13.28 13.35 13.38 379.93 406.5 409.13 13.34 398.52
70B_q4_0 (CPU) 3.11 3.11 3.11 9.27 8.42 8.35 3.11 8.68
70B_f16 OOM OOM OOM OOM OOM OOM OOM OOM

Conclusion

Same performance on LLaMA and LLaMA 2 of the same size and quantization. Multiple NVIDIA GPUs might affect the performance.

For LLM inference, buy 3090s to save money. Buy 4090s if you want to speed up. Buy A100s if you are rich. Buy Mac Studio if you want to put your computer on your desk, save energy, be quiet, and don't wanna maintenance. (If you want to train LLM, choose NIVIDA.)

If you find this information helpful, please give me a star. ⭐️ Feel free to contact me if you have any advice. Thank you. 🤗

Appendix

All the latest results are welcome! 🥰 (Thanks to: MichaelDays ❤️)

Mac

LLaMA 2 🦙🦙:

GPU Model eval time (ms/token) prompt eval time (ms/token) mean eval time (ms/token) mean prompt eval time (ms/token)
Mac Pro (2019 intel/16 core/384GB) CPU 7B_q4_0 54.78 61.74 59.77 35.72 37.29 35.88 58.76 36.30
13B_q4_0 102.20 101.97 100.26 63.28 62.77 62.93 101.48 62.99
70B_q4_0 547.89 456.45 457.12 295.31 296.16 295.12 487.15 295.53
M2 Mini Pro (12/19/32GB) CPU 7B_q4_0 48.04 45.78 49.61 17.34 16.81 16.82 47.81 16.99
13B_q4_0 75.93 72.56 71.66 29.39 28.50 28.46 73.38 28.78
70B_q4_0 225956.54 OOM OOM 27220.19 OOM OOM OOM OOM
7B.f16 118.10 116.46 118.95 17.11 16.62 17.84 117.84 17.19
M2 Mini Pro (12/19/32GB) GPU 7B_q4_0 28.92 28.54 29.27 5.89 5.95 5.86 28.91 5.90
13B_q4_0 50.76 50.71 50.63 10.30 10.29 10.28 50.70 10.29
70B_q4_0 OOM OOM OOM OOM OOM OOM OOM OOM
7B.f16 92.43 91.91 92.11 5.56 5.53 5.56 92.15 5.55

About

Multiple NVIDIA GPUs or Apple Silicon for Large Language Model Inference?

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%