[BesTLA] First-token inference optimization #271

luoyu-intel · 2024-05-29T07:24:03Z

Type of Change

Higher performance for comp_int8 group=-1 kernels
Improved vector mul and add.
Reduce template combinations, and speed up the compilation process

Performance change

llama2-7B, int4, group=-1, sym, comp_int8:

Alderlake 16 cores
PR:

model_print_timings:        load time =  3765.44 ms
model_print_timings:      sample time =     2.79 ms /    16 runs   (    0.17 ms per token)
model_print_timings: prompt eval time =  3745.27 ms /  1024 tokens (    3.66 ms per token)
model_print_timings:        eval time =  1086.39 ms /    15 runs   (   72.43 ms per token)
model_print_timings:       total time =  4861.68 ms
========== eval time log of each prediction ==========
prediction   0, time: 3745.27ms
prediction   1, time: 72.77ms
prediction   2, time: 72.35ms
prediction   3, time: 72.35ms
prediction   4, time: 72.31ms

Main:

model_print_timings:        load time =  7380.84 ms
model_print_timings:      sample time =     3.81 ms /    16 runs   (    0.24 ms per token)
model_print_timings: prompt eval time =  5029.46 ms /  1024 tokens (    4.91 ms per token)
model_print_timings:        eval time =  1086.52 ms /    15 runs   (   72.43 ms per token)
model_print_timings:       total time =  8486.67 ms
========== eval time log of each prediction ==========
prediction   0, time: 5029.46ms
prediction   1, time: 72.87ms
prediction   2, time: 72.38ms
prediction   3, time: 72.17ms
prediction   4, time: 72.36ms

sapphire rapids 56 cores:
PR:

model_print_timings:        load time =   385.17 ms
model_print_timings:      sample time =     8.08 ms /    16 runs   (    0.50 ms per token)
model_print_timings: prompt eval time =   382.27 ms /  1023 tokens (    0.37 ms per token)
model_print_timings:        eval time =   338.05 ms /    15 runs   (   22.54 ms per token)
model_print_timings:       total time =   735.46 ms
========== eval time log of each prediction ==========
prediction   0, time: 382.27ms
prediction   1, time: 23.75ms
prediction   2, time: 22.91ms
prediction   3, time: 22.69ms
prediction   4, time: 22.63ms

model_print_timings:        load time =   740.27 ms
model_print_timings:      sample time =     8.00 ms /    16 runs   (    0.50 ms per token)
model_print_timings: prompt eval time =   734.54 ms /  2024 tokens (    0.36 ms per token)
model_print_timings:        eval time =   385.63 ms /    15 runs   (   25.71 ms per token)
model_print_timings:       total time =  1137.81 ms
========== eval time log of each prediction ==========
prediction   0, time: 734.54ms
prediction   1, time: 27.92ms
prediction   2, time: 26.25ms
prediction   3, time: 25.76ms
prediction   4, time: 25.60ms

Main:

model_print_timings:      sample time =     8.59 ms /    16 runs   (    0.54 ms per token)
model_print_timings: prompt eval time =   407.88 ms /  1023 tokens (    0.40 ms per token)
model_print_timings:        eval time =   348.30 ms /    15 runs   (   23.22 ms per token)
model_print_timings:       total time =   771.64 ms
========== eval time log of each prediction ==========
prediction   0, time: 407.88ms
prediction   1, time: 24.32ms
prediction   2, time: 23.61ms
prediction   3, time: 23.46ms
prediction   4, time: 23.34ms

model_print_timings:        load time =   865.02 ms
model_print_timings:      sample time =     8.57 ms /    16 runs   (    0.54 ms per token)
model_print_timings: prompt eval time =   859.30 ms /  2024 tokens (    0.42 ms per token)
model_print_timings:        eval time =   386.94 ms /    15 runs   (   25.80 ms per token)
model_print_timings:       total time =  1264.64 ms
========== eval time log of each prediction ==========
prediction   0, time: 859.30ms
prediction   1, time: 27.38ms
prediction   2, time: 27.00ms
prediction   3, time: 26.05ms
prediction   4, time: 25.91ms

Mistral-7B, int4, group=-1, sym, comp_int8:
Cascade lake 20 cores
PR:

model_print_timings:        load time =  2572.34 ms
model_print_timings:      sample time =     9.15 ms /    16 runs   (    0.57 ms per token)
model_print_timings: prompt eval time =  2571.85 ms /  1008 tokens (    2.55 ms per token)
model_print_timings:        eval time =   711.59 ms /    15 runs   (   47.44 ms per token)
model_print_timings:       total time =  3298.07 ms
========== eval time log of each prediction ==========
prediction   0, time: 2571.85ms
prediction   1, time: 48.90ms
prediction   2, time: 47.35ms
prediction   3, time: 47.37ms
prediction   4, time: 47.35ms

Main:

model_print_timings:        load time =  2933.11 ms
model_print_timings:      sample time =     9.31 ms /    16 runs   (    0.58 ms per token)
model_print_timings: prompt eval time =  2932.60 ms /  1008 tokens (    2.91 ms per token)
model_print_timings:        eval time =   708.29 ms /    15 runs   (   47.22 ms per token)
model_print_timings:       total time =  3655.52 ms
========== eval time log of each prediction ==========
prediction   0, time: 2932.60ms
prediction   1, time: 48.08ms
prediction   2, time: 47.02ms
prediction   3, time: 46.99ms
prediction   4, time: 47.08ms

bestla/bestla/bestla_parallel.h

bestla/bestla/kernel_avx2.h

yuchengliu1 · 2024-05-30T07:06:27Z

There also add and mul in custom::epilogue. Should we keep one add and mul?

luoyu-intel · 2024-05-30T07:49:06Z

There also add and mul in custom::epilogue. Should we keep one add and mul?

custom::epilogue are using the ref code, they should call kernel::wrapper::Add and Mul

luoyu-intel · 2024-05-31T06:11:59Z

Windows server can't connect to the proxy server. Windows build is verified on a local windows machine.

luoyu-intel and others added 27 commits May 27, 2024 09:27

add per-channel kblock template

a8ffea6

add gemv support for pckblock. revise all benchmark cases and UT cases.

e83a8d5

fix bandwith calc of CompFp32 and CompBf16

6d40043

use correct core number

af84b4f

update thread pool

463aeaf

fix bug

9b9777c

fix bug

26140bc

update kernels with gemm and qkv fusion

95c8791

refactor epilogue (removing ISA from the class' template)

c094c59

update fnn and ip_add

20262f9

fix compile on gcc

81c9944

fix gcc template

d000ba0

fix compile

0308aea

update amx template

05aecfc

fix UT compile

1d55c37

fix benchmark compile

2035d24

revert NTILE of amx_int8

6db48e7

reduce templates

d9636a9

fix deprecated UTs. optimize cache block strategy

5169105

Enlarge stack size on windows

a78f35d

revert NTILE of amx_int8

499d459

update cache config

76c1331

add mul support

67eacff

add mul implementation

f0a2ad7

support tensor mul tensor

b4bd875

fix compile on gcc

04d7f5d

clang-format

3c591ab

luoyu-intel requested review from yuchengliu1 and DDEle May 29, 2024 07:34

fix doc

12ef415

github-advanced-security bot found potential problems May 29, 2024

View reviewed changes

bestla/bestla/bestla_parallel.h Fixed Show fixed Hide fixed

bestla/bestla/bestla_parallel.h Fixed Show fixed Hide fixed

luoyu-intel added 7 commits May 29, 2024 16:28

code scan fix

e8829a2

fix compile

41f146b

fix batch bug

4e99482

comment add

18250c1

comment mul

dc8e2c1

enable mul&add

3a79072

clang-format

71b017c

yuchengliu1 reviewed May 30, 2024

View reviewed changes

bestla/bestla/kernel_avx2.h Outdated Show resolved Hide resolved

luoyu-intel added BesTLA ready to review Ready to review labels May 30, 2024

luoyu-intel added 2 commits May 31, 2024 13:26

fix the code bug of mul and add. use new kernels in custom::epilogue

86e0a94

clang-format

36b2063

yuchengliu1 approved these changes May 31, 2024

View reviewed changes

luoyu-intel merged commit 3757fda into main May 31, 2024
15 of 16 checks passed

luoyu-intel mentioned this pull request May 31, 2024

[BesTLA] Simplify the templates #274

Merged

luoyu-intel deleted the pctemplate branch May 31, 2024 06:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BesTLA] First-token inference optimization #271

[BesTLA] First-token inference optimization #271

luoyu-intel commented May 29, 2024 •

edited

Loading

yuchengliu1 commented May 30, 2024

luoyu-intel commented May 30, 2024

luoyu-intel commented May 31, 2024

[BesTLA] First-token inference optimization #271

[BesTLA] First-token inference optimization #271

Conversation

luoyu-intel commented May 29, 2024 • edited Loading

Type of Change

Performance change

yuchengliu1 commented May 30, 2024

luoyu-intel commented May 30, 2024

luoyu-intel commented May 31, 2024

luoyu-intel commented May 29, 2024 •

edited

Loading