Skip to content
This repository has been archived by the owner on Aug 30, 2024. It is now read-only.

[BesTLA] First-token inference optimization #271

Merged
merged 37 commits into from
May 31, 2024
Merged

[BesTLA] First-token inference optimization #271

merged 37 commits into from
May 31, 2024

Conversation

luoyu-intel
Copy link
Contributor

@luoyu-intel luoyu-intel commented May 29, 2024

Type of Change

  • Higher performance for comp_int8 group=-1 kernels
  • Improved vector mul and add.
  • Reduce template combinations, and speed up the compilation process

Performance change

llama2-7B, int4, group=-1, sym, comp_int8:

Alderlake 16 cores
PR:

model_print_timings:        load time =  3765.44 ms
model_print_timings:      sample time =     2.79 ms /    16 runs   (    0.17 ms per token)
model_print_timings: prompt eval time =  3745.27 ms /  1024 tokens (    3.66 ms per token)
model_print_timings:        eval time =  1086.39 ms /    15 runs   (   72.43 ms per token)
model_print_timings:       total time =  4861.68 ms
========== eval time log of each prediction ==========
prediction   0, time: 3745.27ms
prediction   1, time: 72.77ms
prediction   2, time: 72.35ms
prediction   3, time: 72.35ms
prediction   4, time: 72.31ms

Main:

model_print_timings:        load time =  7380.84 ms
model_print_timings:      sample time =     3.81 ms /    16 runs   (    0.24 ms per token)
model_print_timings: prompt eval time =  5029.46 ms /  1024 tokens (    4.91 ms per token)
model_print_timings:        eval time =  1086.52 ms /    15 runs   (   72.43 ms per token)
model_print_timings:       total time =  8486.67 ms
========== eval time log of each prediction ==========
prediction   0, time: 5029.46ms
prediction   1, time: 72.87ms
prediction   2, time: 72.38ms
prediction   3, time: 72.17ms
prediction   4, time: 72.36ms

sapphire rapids 56 cores:
PR:

model_print_timings:        load time =   385.17 ms
model_print_timings:      sample time =     8.08 ms /    16 runs   (    0.50 ms per token)
model_print_timings: prompt eval time =   382.27 ms /  1023 tokens (    0.37 ms per token)
model_print_timings:        eval time =   338.05 ms /    15 runs   (   22.54 ms per token)
model_print_timings:       total time =   735.46 ms
========== eval time log of each prediction ==========
prediction   0, time: 382.27ms
prediction   1, time: 23.75ms
prediction   2, time: 22.91ms
prediction   3, time: 22.69ms
prediction   4, time: 22.63ms

model_print_timings:        load time =   740.27 ms
model_print_timings:      sample time =     8.00 ms /    16 runs   (    0.50 ms per token)
model_print_timings: prompt eval time =   734.54 ms /  2024 tokens (    0.36 ms per token)
model_print_timings:        eval time =   385.63 ms /    15 runs   (   25.71 ms per token)
model_print_timings:       total time =  1137.81 ms
========== eval time log of each prediction ==========
prediction   0, time: 734.54ms
prediction   1, time: 27.92ms
prediction   2, time: 26.25ms
prediction   3, time: 25.76ms
prediction   4, time: 25.60ms

Main:

model_print_timings:      sample time =     8.59 ms /    16 runs   (    0.54 ms per token)
model_print_timings: prompt eval time =   407.88 ms /  1023 tokens (    0.40 ms per token)
model_print_timings:        eval time =   348.30 ms /    15 runs   (   23.22 ms per token)
model_print_timings:       total time =   771.64 ms
========== eval time log of each prediction ==========
prediction   0, time: 407.88ms
prediction   1, time: 24.32ms
prediction   2, time: 23.61ms
prediction   3, time: 23.46ms
prediction   4, time: 23.34ms

model_print_timings:        load time =   865.02 ms
model_print_timings:      sample time =     8.57 ms /    16 runs   (    0.54 ms per token)
model_print_timings: prompt eval time =   859.30 ms /  2024 tokens (    0.42 ms per token)
model_print_timings:        eval time =   386.94 ms /    15 runs   (   25.80 ms per token)
model_print_timings:       total time =  1264.64 ms
========== eval time log of each prediction ==========
prediction   0, time: 859.30ms
prediction   1, time: 27.38ms
prediction   2, time: 27.00ms
prediction   3, time: 26.05ms
prediction   4, time: 25.91ms

Mistral-7B, int4, group=-1, sym, comp_int8:
Cascade lake 20 cores
PR:

model_print_timings:        load time =  2572.34 ms
model_print_timings:      sample time =     9.15 ms /    16 runs   (    0.57 ms per token)
model_print_timings: prompt eval time =  2571.85 ms /  1008 tokens (    2.55 ms per token)
model_print_timings:        eval time =   711.59 ms /    15 runs   (   47.44 ms per token)
model_print_timings:       total time =  3298.07 ms
========== eval time log of each prediction ==========
prediction   0, time: 2571.85ms
prediction   1, time: 48.90ms
prediction   2, time: 47.35ms
prediction   3, time: 47.37ms
prediction   4, time: 47.35ms

Main:

model_print_timings:        load time =  2933.11 ms
model_print_timings:      sample time =     9.31 ms /    16 runs   (    0.58 ms per token)
model_print_timings: prompt eval time =  2932.60 ms /  1008 tokens (    2.91 ms per token)
model_print_timings:        eval time =   708.29 ms /    15 runs   (   47.22 ms per token)
model_print_timings:       total time =  3655.52 ms
========== eval time log of each prediction ==========
prediction   0, time: 2932.60ms
prediction   1, time: 48.08ms
prediction   2, time: 47.02ms
prediction   3, time: 46.99ms
prediction   4, time: 47.08ms

@luoyu-intel luoyu-intel requested review from yuchengliu1 and DDEle May 29, 2024 07:34
bestla/bestla/bestla_parallel.h Fixed Show fixed Hide fixed
bestla/bestla/bestla_parallel.h Fixed Show fixed Hide fixed
@yuchengliu1
Copy link
Contributor

There also add and mul in custom::epilogue. Should we keep one add and mul?

@luoyu-intel
Copy link
Contributor Author

There also add and mul in custom::epilogue. Should we keep one add and mul?

custom::epilogue are using the ref code, they should call kernel::wrapper::Add and Mul

@luoyu-intel
Copy link
Contributor Author

Windows server can't connect to the proxy server. Windows build is verified on a local windows machine.

@luoyu-intel luoyu-intel merged commit 3757fda into main May 31, 2024
15 of 16 checks passed
@luoyu-intel luoyu-intel deleted the pctemplate branch May 31, 2024 06:39
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants