Skip to content
This repository has been archived by the owner on Aug 30, 2024. It is now read-only.

[BesTLA] New thread pool and hybrid dispatcher #118

Merged
merged 45 commits into from
Mar 8, 2024
Merged

[BesTLA] New thread pool and hybrid dispatcher #118

merged 45 commits into from
Mar 8, 2024

Conversation

yuchengliu1
Copy link
Contributor

@yuchengliu1 yuchengliu1 commented Feb 4, 2024

Type of Change

  • New thread pool without OpenMP
  • Higher FLOPs for hybrid CPUs
  • Enable BesTLA benchmark
  • Sync thread pool of ne_layers and bestla, remove existing thread pool
  • Model integration
  • Model performance optimization

@luoyu-intel luoyu-intel changed the title Thread pool [BesTLA] New thread pool and hybrid dispatcher Feb 29, 2024
@yuchengliu1 yuchengliu1 marked this pull request as ready for review March 1, 2024 03:20
bestla/bestla/bestla_parallel.h Outdated Show resolved Hide resolved
bestla/bestla/bestla_parallel.h Show resolved Hide resolved
bestla/bestla/bestla_parallel.h Outdated Show resolved Hide resolved
bestla/bestla/bestla_parallel.h Show resolved Hide resolved
bestla/bestla/bestla_parallel.h Show resolved Hide resolved
bestla/bestla/ut/bestla_gemm.cpp Outdated Show resolved Hide resolved
bestla/bestla/ut/bestla_parallel.cpp Outdated Show resolved Hide resolved
@zhewang1-intc
Copy link
Contributor

just a suggestion: default cmake disable our thread-pool, if customer install ns via pip, then they can't get the perf benefit from our thread-pool on client platform.
can our dispatcher detect CPU arch, if detected hybrid CPUs then dispatch to our thread-pool otherwise to omp?

@luoyu-intel
Copy link
Contributor

just a suggestion: default cmake disable our thread-pool, if customer install ns via pip, then they can't get the perf benefit from our thread-pool on client platform. can our dispatcher detect CPU arch, if detected hybrid CPUs then dispatch to our thread-pool otherwise to omp?

@zhewang1-intc In this PR, it initially enables thread pool with good first-token speed. But it may slow down the next token. So it's disabled for now.

@luoyu-intel
Copy link
Contributor

luoyu-intel commented Mar 7, 2024

some performance data on 12900K:
---------------input = 1949--------------------
::ThreadPool::
model_print_timings: prompt eval time = 12259.06 ms / 1949 tokens ( 6.29 ms per token)
model_print_timings: eval time = 1185.61 ms / 15 runs ( 79.04 ms per token)
model_print_timings: total time = 13567.22 ms
========== eval time log of each prediction ==========
prediction 0, time: 12259.06ms
prediction 1, time: 79.18ms
prediction 2, time: 78.56ms
prediction 3, time: 78.58ms
prediction 4, time: 78.68ms
prediction 5, time: 78.55ms
::OMP::
model_print_timings: prompt eval time = 16296.21 ms / 1949 tokens ( 8.36 ms per token)
model_print_timings: eval time = 1294.00 ms / 15 runs ( 86.27 ms per token)
model_print_timings: total time = 17624.78 ms
========== eval time log of each prediction ==========
prediction 0, time: 16296.21ms
prediction 1, time: 80.91ms
prediction 2, time: 84.77ms
prediction 3, time: 86.91ms
prediction 4, time: 97.11ms
prediction 5, time: 88.35ms

---------------input = 32--------------------
::ThreadPool::
model_print_timings: prompt eval time = 184.27 ms / 32 tokens ( 5.76 ms per token)
model_print_timings: eval time = 898.96 ms / 15 runs ( 59.93 ms per token)
model_print_timings: total time = 1091.83 ms
========== eval time log of each prediction ==========
prediction 0, time: 184.27ms
prediction 1, time: 59.38ms
prediction 2, time: 59.29ms
prediction 3, time: 62.48ms
prediction 4, time: 59.49ms
prediction 5, time: 59.23ms
::OMP::
model_print_timings: prompt eval time = 278.05 ms / 32 tokens ( 8.69 ms per token)
model_print_timings: eval time = 1021.67 ms / 15 runs ( 68.11 ms per token)
model_print_timings: total time = 1308.71 ms
========== eval time log of each prediction ==========
prediction 0, time: 278.05ms
prediction 1, time: 62.68ms
prediction 2, time: 66.06ms
prediction 3, time: 62.36ms
prediction 4, time: 61.10ms
prediction 5, time: 62.53ms

@luoyu-intel
Copy link
Contributor

just a suggestion: default cmake disable our thread-pool, if customer install ns via pip, then they can't get the perf benefit from our thread-pool on client platform. can our dispatcher detect CPU arch, if detected hybrid CPUs then dispatch to our thread-pool otherwise to omp?

@zhewang1-intc In this PR, it initially enables thread pool with good first-token speed. But it may slow down the next token. So it's disabled for now.

@zhewang1-intc next token perf issue is fixed. but we need more validation for this new thread pool. so, it will stay disabled as default.

@airMeng airMeng merged commit fd19a44 into main Mar 8, 2024
12 checks passed
@yuchengliu1
Copy link
Contributor Author

CPU:MTL 6P+8E 65W (use 12 thread to keep performance stable)
memory: DDR5 5600Hz

model thread_pool token first token total next token
qwen this branch 1024 14205.99 161.87
qwen main(OMP) 1024 16472.06 188.22
llama2 this branch 1024 12780.52 136.74
llama2 main(OMP) 1024 15572.36 144.12

@yuchengliu1 yuchengliu1 deleted the thread_pool branch March 12, 2024 07:54
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants