[BesTLA] New thread pool and hybrid dispatcher #118

yuchengliu1 · 2024-02-04T14:07:54Z

Type of Change

New thread pool without OpenMP
Higher FLOPs for hybrid CPUs
Enable BesTLA benchmark
Sync thread pool of ne_layers and bestla, remove existing thread pool
Model integration
Model performance optimization

…e presets for CMake

bestla/bestla/bestla_parallel.h

.github/workflows/scripts/formatScan/clangtidy.sh

bestla/bestla/bestla_parallel.h

bestla/bestla/ut/bestla_gemm.cpp

bestla/bestla/ut/bestla_parallel.cpp

zhewang1-intc · 2024-03-07T03:48:45Z

just a suggestion: default cmake disable our thread-pool, if customer install ns via pip, then they can't get the perf benefit from our thread-pool on client platform.
can our dispatcher detect CPU arch, if detected hybrid CPUs then dispatch to our thread-pool otherwise to omp?

luoyu-intel · 2024-03-07T06:13:05Z

just a suggestion: default cmake disable our thread-pool, if customer install ns via pip, then they can't get the perf benefit from our thread-pool on client platform. can our dispatcher detect CPU arch, if detected hybrid CPUs then dispatch to our thread-pool otherwise to omp?

@zhewang1-intc In this PR, it initially enables thread pool with good first-token speed. But it may slow down the next token. So it's disabled for now.

luoyu-intel · 2024-03-07T08:45:30Z

some performance data on 12900K:
---------------input = 1949--------------------
::ThreadPool::
model_print_timings: prompt eval time = 12259.06 ms / 1949 tokens ( 6.29 ms per token)
model_print_timings: eval time = 1185.61 ms / 15 runs ( 79.04 ms per token)
model_print_timings: total time = 13567.22 ms
========== eval time log of each prediction ==========
prediction 0, time: 12259.06ms
prediction 1, time: 79.18ms
prediction 2, time: 78.56ms
prediction 3, time: 78.58ms
prediction 4, time: 78.68ms
prediction 5, time: 78.55ms
::OMP::
model_print_timings: prompt eval time = 16296.21 ms / 1949 tokens ( 8.36 ms per token)
model_print_timings: eval time = 1294.00 ms / 15 runs ( 86.27 ms per token)
model_print_timings: total time = 17624.78 ms
========== eval time log of each prediction ==========
prediction 0, time: 16296.21ms
prediction 1, time: 80.91ms
prediction 2, time: 84.77ms
prediction 3, time: 86.91ms
prediction 4, time: 97.11ms
prediction 5, time: 88.35ms

---------------input = 32--------------------
::ThreadPool::
model_print_timings: prompt eval time = 184.27 ms / 32 tokens ( 5.76 ms per token)
model_print_timings: eval time = 898.96 ms / 15 runs ( 59.93 ms per token)
model_print_timings: total time = 1091.83 ms
========== eval time log of each prediction ==========
prediction 0, time: 184.27ms
prediction 1, time: 59.38ms
prediction 2, time: 59.29ms
prediction 3, time: 62.48ms
prediction 4, time: 59.49ms
prediction 5, time: 59.23ms
::OMP::
model_print_timings: prompt eval time = 278.05 ms / 32 tokens ( 8.69 ms per token)
model_print_timings: eval time = 1021.67 ms / 15 runs ( 68.11 ms per token)
model_print_timings: total time = 1308.71 ms
========== eval time log of each prediction ==========
prediction 0, time: 278.05ms
prediction 1, time: 62.68ms
prediction 2, time: 66.06ms
prediction 3, time: 62.36ms
prediction 4, time: 61.10ms
prediction 5, time: 62.53ms

luoyu-intel · 2024-03-07T08:47:51Z

just a suggestion: default cmake disable our thread-pool, if customer install ns via pip, then they can't get the perf benefit from our thread-pool on client platform. can our dispatcher detect CPU arch, if detected hybrid CPUs then dispatch to our thread-pool otherwise to omp?

@zhewang1-intc In this PR, it initially enables thread pool with good first-token speed. But it may slow down the next token. So it's disabled for now.

@zhewang1-intc next token perf issue is fixed. but we need more validation for this new thread pool. so, it will stay disabled as default.

yuchengliu1 · 2024-03-11T06:20:30Z

CPU：MTL 6P+8E 65W (use 12 thread to keep performance stable)
memory: DDR5 5600Hz

model	thread_pool	token	first token total	next token
qwen	this branch	1024	14205.99	161.87
qwen	main(OMP)	1024	16472.06	188.22
llama2	this branch	1024	12780.52	136.74
llama2	main(OMP)	1024	15572.36	144.12

yuchengliu1 force-pushed the thread_pool branch from d96013c to ea90d51 Compare February 4, 2024 14:15

yuchengliu1 force-pushed the thread_pool branch from f58d0e1 to 309bcfe Compare February 21, 2024 06:59

luoyu-intel changed the title ~~Thread pool~~ [BesTLA] New thread pool and hybrid dispatcher Feb 29, 2024

yuchengliu1 force-pushed the thread_pool branch from 848f44c to add466e Compare February 29, 2024 17:13

yuchengliu1 marked this pull request as ready for review March 1, 2024 03:20

yuchengliu1 and others added 25 commits March 5, 2024 17:02

update

851af6a

update

c2390c5

update for spinlock

ba4f107

update base thread pool

8abef2f

bond core

5cdd61d

using singleton instead of extern

b572c67

add sync

1201235

integrate to core

79b7cbc

fix bugs

86feb68

add thread config for hybrid cpu

2ed0175

split benchmark and UT

1bade92

link pthread for Linux

b35b558

add SchedulerDispatcher

1e6e764

opt benchmark code

50cf1ae

complete compbf16

e0bc203

add core order for client

e2dc613

remove benchmark from UT

ef1a7e1

remove dbg msg

868af0b

optimize log print for each benchmark case

60c3925

integrade to ns

ee7bf2d

modify cmake options for OMP

be86b94

fix memory leak

aca2a23

UT fix

432cb89

fix bugs

4427d73

fix bugs

0e12ac9

luoyu-intel added 8 commits March 6, 2024 14:34

use omp from bestla

d0e5e30

remove threading code in ne_layers

1722f1c

remove thread join

6db2a36

add std thread pool in ne_layers

fee0748

use OMP as default thread pool

1c126d6

use one threadpool in neuralspeed. use BTLA options directly. add som…

3cb5f02

…e presets for CMake

clang-format

6bf8c76

rerun with clang-format 14.0.0

d32a5f2

luoyu-intel suggested changes Mar 6, 2024

View reviewed changes

bestla/bestla/bestla_parallel.h Outdated Show resolved Hide resolved

luoyu-intel and others added 4 commits March 6, 2024 16:51

update clang-tidy

c6aa107

disable OMP for clangtidy

8055d6f

remove dbg code

49dedc1

remove pointer

c0af2c9

luoyu-intel approved these changes Mar 7, 2024

View reviewed changes

luoyu-intel requested review from DDEle and zhewang1-intc March 7, 2024 00:50

DDEle approved these changes Mar 7, 2024

View reviewed changes

.github/workflows/scripts/formatScan/clangtidy.sh Show resolved Hide resolved

zhewang1-intc reviewed Mar 7, 2024

View reviewed changes

zhewang1-intc approved these changes Mar 7, 2024

View reviewed changes

luoyu-intel added 2 commits March 7, 2024 16:06

fix bug of adjust PE ratio. add priority for Windows process.

35ae39c

remove dbg code

289d655

luoyu-intel added BesTLA ready to merge labels Mar 7, 2024

airMeng merged commit fd19a44 into main Mar 8, 2024
12 checks passed

yuchengliu1 deleted the thread_pool branch March 12, 2024 07:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BesTLA] New thread pool and hybrid dispatcher #118

[BesTLA] New thread pool and hybrid dispatcher #118

yuchengliu1 commented Feb 4, 2024 •

edited by luoyu-intel

Loading

zhewang1-intc commented Mar 7, 2024

luoyu-intel commented Mar 7, 2024

luoyu-intel commented Mar 7, 2024 •

edited

Loading

luoyu-intel commented Mar 7, 2024

yuchengliu1 commented Mar 11, 2024

[BesTLA] New thread pool and hybrid dispatcher #118

[BesTLA] New thread pool and hybrid dispatcher #118

Conversation

yuchengliu1 commented Feb 4, 2024 • edited by luoyu-intel Loading

Type of Change

zhewang1-intc commented Mar 7, 2024

luoyu-intel commented Mar 7, 2024

luoyu-intel commented Mar 7, 2024 • edited Loading

luoyu-intel commented Mar 7, 2024

yuchengliu1 commented Mar 11, 2024

yuchengliu1 commented Feb 4, 2024 •

edited by luoyu-intel

Loading

luoyu-intel commented Mar 7, 2024 •

edited

Loading