This repository has been archived by the owner on Aug 30, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 38
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: zhenwei-intel <[email protected]>
* move hidden files Signed-off-by: zhenwei-intel <[email protected]> * update readme path Signed-off-by: zhenwei-intel <[email protected]> --------- Signed-off-by: zhenwei-intel <[email protected]>
* Unify scripts for converting, quantizing and chatting Signed-off-by: zhenwei-intel <[email protected]> * move folder * update script with subprocess Signed-off-by: zhenwei-intel <[email protected]> --------- Signed-off-by: zhenwei-intel <[email protected]>
* initial commit of n_head_kv in MQA Signed-off-by: Yu, Zhentao <[email protected]> * add attn ln Signed-off-by: Yu, Zhentao <[email protected]> * reorder QKV weight when convert Signed-off-by: Yu, Zhentao <[email protected]> * fix typo Signed-off-by: Yu, Zhentao <[email protected]> * cherry-pick ggml MQA Signed-off-by: Yu, Zhentao <[email protected]> * fix kv cache and reduce handmade mem buffer size Signed-off-by: Yu, Zhentao <[email protected]> --------- Signed-off-by: Yu, Zhentao <[email protected]>
* Update README.md Update the readme * Update README.md * Update README.md * Update README.md
* Refine Inference Workflow Readme --------- Signed-off-by: hshen14 <[email protected]> Co-authored-by: lvliang-intel <[email protected]> Co-authored-by: Wang, Chang <[email protected]>
* add s8 perchannel quant and kernel. * add QKV , add fusion support for s8 PerN * add amx_int8 pern gelu fusion * add gelu add fusion for vnni * split jblas file. add compute type fp32. * add comp_type fp32 for ffn fusion * add bf16 for s4 and s4 ffn fusion * add workspace for jblas functions * keep one jblas code * disable mmap as default. change arg --no_mmap to --use_mmap.
* refine reademe * refine reademe * refine table * Refine LLM Runtime readme Signed-off-by: hshen14 <[email protected]> * Continue updating the readme Signed-off-by: hshen14 <[email protected]> * Simplify the readme Signed-off-by: hshen14 <[email protected]> * add back run_llm.py * change script arg name * rename arg * fix * add description * add another way to convert model * remove additional line * refine readme * refine readme, but we need to modify convert script later * fix model_maps Signed-off-by: zhenwei-intel <[email protected]> * fix convert_gptj Signed-off-by: zhenwei-intel <[email protected]> * refine readme * refine --------- Signed-off-by: hshen14 <[email protected]> Signed-off-by: zhenwei-intel <[email protected]> Co-authored-by: hshen14 <[email protected]> Co-authored-by: zhenwei-intel <[email protected]>
Signed-off-by: Dong, Bo1 <[email protected]>
* support bloom Signed-off-by: Dong, Bo1 <[email protected]>
* add length_penalty and min_new_tokens_logits_process Signed-off-by: Yu, Zhentao <[email protected]> * revert V cache reorder Signed-off-by: Yu, Zhentao <[email protected]> * refact beam_search codes arch Signed-off-by: Yu, Zhentao <[email protected]> * fix n_threads Signed-off-by: Yu, Zhentao <[email protected]> * make beam_kv_cache_reorder as a class Signed-off-by: Yu, Zhentao <[email protected]> * clean code Signed-off-by: Yu, Zhentao <[email protected]> --------- Signed-off-by: Yu, Zhentao <[email protected]> Co-authored-by: Haihao Shen <[email protected]>
* fix q8 pern QKV fusion of vnni * add silu jit kernel. add silu fusion. * fix the result of llama silu fusion * enable jit swish for higher performance
* rename llm chat application Signed-off-by: Yu, Zhentao <[email protected]> * rename CI test script Signed-off-by: Yu, Zhentao <[email protected]> --------- Signed-off-by: Yu, Zhentao <[email protected]> Co-authored-by: Dong, Bo <[email protected]>
Signed-off-by: zhenwei-intel <[email protected]>
* update jblas to b3c75b2 * mha refatctor changes * full fp16 mha draft * support fp32fp16fp16fp32 jblas mha with fp16 kernels * add fp16 mha fusion * fix the issue of fp16 on low gcc versions * keep the same permute for bf16 and fp16 MHA * fix param for fp16 MHA * mha amxbf16 supports reo-k * prepare fwd args for int8 inference * int8 mha draft * draft of bf16 mha with kv-update * disable fp16mha by default * fix mha nan * fall back to bf16 when unsupported * check mha support * update swish alpha value * fix fp32 silu bug * disable mha on compilers without bf16 intrinsics --------- Signed-off-by: Ding, Yi1 <[email protected]> Co-authored-by: luoyu-intel <[email protected]>
Signed-off-by: intellinjun <[email protected]>
* add TP and gptj model support 1. add TP_1D algo 2. add parallel_context for broadcast/reduce 3. support all data type 4. support gptj model Signed-off-by: Clark Chin <[email protected]>
Signed-off-by: Ding, Yi1 <[email protected]>
* chatglm-2 q4_j infernece pass with correct accuracy * unift convert scripts * specify chatglm2, remove ambiguous chatglm * initilize glm1 * initilize glm1 * Fix kernel issues for glm1 * adapt to the latest main and chatglm2 infernece pass * add parameters for all convert.py Signed-off-by: Zhenzhong1 <[email protected]> * add parameters for the bloom * update README and cleancode * disable chatglm1 --------- Signed-off-by: Zhenzhong1 <[email protected]>
Signed-off-by: intellinjun <[email protected]>
Signed-off-by: Haihao Shen <[email protected]>
Signed-off-by: Dong, Bo1 <[email protected]>
…generation (#700)
Signed-off-by: Dong, Bo1 <[email protected]>
Co-authored-by: kevinintel <[email protected]>
* Baichuan13B FP32 inference bug fix
Co-authored-by: luoyu-intel <[email protected]> Co-authored-by: Ding, Yi1 <[email protected]> Co-authored-by: zhenwei-intel <[email protected]> Co-authored-by: yuchengliu1 <[email protected]> Co-authored-by: Meng, Hengyu <[email protected]>
* add fp8 in llm frontend Signed-off-by: Yu, Zhentao <[email protected]>
Co-authored-by: luoyu-intel <[email protected]>
migrate CI Signed-off-by: Hengyu Meng <[email protected]> refine CI for neuralspeed Signed-off-by: Wenxin Zhang <[email protected]> add more CI scripts Signed-off-by: Wenxin Zhang <[email protected]> minor fix Signed-off-by: Wenxin Zhang <[email protected]> remove runner.name when running on ubuntu-latest Signed-off-by: Wenxin Zhang <[email protected]> update CI to share system Signed-off-by: Wenxin Zhang <[email protected]> rename jblas tp bestla directory reorg Signed-off-by: Hengyu Meng <[email protected]> remove itrex dependency Signed-off-by: Hengyu Meng <[email protected]> fix script path\n remove python dependency Signed-off-by: Hengyu Meng <[email protected]> -s remove python tests disable percentage disable monitor Signed-off-by: Hengyu Meng <[email protected]> fix naming fix threadpool conflict Signed-off-by: Hengyu Meng <[email protected]> restore percentage Signed-off-by: Hengyu Meng <[email protected]>
add bestla workflow image Signed-off-by: Hengyu Meng <[email protected]>
VincyZhang
approved these changes
Dec 20, 2023
DDEle
pushed a commit
to DDEle/neural-speed
that referenced
this pull request
Feb 15, 2024
Co-authored-by: Jiaxingla <[email protected]>
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Type of Change
feature or bug fix or documentation or others
API changed or not
Description
detail description
Issues: xxx
Expected Behavior & Potential Risk
the expected behavior that triggered by this PR
How has this PR been tested?
how to reproduce the test (including hardware information)
Dependency Change?
any library dependency introduced or removed