This repository has been archived by the owner on Aug 30, 2024. It is now read-only.
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* refact folder stracture (#170) * fix starcoder quantization bug (#159) Signed-off-by: zhenwei-intel <[email protected]> * update readme path and copy hidden files (#185) * move hidden files Signed-off-by: zhenwei-intel <[email protected]> * update readme path Signed-off-by: zhenwei-intel <[email protected]> --------- Signed-off-by: zhenwei-intel <[email protected]> * Unify scripts for converting, quantizing and chatting (#161) * Unify scripts for converting, quantizing and chatting Signed-off-by: zhenwei-intel <[email protected]> * move folder * update script with subprocess Signed-off-by: zhenwei-intel <[email protected]> --------- Signed-off-by: zhenwei-intel <[email protected]> * [CPP Graph] Falcon 40B (#175) * initial commit of n_head_kv in MQA Signed-off-by: Yu, Zhentao <[email protected]> * add attn ln Signed-off-by: Yu, Zhentao <[email protected]> * reorder QKV weight when convert Signed-off-by: Yu, Zhentao <[email protected]> * fix typo Signed-off-by: Yu, Zhentao <[email protected]> * cherry-pick ggml MQA Signed-off-by: Yu, Zhentao <[email protected]> * fix kv cache and reduce handmade mem buffer size Signed-off-by: Yu, Zhentao <[email protected]> --------- Signed-off-by: Yu, Zhentao <[email protected]> * Update README.md (#198) * Update README.md Update the readme * Update README.md * Update README.md * Update README.md * Refine reademe of llm runtime (#200) * Refine Inference Workflow Readme (#214) * Refine Inference Workflow Readme --------- Signed-off-by: hshen14 <[email protected]> Co-authored-by: lvliang-intel <[email protected]> Co-authored-by: Wang, Chang <[email protected]> * [CPP Graph] add s8 perchannel quant and kernel. (#181) * add s8 perchannel quant and kernel. * add QKV , add fusion support for s8 PerN * add amx_int8 pern gelu fusion * add gelu add fusion for vnni * split jblas file. add compute type fp32. * add comp_type fp32 for ffn fusion * add bf16 for s4 and s4 ffn fusion * add workspace for jblas functions * keep one jblas code * disable mmap as default. change arg --no_mmap to --use_mmap. * Force CMake to add --std=cxx/c++xx (#205) * Fix graph model quantization with AVX2-only platforms (#221) * refine reademe (#211) * refine reademe * refine reademe * refine table * Refine LLM Runtime readme Signed-off-by: hshen14 <[email protected]> * Continue updating the readme Signed-off-by: hshen14 <[email protected]> * Simplify the readme Signed-off-by: hshen14 <[email protected]> * add back run_llm.py * change script arg name * rename arg * fix * add description * add another way to convert model * remove additional line * refine readme * refine readme, but we need to modify convert script later * fix model_maps Signed-off-by: zhenwei-intel <[email protected]> * fix convert_gptj Signed-off-by: zhenwei-intel <[email protected]> * refine readme * refine --------- Signed-off-by: hshen14 <[email protected]> Signed-off-by: zhenwei-intel <[email protected]> Co-authored-by: hshen14 <[email protected]> Co-authored-by: zhenwei-intel <[email protected]> * fix scan (#224) Signed-off-by: Dong, Bo1 <[email protected]> * support bloom for cpp (#207) * support bloom Signed-off-by: Dong, Bo1 <[email protected]> * move neural engine to deprecated (#206) * [CPP Graph] Enhance beam search (length_penalty + min_new_tokens) (#173) * add length_penalty and min_new_tokens_logits_process Signed-off-by: Yu, Zhentao <[email protected]> * revert V cache reorder Signed-off-by: Yu, Zhentao <[email protected]> * refact beam_search codes arch Signed-off-by: Yu, Zhentao <[email protected]> * fix n_threads Signed-off-by: Yu, Zhentao <[email protected]> * make beam_kv_cache_reorder as a class Signed-off-by: Yu, Zhentao <[email protected]> * clean code Signed-off-by: Yu, Zhentao <[email protected]> --------- Signed-off-by: Yu, Zhentao <[email protected]> Co-authored-by: Haihao Shen <[email protected]> * [Cpp Graph]fix q8 pern QKV fusion of vnni (#230) * fix q8 pern QKV fusion of vnni * add silu jit kernel. add silu fusion. * fix the result of llama silu fusion * enable jit swish for higher performance * [CPP Graph] Rename LLM chat application (#236) * rename llm chat application Signed-off-by: Yu, Zhentao <[email protected]> * rename CI test script Signed-off-by: Yu, Zhentao <[email protected]> --------- Signed-off-by: Yu, Zhentao <[email protected]> Co-authored-by: Dong, Bo <[email protected]> * [CPP Graph] add opt cpp graph and chat application (#133) * update onednn to v3.3-pc (#187) Signed-off-by: zhenwei-intel <[email protected]> * [CPP Graph] AMX-BF16 MHA with KV update (#179) * update jblas to b3c75b2 * mha refatctor changes * full fp16 mha draft * support fp32fp16fp16fp32 jblas mha with fp16 kernels * add fp16 mha fusion * fix the issue of fp16 on low gcc versions * keep the same permute for bf16 and fp16 MHA * fix param for fp16 MHA * mha amxbf16 supports reo-k * prepare fwd args for int8 inference * int8 mha draft * draft of bf16 mha with kv-update * disable fp16mha by default * fix mha nan * fall back to bf16 when unsupported * check mha support * update swish alpha value * fix fp32 silu bug * disable mha on compilers without bf16 intrinsics --------- Signed-off-by: Ding, Yi1 <[email protected]> Co-authored-by: luoyu-intel <[email protected]> * [CPP Graph]Enable FFN fusion (#160) * fix the error of convert bloom and opt (#254) Signed-off-by: intellinjun <[email protected]> * add TP and gptj model support (#223) * add TP and gptj model support 1. add TP_1D algo 2. add parallel_context for broadcast/reduce 3. support all data type 4. support gptj model Signed-off-by: Clark Chin <[email protected]> * Fix models without jblas-based kvcache support (#260) Signed-off-by: Ding, Yi1 <[email protected]> * Update transformers version (#259) * [CPP Graph] ChatGLM-2 Enabling (#210) * chatglm-2 q4_j infernece pass with correct accuracy * unift convert scripts * specify chatglm2, remove ambiguous chatglm * initilize glm1 * initilize glm1 * Fix kernel issues for glm1 * adapt to the latest main and chatglm2 infernece pass * add parameters for all convert.py Signed-off-by: Zhenzhong1 <[email protected]> * add parameters for the bloom * update README and cleancode * disable chatglm1 --------- Signed-off-by: Zhenzhong1 <[email protected]> * [CPP Graph] fix broken format (#262) * add one-click script for cpp graph running (#203) * fix3rdparty--rebase (#239) * Q4 perchannel (#271) * add s4 perchannel quant and inner product code. * Add weight_only support for PyTorch framework (#234) * Fix q40 gptj with MHA fusion enabled & remove logits.txt (#285) Signed-off-by: Ding, Yi1 <[email protected]> * Revert "Add weight_only support for PyTorch framework (#234)" This reverts commit cea3a582fa6ac7afa0d8e679b80b04389aa18abc. * Disable building OneDNN examples & tests (#288) Signed-off-by: Ding, Yi1 <[email protected]> * fix the bloom and dolly ffn fusion error (#284) * Build wheel from cached dnnl local (#303) Signed-off-by: Ding, Yi1 <[email protected]> Signed-off-by: Wenxin Zhang <[email protected]> Co-authored-by: Ding, Yi1 <[email protected]> * [CPP Graph] ChatGLM Enabling and ChatGLM-2 Issues Fix (#278) * [Graph] windows build (#312) * fix win build error * add win header * modify MD * clang-format 14 * [CPP Graph] Asym model (#306) * Add weight_only support for PyTorch framework (#297) * update onednn to v3.3-pc (#332) Signed-off-by: zhenwei-intel <[email protected]> * Update ChatGLM-6B to README.md (#344) * Python api for cpp model (#252) * New avx512_vnni kernel (#343) * update avx512_vnni kernels --------- Co-authored-by: ZheWang <[email protected]> * Refine Script and args for Cpp Graph (#320) * Restrict onnxruntime version (#350) * Add dnnl_dim_t cast to fix executor windows failure (#347) * update llm runtime parameters (#362) * update param Signed-off-by: zhenwei-intel <[email protected]> * update llm runtime parameters Signed-off-by: zhenwei-intel <[email protected]> * rename one click run to run Signed-off-by: zhenwei-intel <[email protected]> * rename compute_type to compute_dtype Signed-off-by: zhenwei-intel <[email protected]> * use ggml Signed-off-by: zhenwei-intel <[email protected]> * update Signed-off-by: zhenwei-intel <[email protected]> * update Signed-off-by: zhenwei-intel <[email protected]> * update parameters Signed-off-by: zhenwei-intel <[email protected]> * fix run Signed-off-by: zhenwei-intel <[email protected]> * fix use-ggml Signed-off-by: zhenwei-intel <[email protected]> * fix Signed-off-by: zhenwei-intel <[email protected]> * fix strcasecmp Signed-off-by: zhenwei-intel <[email protected]> * store true for use-ggml Signed-off-by: zhenwei-intel <[email protected]> * update format Signed-off-by: zhenwei-intel <[email protected]> --------- Signed-off-by: zhenwei-intel <[email protected]> * try aspll spellingcheck (#368) * [CPP Graph] KV-Update Optimization (#369) * fix python api bug (#382) * change mainpage (#340) * [CPP Graph] Enable llama2-70b (#213) * add readme for llm kernels (#386) * add readme for llm kernels Co-authored-by: VincyZhang <[email protected]> * Update README.md for llama2 70B (#391) * Refine LLM runtime readme (#395) * Refine LLM runtime readme Signed-off-by: hshen14 <[email protected]> * Use transfomers tokenizer and streamer for python api (#388) * [Cpp Graph] Align Cpp Beam Search (#322) * not compiling python api in cpp graph by default (#401) * using AutoModelCausalLM Signed-off-by: zhenwei-intel <[email protected]> * not compiling python api of cpp model Signed-off-by: zhenwei-intel <[email protected]> --------- Signed-off-by: zhenwei-intel <[email protected]> Co-authored-by: Dong, Bo <[email protected]> * [CPP Graph] Falcon MHA support (#422) * reinit cpp model and infinite text generation (#413) * [CPP Graph] ChatGLM2 MHA support (#435) * update post process with num_beams and do_sample (#430) * use mpt post process Signed-off-by: zhenwei-intel <[email protected]> * update jblas (#433) * pass compilation, before model test. * upgrade QBits * update jblas Co-authored-by: ZheWang <[email protected]> * fixed the version of transformers (#437) * [Cpp Graph] Update Falcon HF para and support Falcon-180B (#414) * [CPP Graph] MPT MHA support (#453) * [CPP Graph] Baichuan & Baichuan2 Enabling (#376) * Enable Baichan and Baichuan2 in LLM Runtime * GitHub Action Workflows speedup (#456) * workflow speedup * read special token id from tokenizer (#463) * read special token id from tokenizer Signed-off-by: zhenwei-intel <[email protected]> * gelu support (#424) Co-authored-by: intellinjun <[email protected]> * Fix msvc compile issues (#477) * [Cpp Graph] Beam Search Pybind (model archs: gptj and gptneox) (#449) * fix post process with topk topp of python api (#476) * [CPP Graph] Opt qbits dequant (#465) * [RUNTIME] Enabing streaming llm for Runtime (#501) * Support StreamingLLM on CPU Signed-off-by: zhenwei-intel <[email protected]> * support Avx2 (#493) * support Memcpy2D * support gelu fusion --------- Co-authored-by: luoyu-intel <[email protected]> * Fix typo in README.md (#516) convertion -> conversion Signed-off-by: Ikko Eltociear Ashimine <[email protected]> * improve Avx2 (#511) * reduce unnecessary tests (#521) * update python api readme (#504) * Update README.md Signed-off-by: Haihao Shen <[email protected]> * Update README.md Signed-off-by: Haihao Shen <[email protected]> * Update README.md Signed-off-by: Haihao Shen <[email protected]> * Update README.md Signed-off-by: Haihao Shen <[email protected]> * Update README.md Signed-off-by: Haihao Shen <[email protected]> * Revert "update python api readme (#504)" This reverts commit 5f4175ad754fb2e3c1f0f2f49a5a8356c1c3e170. * reduce unnessasory tests Signed-off-by: Wenxin Zhang <[email protected]> * reduce unnessasory tests Signed-off-by: Wenxin Zhang <[email protected]> * reduce unnessasory tests Signed-off-by: Wenxin Zhang <[email protected]> * reduce unnessasory tests Signed-off-by: Wenxin Zhang <[email protected]> --------- Signed-off-by: Haihao Shen <[email protected]> Signed-off-by: Wenxin Zhang <[email protected]> Co-authored-by: liuzhenwei <[email protected]> Co-authored-by: Haihao Shen <[email protected]> * [LLM Runtime] update python api readme (#525) * [LLM Runtime] Baichuan FFN & MHA support (#497) * [CPP Graph] Fused Attention Doc (#443) * Add doc for fused attn * [NeuralChat] Add neuralchat UT for cache and memory (#502) Add neuralchat UT for cache and memory Signed-off-by: Liangyx2 <[email protected]> * [Documentation] upload streaming llm video (#533) * upload streaming llm video Signed-off-by: zhenwei-intel <[email protected]> * support attention block TP and add gptj llama model (#361) * [LLM Runtime] Enable Mistral-7b (#552) * [LLM Runtime] Enable Mistral-7b Signed-off-by: intellinjun <[email protected]> * Add itrex llm runtime graph int4 notebook (#399) * [LLM Runtime] Enable interactive mode of python api (#548) * [LLM Runtime] Streaming-LLM based on shift RoPE (#580) * [LLM Runtime] enable MHA fusion for gptneox&dolly&starcoder&llama2-70b (#567) * [Doc] change the structure of llm runtime readme (#596) * add warning in graph build * add more info * Added script of merging peft adapter for quantization of llm with peft (#615) * added script of merging peft adapter for quantization of llm with peft. Signed-off-by: Ye, Xinyu <[email protected]> * Fix bloom ffn fusion (#620) * [DOC] add LLM Runtime developer document (#609) * add developer document Signed-off-by: intellinjun <[email protected]> * [Document] update llm runtime readme (#623) * [LLM Runtime] Fix LLaMA after discarding KV-cache (#625) * [LLM Runtime] Shift-RoPE-based Streaming-LLM for Fused-Attention (#608) * sync jblas 6656837 * shift-RoPE with mha * restrain transformers version (#627) * [LLM Runtime] integrate AVX_VNNI (#565) * [LLM Runtime] Multi-Round chat with chatglm2 (#646) * [LLM Runtime] Unify KV_cache and Support Batch-dim Process in Beam Search (#583) * [LLM Runtime] Allow CompileBF16 on GCC11 (#655) * Allow CompileBF16 on GCC11 * fixed bf16 error in convert_llama.py (#661) * [Doc]add readme (#663) * add support matrix * diable bf16 scale for jblas (#662) Signed-off-by: Hengyu Meng <[email protected]> * [LLM Runtime]Fix gptneox bug (#671) Signed-off-by: intellinjun <[email protected]> * [LLM Runtime] Refine Python API (#665) * [LLM Runtime] add python api for mistral (#684) Signed-off-by: intellinjun <[email protected]> * fix typo : graph_developer_document branch no longer exists (#686) Signed-off-by: sangjune.park <[email protected]> * [LLM Runtime] Support load_in_nbit in llm runtime (#688) * support load_in_nbit in llm runtime Signed-off-by: zhenwei-intel <[email protected]> * [LLM Runtime] Update README (#696) * update readme (#708) Update LLM runtime readme * [LLM Runtime] Add Script for PPL Evaluation (#685) * [LLM Runtime] Optimize tests of llm runtime (#718) * separate optimize UT and improve UT infra (#729) * [LLM Runtime] enable qwen graph (#669) * [LLM Runtime] enable qwen graph Signed-off-by: intellinjun <[email protected]> * [LLM Runtime] Enable GPTQ models (#611) * Enable GPTQ for bloom model Signed-off-by: zhenwei-intel <[email protected]> * [LLM Runtime] Add jblas split weight interface and support jblas models (#639) * [LLM Runtime] Add jblas split weight interface and support jblas models Signed-off-by: Clark Chin <[email protected]> * [LLM Runtime] Beam Search Support of Fused Attention (#734) * Update GPTQ into README (#781) * Update GPTQ into README Signed-off-by: Dong, Bo <[email protected]> * Update README.md Signed-off-by: Dong, Bo <[email protected]> --------- Signed-off-by: Dong, Bo <[email protected]> * fix : max output token (#788) Signed-off-by: sangjune.park <[email protected]> * docs : reinforcement llm runtime graph devleoper guide (#786) Signed-off-by: sangjune.park <[email protected]> * [LLM Runtime] Check weight dtype and compute dtype (#778) * [LLM Runtime] Fix develop doc and convert.py (#794) * fix develop doc and convert.py Signed-off-by: Yu, Zhentao <[email protected]> * fix : init_from_bin example (#789) * [LLM Runtime] Enable whisper new app (#682) * [Engine] Apply the STS task to bge models (#673) * [LLM Runtime]fix format (#812) * [LLM Runtime] fix added_tokens error (#793) Signed-off-by: intellinjun <[email protected]> * Update README.md Signed-off-by: Haihao Shen <[email protected]> * update (#823) Signed-off-by: Dong, Bo1 <[email protected]> * [Doc] update README for Qwen chat (#808) * [LLM Runtime] ChatGLM-V1 multi-batch infer and batched greedy search generation (#700) * [LLM Runtime] Remove use_cache in WOQ (#818) * make void to char to avoid the unknow size (#856) Signed-off-by: Dong, Bo1 <[email protected]> * [Infra] enhance CI scan (#834) * Fix kernels softmax in int8 mha (#869) Co-authored-by: kevinintel <[email protected]> * [LLM Runtime] Baichuan13B inference bug fix (#891) * Baichuan13B FP32 inference bug fix * [LLM Runtime] Remove the identical branch (#894) * [LLM Runtime] make rms_norm_eps and freq_base as parameter (#903) * [LLM Runtime] refactor itrex backend based on the latest Jblas (#769) Co-authored-by: luoyu-intel <[email protected]> Co-authored-by: Ding, Yi1 <[email protected]> Co-authored-by: zhenwei-intel <[email protected]> Co-authored-by: yuchengliu1 <[email protected]> Co-authored-by: Meng, Hengyu <[email protected]> * [Doc] add gaudi2 in doc (#799) * [LLM Runtime] Add MX-Format (FP8_E5M2, FP8_E4M3, FP4_E2M1, NF4) (#872) * add fp8 in llm frontend Signed-off-by: Yu, Zhentao <[email protected]> * [LLM Runtime] Fix PPL Test (#937) * [LLM Runtime] Add MatMul data types combinations table (#945) * [LLM Runtime] decoupling weight_type and scale_type in Qbits (#940) * [LLM Runtime] Convert huggingface gptq model to jblas (#927) Co-authored-by: luoyu-intel <[email protected]> * reorg directory migrate CI Signed-off-by: Hengyu Meng <[email protected]> refine CI for neuralspeed Signed-off-by: Wenxin Zhang <[email protected]> add more CI scripts Signed-off-by: Wenxin Zhang <[email protected]> minor fix Signed-off-by: Wenxin Zhang <[email protected]> remove runner.name when running on ubuntu-latest Signed-off-by: Wenxin Zhang <[email protected]> update CI to share system Signed-off-by: Wenxin Zhang <[email protected]> rename jblas tp bestla directory reorg Signed-off-by: Hengyu Meng <[email protected]> remove itrex dependency Signed-off-by: Hengyu Meng <[email protected]> fix script path\n remove python dependency Signed-off-by: Hengyu Meng <[email protected]> -s remove python tests disable percentage disable monitor Signed-off-by: Hengyu Meng <[email protected]> fix naming fix threadpool conflict Signed-off-by: Hengyu Meng <[email protected]> restore percentage Signed-off-by: Hengyu Meng <[email protected]> * fix bestla typo add bestla workflow image Signed-off-by: Hengyu Meng <[email protected]> * fix scripts path * fix pylint and cpplint --------- Signed-off-by: zhenwei-intel <[email protected]> Signed-off-by: Yu, Zhentao <[email protected]> Signed-off-by: hshen14 <[email protected]> Signed-off-by: Dong, Bo1 <[email protected]> Signed-off-by: intellinjun <[email protected]> Signed-off-by: Clark Chin <[email protected]> Signed-off-by: Ding, Yi1 <[email protected]> Signed-off-by: Zhenzhong1 <[email protected]> Signed-off-by: Wenxin Zhang <[email protected]> Signed-off-by: Ikko Eltociear Ashimine <[email protected]> Signed-off-by: Haihao Shen <[email protected]> Signed-off-by: Liangyx2 <[email protected]> Signed-off-by: Ye, Xinyu <[email protected]> Signed-off-by: Hengyu Meng <[email protected]> Signed-off-by: sangjune.park <[email protected]> Signed-off-by: Dong, Bo <[email protected]> Co-authored-by: Cheng, Penghui <[email protected]> Co-authored-by: liuzhenwei <[email protected]> Co-authored-by: zhentaoyu <[email protected]> Co-authored-by: Dong, Bo <[email protected]> Co-authored-by: kevinintel <[email protected]> Co-authored-by: Haihao Shen <[email protected]> Co-authored-by: lvliang-intel <[email protected]> Co-authored-by: Wang, Chang <[email protected]> Co-authored-by: luoyu-intel <[email protected]> Co-authored-by: Yi DING <[email protected]> Co-authored-by: zhenwei-intel <[email protected]> Co-authored-by: intellinjun <[email protected]> Co-authored-by: Chen Xi <[email protected]> Co-authored-by: Zhenzhong1 <[email protected]> Co-authored-by: CeciliaWwq <[email protected]> Co-authored-by: Wenxin Zhang <[email protected]> Co-authored-by: ZheWang <[email protected]> Co-authored-by: yuchengliu1 <[email protected]> Co-authored-by: Ikko Eltociear Ashimine <[email protected]> Co-authored-by: Liangyx2 <[email protected]> Co-authored-by: XinyuYe-Intel <[email protected]> Co-authored-by: akarX23 <[email protected]> Co-authored-by: sangjune.park <[email protected]>
- Loading branch information