Refact convert scripts #35

zhenwei-intel · 2024-01-08T02:27:21Z

Type of Change

feature

Description

convert llama online without saving to local path
support online quantizing: q4_0/jblas, without saving fp32.bin
enable gptq and awq for all other models
bf16/fp16
integrate gguf function

zhenweil@icx-1 ~/c/neural-speed (lzw/online_llama) [1]> python scripts/python_api_example.py ~/models/llama/Llama-2-7b-chat-hf/                                                   (llm) 
QuantConfig(weight_dtype='int4', alg='sym', group_size=32, scale_dtype='fp32', compute_dtype='int8', use_ggml=False, not_quant=False, use_gptq=False, use_awq=False)
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.70it/s]
Loading vocab file /home/zhenweil/models/llama/Llama-2-7b-chat-hf/tokenizer.model
Processing layers: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:57<00:00,  1.79s/it]
Success! saved as runtime_outs/ne_llama_q.bin
AVX:1 AVX2:1 AVX512F:1 AVX_VNNI:0 AVX512_VNNI:1 AMX_INT8:0 AMX_BF16:0 AVX512_BF16:0 AVX512_FP16:0
beam_size: 1, do_sample: 1, top_k: 40, top_p: 0.950000
model.cpp: loading model from runtime_outs/ne_llama_q.bin
init: n_vocab    = 32000
init: n_embd     = 4096
init: n_mult     = 256
init: n_head     = 32
init: n_head_kv  = 32
init: n_layer    = 32
init: n_rot      = 128
init: n_ff       = 11008
init: n_parts    = 1
load: ne ctx size = 5271.06 MB
load: mem required  = 7321.06 MB (+ memory per state)
....................................................................................
model_init_from_file: support_bestla_kv = 0
model_init_from_file: kv self size =  128.00 MB
<s> Once upon a time, a little girl named Lily lived in a small village nestled between two great mountains. everyone in the village loved Lily, and she was known for her kindness and

Expected Behavior & Potential Risk

the expected behavior that triggered by this PR

How has this PR been tested?

how to reproduce the test (including hardware information)

Dependency Change?

any library dependency introduced or removed

Signed-off-by: zhenwei-intel <[email protected]>

zhenwei-intel marked this pull request as draft January 9, 2024 02:49

zhenwei-intel changed the title ~~convert llama online without saving to local path~~ Refact convert scripts Jan 11, 2024

zhenwei-intel added 6 commits January 22, 2024 10:35

online convert llama, gptj, chatglm2

d0811a5

Signed-off-by: zhenwei-intel <[email protected]>

update llama params

c3bb656

Signed-off-by: zhenwei-intel <[email protected]>

init gguf version

200e4ab

Signed-off-by: zhenwei-intel <[email protected]>

llama gguf fp32 done

b123320

Signed-off-by: zhenwei-intel <[email protected]>

gguf q40

e651a3c

Signed-off-by: zhenwei-intel <[email protected]>

unify convert

e493287

Signed-off-by: zhenwei-intel <[email protected]>

zhenwei-intel force-pushed the lzw/online_llama branch from 8706ccc to e493287 Compare January 23, 2024 01:49

update

49b11e7

Signed-off-by: zhenwei-intel <[email protected]>

Zhenzhong1 mentioned this pull request Jan 31, 2024

[LLM Runtime] Support 3bits & 4bits GPTQ for gpt-j #100

Merged

VincyZhang closed this Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refact convert scripts #35

Refact convert scripts #35

zhenwei-intel commented Jan 8, 2024 •

edited

Loading

Refact convert scripts #35

Refact convert scripts #35

Conversation

zhenwei-intel commented Jan 8, 2024 • edited Loading

Type of Change

Description

Expected Behavior & Potential Risk

How has this PR been tested?

Dependency Change?

zhenwei-intel commented Jan 8, 2024 •

edited

Loading