This repository has been archived by the owner on Aug 30, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 38
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
b6f3eef
commit e758ff5
Showing
1 changed file
with
65 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
GGUF | ||
======= | ||
|
||
Neural Speed also supports GGUF models generated by [llama.cpp](https://github.com/ggerganov/llama.cpp), you need to download the model and use llama.cpp to create it. | ||
|
||
Validated models: [llama2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf), [falcon-7b](https://huggingface.co/tiiuae/falcon-7b), [falcon-40b](https://huggingface.co/tiiuae/falcon-40b), [mpt-7b](https://huggingface.co/mosaicml/mpt-7b), [mpt-40b](https://huggingface.co/mosaicml/mpt-40b) and [bloom-7b1](https://huggingface.co/bigscience/bloomz-7b1). | ||
|
||
Please check more validated GGUF models from HuggingFace in [list](./docs/supported_models.md). | ||
|
||
## Examples | ||
|
||
How to create the GGUF file in Neural Speed: | ||
```python | ||
# Example: | ||
# please provide the local model path as the arg, | ||
# which means you need to `git clone https://huggingface.co/meta-llama/Llama-2-7b-chat-hf` first. | ||
python neural_speed/convert/convert-hf-to-gguf.py /model_path/Llama-2-7b-chat-hf/ | ||
|
||
``` | ||
|
||
How to load the GGUF bin file in Neural Speed: | ||
|
||
```python | ||
prompt = "Once upon a time" | ||
tokenizer = AutoTokenizer.from_pretrained(args.model_path, trust_remote_code=True) | ||
inputs = tokenizer(prompt, return_tensors="pt").input_ids | ||
streamer = TextStreamer(tokenizer) | ||
|
||
model = Model() | ||
model.init_from_bin(args.model_name, gguf_path) | ||
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300, do_sample=True) | ||
|
||
# Please check this script for more details and input parameters. | ||
# python scripts/python_api_example_for_gguf.py --model_name falcon --model_path /home/model/falcon-7b -m /home/model/falcon-7b/ggml-model-f32.gguf | ||
``` | ||
|
||
Note: These GGUF models can be accelerated by [Neural Speed BestTLA](https://github.com/intel/neural-speed/blob/c0312283f528d4a9ffebc283cd0f15a7a8eabf1a/bestla/README.md#L1). | ||
|
||
How to accelerate GGUF by BTLA: | ||
```python | ||
# quantization and then re-run the above step python_api_example_for_gguf.py | ||
./build/bin/quant_falcon --model_file /home/model/falcon-7b/ggml-model-f32.gguf --out_file ne-falcon-q4_j.bin --weight_dtype int4 --compute_dtype int8 | ||
|
||
python scripts/python_api_example_for_gguf.py --model_name falcon --model_path /home/model/falcon-7b -m ne-falcon-q4_j.bin | ||
``` | ||
|
||
How to load the GGUF bin file in [intel-extension-for-transformers](https://github.com/intel/intel-extension-for-transformers/pull/1151): | ||
```python | ||
from transformers import AutoTokenizer, TextStreamer | ||
from intel_extension_for_transformers.transformers import AutoModelForCausalLM, WeightOnlyQuantConfig | ||
|
||
# Specify the GGUF repo on the Hugginface | ||
model_name = "TheBloke/Llama-2-7B-Chat-GGUF" | ||
# Download the the specific gguf model file from the above repo | ||
model_file = "llama-2-7b-chat.Q4_0.gguf" | ||
# make sure you are granted to access this model on the Huggingface. | ||
tokenizer_name = "meta-llama/Llama-2-7b-chat-hf" | ||
|
||
prompt = "Once upon a time" | ||
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, trust_remote_code=True) | ||
inputs = tokenizer(prompt, return_tensors="pt").input_ids | ||
streamer = TextStreamer(tokenizer) | ||
model = AutoModelForCausalLM.from_pretrained(model_name, model_file = model_file) | ||
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300) | ||
``` |