Skip to content
This repository has been archived by the owner on Aug 30, 2024. It is now read-only.

Commit

Permalink
addthe gguf.md
Browse files Browse the repository at this point in the history
  • Loading branch information
Zhenzhong1 committed Feb 19, 2024
1 parent b6f3eef commit e758ff5
Showing 1 changed file with 65 additions and 0 deletions.
65 changes: 65 additions & 0 deletions docs/gguf.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
GGUF
=======

Neural Speed also supports GGUF models generated by [llama.cpp](https://github.com/ggerganov/llama.cpp), you need to download the model and use llama.cpp to create it.

Validated models: [llama2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf), [falcon-7b](https://huggingface.co/tiiuae/falcon-7b), [falcon-40b](https://huggingface.co/tiiuae/falcon-40b), [mpt-7b](https://huggingface.co/mosaicml/mpt-7b), [mpt-40b](https://huggingface.co/mosaicml/mpt-40b) and [bloom-7b1](https://huggingface.co/bigscience/bloomz-7b1).

Please check more validated GGUF models from HuggingFace in [list](./docs/supported_models.md).

## Examples

How to create the GGUF file in Neural Speed:
```python
# Example:
# please provide the local model path as the arg,
# which means you need to `git clone https://huggingface.co/meta-llama/Llama-2-7b-chat-hf` first.
python neural_speed/convert/convert-hf-to-gguf.py /model_path/Llama-2-7b-chat-hf/

```

How to load the GGUF bin file in Neural Speed:

```python
prompt = "Once upon a time"
tokenizer = AutoTokenizer.from_pretrained(args.model_path, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)

model = Model()
model.init_from_bin(args.model_name, gguf_path)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300, do_sample=True)

# Please check this script for more details and input parameters.
# python scripts/python_api_example_for_gguf.py --model_name falcon --model_path /home/model/falcon-7b -m /home/model/falcon-7b/ggml-model-f32.gguf
```

Note: These GGUF models can be accelerated by [Neural Speed BestTLA](https://github.com/intel/neural-speed/blob/c0312283f528d4a9ffebc283cd0f15a7a8eabf1a/bestla/README.md#L1).

How to accelerate GGUF by BTLA:
```python
# quantization and then re-run the above step python_api_example_for_gguf.py
./build/bin/quant_falcon --model_file /home/model/falcon-7b/ggml-model-f32.gguf --out_file ne-falcon-q4_j.bin --weight_dtype int4 --compute_dtype int8

python scripts/python_api_example_for_gguf.py --model_name falcon --model_path /home/model/falcon-7b -m ne-falcon-q4_j.bin
```

How to load the GGUF bin file in [intel-extension-for-transformers](https://github.com/intel/intel-extension-for-transformers/pull/1151):
```python
from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM, WeightOnlyQuantConfig

# Specify the GGUF repo on the Hugginface
model_name = "TheBloke/Llama-2-7B-Chat-GGUF"
# Download the the specific gguf model file from the above repo
model_file = "llama-2-7b-chat.Q4_0.gguf"
# make sure you are granted to access this model on the Huggingface.
tokenizer_name = "meta-llama/Llama-2-7b-chat-hf"

prompt = "Once upon a time"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)
model = AutoModelForCausalLM.from_pretrained(model_name, model_file = model_file)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)
```

0 comments on commit e758ff5

Please sign in to comment.