This repository has been archived by the owner on Aug 30, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 38
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
fef0ee7
commit 0129503
Showing
3 changed files
with
85 additions
and
81 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,55 +1,70 @@ | ||
# Neural Speed | ||
|
||
Neural Speed is an innovation library designed to provide the efficient inference of large language models (LLMs) on Intel platforms through the state-of-the-art (SOTA) low-bit quantization and sparsity powered by [Intel Neural Compressor](https://github.com/intel/neural-compressor) and [llama.cpp](https://github.com/ggerganov/llama.cpp). We provide the experimental features as below: | ||
Neural Speed is an innovation library designed to provide the efficient inference of large language models (LLMs) on Intel platforms through the state-of-the-art (SOTA) low-bit quantization and sparsity powered by [Intel Neural Compressor](https://github.com/intel/neural-compressor) and [llama.cpp](https://github.com/ggerganov/llama.cpp). Highlights of this project: | ||
|
||
- Modular design to support new models | ||
- [Highly optimized low precision kernels](neural_speed/core/README.md) | ||
- Utilize AMX, VNNI, AVX512F and AVX2 instruction set | ||
- Support CPU (x86 platforms only) and Intel GPU (WIP) | ||
- Support LLAMA, LLAMA2, NeuralChat series, GPT-J, GPT-NEOX, Dolly-v2, MPT, Falcon, BLOOM, OPT, ChatGLM, ChatGLM2, Baichuan, Baichuan2, Qwen, Mistral, Whisper, CodeLlama, MagicCoder and StarCoder | ||
- [Highly optimized low precision kernels](neural_speed/core/README.md), utilize AMX, VNNI, AVX512F, AVX_VNNI and AVX2 instruction set | ||
- Up to 40x compared with [llama.cpp](https://github.com/ggerganov/llama.cpp), performance details: [blog](https://medium.com/@NeuralCompressor/llm-performance-of-intel-extension-for-transformers-f7d061556176) | ||
- NeurIPS' 2023: [Efficient LLM Inference on CPUs](https://arxiv.org/abs/2311.00502) | ||
- Support 4bits and 8bits quantization | ||
- Tensor Parallelism across sockets/nodes: [tensor_parallelism.md](./docs/tensor_parallelism.md) | ||
|
||
> Neural Speed is under active development so APIs are subject to change. | ||
Up to 40x compared with [llama.cpp](https://github.com/ggerganov/llama.cpp). | ||
Some performace data have been published in NeurIPS' 2023: [Efficient LLM Inference on CPUs](https://arxiv.org/abs/2311.00502). You can refer [this blog](https://medium.com/@NeuralCompressor/llm-performance-of-intel-extension-for-transformers-f7d061556176) to get more performance information. | ||
## Installation | ||
|
||
### Build Python package (Recommended way) | ||
```shell | ||
pip install -r requirements.txt | ||
pip install . | ||
``` | ||
|
||
>**Note**: Please make sure GCC version is higher than GCC 10. | ||
|
||
## Quick Start | ||
There are two approaches for utilizing the Neural Speed: | ||
### 1. Transformer-like API | ||
There are two approaches for utilizing the Neural Speed: 1. Transformer-like usage, you need to install [ITREX(intel extension for transformers)](https://github.com/intel/intel-extension-for-transformers) 2. llama.cpp-like usage | ||
|
||
### 1. Transformer-like usage | ||
|
||
Pytorch format HF model | ||
``` | ||
```python | ||
from transformers import AutoTokenizer, TextStreamer | ||
from neural_speed import Model | ||
from intel_extension_for_transformers.transformers import AutoModelForCausalLM | ||
model_name = "Intel/neural-chat-7b-v3-1" # Hugging Face model_id or local model | ||
prompt = "Once upon a time, there existed a little girl," | ||
|
||
prompt = "Once upon a time, a little girl" | ||
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) | ||
inputs = tokenizer(prompt, return_tensors="pt").input_ids | ||
streamer = TextStreamer(tokenizer) | ||
|
||
model = Model() | ||
model.init(model_name, weight_dtype="int4", compute_dtype="int8", group_size=32) | ||
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=30, do_sample=True) | ||
model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True) | ||
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300) | ||
``` | ||
|
||
GGUF format HF model | ||
```python | ||
from transformers import AutoTokenizer, TextStreamer | ||
from neural_speed import Model | ||
from intel_extension_for_transformers.transformers import AutoModelForCausalLM | ||
|
||
# Specify the GGUF repo on the Hugginface | ||
model_name = "TheBloke/Llama-2-7B-Chat-GGUF" | ||
# Download the the specific gguf model file from the above repo | ||
model_file = "llama-2-7b-chat.Q4_0.gguf" | ||
# make sure you are granted to access this model on the Huggingface. | ||
tokenizer_name = "meta-llama/Llama-2-7b-chat-hf" | ||
|
||
prompt = "Once upon a time" | ||
tokenizer = AutoTokenizer.from_pretrained(args.model_path, trust_remote_code=True) | ||
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, trust_remote_code=True) | ||
inputs = tokenizer(prompt, return_tensors="pt").input_ids | ||
streamer = TextStreamer(tokenizer) | ||
|
||
model = Model() | ||
model.init_from_bin(model_type, model_file) # for exmaple model_type: "llama", model_file: "/PATH/ggml-model-q4_0.gguf" | ||
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=30, do_sample=True) | ||
model = AutoModelForCausalLM.from_pretrained(model_name, model_file = model_file) | ||
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300) | ||
``` | ||
|
||
You can also use [Transformer-based API](https://github.com/intel/intel-extension-for-transformers/blob/main/docs/weightonlyquant.md#llm-runtime-example-code) in [ITREX(intel extension for transformers)](https://github.com/intel/intel-extension-for-transformers), but you need to install Intel Extension for Transformers. | ||
|
||
### 2. Neural Speed straight forward: | ||
### 2. llama.cpp-like usage: | ||
|
||
#### One-click Python scripts | ||
Run LLM with one-click python script including conversion, quantization and inference. | ||
|
@@ -58,7 +73,10 @@ python scripts/run.py model-path --weight_dtype int4 -p "She opened the door and | |
``` | ||
|
||
#### Quantize and Inference Step By Step | ||
Besides the one-click script, Neural Speed also offers the detailed script: 1) convert and quantize, and 2) inference. | ||
Neural Speed supports 1. GGUF models generated by [llama.cpp](https://github.com/ggerganov/llama.cpp) 2. GGUF models from HuggingFace 3. PyTorch model from HuggingFace, but quantized by Neural Speed | ||
Neural Speed offers the scripts: 1) convert and quantize, and 2) inference for conveting the model by yourself. | ||
If the GGUF model is from HuggingFace or generated by llama.cpp, you can inference it directly. | ||
|
||
|
||
##### 1. Convert and Quantize LLM | ||
converting the model by following the below steps: | ||
|
@@ -67,30 +85,16 @@ converting the model by following the below steps: | |
# convert the model directly use model id in Hugging Face. (recommended) | ||
python scripts/convert.py --outtype f32 --outfile ne-f32.bin EleutherAI/gpt-j-6b | ||
``` | ||
|
||
Neural Speed assumes the compatible model format as [Q4_0, Q5_0, Q8_0 GGUF](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md). You can directly use GGUF models from HuggingFace instead of converting. Validated GGUF models from HuggnigFace: [llama2-7b](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF), [Mistral-7B](https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF), [CodeLlama-7B](https://huggingface.co/TheBloke/CodeLlama-7B-GGUF), [CodeLlama-13B](https://huggingface.co/TheBloke/CodeLlama-13B-GGUF). | ||
|
||
Neural Speed also supports GGUF models generated by [llama.cpp](https://github.com/ggerganov/llama.cpp), you need to download the model and use llama.cpp to create it. Validated models: [llama2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf), [falcon-7b](https://huggingface.co/tiiuae/falcon-7b), [falcon-40b](https://huggingface.co/tiiuae/falcon-40b), [mpt-7b](https://huggingface.co/mosaicml/mpt-7b), [mpt-40b](https://huggingface.co/mosaicml/mpt-40b) and [bloom-7b1](https://huggingface.co/bigscience/bloomz-7b1). | ||
|
||
##### 2. Inference | ||
|
||
We provide LLM inference script to run the quantized model. Please reach [us](mailto:[email protected]) if you want to run using C++ API directly. | ||
```bash | ||
# recommed to use numactl to bind cores in Intel cpus for better performance | ||
# if you use different core numbers, please also change -t arg value | ||
# please type prompt about codes when run `StarCoder`, for example, -p "def fibonnaci(". | ||
Linux and WSL | ||
|
||
#Linux and WSL | ||
```bash | ||
OMP_NUM_THREADS=<physic_cores> numactl -m 0 -C 0-<physic_cores-1> python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t <physic_cores> --color -p "She opened the door and see" | ||
``` | ||
|
||
# if you want to generate fixed outputs, please set --seed arg, for example: | ||
OMP_NUM_THREADS=<physic_cores> numactl -m 0 -C 0-<physic_cores-1> python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t <physic_cores> --color -p "She opened the door and see" --seed 12 | ||
|
||
# if you want to reduce repeated generated texts, please set --repeat_penalty (value > 1.0, default = 1.0), for example: | ||
OMP_NUM_THREADS=<physic_cores> numactl -m 0 -C 0-<physic_cores-1> python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t <physic_cores> --color -p "She opened the door and see" --repeat_penalty 1.2 | ||
|
||
#Windows | ||
#Recommend to build and run our project in WSL to get a better and stable performance | ||
Windows | ||
```bash | ||
python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t <physic_cores|P-cores> --color -p "She opened the door and see" | ||
``` | ||
|
||
|
@@ -102,67 +106,40 @@ python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n | |
|Intel Xeon Scalable Processors | ✔ | | ||
|Intel Xeon CPU Max Series | ✔ | | ||
|Intel Core Processors | ✔ | | ||
|Intel Arc GPU Series | WIP | | ||
|Intel Data Center GPU Max Series | WIP | | ||
|Intel Gaudi2 | Not yet | | ||
|
||
## Supported Models | ||
LLAMA, LLAMA2, NeuralChat series, GPT-J, GPT-NEOX, Dolly-v2, MPT, Falcon, BLOOM, OPT, ChatGLM, ChatGLM2, Baichuan, Baichuan2, Qwen, Mistral, Whisper, CodeLlama, MagicCoder and StarCoder. | ||
You find find more deatils such as validated GGUF models from HuggingFace in [list](./docs/supported_models.md). | ||
|
||
Neural Speed supports the following models: [list](./docs/supported_models.md) | ||
|
||
## Install | ||
|
||
### Build Python package | ||
```shell | ||
pip install -r requirements.txt | ||
pip install . | ||
``` | ||
|
||
### Build executable only | ||
|
||
```shell | ||
# Linux and WSL | ||
git submodule update --init --recursive | ||
mkdir build | ||
cd build | ||
cmake .. -G Ninja | ||
ninja | ||
``` | ||
|
||
```powershell | ||
# Windows | ||
# Install VisualStudio 2022 and open 'Developer PowerShell for VS 2022' | ||
mkdir build | ||
cd build | ||
cmake .. | ||
cmake --build . -j --config Release | ||
``` | ||
Neural Speed also supports GGUF models generated by [llama.cpp](https://github.com/ggerganov/llama.cpp), you need to download the model and use llama.cpp to create it. Validated models: [llama2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf), [falcon-7b](https://huggingface.co/tiiuae/falcon-7b), [falcon-40b](https://huggingface.co/tiiuae/falcon-40b), [mpt-7b](https://huggingface.co/mosaicml/mpt-7b), [mpt-40b](https://huggingface.co/mosaicml/mpt-40b) and [bloom-7b1](https://huggingface.co/bigscience/bloomz-7b1). | ||
|
||
|
||
## Advanced Usage | ||
|
||
### 1. Quantization and inferenece | ||
You can do quantization and inference without Transformer-API: [Advanced Usage](./docs/advanced_usage.md). | ||
More parameters in llama.cpp-like usage: [Advanced Usage](./docs/advanced_usage.md). | ||
|
||
### 2. Tensor Parallelism cross nodes/sockets | ||
|
||
We support tensor parallelism strategy for distributed inference/training on multi-node and multi-socket. You can refer to [tensor_parallelism.md](./docs/tensor_parallelism.md) to enable this feature. | ||
|
||
|
||
### 3. Contribution | ||
|
||
You can consider adding your own models, please follow the document: [graph developer document](./developer_document.md). | ||
|
||
### 4. Custom Stopping Criteria | ||
### 3. Custom Stopping Criteria | ||
|
||
You can customize the stopping criteria according to your own needs by processing the input_ids to determine if text generation needs to be stopped. | ||
Here is the document of Custom Stopping Criteria: [simple example with minimum generation length of 80 tokens](./docs/customized_stop.md) | ||
|
||
### 5. Verbose Mode | ||
### 4. Verbose Mode | ||
|
||
Enable verbose mode and control tracing information using the `NEURAL_SPEED_VERBOSE` environment variable. | ||
|
||
Available modes: | ||
- 0: Print all tracing information. Comprehensive output, including: evaluation time and operator profiling. (need to set `NS_PROFILING` to ON and recompile) | ||
- 1: Print evaluation time. Time taken for each evaluation. | ||
- 2: Profile individual operator. Identify performance bottleneck within the model. (need to set `NS_PROFILING` to ON and recompile) | ||
|
||
|
||
## Enable New Model | ||
You can consider adding your own models, please follow the document: [graph developer document](./developer_document.md). | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -82,7 +82,6 @@ Our Neural Speed supports INT4 / INT8 / FP8 (E4M3, E5M2) / FP4 (E2M1) / NF4 wei | |
|
||
### 2. Inference | ||
|
||
We provide LLM inference script to run the quantized model. Please reach [us](mailto:[email protected]) if you want to run using C++ API directly. | ||
```bash | ||
# recommed to use numactl to bind cores in Intel cpus for better performance | ||
# if you use different core numbers, please also change -t arg value | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
## Install | ||
|
||
### Build Python package | ||
```shell | ||
pip install -r requirements.txt | ||
pip install . | ||
``` | ||
|
||
### Build executable only | ||
|
||
```shell | ||
# Linux and WSL | ||
git submodule update --init --recursive | ||
mkdir build | ||
cd build | ||
cmake .. -G Ninja | ||
ninja | ||
``` | ||
|
||
```powershell | ||
# Windows | ||
# Install VisualStudio 2022 and open 'Developer PowerShell for VS 2022' | ||
mkdir build | ||
cd build | ||
cmake .. | ||
cmake --build . -j --config Release | ||
``` | ||
|