refine readme (#76)

intel · Jan 22, 2024 · 0129503 · 0129503
1 parent fef0ee7
commit 0129503
Show file tree

Hide file tree

Showing 3 changed files with 85 additions and 81 deletions.
diff --git a/README.md b/README.md
@@ -1,55 +1,70 @@
 # Neural Speed
 
-Neural Speed is an innovation library designed to provide the efficient inference of large language models (LLMs) on Intel platforms through the state-of-the-art (SOTA) low-bit quantization and sparsity powered by [Intel Neural Compressor](https://github.com/intel/neural-compressor) and [llama.cpp](https://github.com/ggerganov/llama.cpp). We provide the experimental features as below:
+Neural Speed is an innovation library designed to provide the efficient inference of large language models (LLMs) on Intel platforms through the state-of-the-art (SOTA) low-bit quantization and sparsity powered by [Intel Neural Compressor](https://github.com/intel/neural-compressor) and [llama.cpp](https://github.com/ggerganov/llama.cpp). Highlights of this project:
 
-- Modular design to support new models
-- [Highly optimized low precision kernels](neural_speed/core/README.md)
-- Utilize AMX, VNNI, AVX512F and AVX2 instruction set
-- Support CPU (x86 platforms only) and Intel GPU (WIP)
+- Support LLAMA, LLAMA2, NeuralChat series, GPT-J, GPT-NEOX, Dolly-v2, MPT, Falcon, BLOOM, OPT, ChatGLM, ChatGLM2, Baichuan, Baichuan2, Qwen, Mistral, Whisper, CodeLlama, MagicCoder and StarCoder
+- [Highly optimized low precision kernels](neural_speed/core/README.md), utilize AMX, VNNI, AVX512F, AVX_VNNI and AVX2 instruction set
+- Up to 40x compared with [llama.cpp](https://github.com/ggerganov/llama.cpp), performance details: [blog](https://medium.com/@NeuralCompressor/llm-performance-of-intel-extension-for-transformers-f7d061556176) 
+- NeurIPS' 2023: [Efficient LLM Inference on CPUs](https://arxiv.org/abs/2311.00502)
 - Support 4bits and 8bits quantization
+- Tensor Parallelism across sockets/nodes: [tensor_parallelism.md](./docs/tensor_parallelism.md)
 
 > Neural Speed is under active development so APIs are subject to change.
 
-Up to 40x compared with [llama.cpp](https://github.com/ggerganov/llama.cpp).
-Some performace data have been published in NeurIPS' 2023: [Efficient LLM Inference on CPUs](https://arxiv.org/abs/2311.00502). You can refer [this blog](https://medium.com/@NeuralCompressor/llm-performance-of-intel-extension-for-transformers-f7d061556176) to get more performance information.
+## Installation
+
+### Build Python package (Recommended way)
+```shell
+pip install -r requirements.txt
+pip install .
+```
+
+>**Note**: Please make sure GCC version is higher than GCC 10.
+
 
 ## Quick Start
-There are two approaches for utilizing the Neural Speed:
-### 1. Transformer-like API
+There are two approaches for utilizing the Neural Speed: 1. Transformer-like usage, you need to install [ITREX(intel extension for transformers)](https://github.com/intel/intel-extension-for-transformers) 2. llama.cpp-like usage
+
+### 1. Transformer-like usage
 
 Pytorch format HF model
-```
+```python
 from transformers import AutoTokenizer, TextStreamer
-from neural_speed import Model
+from intel_extension_for_transformers.transformers import AutoModelForCausalLM
+model_name = "Intel/neural-chat-7b-v3-1"     # Hugging Face model_id or local model
+prompt = "Once upon a time, there existed a little girl,"
 
-prompt = "Once upon a time, a little girl"
 tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
 inputs = tokenizer(prompt, return_tensors="pt").input_ids
 streamer = TextStreamer(tokenizer)
 
-model = Model()
-model.init(model_name, weight_dtype="int4", compute_dtype="int8", group_size=32)
-outputs = model.generate(inputs, streamer=streamer, max_new_tokens=30, do_sample=True)
+model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
+outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)
 ```
 
 GGUF format HF model
 ```python
 from transformers import AutoTokenizer, TextStreamer
-from neural_speed import Model
+from intel_extension_for_transformers.transformers import AutoModelForCausalLM
+
+# Specify the GGUF repo on the Hugginface
+model_name = "TheBloke/Llama-2-7B-Chat-GGUF"
+# Download the the specific gguf model file from the above repo
+model_file = "llama-2-7b-chat.Q4_0.gguf"
+# make sure you are granted to access this model on the Huggingface.
+tokenizer_name = "meta-llama/Llama-2-7b-chat-hf"
 
 prompt = "Once upon a time"
-tokenizer = AutoTokenizer.from_pretrained(args.model_path, trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, trust_remote_code=True)
 inputs = tokenizer(prompt, return_tensors="pt").input_ids
 streamer = TextStreamer(tokenizer)
-
-model = Model()
-model.init_from_bin(model_type, model_file) #  for exmaple model_type: "llama", model_file: "/PATH/ggml-model-q4_0.gguf"
-outputs = model.generate(inputs, streamer=streamer, max_new_tokens=30, do_sample=True)
+model = AutoModelForCausalLM.from_pretrained(model_name, model_file = model_file)
+outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)
 ```
 
 You can also use [Transformer-based API](https://github.com/intel/intel-extension-for-transformers/blob/main/docs/weightonlyquant.md#llm-runtime-example-code) in [ITREX(intel extension for transformers)](https://github.com/intel/intel-extension-for-transformers), but you need to install Intel Extension for Transformers.
 
-### 2. Neural Speed straight forward:
+### 2. llama.cpp-like usage:
 
 #### One-click Python scripts
 Run LLM with one-click python script including conversion, quantization and inference.
@@ -58,7 +73,10 @@ python scripts/run.py model-path --weight_dtype int4 -p "She opened the door and
 ```
 
 #### Quantize and Inference Step By Step
-Besides the one-click script, Neural Speed also offers the detailed script: 1) convert and quantize, and 2) inference.
+Neural Speed supports 1. GGUF models generated by [llama.cpp](https://github.com/ggerganov/llama.cpp) 2. GGUF models from HuggingFace 3. PyTorch model from HuggingFace, but quantized by Neural Speed
+Neural Speed offers the scripts: 1) convert and quantize, and 2) inference for conveting the model by yourself.
+If the GGUF model is from HuggingFace or generated by llama.cpp, you can inference it directly.
+
 
 ##### 1. Convert and Quantize LLM
 converting the model by following the below steps:
@@ -67,30 +85,16 @@ converting the model by following the below steps:
 # convert the model directly use model id in Hugging Face. (recommended)
 python scripts/convert.py --outtype f32 --outfile ne-f32.bin EleutherAI/gpt-j-6b
 ```
-
-Neural Speed assumes the compatible model format as [Q4_0, Q5_0, Q8_0 GGUF](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md). You can directly use GGUF models from HuggingFace instead of converting. Validated GGUF models from HuggnigFace: [llama2-7b](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF), [Mistral-7B](https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF), [CodeLlama-7B](https://huggingface.co/TheBloke/CodeLlama-7B-GGUF), [CodeLlama-13B](https://huggingface.co/TheBloke/CodeLlama-13B-GGUF). 
-
-Neural Speed also supports GGUF models generated by [llama.cpp](https://github.com/ggerganov/llama.cpp), you need to download the model and use llama.cpp to create it. Validated models: [llama2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf), [falcon-7b](https://huggingface.co/tiiuae/falcon-7b), [falcon-40b](https://huggingface.co/tiiuae/falcon-40b), [mpt-7b](https://huggingface.co/mosaicml/mpt-7b), [mpt-40b](https://huggingface.co/mosaicml/mpt-40b) and [bloom-7b1](https://huggingface.co/bigscience/bloomz-7b1).
-
 ##### 2. Inference
 
-We provide LLM inference script to run the quantized model. Please reach [us](mailto:[email protected]) if you want to run using C++ API directly.
-```bash
-# recommed to use numactl to bind cores in Intel cpus for better performance
-# if you use different core numbers, please also  change -t arg value
-# please type prompt about codes when run `StarCoder`, for example, -p "def fibonnaci(".
+Linux and WSL
 
-#Linux and WSL
+```bash
 OMP_NUM_THREADS=<physic_cores> numactl -m 0 -C 0-<physic_cores-1> python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t <physic_cores> --color -p "She opened the door and see"
+```
 
-# if you want to generate fixed outputs, please set --seed arg, for example:
-OMP_NUM_THREADS=<physic_cores> numactl -m 0 -C 0-<physic_cores-1> python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t <physic_cores> --color -p "She opened the door and see" --seed 12
-
-# if you want to reduce repeated generated texts, please set --repeat_penalty (value > 1.0, default = 1.0), for example:
-OMP_NUM_THREADS=<physic_cores> numactl -m 0 -C 0-<physic_cores-1> python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t <physic_cores> --color -p "She opened the door and see" --repeat_penalty 1.2
-
-#Windows
-#Recommend to build and run our project in WSL to get a better and stable performance
+Windows
+```bash
 python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t <physic_cores|P-cores> --color -p "She opened the door and see"
 ```
 
@@ -102,67 +106,40 @@ python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n
 |Intel Xeon Scalable Processors | ✔ |
 |Intel Xeon CPU Max Series | ✔ |
 |Intel Core Processors | ✔ |
-|Intel Arc GPU Series | WIP |
-|Intel Data Center GPU Max Series | WIP |
-|Intel Gaudi2 | Not yet |
 
 ## Supported Models
+LLAMA, LLAMA2, NeuralChat series, GPT-J, GPT-NEOX, Dolly-v2, MPT, Falcon, BLOOM, OPT, ChatGLM, ChatGLM2, Baichuan, Baichuan2, Qwen, Mistral, Whisper, CodeLlama, MagicCoder and StarCoder.
+You find find more deatils such as validated GGUF models from HuggingFace in [list](./docs/supported_models.md).
 
-Neural Speed supports the following models: [list](./docs/supported_models.md)
-
-## Install
-
-### Build Python package
-```shell
-pip install -r requirements.txt
-pip install .
-```
-
-### Build executable only
 
-```shell
-# Linux and WSL
-git submodule update --init --recursive
-mkdir build
-cd build
-cmake .. -G Ninja
-ninja
-```
 
-```powershell
-# Windows
-# Install VisualStudio 2022 and open 'Developer PowerShell for VS 2022'
-mkdir build
-cd build
-cmake ..
-cmake --build . -j --config Release
-```
+Neural Speed also supports GGUF models generated by [llama.cpp](https://github.com/ggerganov/llama.cpp), you need to download the model and use llama.cpp to create it. Validated models: [llama2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf), [falcon-7b](https://huggingface.co/tiiuae/falcon-7b), [falcon-40b](https://huggingface.co/tiiuae/falcon-40b), [mpt-7b](https://huggingface.co/mosaicml/mpt-7b), [mpt-40b](https://huggingface.co/mosaicml/mpt-40b) and [bloom-7b1](https://huggingface.co/bigscience/bloomz-7b1).
 
 
 ## Advanced Usage
 
 ### 1. Quantization and inferenece
-You can do quantization and inference without Transformer-API: [Advanced Usage](./docs/advanced_usage.md).
+More parameters in llama.cpp-like usage: [Advanced Usage](./docs/advanced_usage.md).
 
 ### 2. Tensor Parallelism cross nodes/sockets
 
 We support tensor parallelism strategy for distributed inference/training on multi-node and multi-socket. You can refer to [tensor_parallelism.md](./docs/tensor_parallelism.md) to enable this feature.
 
-
-### 3. Contribution
-
-You can consider adding your own models, please follow the document: [graph developer document](./developer_document.md). 
-
-### 4. Custom Stopping Criteria
+### 3. Custom Stopping Criteria
 
 You can customize the stopping criteria according to your own needs by processing the input_ids to determine if text generation needs to be stopped.
 Here is the document of Custom Stopping Criteria: [simple example with minimum generation length of 80 tokens](./docs/customized_stop.md)
 
-### 5. Verbose Mode
+### 4. Verbose Mode
 
 Enable verbose mode and control tracing information using the `NEURAL_SPEED_VERBOSE` environment variable.
 
 Available modes:
 - 0: Print all tracing information. Comprehensive output, including: evaluation time and operator profiling. (need to set `NS_PROFILING` to ON and recompile)
 - 1: Print evaluation time. Time taken for each evaluation.
 - 2: Profile individual operator. Identify performance bottleneck within the model. (need to set `NS_PROFILING` to ON and recompile)
+
+
+## Enable New Model
+You can consider adding your own models, please follow the document: [graph developer document](./developer_document.md).
+
diff --git a/docs/advanced_usage.md b/docs/advanced_usage.md
@@ -82,7 +82,6 @@ Our Neural Speed supports  INT4 / INT8 / FP8 (E4M3, E5M2) / FP4 (E2M1) / NF4 wei
 
 ### 2. Inference
 
-We provide LLM inference script to run the quantized model. Please reach [us](mailto:[email protected]) if you want to run using C++ API directly.
 ```bash
 # recommed to use numactl to bind cores in Intel cpus for better performance
 # if you use different core numbers, please also  change -t arg value

diff --git a/docs/install.md b/docs/install.md
@@ -0,0 +1,28 @@
+## Install
+
+### Build Python package
+```shell
+pip install -r requirements.txt
+pip install .
+```
+
+### Build executable only
+
+```shell
+# Linux and WSL
+git submodule update --init --recursive
+mkdir build
+cd build
+cmake .. -G Ninja
+ninja
+```
+
+```powershell
+# Windows
+# Install VisualStudio 2022 and open 'Developer PowerShell for VS 2022'
+mkdir build
+cd build
+cmake ..
+cmake --build . -j --config Release
+```
+