Name	Name	Last commit message	Last commit date
Latest commit zhenwei-intel add autoround Feb 4, 2024 4aee532 · Feb 4, 2024 History 220 Commits
.github	.github	Update pytest (#70 )	Jan 19, 2024
bestla	bestla	Optimization of Layernormalization (#103 )	Jan 31, 2024
docker	docker	add docker file and readme (#14 )	Dec 26, 2023
docs	docs	Support gptq with solar (#106 )	Feb 2, 2024
neural_speed	neural_speed	add autoround	Feb 4, 2024
scripts	scripts	Fix a blocker on Windows platforms (#92 )	Jan 31, 2024
tests	tests	[Neural Speed] Cont Batching in Offline and Server (GPT-J & Beam Sear…	Jan 25, 2024
third_party	third_party	reorg directory	Dec 20, 2023
.clang-format	.clang-format	update readme path and copy hidden files (#185 )	Dec 20, 2023
.clang-tidy	.clang-tidy	[Neural Speed] Cont Batching in Offline and Server (GPT-J & Beam Sear…	Jan 25, 2024
.editorconfig	.editorconfig	update readme path and copy hidden files (#185 )	Dec 20, 2023
.gitignore	.gitignore	Update pytest (#70 )	Jan 19, 2024
.gitmodules	.gitmodules	reorg directory	Dec 20, 2023
CMakeLists.txt	CMakeLists.txt	fix ns log perf problem (#71 )	Jan 22, 2024
CMakePresets.json	CMakePresets.json	[CI] enable clang tidy (#29 )	Jan 10, 2024
CODE_OF_CONDUCT.md	CODE_OF_CONDUCT.md	add code_of_conduct, contributing agreement, and security.md file	Nov 20, 2023
CONTRIBUTING.md	CONTRIBUTING.md	add code_of_conduct, contributing agreement, and security.md file	Nov 20, 2023
LICENSE	LICENSE	add license file (#34 )	Jan 8, 2024
README.md	README.md	[LLM Runtime] Support loadding models from HF directly. (#93 )	Jan 25, 2024
clang-format.py	clang-format.py	Init ns doc (#9 )	Dec 22, 2023
developer_document.md	developer_document.md	test of python api (#27 )	Jan 5, 2024
requirements.txt	requirements.txt	miagrate pr [LLM Runtime] Add Whisper Example and Python API (#45 )	Jan 18, 2024
security.md	security.md	add code_of_conduct, contributing agreement, and security.md file	Nov 20, 2023
setup.py	setup.py	update profling log when NS_PROFILING is OFF (#102 )	Feb 1, 2024

Neural Speed

Neural Speed is an innovation library designed to provide the efficient inference of large language models (LLMs) on Intel platforms through the state-of-the-art (SOTA) low-bit quantization and sparsity powered by Intel Neural Compressor and llama.cpp. Highlights of this project:

Support LLAMA, LLAMA2, NeuralChat series, GPT-J, GPT-NEOX, Dolly-v2, MPT, Falcon, BLOOM, OPT, ChatGLM, ChatGLM2, Baichuan, Baichuan2, Qwen, Mistral, Whisper, CodeLlama, MagicCoder and StarCoder
Highly optimized low precision kernels, utilize AMX, VNNI, AVX512F, AVX_VNNI and AVX2 instruction set
Up to 40x compared with llama.cpp, performance details: blog
NeurIPS' 2023: Efficient LLM Inference on CPUs
Support 4bits and 8bits quantization
Tensor Parallelism across sockets/nodes: tensor_parallelism.md

Neural Speed is under active development so APIs are subject to change.

Installation

Build Python package (Recommended way)

pip install -r requirements.txt
pip install .

Note: Please make sure GCC version is higher than GCC 10.

Quick Start

There are two approaches for utilizing the Neural Speed: 1. Transformer-like usage, you need to install ITREX(intel extension for transformers) 2. llama.cpp-like usage

1. Transformer-like usage

Pytorch format HF model

from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "Intel/neural-chat-7b-v3-1"     # Hugging Face model_id or local model
prompt = "Once upon a time, there existed a little girl,"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)

model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

GGUF format HF model

from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM

# Specify the GGUF repo on the Hugginface
model_name = "TheBloke/Llama-2-7B-Chat-GGUF"
# Download the the specific gguf model file from the above repo
model_file = "llama-2-7b-chat.Q4_0.gguf"
# make sure you are granted to access this model on the Huggingface.
tokenizer_name = "meta-llama/Llama-2-7b-chat-hf"

prompt = "Once upon a time"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)
model = AutoModelForCausalLM.from_pretrained(model_name, model_file = model_file)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

Please refer this link to check supported models.

If you want to use Transformer-based API in ITREX(Intel extension for transformers). Please refer to ITREX Installation Page.

2. llama.cpp-like usage:

One-click Python scripts

Run LLM with one-click python script including conversion, quantization and inference.

python scripts/run.py model-path --weight_dtype int4 -p "She opened the door and see"

Quantize and Inference Step By Step

Neural Speed supports 1. GGUF models generated by llama.cpp 2. GGUF models from HuggingFace 3. PyTorch model from HuggingFace, but quantized by Neural Speed Neural Speed offers the scripts: 1) convert and quantize, and 2) inference for conveting the model by yourself. If the GGUF model is from HuggingFace or generated by llama.cpp, you can inference it directly.

1. Convert and Quantize LLM

converting the model by following the below steps:

# convert the model directly use model id in Hugging Face. (recommended)
python scripts/convert.py --outtype f32 --outfile ne-f32.bin EleutherAI/gpt-j-6b

2. Inference

Linux and WSL

OMP_NUM_THREADS=<physic_cores> numactl -m 0 -C 0-<physic_cores-1> python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t <physic_cores> --color -p "She opened the door and see"

Windows

python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t <physic_cores|P-cores> --color -p "She opened the door and see"

For details please refer to Advanced Usage.

Supported Hardware

Hardware	Optimization
Intel Xeon Scalable Processors	✔
Intel Xeon CPU Max Series	✔
Intel Core Processors	✔

Supported Models

LLAMA, LLAMA2, NeuralChat series, GPT-J, GPT-NEOX, Dolly-v2, MPT, Falcon, BLOOM, OPT, ChatGLM, ChatGLM2, Baichuan, Baichuan2, Qwen, Mistral, Whisper, CodeLlama, MagicCoder and StarCoder. You find find more deatils such as validated GGUF models from HuggingFace in list.

Neural Speed also supports GGUF models generated by llama.cpp, you need to download the model and use llama.cpp to create it. Validated models: llama2-7b-chat-hf, falcon-7b, falcon-40b, mpt-7b, mpt-40b and bloom-7b1.

Advanced Usage

1. Quantization and inferenece

More parameters in llama.cpp-like usage: Advanced Usage.

2. Tensor Parallelism cross nodes/sockets

We support tensor parallelism strategy for distributed inference/training on multi-node and multi-socket. You can refer to tensor_parallelism.md to enable this feature.

3. Custom Stopping Criteria

You can customize the stopping criteria according to your own needs by processing the input_ids to determine if text generation needs to be stopped. Here is the document of Custom Stopping Criteria: simple example with minimum generation length of 80 tokens

4. Verbose Mode

Enable verbose mode and control tracing information using the NEURAL_SPEED_VERBOSE environment variable.

Available modes:

0: Print all tracing information. Comprehensive output, including: evaluation time and operator profiling. (need to set NS_PROFILING to ON and recompile)
1: Print evaluation time. Time taken for each evaluation.
2: Profile individual operator. Identify performance bottleneck within the model. (need to set NS_PROFILING to ON and recompile)

Enable New Model

You can consider adding your own models, please follow the document: graph developer document.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Neural Speed

Installation

Build Python package (Recommended way)

Quick Start

1. Transformer-like usage

2. llama.cpp-like usage:

One-click Python scripts

Quantize and Inference Step By Step

1. Convert and Quantize LLM

2. Inference

Supported Hardware

Supported Models

Advanced Usage

1. Quantization and inferenece

2. Tensor Parallelism cross nodes/sockets

3. Custom Stopping Criteria

4. Verbose Mode

Enable New Model

About

Releases 5

Packages

Contributors 31

Languages

License

intel/neural-speed

Folders and files

Latest commit

History

Repository files navigation

Neural Speed

Installation

Build Python package (Recommended way)

Quick Start

1. Transformer-like usage

2. llama.cpp-like usage:

One-click Python scripts

Quantize and Inference Step By Step

1. Convert and Quantize LLM

2. Inference

Supported Hardware

Supported Models

Advanced Usage

1. Quantization and inferenece

2. Tensor Parallelism cross nodes/sockets

3. Custom Stopping Criteria

4. Verbose Mode

Enable New Model

About

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases 5

Packages 0

Contributors 31

Languages

Packages