Intel® Extension for Transformers

An innovative toolkit to accelerate Transformer-based models on Intel platforms

🏭Architecture | 💬NeuralChat | 😃Inference | 💻Examples | 📖Documentations

🚀Latest News

NeuralChat, a customizable chatbot framework under Intel® Extension for Transformers, is now available for you to create your own chatbot within minutes! It supports a rich set of plugins Knowledge Retrieval, Speech Interaction, Query Caching, Security Guardrail, and multiple architectures such as Intel® Xeon® Scalable Processors and Habana Gaudi® Accelerator. Check out the below sample code and have a try now!

# pip install intel-extension-for-transformers
from intel_extension_for_transformers.neural_chat import build_chatbot
chatbot = build_chatbot()
response = chatbot.predict("Tell me about Intel Xeon Scalable Processors.")

💬NeuralChat v1.1, a fine-tuned chat model based on MPT-7B using a mixed set of instruction datasets, is available on Hugging Face, together with the release of INT8 quantization recipes and benchmark results.

🏃Installation

Quick Install from Pypi

pip install intel-extension-for-transformers

For more installation method, please refer to Installation Page

🌟Introduction

Intel® Extension for Transformers is an innovative toolkit to accelerate Transformer-based models on Intel platforms, in particular effective on 4th Intel Xeon Scalable processor Sapphire Rapids (codenamed Sapphire Rapids). The toolkit provides the below key features and examples:

Seamless user experience of model compressions on Transformer-based models by extending Hugging Face transformers APIs and leveraging Intel® Neural Compressor
Advanced software optimizations and unique compression-aware runtime (released with NeurIPS 2022's paper Fast Distilbert on CPUs and QuaLA-MiniLM: a Quantized Length Adaptive MiniLM, and NeurIPS 2021's paper Prune Once for All: Sparse Pre-Trained Language Models)
Optimized Transformer-based model packages such as Stable Diffusion, GPT-J-6B, GPT-NEOX, BLOOM-176B, T5, Flan-T5 and end-to-end workflows such as SetFit-based text classification and document level sentiment analysis (DLSA)
NeuralChat, a customizable chatbot framework to create your own chatbot within minutes by leveraging a rich set of plugins and SOTA optimizations
Inference of Large Language Model (LLM) in pure C/C++ with weight-only quantization kernels. It already enabled GPT-NEOX, LLAMA, MPT, FALCON, BLOOM-7B, OPT, ChatGLM2-6B, GPT-J-6B and Dolly-v2-3B

🌱Getting Started

Sentiment Analysis with Quantization

Prepare Dataset

from datasets import load_dataset, load_metric
from transformers import AutoConfig,AutoModelForSequenceClassification,AutoTokenizer

raw_datasets = load_dataset("glue", "sst2")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
raw_datasets = raw_datasets.map(lambda e: tokenizer(e['sentence'], truncation=True, padding='max_length', max_length=128), batched=True)

Quantization

from intel_extension_for_transformers.transformers import QuantizationConfig, metrics, objectives
from intel_extension_for_transformers.transformers.trainer import NLPTrainer

config = AutoConfig.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english",num_labels=2)
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english",config=config)
model.config.label2id = {0: 0, 1: 1}
model.config.id2label = {0: 'NEGATIVE', 1: 'POSITIVE'}
# Replace transformers.Trainer with NLPTrainer
# trainer = transformers.Trainer(...)
trainer = NLPTrainer(model=model, 
    train_dataset=raw_datasets["train"], 
    eval_dataset=raw_datasets["validation"],
    tokenizer=tokenizer
)
q_config = QuantizationConfig(metrics=[metrics.Metric(name="eval_loss", greater_is_better=False)])
model = trainer.quantize(quant_config=q_config)

input = tokenizer("I like Intel Extension for Transformers", return_tensors="pt")
output = model(**input).logits.argmax().item()

For more quick samples, please refer to Get Started Page. For more validated examples, please refer to Support Model Matrix

🎯Validated Performance

Model	FP32	BF16	INT8
EleutherAI/gpt-j-6B	4163.67 (ms)	1879.61 (ms)	1612.24 (ms)
CompVis/stable-diffusion-v1-4	10.33 (s)	3.02 (s)	N/A

Note*: GPT-J-6B software/hardware configuration please refer to text-generation. Stable-diffusion software/hardware configuration please refer to text-to-image

📖Documentation

OVERVIEW
Model Compression	NeuralChat	Neural Engine	Kernel Libraries
MODEL COMPRESSION
Quantization	Pruning	Distillation	Orchestration
Neural Architecture Search	Export	Metrics/Objectives	Pipeline
NEURAL ENGINE
Model Compilation	Custom Pattern	Deployment	Profiling
KERNEL LIBRARIES
Sparse GEMM Kernels	Custom INT8 Kernels	Profiling	Benchmark
ALGORITHMS
Length Adaptive		Data Augmentation
TUTORIALS AND RESULTS
Tutorials	Supported Models	Model Performance	Kernel Performance

📃Selected Publications/Events

Blog published on Medium: Faster Stable Diffusion Inference with Intel Extension for Transformers (July 2023)
Blog of Intel Developer News: The Moat Is Trust, Or Maybe Just Responsible AI (July 2023)
Blog of Intel Developer News: Create Your Own Custom Chatbot (July 2023)
Blog of Intel Developer News: Accelerate Llama 2 with Intel AI Hardware and Software Optimizations (July 2023)
Arxiv: An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs (June 2023)
Blog published on Medium: Simplify Your Custom Chatbot Deployment (June 2023)
Blog published on Medium: Create Your Own Custom Chatbot (April 2023)

View Full Publication List.

Additional Content

💁Collaborations

Welcome to raise any interesting ideas on model compression techniques and LLM-based chatbot development! Feel free to reach us and look forward to our collaborations on Intel Extension for Transformers!

Name		Name	Last commit message	Last commit date
Latest commit History 1,001 Commits
.github		.github
conda_meta		conda_meta
docker		docker
docs		docs
examples		examples
intel_extension_for_transformers		intel_extension_for_transformers
tests		tests
workflows		workflows
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
SECURITY.md		SECURITY.md
requirements.txt		requirements.txt
setup.py		setup.py
third_party_programs.txt		third_party_programs.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Intel® Extension for Transformers

An innovative toolkit to accelerate Transformer-based models on Intel platforms

🚀Latest News

🏃Installation

Quick Install from Pypi

🌟Introduction

🌱Getting Started

Sentiment Analysis with Quantization

Prepare Dataset

Quantization

🎯Validated Performance

📖Documentation

📃Selected Publications/Events

Additional Content

💁Collaborations

About

Releases

Packages

Languages

License

intellinjun/intel-extension-for-transformers

Folders and files

Latest commit

History

Repository files navigation

Intel® Extension for Transformers

An innovative toolkit to accelerate Transformer-based models on Intel platforms

🚀Latest News

🏃Installation

Quick Install from Pypi

🌟Introduction

🌱Getting Started

Sentiment Analysis with Quantization

Prepare Dataset

Quantization

🎯Validated Performance

📖Documentation

📃Selected Publications/Events

Additional Content

💁Collaborations

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages