- NeuralChat, a customizable chatbot framework under Intel® Extension for Transformers, is now available for you to create your own chatbot within minutes! It supports a rich set of plugins Knowledge Retrieval, Speech Interaction, Query Caching, Security Guardrail, and multiple architectures such as Intel® Xeon® Scalable Processors and Habana Gaudi® Accelerator. Check out the below sample code and have a try now!
# pip install intel-extension-for-transformers
from intel_extension_for_transformers.neural_chat import build_chatbot
chatbot = build_chatbot()
response = chatbot.predict("Tell me about Intel Xeon Scalable Processors.")
- 💬NeuralChat v1.1, a fine-tuned chat model based on MPT-7B using a mixed set of instruction datasets, is available on Hugging Face, together with the release of INT8 quantization recipes and benchmark results.
pip install intel-extension-for-transformers
For more installation method, please refer to Installation Page
Intel® Extension for Transformers is an innovative toolkit to accelerate Transformer-based models on Intel platforms, in particular effective on 4th Intel Xeon Scalable processor Sapphire Rapids (codenamed Sapphire Rapids). The toolkit provides the below key features and examples:
-
Seamless user experience of model compressions on Transformer-based models by extending Hugging Face transformers APIs and leveraging Intel® Neural Compressor
-
Advanced software optimizations and unique compression-aware runtime (released with NeurIPS 2022's paper Fast Distilbert on CPUs and QuaLA-MiniLM: a Quantized Length Adaptive MiniLM, and NeurIPS 2021's paper Prune Once for All: Sparse Pre-Trained Language Models)
-
Optimized Transformer-based model packages such as Stable Diffusion, GPT-J-6B, GPT-NEOX, BLOOM-176B, T5, Flan-T5 and end-to-end workflows such as SetFit-based text classification and document level sentiment analysis (DLSA)
-
NeuralChat, a customizable chatbot framework to create your own chatbot within minutes by leveraging a rich set of plugins and SOTA optimizations
-
Inference of Large Language Model (LLM) in pure C/C++ with weight-only quantization kernels. It already enabled GPT-NEOX, LLAMA, MPT, FALCON, BLOOM-7B, OPT, ChatGLM2-6B, GPT-J-6B and Dolly-v2-3B
from datasets import load_dataset, load_metric
from transformers import AutoConfig,AutoModelForSequenceClassification,AutoTokenizer
raw_datasets = load_dataset("glue", "sst2")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
raw_datasets = raw_datasets.map(lambda e: tokenizer(e['sentence'], truncation=True, padding='max_length', max_length=128), batched=True)
from intel_extension_for_transformers.transformers import QuantizationConfig, metrics, objectives
from intel_extension_for_transformers.transformers.trainer import NLPTrainer
config = AutoConfig.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english",num_labels=2)
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english",config=config)
model.config.label2id = {0: 0, 1: 1}
model.config.id2label = {0: 'NEGATIVE', 1: 'POSITIVE'}
# Replace transformers.Trainer with NLPTrainer
# trainer = transformers.Trainer(...)
trainer = NLPTrainer(model=model,
train_dataset=raw_datasets["train"],
eval_dataset=raw_datasets["validation"],
tokenizer=tokenizer
)
q_config = QuantizationConfig(metrics=[metrics.Metric(name="eval_loss", greater_is_better=False)])
model = trainer.quantize(quant_config=q_config)
input = tokenizer("I like Intel Extension for Transformers", return_tensors="pt")
output = model(**input).logits.argmax().item()
For more quick samples, please refer to Get Started Page. For more validated examples, please refer to Support Model Matrix
Model | FP32 | BF16 | INT8 |
---|---|---|---|
EleutherAI/gpt-j-6B | 4163.67 (ms) | 1879.61 (ms) | 1612.24 (ms) |
CompVis/stable-diffusion-v1-4 | 10.33 (s) | 3.02 (s) | N/A |
Note*: GPT-J-6B software/hardware configuration please refer to text-generation. Stable-diffusion software/hardware configuration please refer to text-to-image
OVERVIEW | |||||||
---|---|---|---|---|---|---|---|
Model Compression | NeuralChat | Neural Engine | Kernel Libraries | ||||
MODEL COMPRESSION | |||||||
Quantization | Pruning | Distillation | Orchestration | ||||
Neural Architecture Search | Export | Metrics/Objectives | Pipeline | ||||
NEURAL ENGINE | |||||||
Model Compilation | Custom Pattern | Deployment | Profiling | ||||
KERNEL LIBRARIES | |||||||
Sparse GEMM Kernels | Custom INT8 Kernels | Profiling | Benchmark | ||||
ALGORITHMS | |||||||
Length Adaptive | Data Augmentation | ||||||
TUTORIALS AND RESULTS | |||||||
Tutorials | Supported Models | Model Performance | Kernel Performance |
- Blog published on Medium: Faster Stable Diffusion Inference with Intel Extension for Transformers (July 2023)
- Blog of Intel Developer News: The Moat Is Trust, Or Maybe Just Responsible AI (July 2023)
- Blog of Intel Developer News: Create Your Own Custom Chatbot (July 2023)
- Blog of Intel Developer News: Accelerate Llama 2 with Intel AI Hardware and Software Optimizations (July 2023)
- Arxiv: An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs (June 2023)
- Blog published on Medium: Simplify Your Custom Chatbot Deployment (June 2023)
- Blog published on Medium: Create Your Own Custom Chatbot (April 2023)
View Full Publication List.
Welcome to raise any interesting ideas on model compression techniques and LLM-based chatbot development! Feel free to reach us and look forward to our collaborations on Intel Extension for Transformers!