🤗 Hugging Face | 📝 Blog | 📑 Technical Report
We introduce EXAONE 3.5, a collection of instruction-tuned bilingual (English and Korean) generative models ranging from 2.4B to 32B parameters, developed and released by LG AI Research. EXAONE 3.5 language models include: 1) 2.4B model optimized for deployment on small or resource-constrained devices, 2) 7.8B model matching the size of its predecessor but offering improved performance, and 3) 32B model delivering powerful performance. All models support long-context processing of up to 32K tokens. Each model demonstrates state-of-the-art performance in real-world use cases and long-context understanding, while remaining competitive in general domains compared to recently released models of similar sizes.
Our documentation consists of the following sections:
- Performance: Experimental results of EXAONE 3.5 models.
- Quickstart: A basic guide to using EXAONE 3.5 models with Transformers.
- Quantized Models: An explanation of quantized EXAONE 3.5 weights in
AWQ
andGGUF
format. - Run Locally: A guide to running EXAONE 3.5 models locally with
llama.cpp
andOllama
frameworks. - Deployment: A guide to running EXAONE 3.5 models with
TensorRT-LLM
,vLLM
, andSGLang
deployment frameworks.
- 2024.12.11: EXAONE 3.5 is now avaiable on Ollama model library.
You can now installAutoAWQ
library via pip without using the git repository. - 2024.12.10: We update the EXAONE Modelfile for Ollama. Please use the new one.
- 2024.12.09: We release the EXAONE 3.5 language model series including 2.4B, 7.8B, and 32B instruction-tuned models. Check out the 📑 Technical Report!
Some experimental results are shown below. The full evaluation results can be found in the Technical Report.
Models | MT-Bench | LiveBench | Arena-Hard | AlpacaEval | IFEval | KoMT-Bench[1] | LogicKor |
---|---|---|---|---|---|---|---|
EXAONE 3.5 32B | 8.51 | 43.0 | 78.6 | 60.6 | 81.7 | 8.05 | 9.06 |
Qwen 2.5 32B | 8.49 | 50.6 | 67.0 | 41.0 | 78.7 | 7.75 | 8.89 |
C4AI Command R 32B | 7.38 | 29.7 | 17.0 | 25.9 | 26.1 | 6.72 | 8.24 |
Gemma 2 27B | 8.28 | 40.0 | 57.5 | 52.2 | 59.7 | 7.19 | 8.56 |
Yi 1.5 34B | 7.64 | 26.2 | 23.1 | 34.8 | 55.5 | 4.88 | 6.33 |
EXAONE 3.5 7.8B | 8.29 | 39.8 | 68.7 | 54.2 | 78.9 | 7.96 | 9.08 |
Qwen 2.5 7B | 6.48 | 35.6 | 48.9 | 31.7 | 72.5 | 5.19 | 6.38 |
Llama 3.1 8B | 7.59 | 28.3 | 27.7 | 25.7 | 74.5 | 4.85 | 5.99 |
Gemma 2 9B | 7.64 | 32.1 | 43.6 | 47.3 | 54.7 | 7.10 | 8.05 |
Phi 3 small (7B) | 7.63 | 27.9 | 26.8 | 29.2 | 59.5 | 3.22 | 3.99 |
EXAONE 3.5 2.4B | 7.81 | 33.0 | 48.2 | 37.1 | 73.6 | 7.24 | 8.51 |
Qwen 2.5 3B | 7.21 | 25.7 | 26.4 | 17.4 | 60.8 | 5.68 | 5.21 |
Qwen 2.5 1.5B | 5.72 | 19.2 | 10.6 | 8.4 | 40.7 | 3.87 | 3.60 |
Llama 3.2 3B | 6.94 | 24.0 | 14.2 | 18.7 | 70.1 | 3.16 | 2.86 |
Gemma 2 2B | 7.20 | 20.0 | 19.1 | 29.1 | 50.5 | 4.83 | 5.29 |
- [1] KoMT-Bench is a dataset created by translating MT-Bench into Korean; see README for more details.
- You need to install
transformers>=4.43.0
for the EXAONE 3.5 models. The Latest version is recommended to use.
Here is the example code to show how to use EXAONE 3.5 models.
Tip
In all examples below, you can use another size model by changing 7.8B to 32B or 2.4B.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Choose your prompt
prompt = "Explain how wonderful you are" # English example
prompt = "스스로를 자랑해 봐" # Korean example
messages = [
{"role": "system", "content": "You are EXAONE model from LG AI Research, a helpful assistant."},
{"role": "user", "content": prompt}
]
input_ids = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt"
)
output = model.generate(
input_ids.to("cuda"),
eos_token_id=tokenizer.eos_token_id,
max_new_tokens=128,
do_sample=False,
)
print(tokenizer.decode(output[0]))
Note
The EXAONE 3.5 instruction-tuned language models were trained to utilize the system prompt, so we highly recommend using the system prompts provided in the code snippet above.
We introduce a series of quantized weights of EXAONE 3.5 models.
We provide AWQ-quantized weights of EXAONE 3.5 models, quantized using AutoAWQ
library. Please refer to the AutoAWQ documentation for more details.
You need to install the latest version of AutoAWQ library (autoawq>=0.2.7.post3
) to load the AWQ-quantized version of EXAONE 3.5 models.
pip install autoawq
You can load the model in similar ways to the original models, only changing the model name. It automatically loads with AWQ configuration of the model. Please check the Quickstart section above for more details.
We provide weights in BF16
format and quantized weights in Q8_0
, Q6_K
, Q5_K_M
, Q4_K_M
, IQ4_XS
.
The example below is for the 7.8B model in BF16 format. Please refer to the EXAONE 3.5 collection to find quantized models. You may need to install huggingface_hub
to download the GGUF weights.
# (optional) install huggingface_hub
pip install huggingface_hub
# Download the GGUF weights
huggingface-cli download LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct-GGUF \
--include "EXAONE-3.5-7.8B-Instruct-BF16*.gguf" \
--local-dir .
For end users, we introduce two ways to run EXAONE 3.5 models locally.
Note
We highly recommend to use repetition penalty not exceeding 1.0 for better generation quality.
You can run EXAONE models with llama.cpp as follows:
-
Install llama.cpp. Please refer to the llama.cpp repository for more details.
-
Download EXAONE 3.5 model in GGUF format.
huggingface-cli download LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct-GGUF \
--include "EXAONE-3.5-7.8B-Instruct-BF16*.gguf" \
--local-dir .
- Run the model with llama.cpp in conversational mode.
llama-cli -cnv -m ./EXAONE-3.5-7.8B-Instruct-BF16.gguf \
-p "You are EXAONE model from LG AI Research, a helpful assistant."
-
In case of using EXAONE 3.5 32B model with BF16 precision, you may need to download all split files and merge them before running the model.
# Download all split files huggingface-cli download LGAI-EXAONE/EXAONE-3.5-32B-Instruct-GGUF \ --include "EXAONE-3.5-32B-Instruct-BF16*.gguf" \ --local-dir . # Merge all split files llama-gguf-split --merge \ ./EXAONE-3.5-32B-Instruct-BF16-00001-of-00002.gguf \ ./EXAONE-3.5-32B-Instruct-BF16.gguf
EXAONE 3.5 models are uploaded to Ollama model library. You can easily use EXAONE models as follows:
-
Install Ollama. Please refer to the Ollama repository for more details.
-
Run EXAONE 3.5 model as follows:
ollama run exaone3.5:7.8b
Note
In above example, the model exaone3.5:7.8b
is quantized in Q4_K_M
. If you would like to know a list of available models,
please refer to the EXAONE 3.5 Ollama page for more details.
Or, you can create and run EXAONE 3.5 models with GGUF format for customizing.
-
Install Ollama. Please refer to the Ollama repository for more details.
-
Download EXAONE 3.5 model in GGUF format. Please refer to the GGUF section for more details.
-
Write the
Modelfile
for EXAONE 3.5.
Important
The EXAONE Modelfile is updated for better generation quality. We strongly recommend to use the new one.
# Model path (choose appropriate GGUF weights on your own)
FROM ./EXAONE-3.5-7.8B-Instruct-BF16.gguf
# Parameter values
PARAMETER stop "[|endofturn|]"
PARAMETER repeat_penalty 1.0
# PARAMETER num_ctx 32768 # if you need a long context
# Chat template
TEMPLATE """{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 -}}
{{ if eq .Role "system" }}[|system|]{{ .Content }}[|endofturn|]
{{ continue }}
{{ else if eq .Role "user" }}[|user|]{{ .Content }}
{{ else if eq .Role "assistant" }}[|assistant|]{{ .Content }}[|endofturn|]
{{ end }}
{{- if and (ne .Role "assistant") $last }}[|assistant|]{{ end }}
{{- end -}}"""
# System prompt
SYSTEM """You are EXAONE model from LG AI Research, a helpful assistant."""
# License
LICENSE """EXAONE AI Model License Agreement 1.1 - NC """
- Convert the model to Ollama.
ollama create exaone -f Modelfile
- Run the model with Ollama.
ollama run exaone
EXAONE 3.5 models have been integrated into various deployment frameworks.
Note
We highly recommend to use repetition penalty not exceeding 1.0 for better generation quality.
TensorRT-LLM has supported EXAONE language models since EXAONE 3.0. We recommend to use TensorRT-LLM for the best performance. You can run EXAONE 3.5 models with TensorRT-LLM by following the instructions on TensorRT-LLM EXAONE Example.
Note
TensorRT-LLM also supports AWQ on their own methods. If you want to use AWQ with TensorRT-LLM, please refer to the AWQ section in TensorRT-LLM EXAONE Example.
You can easily run EXAONE 3.5 models with vLLM.
- Install vLLM (
vllm>=0.6.0
). Please refer to the vLLM quickstart guide for more details.
pip install vllm
- Run the models with vLLM.
vllm serve LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct
- Send a request with the following curl command after the server starts.
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct",
"messages": [
{"role": "system", "content": "You are EXAONE model from LG AI Research, a helpful assistant."},
{"role": "user", "content": "Explain how wonderful you are"}
],
"max_tokens": 128,
"temperature": 0.7
}'
Note
If you want to serve GGUF quantized models with vLLM, please refer to the vLLM GGUF documentation.
You can also run EXAONE 3.5 models with SGLang.
-
Install SGLang. Please refer to the SGLang documentation for more details.
-
Run the server with the following command.
python -m sglang.launch_server --model-path LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct \
--port 30000 --host 0.0.0.0
Note
In case of using EXAONE 3.5 2.4B model, you need to install sglang>=0.3.6 and use --attention-backend triton
option.
- Send a request with the following curl command after the server starts.
curl -s http://0.0.0.0:30000/v1/chat/completions \
-d '{
"model": "LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct",
"messages": [
{"role": "system", "content": "You are EXAONE model from LG AI Research, a helpful assistant."},
{"role": "user", "content": "Explain how wonderful you are"}
],
"max_tokens": 128,
"temperature": 0.7
}'
The EXAONE language model has certain limitations and may occasionally generate inappropriate responses. The language model generates responses based on the output probability of tokens, and it is determined during learning from training data. While we have made every effort to exclude personal, harmful, and biased information from the training data, some problematic content may still be included, potentially leading to undesirable responses. Please note that the text generated by EXAONE language model does not reflects the views of LG AI Research.
- Inappropriate answers may be generated, which contain personal, harmful or other inappropriate information.
- Biased responses may be generated, which are associated with age, gender, race, and so on.
- The generated responses rely heavily on statistics from the training data, which can result in the generation of semantically or syntactically incorrect sentences.
- Since the model does not reflect the latest information, the responses may be false or contradictory.
LG AI Research strives to reduce potential risks that may arise from EXAONE language models. Users are not allowed to engage in any malicious activities (e.g., keying in illegal information) that may induce the creation of inappropriate outputs violating LG AI’s ethical principles when using EXAONE language models.
The model is licensed under EXAONE AI Model License Agreement 1.1 - NC
@article{exaone-3.5,
title={EXAONE 3.5: Series of Large Language Models for Real-world Use Cases},
author={LG AI Research},
journal={arXiv preprint arXiv:2412.04862},
year={2024}
}
LG AI Research Technical Support: [email protected]