EXAONE 3.5

🤗 Hugging Face | 📝 Blog | 📑 Technical Report

Introduction

We introduce EXAONE 3.5, a collection of instruction-tuned bilingual (English and Korean) generative models ranging from 2.4B to 32B parameters, developed and released by LG AI Research. EXAONE 3.5 language models include: 1) 2.4B model optimized for deployment on small or resource-constrained devices, 2) 7.8B model matching the size of its predecessor but offering improved performance, and 3) 32B model delivering powerful performance. All models support long-context processing of up to 32K tokens. Each model demonstrates state-of-the-art performance in real-world use cases and long-context understanding, while remaining competitive in general domains compared to recently released models of similar sizes.

Our documentation consists of the following sections:

Performance: Experimental results of EXAONE 3.5 models.
Quickstart: A basic guide to using EXAONE 3.5 models with Transformers.
Quantized Models: An explanation of quantized EXAONE 3.5 weights in AWQ and GGUF format.
Run Locally: A guide to running EXAONE 3.5 models locally with llama.cpp and Ollama frameworks.
Deployment: A guide to running EXAONE 3.5 models with TensorRT-LLM, vLLM, and SGLang deployment frameworks.

News

2024.12.11: EXAONE 3.5 is now avaiable on Ollama model library.
You can now install AutoAWQ library via pip without using the git repository.
2024.12.10: We update the EXAONE Modelfile for Ollama. Please use the new one.
2024.12.09: We release the EXAONE 3.5 language model series including 2.4B, 7.8B, and 32B instruction-tuned models. Check out the 📑 Technical Report!

Performance

Some experimental results are shown below. The full evaluation results can be found in the Technical Report.

Models	MT-Bench	LiveBench	Arena-Hard	AlpacaEval	IFEval	KoMT-Bench[1]	LogicKor
EXAONE 3.5 32B	8.51	43.0	78.6	60.6	81.7	8.05	9.06
Qwen 2.5 32B	8.49	50.6	67.0	41.0	78.7	7.75	8.89
C4AI Command R 32B	7.38	29.7	17.0	25.9	26.1	6.72	8.24
Gemma 2 27B	8.28	40.0	57.5	52.2	59.7	7.19	8.56
Yi 1.5 34B	7.64	26.2	23.1	34.8	55.5	4.88	6.33

EXAONE 3.5 7.8B	8.29	39.8	68.7	54.2	78.9	7.96	9.08
Qwen 2.5 7B	6.48	35.6	48.9	31.7	72.5	5.19	6.38
Llama 3.1 8B	7.59	28.3	27.7	25.7	74.5	4.85	5.99
Gemma 2 9B	7.64	32.1	43.6	47.3	54.7	7.10	8.05
Phi 3 small (7B)	7.63	27.9	26.8	29.2	59.5	3.22	3.99

EXAONE 3.5 2.4B	7.81	33.0	48.2	37.1	73.6	7.24	8.51
Qwen 2.5 3B	7.21	25.7	26.4	17.4	60.8	5.68	5.21
Qwen 2.5 1.5B	5.72	19.2	10.6	8.4	40.7	3.87	3.60
Llama 3.2 3B	6.94	24.0	14.2	18.7	70.1	3.16	2.86
Gemma 2 2B	7.20	20.0	19.1	29.1	50.5	4.83	5.29

[1] KoMT-Bench is a dataset created by translating MT-Bench into Korean; see README for more details.

Quickstart

You need to install transformers>=4.43.0 for the EXAONE 3.5 models. The Latest version is recommended to use.

Here is the example code to show how to use EXAONE 3.5 models.

Tip

In all examples below, you can use another size model by changing 7.8B to 32B or 2.4B.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Choose your prompt
prompt = "Explain how wonderful you are"  # English example
prompt = "스스로를 자랑해 봐"       # Korean example

messages = [
    {"role": "system", "content": "You are EXAONE model from LG AI Research, a helpful assistant."},
    {"role": "user", "content": prompt}
]
input_ids = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
)

output = model.generate(
    input_ids.to("cuda"),
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=128,
    do_sample=False,
)
print(tokenizer.decode(output[0]))

Note

The EXAONE 3.5 instruction-tuned language models were trained to utilize the system prompt, so we highly recommend using the system prompts provided in the code snippet above.

Quantized Models

We introduce a series of quantized weights of EXAONE 3.5 models.

AWQ

We provide AWQ-quantized weights of EXAONE 3.5 models, quantized using AutoAWQ library. Please refer to the AutoAWQ documentation for more details.

You need to install the latest version of AutoAWQ library (autoawq>=0.2.7.post3) to load the AWQ-quantized version of EXAONE 3.5 models.

pip install autoawq

You can load the model in similar ways to the original models, only changing the model name. It automatically loads with AWQ configuration of the model. Please check the Quickstart section above for more details.

GGUF

We provide weights in BF16 format and quantized weights in Q8_0, Q6_K, Q5_K_M, Q4_K_M, IQ4_XS.

The example below is for the 7.8B model in BF16 format. Please refer to the EXAONE 3.5 collection to find quantized models. You may need to install huggingface_hub to download the GGUF weights.

# (optional) install huggingface_hub
pip install huggingface_hub

# Download the GGUF weights
huggingface-cli download LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct-GGUF \
    --include "EXAONE-3.5-7.8B-Instruct-BF16*.gguf" \
    --local-dir .

Run Locally

For end users, we introduce two ways to run EXAONE 3.5 models locally.

Note

We highly recommend to use repetition penalty not exceeding 1.0 for better generation quality.

llama.cpp

You can run EXAONE models with llama.cpp as follows:

Install llama.cpp. Please refer to the llama.cpp repository for more details.
Download EXAONE 3.5 model in GGUF format.

huggingface-cli download LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct-GGUF \
    --include "EXAONE-3.5-7.8B-Instruct-BF16*.gguf" \
    --local-dir .

Run the model with llama.cpp in conversational mode.

llama-cli -cnv -m ./EXAONE-3.5-7.8B-Instruct-BF16.gguf \
    -p "You are EXAONE model from LG AI Research, a helpful assistant."

In case of using EXAONE 3.5 32B model with BF16 precision, you may need to download all split files and merge them before running the model.

# Download all split files
huggingface-cli download LGAI-EXAONE/EXAONE-3.5-32B-Instruct-GGUF \
    --include "EXAONE-3.5-32B-Instruct-BF16*.gguf" \
    --local-dir .

# Merge all split files
llama-gguf-split --merge \
    ./EXAONE-3.5-32B-Instruct-BF16-00001-of-00002.gguf \
    ./EXAONE-3.5-32B-Instruct-BF16.gguf

Ollama

EXAONE 3.5 models are uploaded to Ollama model library. You can easily use EXAONE models as follows:

Install Ollama. Please refer to the Ollama repository for more details.
Run EXAONE 3.5 model as follows:

ollama run exaone3.5:7.8b

Note

In above example, the model exaone3.5:7.8b is quantized in Q4_K_M. If you would like to know a list of available models, please refer to the EXAONE 3.5 Ollama page for more details.

Or, you can create and run EXAONE 3.5 models with GGUF format for customizing.

Install Ollama. Please refer to the Ollama repository for more details.
Download EXAONE 3.5 model in GGUF format. Please refer to the GGUF section for more details.
Write the Modelfile for EXAONE 3.5.

Important

The EXAONE Modelfile is updated for better generation quality. We strongly recommend to use the new one.

# Model path (choose appropriate GGUF weights on your own)
FROM ./EXAONE-3.5-7.8B-Instruct-BF16.gguf

# Parameter values
PARAMETER stop "[|endofturn|]"
PARAMETER repeat_penalty 1.0
# PARAMETER num_ctx 32768  # if you need a long context

# Chat template
TEMPLATE """{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 -}}
{{ if eq .Role "system" }}[|system|]{{ .Content }}[|endofturn|]
{{ continue }}
{{ else if eq .Role "user" }}[|user|]{{ .Content }}
{{ else if eq .Role "assistant" }}[|assistant|]{{ .Content }}[|endofturn|]
{{ end }}
{{- if and (ne .Role "assistant") $last }}[|assistant|]{{ end }}
{{- end -}}"""

# System prompt
SYSTEM """You are EXAONE model from LG AI Research, a helpful assistant."""

# License
LICENSE """EXAONE AI Model License Agreement 1.1 - NC """

Convert the model to Ollama.

ollama create exaone -f Modelfile

Run the model with Ollama.

ollama run exaone

Deployment

EXAONE 3.5 models have been integrated into various deployment frameworks.

Note

We highly recommend to use repetition penalty not exceeding 1.0 for better generation quality.

TensorRT-LLM

TensorRT-LLM has supported EXAONE language models since EXAONE 3.0. We recommend to use TensorRT-LLM for the best performance. You can run EXAONE 3.5 models with TensorRT-LLM by following the instructions on TensorRT-LLM EXAONE Example.

Note

TensorRT-LLM also supports AWQ on their own methods. If you want to use AWQ with TensorRT-LLM, please refer to the AWQ section in TensorRT-LLM EXAONE Example.

vLLM

You can easily run EXAONE 3.5 models with vLLM.

Install vLLM (vllm>=0.6.0). Please refer to the vLLM quickstart guide for more details.

pip install vllm

Run the models with vLLM.

vllm serve LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct

Send a request with the following curl command after the server starts.

curl -X POST http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct",
        "messages": [
            {"role": "system", "content": "You are EXAONE model from LG AI Research, a helpful assistant."},
            {"role": "user", "content": "Explain how wonderful you are"}
        ],
        "max_tokens": 128,
        "temperature": 0.7
    }'

Note

If you want to serve GGUF quantized models with vLLM, please refer to the vLLM GGUF documentation.

SGLang

You can also run EXAONE 3.5 models with SGLang.

Install SGLang. Please refer to the SGLang documentation for more details.
Run the server with the following command.

python -m sglang.launch_server --model-path LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct \
    --port 30000 --host 0.0.0.0

Note

In case of using EXAONE 3.5 2.4B model, you need to install sglang>=0.3.6 and use --attention-backend triton option.

Send a request with the following curl command after the server starts.

curl -s http://0.0.0.0:30000/v1/chat/completions \
    -d '{
        "model": "LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct",
        "messages": [
            {"role": "system", "content": "You are EXAONE model from LG AI Research, a helpful assistant."},
            {"role": "user", "content": "Explain how wonderful you are"}
        ],
        "max_tokens": 128,
        "temperature": 0.7
    }'

Limitation

The EXAONE language model has certain limitations and may occasionally generate inappropriate responses. The language model generates responses based on the output probability of tokens, and it is determined during learning from training data. While we have made every effort to exclude personal, harmful, and biased information from the training data, some problematic content may still be included, potentially leading to undesirable responses. Please note that the text generated by EXAONE language model does not reflects the views of LG AI Research.

Inappropriate answers may be generated, which contain personal, harmful or other inappropriate information.
Biased responses may be generated, which are associated with age, gender, race, and so on.
The generated responses rely heavily on statistics from the training data, which can result in the generation of semantically or syntactically incorrect sentences.
Since the model does not reflect the latest information, the responses may be false or contradictory.

LG AI Research strives to reduce potential risks that may arise from EXAONE language models. Users are not allowed to engage in any malicious activities (e.g., keying in illegal information) that may induce the creation of inappropriate outputs violating LG AI’s ethical principles when using EXAONE language models.

License

The model is licensed under EXAONE AI Model License Agreement 1.1 - NC

Citation

@article{exaone-3.5,
  title={EXAONE 3.5: Series of Large Language Models for Real-world Use Cases},
  author={LG AI Research},
  journal={arXiv preprint arXiv:2412.04862},
  year={2024}
}

Contact

LG AI Research Technical Support: contact_us@lgresearch.ai

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

EXAONE 3.5

Introduction

News

Performance

Quickstart

Quantized Models

AWQ

GGUF

Run Locally

llama.cpp

Ollama

Deployment

TensorRT-LLM

vLLM

SGLang

Limitation

License

Citation

Contact

Files

README.md

Latest commit

History

README.md

File metadata and controls

EXAONE 3.5

Introduction

News

Performance

Quickstart

Quantized Models

AWQ

GGUF

Run Locally

llama.cpp

Ollama

Deployment

TensorRT-LLM

vLLM

SGLang

Limitation

License

Citation

Contact