SeaBench: Benchmarking LLMs for Southeast Aisa languages with Open-ended Questions

This repository contains evaluation code for SeaBench, a comprehensive benchmark designed to assess the capabilities of large language models (LLMs) in Southeast Asian (SEA) languages. Specifically, SeaBench evaluates models' multi-turn and instruction-following abilities across Indonesian, Thai, and Vietnamese languages through carefully crafted evaluation tasks.

Data

All the data is available here. Currently, only public-questions.jsonl here (private questions are hidden to avoid data contamination).

Please also check SeaExam dataset here for more evaluation tasks on SEA languages.

Evaluation

Setup enironment

git clone https://github.com/DAMO-NLP-SG/SeaBench.git
cd SeaBench
conda create -n SeaBench python=3.9
conda activate SeaBench
pip install -r requirement.txt

1. run inference to get model's prediction

You need to first generate model's response, you can directly run python gen_responses.py. It supports both open-source or commercial models.

Example:

python gen_responses.py --model_id SeaLLMs/SeaLLMs-v3-7B-Chat

Please pay attention to:

different models have different ways to express system_prompt, user_turn, and assistant_turn
it needs to at least support a two-turn inference

Either way, the model prediction would be in ./outputs/{model_name}.jsonl

It should add two keys for each row compared to publuc-questions.jsonl: modelname_1 and modelname_2, which are the model's responses at the 1st and 2nd turn

2. judge model evaluation

Specify openai API key: export OPENAI_API_KEY=xxx, then run python gen_judgements.py --testing_model model_name

by default, it will use gpt-4o-2024-08-06 as the evaluator
the predictions will be written to ./model_judgement/ directory

3. extract and summarize the results

run python gen_results.py will give the results of a certain model

Pipeline

You can also specify the testing_model, judge_model and OPENAI_API_KEY in pipeline.sh and quickly run

source pipeline.sh

Leaderboard

You can find our interactive leaderboard 🤗 Here. The leaderboard showcases results from two complementary benchmarks: SeaExam and SeaBench. Each benchmark evaluates different aspects of model capabilities through distinct question types, providing a comprehensive assessment of model performance.

Citation

If you find SeaBench useful for your research, please consider citing our papers:

@article{damonlp2024seallm3,
  author = {Wenxuan Zhang*, Hou Pong Chan*, Yiran Zhao*, Mahani Aljunied*,
            Jianyu Wang*, Chaoqun Liu, Yue Deng, Zhiqiang Hu, Weiwen Xu,
            Yew Ken Chia, Xin Li, Lidong Bing},
  title = {SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages},
  year = {2024},
  url = {https://arxiv.org/abs/2407.19672}
}

@article{damonlpsg2023seallm,
  author = {Xuan-Phi Nguyen*, Wenxuan Zhang*, Xin Li*, Mahani Aljunied*,
            Zhiqiang Hu, Chenhui Shen, Yew Ken Chia, Xingxuan Li, Jianyu Wang,
            Qingyu Tan, Liying Cheng, Guanzheng Chen, Yue Deng, Sen Yang,
            Chaoqun Liu, Hang Zhang, Lidong Bing},
  title = {SeaLLMs - Large Language Models for Southeast Asia},
  year = {2024},
  booktitle = {ACL 2024 System Demonstrations},
  url = {https://arxiv.org/pdf/2312.00738},
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
prompts		prompts
.gitignore		.gitignore
gen_judgements.py		gen_judgements.py
gen_responses.py		gen_responses.py
gen_results.py		gen_results.py
pipeline.sh		pipeline.sh
readme.md		readme.md
requirement.txt		requirement.txt
utils_model.py		utils_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SeaBench: Benchmarking LLMs for Southeast Aisa languages with Open-ended Questions

Data

Evaluation

Setup enironment

1. run inference to get model's prediction

2. judge model evaluation

3. extract and summarize the results

Pipeline

Leaderboard

Citation

About

Releases

Packages

Languages

DAMO-NLP-SG/SeaBench

Folders and files

Latest commit

History

Repository files navigation

SeaBench: Benchmarking LLMs for Southeast Aisa languages with Open-ended Questions

Data

Evaluation

Setup enironment

1. run inference to get model's prediction

2. judge model evaluation

3. extract and summarize the results

Pipeline

Leaderboard

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages