This folder contains the evaluation harness for the MINT benchmark on LLMs' ability to solve tasks with multi-turn interactions.
We support evaluation of the Eurus subset focus on math and code reasoning, including MATH, MMLU, TheoremQA, HumanEval, MBPP.
Please follow instruction here to setup your local development environment and LLM.
We are using the MINT dataset hosted on Hugging Face.
Following is the basic command to start the evaluation. Currently, the only agent supported with MINT is CodeActAgent
.
./evaluation/mint/scripts/run_infer.sh [model_config] [git-version] [subset] [eval_limit]
where model_config
is mandatory, while others are optional.
-
model_config
, e.g.eval_gpt4_1106_preview
, is the config group name for your LLM settings, as defined in yourconfig.toml
. -
git-version
, e.g.HEAD
, is the git commit hash of the OpenDevin version you would like to evaluate. It could also be a release tag like0.6.2
. -
subset
, e.g.math
, is the subset of the MINT benchmark to evaluate on, defaulting tomath
. It can be either:math
,gsm8k
,mmlu
,theoremqa
,mbpp
,humaneval
. -
eval_limit
, e.g.2
, limits the evaluation to the firsteval_limit
instances, defaulting to all instances.
Note: in order to use eval_limit
, you must also set subset
.
For example,
./evaluation/swe_bench/scripts/run_infer.sh eval_gpt4_1106_preview 0.6.2 gsm8k 3
@misc{wang2024mint,
title={MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback},
author={Xingyao Wang and Zihan Wang and Jiateng Liu and Yangyi Chen and Lifan Yuan and Hao Peng and Heng Ji},
year={2024},
eprint={2309.10691},
archivePrefix={arXiv},
primaryClass={cs.CL}
}