Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support multi modal evaluation #1540

Merged
merged 15 commits into from
Aug 5, 2024
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@ You can contact us and communicate with us by adding our group:
<img src="asset/discord_qr.jpg" width="200" height="200"> | <img src="asset/wechat.png" width="200" height="200">

## 🎉 News
- 🔥2024.07.30: Support evaluation for multi-modal models! Same command with [new datasets](https://swift.readthedocs.io/en/latest/LLM/LLM-eval.html#introduction).
- 🔥2024.07.29: Support the use of lmdeploy for inference acceleration of LLM and VLM models. Documentation can be found [here](docs/source_en/Multi-Modal/LmDeploy-inference-acceleration.md).
- 🔥2024.07.24: Support DPO/ORPO/SimPO/CPO alignment algorithm for vision MLLM, training scripts can be find in [Document](docs/source_en/Multi-Modal/human-preference-alignment-training-documentation.md). support RLAIF-V dataset.
- 🔥2024.07.24: Support using Megatron for CPT and SFT on the Qwen2 series. You can refer to the [Megatron training documentation](docs/source_en/LLM/Megatron-training.md).
Expand Down
1 change: 1 addition & 0 deletions README_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ SWIFT具有丰富全面的文档,请查看我们的文档网站:


## 🎉 新闻
- 🔥2024.07.30: 支持多模态数据集的评测!命令行完全一致,新增了许多[多模态数据集](https://swift.readthedocs.io/zh-cn/latest/LLM/LLM%E8%AF%84%E6%B5%8B%E6%96%87%E6%A1%A3.html#id2).
- 🔥2024.07.29: 支持使用lmdeploy对LLM和VLM模型进行推理加速. 文档可以查看[这里](docs/source/Multi-Modal/LmDeploy推理加速文档.md).
- 🔥2024.07.24: 人类偏好对齐算法支持视觉多模态大模型, 包括DPO/ORPO/SimPO/CPO, 训练参考[文档](docs/source/Multi-Modal/人类偏好对齐训练文档.md). 支持数据集RLAIF-V.
- 🔥2024.07.24: 支持使用megatron对qwen2系列进行CPT和SFT. 可以查看[megatron训练文档](docs/source/LLM/Megatron训练文档.md).
Expand Down
30 changes: 28 additions & 2 deletions docs/source/LLM/LLM评测文档.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,38 @@ SWIFT支持了eval(评测)能力,用于对原始模型和训练后的模

SWIFT的eval能力使用了魔搭社区[评测框架EvalScope](https://github.com/modelscope/eval-scope),以及[Open-Compass](https://hub.opencompass.org.cn/home),并进行了高级封装以支持各类模型的评测需求。目前我们支持了**标准评测集**的评测流程,以及**用户自定义**评测集的评测流程。其中**标准评测集**包含:

纯文本评测:
```text
'obqa', 'AX_b', 'siqa', 'nq', 'mbpp', 'winogrande', 'mmlu', 'BoolQ', 'cluewsc', 'ocnli', 'lambada', 'CMRC', 'ceval', 'csl', 'cmnli', 'bbh', 'ReCoRD', 'math', 'humaneval', 'eprstmt', 'WSC', 'storycloze', 'MultiRC', 'RTE', 'chid', 'gsm8k', 'AX_g', 'bustm', 'afqmc', 'piqa', 'lcsts', 'strategyqa', 'Xsum', 'agieval', 'ocnli_fc', 'C3', 'tnews', 'race', 'triviaqa', 'CB', 'WiC', 'hellaswag', 'summedits', 'GaokaoBench', 'ARC_e', 'COPA', 'ARC_c', 'DRCD'
'obqa', 'AX_b', 'siqa', 'nq', 'mbpp', 'winogrande', 'mmlu', 'BoolQ', 'cluewsc', 'ocnli', 'lambada',
'CMRC', 'ceval', 'csl', 'cmnli', 'bbh', 'ReCoRD', 'math', 'humaneval', 'eprstmt', 'WSC', 'storycloze',
'MultiRC', 'RTE', 'chid', 'gsm8k', 'AX_g', 'bustm', 'afqmc', 'piqa', 'lcsts', 'strategyqa', 'Xsum', 'agieval',
'ocnli_fc', 'C3', 'tnews', 'race', 'triviaqa', 'CB', 'WiC', 'hellaswag', 'summedits', 'GaokaoBench',
'ARC_e', 'COPA', 'ARC_c', 'DRCD'
```

数据集的具体介绍可以查看:https://hub.opencompass.org.cn/home

多模态评测:
```text
'COCO_VAL', 'MME', 'HallusionBench', 'POPE', 'MMBench_DEV_EN', 'MMBench_TEST_EN', 'MMBench_DEV_CN', 'MMBench_TEST_CN',
'MMBench', 'MMBench_CN', 'MMBench_DEV_EN_V11', 'MMBench_TEST_EN_V11', 'MMBench_DEV_CN_V11',
'MMBench_TEST_CN_V11', 'MMBench_V11', 'MMBench_CN_V11', 'SEEDBench_IMG', 'SEEDBench2',
'SEEDBench2_Plus', 'ScienceQA_VAL', 'ScienceQA_TEST', 'MMT-Bench_ALL_MI', 'MMT-Bench_ALL',
'MMT-Bench_VAL_MI', 'MMT-Bench_VAL', 'AesBench_VAL', 'AesBench_TEST', 'CCBench', 'AI2D_TEST', 'MMStar',
'RealWorldQA', 'MLLMGuard_DS', 'BLINK', 'OCRVQA_TEST', 'OCRVQA_TESTCORE', 'TextVQA_VAL', 'DocVQA_VAL',
'DocVQA_TEST', 'InfoVQA_VAL', 'InfoVQA_TEST', 'ChartQA_TEST', 'MathVision', 'MathVision_MINI',
'MMMU_DEV_VAL', 'MMMU_TEST', 'OCRBench', 'MathVista_MINI', 'LLaVABench', 'MMVet', 'MTVQA_TEST',
'MMLongBench_DOC', 'VCR_EN_EASY_500', 'VCR_EN_EASY_100', 'VCR_EN_EASY_ALL', 'VCR_EN_HARD_500',
'VCR_EN_HARD_100', 'VCR_EN_HARD_ALL', 'VCR_ZH_EASY_500', 'VCR_ZH_EASY_100', 'VCR_ZH_EASY_ALL',
'VCR_ZH_HARD_500', 'VCR_ZH_HARD_100', 'VCR_ZH_HARD_ALL', 'MMDU', 'MMBench-Video', 'Video-MME',
'MMBench_DEV_EN', 'MMBench_TEST_EN', 'MMBench_DEV_CN', 'MMBench_TEST_CN', 'MMBench', 'MMBench_CN',
'MMBench_DEV_EN_V11', 'MMBench_TEST_EN_V11', 'MMBench_DEV_CN_V11', 'MMBench_TEST_CN_V11', 'MMBench_V11',
'MMBench_CN_V11', 'SEEDBench_IMG', 'SEEDBench2', 'SEEDBench2_Plus', 'ScienceQA_VAL', 'ScienceQA_TEST',
'MMT-Bench_ALL_MI', 'MMT-Bench_ALL', 'MMT-Bench_VAL_MI', 'MMT-Bench_VAL', 'AesBench_VAL',
'AesBench_TEST', 'CCBench', 'AI2D_TEST', 'MMStar', 'RealWorldQA', 'MLLMGuard_DS', 'BLINK'
```
数据集的具体介绍可以查看:https://github.com/open-compass/VLMEvalKit


> 首次评测时会自动下载数据集文件:https://www.modelscope.cn/datasets/swift/evalscope_resource/files
> 如果下载失败可以手动下载放置本地路径, 具体可以查看eval的日志输出.

Expand Down
5 changes: 1 addition & 4 deletions docs/source/LLM/命令行参数.md
Original file line number Diff line number Diff line change
Expand Up @@ -341,10 +341,7 @@ export参数继承了infer参数, 除此之外增加了以下参数:

eval参数继承了infer参数,除此之外增加了以下参数:(注意: infer中的generation_config参数将失效, 由[evalscope](https://github.com/modelscope/eval-scope)控制.)

- `--eval_dataset`: 评测的官方数据集, 默认值为空, 代表全量评测, 注意指定了custom_eval_config时本参数不生效.
```text
目前支持的数据集包括:'obqa', 'AX_b', 'siqa', 'nq', 'mbpp', 'winogrande', 'mmlu', 'BoolQ', 'cluewsc', 'ocnli', 'lambada', 'CMRC', 'ceval', 'csl', 'cmnli', 'bbh', 'ReCoRD', 'math', 'humaneval', 'eprstmt', 'WSC', 'storycloze', 'MultiRC', 'RTE', 'chid', 'gsm8k', 'AX_g', 'bustm', 'afqmc', 'piqa', 'lcsts', 'strategyqa', 'Xsum', 'agieval', 'ocnli_fc', 'C3', 'tnews', 'race', 'triviaqa', 'CB', 'WiC', 'hellaswag', 'summedits', 'GaokaoBench', 'ARC_e', 'COPA', 'ARC_c', 'DRCD'
```
- `--eval_dataset`: 评测的官方数据集, 默认值为空, 代表全量评测, 注意指定了custom_eval_config时本参数不生效. [查看所有支持的评测集](./LLM评测文档.md#能力介绍).
- `--eval_few_shot`: 每个评测集的子数据集的few-shot个数, 默认为`None`, 即使用数据集的默认配置. **本参数暂时废弃**
- `--eval_limit`: 每个评测集的子数据集的采样数量, 默认为`None`代表全量评测. 可以传入整数, 表示每个数据集的评测数量, 也可以传入string, 如`[10:20]`, 代表切片.
- `--name`: 用于区分相同配置评估的结果存储路径. 如: `{eval_output_dir}/{name}`, 默认在:`eval_outputs/defaults`, 其内部存在以时间命名的文件夹来承载每次评测结果.
Expand Down
5 changes: 1 addition & 4 deletions docs/source_en/LLM/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -341,10 +341,7 @@ export parameters inherit from infer parameters, with the following added parame

The eval parameters inherit from the infer parameters, and additionally include the following parameters: (Note: The generation_config parameter in infer will be invalid, controlled by [evalscope](https://github.com/modelscope/eval-scope).)

- `--eval_dataset`: The official evaluation dataset, default is `None`, means all datasets. if `custom_eval_config` is specified, this arg will be ignored.
```text
Currently supported datasets include: 'obqa', 'AX_b', 'siqa', 'nq', 'mbpp', 'winogrande', 'mmlu', 'BoolQ', 'cluewsc', 'ocnli', 'lambada', 'CMRC', 'ceval', 'csl', 'cmnli', 'bbh', 'ReCoRD', 'math', 'humaneval', 'eprstmt', 'WSC', 'storycloze', 'MultiRC', 'RTE', 'chid', 'gsm8k', 'AX_g', 'bustm', 'afqmc', 'piqa', 'lcsts', 'strategyqa', 'Xsum', 'agieval', 'ocnli_fc', 'C3', 'tnews', 'race', 'triviaqa', 'CB', 'WiC', 'hellaswag', 'summedits', 'GaokaoBench', 'ARC_e', 'COPA', 'ARC_c', 'DRCD'
```
- `--eval_dataset`: The official evaluation dataset, default is `None`, means all datasets. if `custom_eval_config` is specified, this arg will be ignored. [Check all supported eval datasets](./LLM-eval.md#introduction).
- `--eval_few_shot`: The few-shot number of sub-datasets for each evaluation set, with a default value of `None`, meaning to use the default configuration of the dataset. **This parameter is currently deprecated.**
- `--eval_limit`: The sampling quantity for each sub-dataset of the evaluation set, with a default value of `None` indicating full-scale evaluation. You can pass integer(number of samples from each eval dataset) or str(`[10:20]`, slice).
- `--name`: Used to differentiate the result storage path for evaluating the same configuration. Like: `{eval_output_dir}/{name}`, default will be `eval_outputs/defaults`, in which a timestamp named folder will hold each eval result.
Expand Down
30 changes: 28 additions & 2 deletions docs/source_en/LLM/LLM-eval.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,38 @@ SWIFT supports the eval (evaluation) capability to provide standardized evaluati

SWIFT's eval capability utilizes the [EvalScope evaluation framework](https://github.com/modelscope/eval-scope) from the ModelScope community and [Open-Compass](https://hub.opencompass.org.cn/home) and provides advanced encapsulation to support evaluation needs for various models. Currently, we support the evaluation process for **standard evaluation sets** and **user-defined evaluation sets**. The **standard evaluation sets** include:

NLP eval datasets:
```text
'obqa', 'AX_b', 'siqa', 'nq', 'mbpp', 'winogrande', 'mmlu', 'BoolQ', 'cluewsc', 'ocnli', 'lambada', 'CMRC', 'ceval', 'csl', 'cmnli', 'bbh', 'ReCoRD', 'math', 'humaneval', 'eprstmt', 'WSC', 'storycloze', 'MultiRC', 'RTE', 'chid', 'gsm8k', 'AX_g', 'bustm', 'afqmc', 'piqa', 'lcsts', 'strategyqa', 'Xsum', 'agieval', 'ocnli_fc', 'C3', 'tnews', 'race', 'triviaqa', 'CB', 'WiC', 'hellaswag', 'summedits', 'GaokaoBench', 'ARC_e', 'COPA', 'ARC_c', 'DRCD'
'obqa', 'AX_b', 'siqa', 'nq', 'mbpp', 'winogrande', 'mmlu', 'BoolQ', 'cluewsc', 'ocnli', 'lambada',
'CMRC', 'ceval', 'csl', 'cmnli', 'bbh', 'ReCoRD', 'math', 'humaneval', 'eprstmt', 'WSC', 'storycloze',
'MultiRC', 'RTE', 'chid', 'gsm8k', 'AX_g', 'bustm', 'afqmc', 'piqa', 'lcsts', 'strategyqa', 'Xsum', 'agieval',
'ocnli_fc', 'C3', 'tnews', 'race', 'triviaqa', 'CB', 'WiC', 'hellaswag', 'summedits', 'GaokaoBench',
'ARC_e', 'COPA', 'ARC_c', 'DRCD'
```

Check out the detail descriptions of these datasets: https://hub.opencompass.org.cn/home

Multi Modal eval datasets:
```text
'COCO_VAL', 'MME', 'HallusionBench', 'POPE', 'MMBench_DEV_EN', 'MMBench_TEST_EN', 'MMBench_DEV_CN', 'MMBench_TEST_CN',
'MMBench', 'MMBench_CN', 'MMBench_DEV_EN_V11', 'MMBench_TEST_EN_V11', 'MMBench_DEV_CN_V11',
'MMBench_TEST_CN_V11', 'MMBench_V11', 'MMBench_CN_V11', 'SEEDBench_IMG', 'SEEDBench2',
'SEEDBench2_Plus', 'ScienceQA_VAL', 'ScienceQA_TEST', 'MMT-Bench_ALL_MI', 'MMT-Bench_ALL',
'MMT-Bench_VAL_MI', 'MMT-Bench_VAL', 'AesBench_VAL', 'AesBench_TEST', 'CCBench', 'AI2D_TEST', 'MMStar',
'RealWorldQA', 'MLLMGuard_DS', 'BLINK', 'OCRVQA_TEST', 'OCRVQA_TESTCORE', 'TextVQA_VAL', 'DocVQA_VAL',
'DocVQA_TEST', 'InfoVQA_VAL', 'InfoVQA_TEST', 'ChartQA_TEST', 'MathVision', 'MathVision_MINI',
'MMMU_DEV_VAL', 'MMMU_TEST', 'OCRBench', 'MathVista_MINI', 'LLaVABench', 'MMVet', 'MTVQA_TEST',
'MMLongBench_DOC', 'VCR_EN_EASY_500', 'VCR_EN_EASY_100', 'VCR_EN_EASY_ALL', 'VCR_EN_HARD_500',
'VCR_EN_HARD_100', 'VCR_EN_HARD_ALL', 'VCR_ZH_EASY_500', 'VCR_ZH_EASY_100', 'VCR_ZH_EASY_ALL',
'VCR_ZH_HARD_500', 'VCR_ZH_HARD_100', 'VCR_ZH_HARD_ALL', 'MMDU', 'MMBench-Video', 'Video-MME',
'MMBench_DEV_EN', 'MMBench_TEST_EN', 'MMBench_DEV_CN', 'MMBench_TEST_CN', 'MMBench', 'MMBench_CN',
'MMBench_DEV_EN_V11', 'MMBench_TEST_EN_V11', 'MMBench_DEV_CN_V11', 'MMBench_TEST_CN_V11', 'MMBench_V11',
'MMBench_CN_V11', 'SEEDBench_IMG', 'SEEDBench2', 'SEEDBench2_Plus', 'ScienceQA_VAL', 'ScienceQA_TEST',
'MMT-Bench_ALL_MI', 'MMT-Bench_ALL', 'MMT-Bench_VAL_MI', 'MMT-Bench_VAL', 'AesBench_VAL',
'AesBench_TEST', 'CCBench', 'AI2D_TEST', 'MMStar', 'RealWorldQA', 'MLLMGuard_DS', 'BLINK'
```
Check out the detail descriptions of these datasets: https://github.com/open-compass/VLMEvalKit


> At the first time of running eval, a resource dataset will be downloaded: https://www.modelscope.cn/datasets/swift/evalscope_resource/files
> If downloading fails, you can manually download the dataset to your local disk, please pay attention to the log of the `eval` command.

Expand Down
3 changes: 1 addition & 2 deletions requirements/eval.txt
Original file line number Diff line number Diff line change
@@ -1,2 +1 @@
llmuses>=0.4.0
ms-opencompass
evalscope[all]>=0.5.0
106 changes: 73 additions & 33 deletions swift/llm/eval.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,14 @@
from typing import Any, Dict, List, Optional, Tuple

import json
from llmuses.config import TaskConfig
from llmuses.constants import DEFAULT_ROOT_CACHE_DIR
from llmuses.models.custom import CustomModel
from llmuses.run import run_task
from llmuses.summarizer import Summarizer
from llmuses.utils import EvalBackend
from evalscope.backend.opencompass import OpenCompassBackendManager
from evalscope.backend.vlm_eval_kit import VLMEvalKitBackendManager
from evalscope.config import TaskConfig
from evalscope.constants import DEFAULT_ROOT_CACHE_DIR
from evalscope.models.custom import CustomModel
from evalscope.run import run_task
from evalscope.summarizer import Summarizer
from evalscope.utils import EvalBackend
from modelscope import GenerationConfig
from openai import APIConnectionError
from tqdm import tqdm
Expand Down Expand Up @@ -190,8 +192,64 @@ def get_model_type(port, timeout):
time.sleep(1)


def opencompass_runner(args: EvalArguments, dataset: List[str], model_type: str, is_chat: bool, url: str):
eval_limit = args.eval_limit
if eval_limit is not None and '[' not in eval_limit:
eval_limit = int(eval_limit)
limit_config = {'limit': eval_limit} if eval_limit else {}
task_cfg = dict(
eval_backend='OpenCompass',
eval_config={
'datasets': dataset,
'reuse': 'latest' if args.eval_use_cache else None,
'batch_size': args.eval_batch_size,
'work_dir': args.eval_output_dir,
'models': [
{
'path': model_type,
'openai_api_base': url,
'is_chat': is_chat,
'key': args.eval_token,
},
],
**limit_config,
},
)
with EvalDatasetContext():
run_task(task_cfg=task_cfg)

return Summarizer.get_report_from_cfg(task_cfg=task_cfg)


def vlmeval_runner(args: EvalArguments, dataset: List[str], model_type: str, is_chat: bool, url: str):
eval_limit = args.eval_limit
if eval_limit is not None and '[' not in eval_limit:
eval_limit = int(eval_limit)
limit_config = {'limit': eval_limit} if eval_limit else {}
if args.eval_batch_size or args.eval_use_cache:
logger.warn('VLMEval does not support `batch_size` or `eval_use_cache`')
task_cfg = dict(
eval_backend='VLMEvalKit',
eval_config={
'data': dataset,
'work_dir': args.eval_output_dir,
'model': [
{
'name': 'CustomAPIModel',
'api_base': url,
'key': args.eval_token,
'type': model_type,
},
],
**limit_config,
},
)
run_task(task_cfg=task_cfg)
return Summarizer.get_report_from_cfg(task_cfg=task_cfg)


def eval_opencompass(args: EvalArguments) -> List[Dict[str, Any]]:
from llmuses.run import run_task
from evalscope.run import run_task
from swift.utils.torch_utils import _find_free_port
logger.info(f'args: {args}')
if args.eval_few_shot:
Expand Down Expand Up @@ -224,34 +282,16 @@ def eval_opencompass(args: EvalArguments) -> List[Dict[str, Any]]:
url += '/completions'
model_type = args.model_type
is_chat = args.eval_is_chat_model
eval_limit = args.eval_limit
if eval_limit is not None and '[' not in eval_limit:
eval_limit = int(eval_limit)
limit_config = {'limit': eval_limit} if eval_limit else {}
task_cfg = dict(
eval_backend='OpenCompass',
eval_config={
'datasets': args.eval_dataset,
'work_dir': args.eval_output_dir,
'reuse': 'latest' if args.eval_use_cache else None,
'batch_size': args.eval_batch_size,
'models': [
{
'path': model_type,
'openai_api_base': url,
'is_chat': is_chat,
'key': args.eval_token,
},
],
**limit_config
},
)

with EvalDatasetContext():
run_task(task_cfg=task_cfg)
nlp_datasets = set(OpenCompassBackendManager.list_datasets()) & set(args.eval_dataset)
mm_datasets = set(VLMEvalKitBackendManager.list_supported_datasets()) & set(args.eval_dataset)

final_report: List[dict] = Summarizer.get_report_from_cfg(task_cfg=task_cfg)
logger.info(f'Final report:{final_report}\n')
for dataset, runner in zip([list(nlp_datasets), list(mm_datasets)], [opencompass_runner, vlmeval_runner]):
if not dataset:
continue

final_report = runner(args, dataset, model_type, is_chat, url)
logger.info(f'Final report:{final_report}\n')
if process:
process.kill()
return final_report
Expand Down
Loading
Loading