-
Notifications
You must be signed in to change notification settings - Fork 178
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* bib * Add MixEval-X section to README with citation * Add citation section and links to MixEval-X homepage and arXiv in README * Add usage instructions and new audio-to-text hard task configuration * Update README to include new mix evaluation tasks for image, video, and audio to text * Remove deprecated image and video evaluation tasks from README * Update README to enhance task listing and provide usage instructions for MixEval-X * Add links to README for additional details and documentation * Update README to include row counts for mix evaluation tasks * Update README to improve formatting of task row counts in MixEval-X * Update README to reflect removal of HowToQA and Social-IQ-2.0 from Video2Text benchmark and add final results calculation details
- Loading branch information
Showing
2 changed files
with
93 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,90 @@ | ||
# MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures | ||
|
||
[Homepage](https://mixeval-x.github.io/) / [arXiv](https://arxiv.org/abs/2410.13754) | ||
|
||
## Usage | ||
|
||
Here is the list of tasks in MixEval-X: | ||
|
||
``` | ||
mix_evals_image2text | ||
├── mix_evals_image2text_freeform ---------- 998 rows | ||
└── mix_evals_image2text_mc ---------------- 990 rows | ||
mix_evals_image2text_hard | ||
├── mix_evals_image2text_freeform_hard ----- 498 rows | ||
└── mix_evals_image2text_mc_hard ----------- 500 rows | ||
mix_evals_video2text | ||
├── mix_evals_video2text_freeform ---------- 968 rows | ||
└── mix_evals_video2text_mc ---------------- 634 rows | ||
mix_evals_video2text_hard | ||
├── mix_evals_video2text_freeform_hard ----- 499 rows | ||
└── mix_evals_video2text_mc_hard ----------- 324 rows | ||
mix_evals_audio2text | ||
└── mix_evals_audio2text_freeform ---------- 962 rows | ||
mix_evals_audio2text_hard | ||
└── mix_evals_audio2text_freeform_hard ----- 505 rows | ||
``` | ||
|
||
The HowToQA and Social-IQ-2.0 was removed from the Video2Text benchmark pool due to annotation issues. A key advantage of MixEval-X is its capacity for self-refinement, enabling the benchmark pool to adapt and grow with time. | ||
|
||
You can run the command: | ||
|
||
```bash | ||
lmms-eval --model=<MODEL> \ | ||
--model_args=<MODEL_ARGS> \ | ||
--tasks=<TASK> \ | ||
--batch_size=1 \ | ||
--log_samples \ | ||
--output_path=./logs/ | ||
``` | ||
|
||
Models are listed at [here](https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/0589d0fba2efbcb526321f23ab0587599fd3c4c9/lmms_eval/models/__init__.py#L13). | ||
|
||
For example, to evaluate `llava_vid` on `mix_evals_video2text` (including `mix_evals_video2text_freeform` and `mix_evals_video2text_mc`): | ||
|
||
```bash | ||
lmms-eval --model=llava_vid \ | ||
--model_args=pretrained=lmms-lab/LLaVA-NeXT-Video-7B \ | ||
--tasks=mix_evals_video2text \ | ||
--batch_size=1 \ | ||
--log_samples \ | ||
--output_path=./logs/ | ||
``` | ||
|
||
For more details, please refer to the [readme](https://github.com/EvolvingLMMs-Lab/lmms-eval/tree/main) and [documentation](https://github.com/EvolvingLMMs-Lab/lmms-eval/tree/main/docs). | ||
|
||
## Final Results Calculation | ||
|
||
The final results are calculated by the weighted average of the results from the two tasks in each benchmark pool. The weights are determined by the number of rows in each task. For example, the final results for `mix_evals_video2text` are calculated as follows: | ||
|
||
```python | ||
NUM_ROWS = { | ||
"mix_evals_video2text_freeform": 968, | ||
"mix_evals_video2text_mc": 634, | ||
} | ||
|
||
results = { | ||
"mix_evals_video2text_freeform": 0.5, | ||
"mix_evals_video2text_mc": 0.6, | ||
} | ||
|
||
final_result = sum([results[task] * NUM_ROWS[task] for task in NUM_ROWS]) / sum(NUM_ROWS.values()) | ||
``` | ||
|
||
## Citation | ||
|
||
```bib | ||
@article{ni2024mixevalx, | ||
title={MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures}, | ||
author={Ni, Jinjie and Song, Yifan and Ghosal, Deepanway and Li, Bo and Zhang, David Junhao and Yue, Xiang and Xue, Fuzhao and Zheng, Zian and Zhang, Kaichen and Shah, Mahir and Jain, Kabir and You, Yang and Shieh, Michael}, | ||
journal={arXiv preprint arXiv:2410.13754}, | ||
year={2024} | ||
} | ||
@article{ni2024mixeval, | ||
title={MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures}, | ||
author={Ni, Jinjie and Xue, Fuzhao and Yue, Xiang and Deng, Yuntian and Shah, Mahir and Jain, Kabir and Neubig, Graham and You, Yang}, | ||
journal={arXiv preprint arXiv:2406.06565}, | ||
year={2024} | ||
} | ||
``` |
3 changes: 3 additions & 0 deletions
3
lmms_eval/tasks/mix_evals/audio2text/mix_evals_audio2text_hard.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
group: mix_evals_audio2text_hard | ||
task: | ||
- mix_evals_audio2_text_freeform_hard |