From 59e81b0c1e189165add2d30a990bb5456d76f41e Mon Sep 17 00:00:00 2001
From: Kaichen Zhang - NTU <zhan0564@e.ntu.edu.sg>
Date: Wed, 27 Nov 2024 14:04:15 +0800
Subject: [PATCH] Add lmms-eval-0.3 docs
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Update lmms-eval-0.3.md

fix errors in markdown and add hyperlinks

proofread markdown and fix errors

rewrite some parts to fix errors

rewrite some parts to fix errors

rewrite some parts to fix errors

try optimize the table format using html

try optimize the table format using html

try optimize the table 2 format

final proofread

final proofread

final proofread

add explanantion for AIF and ASR

standardize WER to WER(↓)

final proofread

final proofread

final proofread

final proofread

correct hyperlink errors

modify readme to support lmms-eval0.3.0 release

modify icon

fix typos

Co-Authored-By: KairuiHu <kairuih12@gmail.com>
---
 README.md             |  17 ++-
 docs/lmms-eval-0.3.md | 265 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 278 insertions(+), 4 deletions(-)
 create mode 100644 docs/lmms-eval-0.3.md
diff --git a/README.md b/README.md
index b256fb88..71606405 100644
--- a/README.md
+++ b/README.md
@@ -19,18 +19,27 @@
 ---
 
 ## Annoucement
+
+- [2024-11] 🔈🔊 The `lmms-eval/v0.3.0` has been upgraded to support audio evaluations for audio models like Qwen2-Audio and Gemini_Audio across tasks such as AIR-Bench, Clotho-AQA, LibriSpeech, and more. Please refer to the [blog](https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/main/docs/lmms-eval-0.3.md) for more details!
+
+- [2024-07] 🎉🎉 We have released the [technical report](https://arxiv.org/abs/2407.12772) and [LiveBench](https://huggingface.co/spaces/lmms-lab/LiveBench)! 
+
+- [2024-06] 🎬🎬 The `lmms-eval/v0.2.0` has been upgraded to support video evaluations for video models like LLaVA-NeXT Video and Gemini 1.5 Pro across tasks such as EgoSchema, PerceptionTest, VideoMME, and more. Please refer to the [blog](https://lmms-lab.github.io/posts/lmms-eval-0.2/) for more details!
+
+- [2024-03] 📝📝 We have released the first version of `lmms-eval`, please refer to the [blog](https://lmms-lab.github.io/posts/lmms-eval-0.1/) for more details!
+
+<details>
+<summary>We warmly welcome contributions from the open-source community! Below is a chronological list of recent tasks, models, and features added by our amazing contributors. </summary>
+
 - [2024-10] 🎉🎉 We welcome the new task [NaturalBench](https://huggingface.co/datasets/BaiqiL/NaturalBench), a vision-centric VQA benchmark (NeurIPS'24) that challenges vision-language models with simple questions about natural imagery.
 - [2024-10] 🎉🎉 We welcome the new task [TemporalBench](https://huggingface.co/datasets/microsoft/TemporalBench) for fine-grained temporal understanding and reasoning for videos, which reveals a huge (>30%) human-AI gap.
 - [2024-10] 🎉🎉 We welcome the new tasks [VDC](https://rese1f.github.io/aurora-web/) for video detailed captioning, [MovieChat-1K](https://rese1f.github.io/MovieChat/) for long-form video understanding, and [Vinoground](https://vinoground.github.io/), a temporal counterfactual LMM benchmark composed of 1000 short natural video-caption pairs. We also welcome the new models: [AuroraCap](https://github.com/rese1f/aurora) and [MovieChat](https://github.com/rese1f/MovieChat).
 - [2024-09] 🎉🎉 We welcome the new tasks [MMSearch](https://mmsearch.github.io/) and [MME-RealWorld](https://mme-realworld.github.io/) for inference acceleration
 - [2024-09] ⚙️️⚙️️️️ We upgrade `lmms-eval` to `0.2.3` with more tasks and features. We support a compact set of language tasks evaluations (code credit to [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)), and we remove the registration logic at start (for all models and tasks) to reduce the overhead. Now `lmms-eval` only launches necessary tasks/models. Please check the [release notes](https://github.com/EvolvingLMMs-Lab/lmms-eval/releases/tag/v0.2.3) for more details.
 - [2024-08] 🎉🎉 We welcome the new model [LLaVA-OneVision](https://huggingface.co/papers/2408.03326), [Mantis](https://github.com/EvolvingLMMs-Lab/lmms-eval/pull/162), new tasks [MVBench](https://huggingface.co/datasets/OpenGVLab/MVBench), [LongVideoBench](https://github.com/EvolvingLMMs-Lab/lmms-eval/pull/117), [MMStar](https://github.com/EvolvingLMMs-Lab/lmms-eval/pull/158). We provide new feature of SGlang Runtime API for llava-onevision model, please refer the [doc](https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/main/docs/commands.md) for inference acceleration
-- [2024-07] 🎉🎉 We have released the [technical report](https://arxiv.org/abs/2407.12772) and [LiveBench](https://huggingface.co/spaces/lmms-lab/LiveBench)! 
 - [2024-07] 👨‍💻👨‍💻 The `lmms-eval/v0.2.1` has been upgraded to support more models, including [LongVA](https://github.com/EvolvingLMMs-Lab/LongVA), [InternVL-2](https://github.com/OpenGVLab/InternVL), [VILA](https://github.com/NVlabs/VILA), and many more evaluation tasks, e.g. [Details Captions](https://github.com/EvolvingLMMs-Lab/lmms-eval/pull/136), [MLVU](https://arxiv.org/abs/2406.04264), [WildVision-Bench](https://huggingface.co/datasets/WildVision/wildvision-arena-data), [VITATECS](https://github.com/lscpku/VITATECS) and [LLaVA-Interleave-Bench](https://llava-vl.github.io/blog/2024-06-16-llava-next-interleave/).
 
-- [2024-06] 🎬🎬 The `lmms-eval/v0.2.0` has been upgraded to support video evaluations for video models like LLaVA-NeXT Video and Gemini 1.5 Pro across tasks such as EgoSchema, PerceptionTest, VideoMME, and more. Please refer to the [blog](https://lmms-lab.github.io/posts/lmms-eval-0.2/) for more details
-
-- [2024-03] 📝📝 We have released the first version of `lmms-eval`, please refer to the [blog](https://lmms-lab.github.io/posts/lmms-eval-0.1/) for more details
+</details>
 
 ## Why `lmms-eval`?
 
diff --git a/docs/lmms-eval-0.3.md b/docs/lmms-eval-0.3.md
new file mode 100644
index 00000000..b402b70c
--- /dev/null
+++ b/docs/lmms-eval-0.3.md
@@ -0,0 +1,265 @@
+# Integration of Audio Evaluation in LMMs-Eval
+
+
+## **Introduction**
+
+Humans perceive the world through both sight and sound, integrating visual cues with auditory signals such as speech, environmental sounds, and emotional tones. 
+
+This dual sensory input enhances decision-making and overall understanding. Similarly, for multimodal models to achieve human-like comprehension, it is essential to make them process both visual and auditory data together.
+
+While many models have made progress in integrating audio understanding, there is still no reproducible and efficient evaluation toolkit to fairly assess their capabilities.
+
+To address this, we introduce an upgrade to the `lmms-eval` framework, focusing on audio understanding. Building on the success of `lmms-eval/v0.2.0`, the new `lmms-eval/v0.3.0` includes dedicated modules and designs for audio tasks, ensuring consistent evaluation across audio and visual modalities. 
+
+This upgrade includes multiple benchmarks for audio understanding and instruction following, enabling standardized and reproducible comparisons of various audio models.
+
+## Audio Evaluation Pipeline
+
+1. **Improved Pipeline for Audio Evaluations**
+    
+    Here’s a breakdown of adding audio datasets support.
+    
+    1. **Load Audio:** Audios are saved in HuggingFace and can be loaded via the  `doc_to_audio` function.
+        - The code specifically demonstrates the logic of how we handle audio datasets in lmms-eval.
+            
+            ```python
+            def air_bench_doc_to_audio(doc):
+                return [doc["audio"]]
+            ```
+            
+    2. **Format questions:**  Questions and instructions are defined in `<taskname>/utils.py`. For some Audio Instruction Following (AIF) tasks, we create custom prompts and try to align with Qwen2-Audio's evaluation format since the default dataset instructions are sometimes not clear enough for some datasets. We can add model-specific prompts besides the default instruction.
+        - The code demonstrates an example of formatting the question.
+            
+            ```python
+            # This is the place where you format your question
+            def common_voice_15_doc_to_text(doc, lmms_eval_specific_kwargs):
+                pre_prompt = lmms_eval_specific_kwargs["pre_prompt"]
+                post_prompt = lmms_eval_specific_kwargs["post_prompt"]
+                return f"{pre_prompt}Please recognize the speech and only output the recognized content:{post_prompt}"
+            ```
+            
+    3. **Process results:**  Model outputs are evaluated using metrics from either official dataset implementations or aligning with the implementation in [AudioBench](https://github.com/AudioLLMs/AudioBench). We primarily adopt three types of metrics:
+        
+        **a. Accuracy:** Used for tasks with definitive ground truth answers, such as multiple-choice questions
+
+        **b. WER:** Applied to some Audio Speech Recognition (ASR) tasks.
+
+        **c. GPT-4 Eval:** Applied to open-ended responses. We align the evaluation prompt with the implementation in [AudioBench](https://github.com/AudioLLMs/AudioBench).
+        
+        - The code specifically demonstrates an example prompt for GPT-4 Evaluation.
+            
+            ```python
+            eval_prompt = """
+                        [Question]
+                        {question}
+            
+                        [Reference Answer]
+                        {ground_truth}
+            
+                        [Model Answer]
+                        {model_response}
+            
+                        [Task]
+                        Rate the model's answer based on its alignment with the reference answer, focusing on accuracy and relevance to the reference provided. Please be critical on the details.
+                        Criteria: Assess if the model's response mirrors the reference in terms of content, accuracy, and relevance.
+                        Score0: The answer is completely misaligned, providing incorrect or irrelevant information compared to the reference.
+                        Score1: The answer shows minimal alignment, often misunderstanding or providing irrelevant details unrelated to the reference.
+                        Score2: The answer recognizes the topic but diverges significantly from the reference in accuracy or relevance.
+                        Score3: The answer aligns with the reference generally but lacks detail or precise accuracy in some aspects.
+                        Score4: The answer is mostly accurate and relevant, closely following the reference but could be clearer or more detailed.
+                        Score5: The answer is highly accurate, detailed, and matches the reference answer perfectly, capturing its essence and detail.
+            
+                        Your response should be formatted as follows:
+                        Explanation: (Provide a concise explanation of your rating, comparing the reference answer with the model's response. "The reference answer is [XXX], while the model's answer is [YYY]. I think ...")
+                        Rating: (int)"""
+            ```
+            
+        
+
+        
+    4. **Aggregate results:** 
+    After evaluating each data instance, we aggregate the individual results to generate the overall evaluation metrics. Finally, we provide a summary table that consolidates all the evaluation results, similar to the one in [Google’s Gemini report](https://arxiv.org/abs/2312.11805).
+    5. **Grouped Tasks:** 
+    For tasks with multiple subsets, we group all subset tasks together. For example, the AirBench-Chat dataset includes 4 subsets: sound, music, speech, mixed. By running `--task air_bench_chat`, all 4 subsets can be evaluated together, eliminating the need to specify each subset individually. We summarize all the grouped task names in Table 1. This pipeline ensures a thorough and standardized evaluation process for Audio, facilitating consistent and reliable performance assessment across various tasks and datasets.
+        - The code specifically demonstrates an example yaml file of task grouping.
+            
+            ```python
+            group: air_bench_chat
+            task:
+            - air_bench_chat_sound
+            - air_bench_chat_music
+            - air_bench_chat_speech
+            - air_bench_chat_mixed
+            ```
+        
+    
+2. **Audio-based Capabilities**
+
+    Our selected benchmarks collectively evaluate various essential audio-based capabilities, as inspired by [AudioBench](https://github.com/AudioLLMs/AudioBench):
+
+    1. **Audio Captioning:** The ability to accurately transcribe human speech and convert audio content into text.
+    2. **Speech Understanding:** The capability to comprehend the semantic meaning of human speech, enabling appropriate responses to questions and audio instructions.
+    3. **Audio Scene Understanding:** The ability to interpret non-human sounds, such as environment sounds.
+    4. **Voice Understanding:** The capability to analyze non-speech human vocal information, including emotional states, accents, and speaker characteristics.
+    5. **Specialized Audio Processing:** The ability to analyze other audio types, such as musical compositions and multilingual content.
+
+### **Meta Information for Audio Datasets**
+
+#### Table 1: Meta informantion for audio datasets
+
+| **Dataset** | **Year** | **Task Name in lmms-eval** | **Split** | **Task Format** | **Evaluation Metric** | **Number of QAs** | **Feature** |
+| --- | --- | --- | --- | --- | --- | --- | --- |
+| **AIR-Bench** | 2024 | air_bench_chat \| air_bench_foundation | chat, foundation | AIF | GPT-4 Eval (chat) \| Accuracy (foundation) | 2k (chat) \| 19k (foundation) | 1. Comprhensive tasks and audio types |
+| **Alpaca Audio** | 2024 | alpaca_audio | test | AIF | GPT-4 Eval | 100 | 1. Synthetic voice |
+| **Clotho-AQA** | 2022 | clotho_aqa | test \| val | AIF | Accuracy | test_v2 (2.06k), test \| val (1.44k \| 1.05k) | 1. Audio Question Answering<br> 2. Single word answer<br> 3. Text based question |
+| **Common_voice** | 2023 | common_voice_15 | test | ASR | WER(↓) (align with Qwen-audio) | en (16.4k) \| fr (16.1k) \| zh (10.6k) | 1. Real people voice<br> 2. Captioning |
+| **GigaSpeech** | 2021 | gigaspeech | test \| dev | ASR | WER(↓)| dev (6.75k) \| test (25.6k) | 1. Transciption<br> 2. Audio book<br> 3. YouTube<br> 4. Podcasts |
+| **LibriSpeech** | 2015 | librispeech | dev-clean \| dev-other \| test-clean \| test-other | ASR | WER(↓)| dev-clean (~2.48k) \|<br>dev-other (~2.66k) \|<br>test-clean(~2.55k) \|<br> test-other (~2.70k) | 1. Transcription (audio book) |
+| **OpenHermes** | 2024 | openhermes | test | AIF | GPT-Eval | 100 | 1. Synthetic voice |
+| **MuchoMusic** | 2024 | muchomusic | test | AIF | Accuracy | 1.19k | 1. Music understanding |
+| **People_speech** | 2021 | people_speech_val | val | ASR | WER(↓)| 18.6k | 1. Real people voice<br> 2. Captioning |
+| **Tedium v3** | 2018 | tedlium_dev_test | val | ASR | WER(↓)| 591 | 1. TED talk<br>  2. Real people ASR<br>  3. Captioning |
+| **VocalSound** | 2022 | vocalsound_test | test \| val | AIF | Accuracy | test (3.59k) \| val (1.86k) | 1. Vocal sound recognition<br> 2. Non-speech |
+| **WavCaps** | 2024 | wavcaps | test | ASR | GPT-4 Eval | 1.73k | 1. Audio Captioning<br> 2. ChatGPT-augmented captions |
+
+AIF refers to Audio Instruction Following, and ASR refers to Audio Speech Recognition. 
+
+### Alignment Check for Audio Datasets
+
+#### Table 2: Alignment check for audio datasets
+
+|  |  | **Metric** | **Qwen2-Audio-Instruct (lmms-eval)** | **Qwen2-Audio (lmms-eval)** |
+| --- | --- | --- | --- | --- |
+| **AIR-Bench-Chat** | Speech | GPT-Eval  | 7.16 |  |
+|  | Sound |  | 6.14 |  |
+|  | Music |  | 6.66 |  |
+|  | Mixed |  | 5.75 |  |
+| **AIR-Bench-Foundation** | Speech | Acc | 62.89 |  |
+|  | Sound |  | 55.42 |  |
+|  | Music |  | 56.77 |  |
+| **Alpaca** | test | GPT-Eval | 51.8 |  |
+| **Clotho_aqa** | test | GPT-Eval | 0.7587 |  |
+| **Common_voice** | zh |WER(↓)| 15.78 | 6.7 |
+|  | en |  | 36.01 | 27.9 |
+|  | fr |  | 39.88 | 34.8 |
+| **GigaSpeech** | dev |WER(↓)| 19.45 | 14 |
+|  | test |  | 22.6 | 15.01 |
+| **LibriSpeech** | dev-clean |WER(↓)| 4.24 | 1.66 |
+|  | dev-others |  | 6.54 | 3.66 |
+|  | test-clean |  | 3.59 | 1.74 |
+|  | test-others |  | 7.46 | 3.87 |
+| **MuchoMusic** | test | Acc | 68.32 | 45.07 |
+| **OpenHermes** | test | GPT-Eval | 46.8 |  |
+| **People_speech** | val |WER(↓)| 25.86 | 17.1 |
+| **Tedium** | val |WER(↓)| 10.92 | 8.29 |
+| **VocalSound** | test | Acc | 0.936 | 0.81 |
+|  | val |  | 0.9288 | 0.8 |
+| **WavCaps** | test | GPT-Eval | 1.73 |  |
+
+
+The result might be inconsistent with the reported result as we do not have the original prompt and we have to maintain the fair environment for all the models. For the base model, we do not test on the Chat Benchmarks.
+
+Certain datasets face alignment challenge: Datasets with WER, CIDEr, BLEU as metrics cannot accurately align due to their rigid output formats. Model responses are sensitive to prompt, we will investigate more deeply in the section [Robustness of the model](#robustness-of-the-model).
+
+## Evaluation Analysis and Thinking:
+
+During our implementation, we observe several interesting phenomena that may be valuable to discuss. We believe that reflecting on these aspects deeply can help accelerate the development of truly robust audio evaluations.
+
+### Robustness of the model
+
+As we trying to align the results, our investigation revealed that the choice of chat template significantly impacts model performance, even for instruction-tuned models. This finding emerged while analyzing the Qwen2 Audio model. The original Qwen2 Audio repository uses a minimal prompt format:  `"<|audio_bos|><|AUDIO|><|audio_eos|>"` .
+
+This basic format is then combined with various question prompts for different evaluation scenarios. However, this prompt format is not in an instruction format and when applying a chat template, the performance of the model may changes significantly.
+
+#### Table 3: Impact of Chat Template on Qwen-7B-Instruct's Performance
+
+| **Impact of Chat Template** | **Split** | **Metric** | **Chat Template (Off)** | **Chat Template (On)** |
+| --- | --- | --- | --- | --- |
+| **LibriSpeech** | dev-clean | WER(↓) | 2.65 | 4.24 |
+|  | dev-others |  | 5.36 | 6.54 |
+|  | test-clean |  | 2.91 | 3.59 |
+|  | test-others |  | 5.14 | 7.46 |
+| **People_speech** | val | WER(↓) | 21.92 | 25.86 |
+| **Tedium** | dev_test | WER(↓) | 9.56 | 10.92 |
+
+More specifically, we founds out that as shown in the above table, the influence of the chat template is very huge. We believe that these demonstrate the actual robustness of the model and signifies that current audio model may eventually not being stable enough when coping different text input. Also, it again leads us into another thinking: “Is current metrics good at evaluating a model’s performance?
+
+### Rethinking the evaluation metrics
+
+Traditional fixed-format metrics like WER, CIDEr, and BLEU face several limitations in audio model evaluation:
+
+1. **Format Rigidity:** Fixed metrics struggle to properly assess responses that are semantically correct but differ in format from reference answers
+2. **Prompt Sensitivity:** These metrics are highly sensitive to variations in input prompts, leading to inconsistent evaluation results
+
+Due to these limitations, the scores reported in `lmms-eval` might slightly differ from those reported in original papers, highlighting the challenge of maintaining consistent evaluation standards across different frameworks.
+
+Looking ahead, model-based evaluators such as GPT-4 could offer a more flexible and robust evaluation approach. Such evaluators can better understand semantic meaning, handle diverse response formats, and provide more consistent scoring across different implementations. This shift from rigid metrics to intelligent evaluation systems may better capture the true capabilities of audio processing models.
+
+## Additional Experiments
+
+### Batch Size
+
+We perform an exploratory batch inference experiment on Qwen2-Audio with the following results:
+
+#### Table 4: Impact of batch size
+
+|  | **Split** | **Metric** | **Qwen2-Audio (BS=4)** | **Qwen2-Audio (BS=1)** |
+| --- | --- | --- | --- | --- |
+| **LibriSpeech** | dev-clean | WER(↓) | 1.66 | 1.66 |
+|  | dev-others |  | 4.4 | 3.66 |
+|  | test-clean |  | 1.75 | 1.74 |
+|  | test-others |  | 4.06 | 3.87 |
+| **Total Time** |  |  | 10 mins 50 seconds | 5 min 23 seconds |
+
+As shown in the above results, the batch inference (BS=4) can significantly saves the inference time, it could lead to evaluation inconsistencies compared to single-sample processing (BS=1). This is a known issue in the `transformers` library that currently lacks a solution.
+
+### More Details and Feature Updates with `v0.3.0`
+
+1. **Supported Audio Tasks**
+    1. [AirBench](https://github.com/OFA-Sys/AIR-Bench)
+    2. [Alpaca Audio](https://tango2-web.github.io/)
+    3. [Clotho-AQA](https://github.com/partha2409/AquaNet)
+    4. [Common_voice_15](https://github.com/common-voice/common-voice)
+    5. [GigaSpeech](https://github.com/SpeechColab/GigaSpeech)
+    6. [LibriSpeech](https://www.openslr.org/12)
+    7. [OpenHermes](https://huggingface.co/datasets/AudioLLMs/openhermes_instruction_test)
+    8. [MuchoMusic](https://github.com/mulab-mir/muchomusic)
+    9. [Peoples_speech](https://mlcommons.org/datasets/peoples-speech/)
+    10. [Tedium v3](https://www.openslr.org/51/)
+    11. [VocalSound](https://github.com/YuanGongND/vocalsound)
+    12. [WavCaps](https://github.com/XinhaoMei/WavCaps)
+2. **Support Audio Models**
+    
+    1. [Qwen2-Audio](https://github.com/QwenLM/Qwen2-Audio)
+    2. [Gemini_Audio](https://arxiv.org/abs/2312.11805)
+3. **Supporting Multi-Round Evaluation**
+    1. [Feat][Task] Add multi-round evaluation in llava-onevision; Add MMSearch Benchmark by [@CaraJ7](https://github.com/CaraJ7) in [#277](https://github.com/EvolvingLMMs-Lab/lmms-eval/pull/277)
+4. **Regression Test**
+    1. [Feat] add regression test and change saving logic related to `output_path` by [@Luodian](https://github.com/Luodian) in [#259](https://github.com/EvolvingLMMs-Lab/lmms-eval/pull/259)
+5. **Speed-Up by loading required tasks and models.**
+    1. [feat] remove registeration logic and adding language evaluation tasks. by [@Luodian](https://github.com/Luodian) in [#218](https://github.com/EvolvingLMMs-Lab/lmms-eval/pull/218)
+6. **LMMs-Eval Analysis Tool**
+    1. Lite/Core-set Selection by Kaichen Zhang
+        
+        https://github.com/EvolvingLMMs-Lab/lmms-eval/tree/main/tools/lite
+        
+    2. LiveBench by Fanyi Pu
+        
+        https://github.com/EvolvingLMMs-Lab/lmms-eval/tree/main/tools/live_bench
+        
+7. **SGLang Evaluation**
+    1. [Feat] SGLang SRT commands in one go, async input for openai server by [@kcz358](https://github.com/kcz358) in [#212](https://github.com/EvolvingLMMs-Lab/lmms-eval/pull/212)
+    2. [Fix] Fix async append result in different order issue by [@kcz358](https://github.com/kcz358) in [#244](https://github.com/EvolvingLMMs-Lab/lmms-eval/pull/244)
+
+## Contributors
+
+> Listed in order of contribution significance.
+> 
+
+**Core Contributors**
+
+Pengyun Wang, Cong Pham Ba, Yingluo Li, Fanyi Pu
+
+**Release Managers**
+
+Kairui Hu, Kaichen Zhang, Bo Li