diff --git a/docs/lmms-eval-0.3.md b/docs/lmms-eval-0.3.md index 324f692f..9c67a19d 100644 --- a/docs/lmms-eval-0.3.md +++ b/docs/lmms-eval-0.3.md @@ -40,11 +40,11 @@ This upgrade includes multiple benchmarks for audio understanding and instructio return f"{pre_prompt}Please recognize the speech and only output the recognized content:{post_prompt}" ``` - 3. **Process results:** Model outputs are evaluated using metrics from either official dataset implementations or aligning with the implementation in AudioBench. We primarily use three types of metrics: + 3. **Process results:** Model outputs are evaluated using metrics from either official dataset implementations or aligning with the implementation in [AudioBench](https://github.com/AudioLLMs/AudioBench). We primarily adopt three types of metrics: **a. Accuracy:** Used for tasks with definitive ground truth answers, such as multiple-choice questions - - **b. GPT-4 Eval:** Applied to open-ended responses. We align the evaluation prompt with the implementation in AudioBench. + **b. WER:** Applied to some Audio Speech Recognition (ASR) tasks. + **c. GPT-4 Eval:** Applied to open-ended responses. We align the evaluation prompt with the implementation in [AudioBench](https://github.com/AudioLLMs/AudioBench). - The code specifically demonstrates an example prompt for GPT-4 Evaluation. @@ -75,10 +75,10 @@ This upgrade includes multiple benchmarks for audio understanding and instructio ``` - **c. WER:** Applied to some Audio Speech Recognition (ASR) tasks. + 4. **Aggregate results:** - After evaluating each data instance, we aggregate the individual results to generate the overall evaluation metrics. Finally, we provide a summary table that consolidates all the evaluation results, similar to the one in Google’s Gemini report. + After evaluating each data instance, we aggregate the individual results to generate the overall evaluation metrics. Finally, we provide a summary table that consolidates all the evaluation results, similar to the one in [Google’s Gemini report](https://arxiv.org/abs/2312.11805). 5. **Grouped Tasks:** For tasks with multiple subsets, we group all subset tasks together. For example, the AirBench-Chat dataset includes 4 subsets: sound, music, speech, mixed. By running `--task air_bench_chat`, all 4 subsets can be evaluated together, eliminating the need to specify each subset individually. We summarize all the grouped task names in Table 1. This pipeline ensures a thorough and standardized evaluation process for Audio, facilitating consistent and reliable performance assessment across various tasks and datasets. - The code specifically demonstrates an example yaml file of task grouping. @@ -95,35 +95,35 @@ This upgrade includes multiple benchmarks for audio understanding and instructio 2. **Audio-based Capabilities** -Our selected benchmarks assess the following key audio processing abilities, as inspired by AudioBench [AudioBench](https://github.com/AudioLLMs/AudioBench): + Our selected benchmarks assess the following key audio processing abilities, as inspired by [AudioBench](https://github.com/AudioLLMs/AudioBench): -1. **Audio Captioning:** The ability to accurately transcribe human speech and convert audio content into text -2. **Speech Understanding:** The capability to comprehend the semantic meaning of human speech, enabling appropriate responses to questions and audio instructions -3. **Audio Scene Understanding:** The ability to interpret non-human sounds, such as environment sounds -4. **Voice Understanding:** The capability to analyze non-speech human vocal information, including emotional states, accents, and speaker characteristics -5. **Specialized Audio Processing:** The ability to analyze other audio types, such as musical compositions and multilingual content + 1. **Audio Captioning:** The ability to accurately transcribe human speech and convert audio content into text + 2. **Speech Understanding:** The capability to comprehend the semantic meaning of human speech, enabling appropriate responses to questions and audio instructions + 3. **Audio Scene Understanding:** The ability to interpret non-human sounds, such as environment sounds + 4. **Voice Understanding:** The capability to analyze non-speech human vocal information, including emotional states, accents, and speaker characteristics + 5. **Specialized Audio Processing:** The ability to analyze other audio types, such as musical compositions and multilingual content -Our goal is to provide a comprehensive evaluation framework that assesses models' abilities to interpret audio content and respond appropriately to user queries, ranging from basic transcription to complex audio-based reasoning tasks. + Our goal is to provide a comprehensive evaluation framework that assesses models' abilities to interpret audio content and respond appropriately to user queries, ranging from basic transcription to complex audio-based reasoning tasks. ### **Meta Information for Audio Datasets** - +Table 1: Meta informantion for audio datasets | **Dataset** | **Year** | **Task Name in lmms-eval** | **Split** | **Task Format** | **Evaluation Metric** | **Number of QAs** | **Feature** | | --- | --- | --- | --- | --- | --- | --- | --- | -| **AIRBench** | 2024 | air_bench_chat \| air_bench_foundation | chat, foundation | AIF | GPT-4 Eval (chat) \| Accuracy (foundation) | 2k (chat) \| 19k (foundation) | comprhensive tasks and audio types | +| **AIRBench** | 2024 | air_bench_chat \| air_bench_foundation | chat, foundation | AIF | GPT-4 Eval (chat) \| Accuracy (foundation) | 2k (chat) \| 19k (foundation) | Comprhensive tasks and audio types | | **Alpaca Audio** | 2024 | alpaca_audio | test | AIF | GPT-4 Eval | 100 | synthetic voice | -| **Clotho-AQA** | 2022 | clotho_aqa | test \| val | AIF | Accuracy | test_v2 (2.06k), test \| val (1.44k \| 1.05k) | Audio Question Answering, single word answer, text based question | -| **Common_voice** | 2023 | common_voice_15 | test | ASR | WER (align with Qwen-audio) | en (16.4k) \| fr (16.1k) \| zh (10.6k) | real people voice, captioning | -| **GigaSpeech** | 2021 | gigaspeech | test \| dev | ASR | WER | dev (6.75k) \| test (25.6k) | transciption, audio book, YouTube, podcasts | +| **Clotho-AQA** | 2022 | clotho_aqa | test \| val | AIF | Accuracy | test_v2 (2.06k), test \| val (1.44k \| 1.05k) | 1. Audio Question Answering
2. single word answer
3. text based question | +| **Common_voice** | 2023 | common_voice_15 | test | ASR | WER (align with Qwen-audio) | en (16.4k) \| fr (16.1k) \| zh (10.6k) | 1. real people voice
2. captioning | +| **GigaSpeech** | 2021 | gigaspeech | test \| dev | ASR | WER | dev (6.75k) \| test (25.6k) | 1. transciption
2. audio book
3. YouTube
4. podcasts | | **LibriSpeech** | 2015 | librispeech | dev-clean \| dev-other \| test-clean \| test-other | ASR | WER | dev-clean (~2.48k) \|dev-other (~2.66k) \|test-clean(~2.55k) \| test-other (~2.70k) | Transcription (audio book) | | **OpenHermes** | 2024 | openhermes | test | AIF | GPT-Eval | 100 | synthetic voice | | **MuchoMusic** | 2024 | muchomusic | test | AIF | Accuracy | 1.19k | Music understanding | -| **People_speech** | 2021 | people_speech_val | val | ASR | WER | 18.6k | real people voice, captioning | -| **Tedium v3** | 2018 | tedlium_dev_test | val | ASR | WER | 591 | ted talk, real people asr, captioning | +| **People_speech** | 2021 | people_speech_val | val | ASR | WER | 18.6k | 1. real people voice
2. captioning | +| **Tedium v3** | 2018 | tedlium_dev_test | val | ASR | WER | 591 | 1. ted talk
2. real people asr
3. captioning | | **VocalSound** | 2022 | vocalsound_test | test \| val | AIF | Accuracy | test (3.59k) | val (1.86k) | 1. Vocal sound recognition
2. Non-speech | | **WavCaps** | 2024 | wavcaps | test | ASR | GPT-4 Eval | 1.73k | 1. Audio Captioning
2. ChatGPT-augmented captions | ### Alignment Check for Audio Datasets - +Table 2: Alignment check for audio datasets | | | **metric** | **Qwen2-Audio-Instruct (lmms-eval)** | **Qwen2-Audio (lmms-eval)** | | --- | --- | --- | --- | --- | | **AIRBench-Chat** | Speech | GPT-Eval | 7.16 | | @@ -135,68 +135,67 @@ Our goal is to provide a comprehensive evaluation framework that assesses models | | Music | | 56.77 | | | **Alpaca** | test | GPT-Eval | 51.8 | | | **Clotho_aqa** | test | GPT-Eval | 0.7587 | | -| **Common_voice** | zh | wer | 15.78 | 6.7 | +| **Common_voice** | zh |WER| 15.78 | 6.7 | | | en | | 36.01 | 27.9 | | | fr | | 39.88 | 34.8 | -| **GigaSpeech** | dev | wer | 19.45 | 14 | +| **GigaSpeech** | dev |WER| 19.45 | 14 | | | test | | 22.6 | 15.01 | -| **LibriSpeech** | dev-clean | wer | 4.24 | 1.66 | +| **LibriSpeech** | dev-clean |WER| 4.24 | 1.66 | | | dev-others | | 6.54 | 3.66 | | | test-clean | | 3.59 | 1.74 | | | test-others | | 7.46 | 3.87 | | **MuchoMusic** | test | Acc | 68.32 | 45.07 | | **OpenHermes** | test | GPT-Eval | 46.8 | | -| **People_speech** | val | wer | 25.86 | 17.1 | -| **Tedium** | val | wer | 10.92 | 8.29 | +| **People_speech** | val |WER| 25.86 | 17.1 | +| **Tedium** | val |WER| 10.92 | 8.29 | | **VocalSound** | test | Acc | 0.936 | 0.81 | | | val | | 0.9288 | 0.8 | | **WavCaps** | test | GPT-Eval | 1.73 | | -The result might be inconsistent with the reported result as we do not have the original prompt and we have to maintain the fair environment for all the models. For the base model, we do not test on the Chat Benchmarks. + The result might be inconsistent with the reported result as we do not have the original prompt and we have to maintain the fair environment for all the models. For the base model, we do not test on the Chat Benchmarks. -Certain datasets face alignment challenge: Datasets with WER, CIDEr, BLEU as metrics cannot accurately align due to their rigid, reference-based formats. Model response sensitive to prompt, we will investigate more deeply in Section [Robustness of the model](https://www.notion.so/Robustness-of-the-model-b89c005d3e044cb6aff51165929cea45?pvs=21) . + Certain datasets face alignment challenge: Datasets with WER, CIDEr, BLEU as metrics cannot accurately align due to their rigid, reference-based formats. Model response sensitive to prompt, we will investigate more deeply in Section [Robustness of the model](https://www.notion.so/Robustness-of-the-model-b89c005d3e044cb6aff51165929cea45?pvs=21) . ## Evaluation Analysis and Thinking: -During our implementation, we observe several interesting phenomena that may be valuable to discuss. We believe that reflecting on these aspects deeply can help accelerate the development of truly robust audio evaluations. + During our implementation, we observe several interesting phenomena that may be valuable to discuss. We believe that reflecting on these aspects deeply can help accelerate the development of truly robust audio evaluations. ### Robustness of the model -As we trying to align the results, our investigation revealed that the choice of chat template significantly impacts model performance, even for instruction-tuned models. This finding emerged while analyzing the Qwen2 Audio model. The original Qwen2 Audio repository uses a minimal prompt format: `"<|audio_bos|><|AUDIO|><|audio_eos|>"` . + As we trying to align the results, our investigation revealed that the choice of chat template significantly impacts model performance, even for instruction-tuned models. This finding emerged while analyzing the Qwen2 Audio model. The original Qwen2 Audio repository uses a minimal prompt format: `"<|audio_bos|><|AUDIO|><|audio_eos|>"` . -This basic format is then combined with various question prompts for different evaluation scenarios. However, this prompt format is not in an instruction format and when applying a chat template, the performance of the model may changes significantly. + This basic format is then combined with various question prompts for different evaluation scenarios. However, this prompt format is not in an instruction format and when applying a chat template, the performance of the model may changes significantly. -##### Table 3: Impact of Chat Template on Qwen-7B-Instruct's Performance -{: .table-caption} +Table 3: Impact of Chat Template on Qwen-7B-Instruct's Performance | Impact of Chat Template | | | Chat Template (Off) | Chat Template (On) | | --- | --- | --- | --- | --- | -| LibriSpeech | dev-clean | wer(↓) | 2.65 | 4.24 | +| LibriSpeech | dev-clean | WER(↓) | 2.65 | 4.24 | | | dev-others | | 5.36 | 6.54 | | | test-clean | | 2.91 | 3.59 | | | test-others | | 5.14 | 7.46 | -| People_speech | val | wer(↓) | 21.92 | 25.86 | -| Tedium | dev_test | wer(↓) | 9.56 | 10.92 | +| People_speech | val | WER(↓) | 21.92 | 25.86 | +| Tedium | dev_test | WER(↓) | 9.56 | 10.92 | -More specifically, we founds out that as shown in the above table, the influence of the chat template is very huge. We believe that these demonstrate the actual robustness of the model and signifies that current audio model may eventually not being stable enough when coping different text input. Also, it again leads us into another thinking: “Is current metrics good at evaluating a model’s performance? + More specifically, we founds out that as shown in the above table, the influence of the chat template is very huge. We believe that these demonstrate the actual robustness of the model and signifies that current audio model may eventually not being stable enough when coping different text input. Also, it again leads us into another thinking: “Is current metrics good at evaluating a model’s performance? ### Rethinking the evaluation metrics -Traditional fixed-format metrics like Word Error Rate (WER), CIDEr, and BLEU face several limitations in audio model evaluation: + Traditional fixed-format metrics like WER, CIDEr, and BLEU face several limitations in audio model evaluation: -1. **Format Rigidity:** Fixed metrics struggle to properly assess responses that are semantically correct but differ in format from reference answers -2. **Prompt Sensitivity:** These metrics are highly sensitive to variations in input prompts, leading to inconsistent evaluation results + 1. **Format Rigidity:** Fixed metrics struggle to properly assess responses that are semantically correct but differ in format from reference answers + 2. **Prompt Sensitivity:** These metrics are highly sensitive to variations in input prompts, leading to inconsistent evaluation results -Due to these limitations, the scores reported in `lmms-eval` might slightly differ from those reported in original papers, highlighting the challenge of maintaining consistent evaluation standards across different frameworks. + Due to these limitations, the scores reported in `lmms-eval` might slightly differ from those reported in original papers, highlighting the challenge of maintaining consistent evaluation standards across different frameworks. -Looking ahead, model-based evaluators such as GPT-4 could offer a more flexible and robust evaluation approach. Such evaluators can better understand semantic meaning, handle diverse response formats, and provide more consistent scoring across different implementations. This shift from rigid metrics to intelligent evaluation systems may better capture the true capabilities of audio processing models. + Looking ahead, model-based evaluators such as GPT-4 could offer a more flexible and robust evaluation approach. Such evaluators can better understand semantic meaning, handle diverse response formats, and provide more consistent scoring across different implementations. This shift from rigid metrics to intelligent evaluation systems may better capture the true capabilities of audio processing models. ## Additional Experiments ### Batch Size -We perform an exploratory batch inference experiment on Qwen2-Audio with the following results: - + We perform an exploratory batch inference experiment on Qwen2-Audio with the following results: +Table 4: Impact of batch size | | **Split** | **Metric** | **Qwen2-Audio (BS=4)** | **Qwen2-Audio (BS=1)** | | --- | --- | --- | --- | --- | | **LibriSpeech** | dev-clean | wer(↓) | 1.66 | 1.66 | @@ -205,7 +204,7 @@ We perform an exploratory batch inference experiment on Qwen2-Audio with the fol | | test-others | | 4.06 | 3.87 | | **Total Time** | | | 10 mins 50 seconds | 5 min 23 seconds | -As shown in the above results, the batch inference (BS=4) can significantly saves the inference time, it could lead to evaluation inconsistencies compared to single-sample processing (BS=1). This is a known issue in the `transformers` library that currently lacks a solution. + As shown in the above results, the batch inference (BS=4) can significantly saves the inference time, it could lead to evaluation inconsistencies compared to single-sample processing (BS=1). This is a known issue in the `transformers` library that currently lacks a solution. ### More Details and Feature Updates with `v0.3.0` @@ -225,7 +224,7 @@ As shown in the above results, the batch inference (BS=4) can significantly save 2. **Support Audio Models** 1. [Qwen2-Audio](https://github.com/QwenLM/Qwen2-Audio) - 2. [Gemini_Audio] + 2. [Gemini_Audio](https://arxiv.org/abs/2312.11805) 3. **Supporting Multi-Round Evaluation** 1. [Feat][Task] Add multi-round evaluation in llava-onevision; Add MMSearch Benchmark by [@CaraJ7](https://github.com/CaraJ7) in [#277](https://github.com/EvolvingLMMs-Lab/lmms-eval/pull/277) 4. **Regression Test**