diff --git a/docs/lmms-eval-0.3.md b/docs/lmms-eval-0.3.md index 2cd6c0d9..8117ae58 100644 --- a/docs/lmms-eval-0.3.md +++ b/docs/lmms-eval-0.3.md @@ -116,7 +116,7 @@ This upgrade includes multiple benchmarks for audio understanding and instructio | **Clotho-AQA** | 2022 | clotho_aqa | test \| val | AIF | Accuracy | test_v2 (2.06k), test \| val (1.44k \| 1.05k) | 1. Audio Question Answering
2. single word answer
3. text based question | | **Common_voice** | 2023 | common_voice_15 | test | ASR | WER (align with Qwen-audio) | en (16.4k) \| fr (16.1k) \| zh (10.6k) | 1. real people voice
2. captioning | | **GigaSpeech** | 2021 | gigaspeech | test \| dev | ASR | WER | dev (6.75k) \| test (25.6k) | 1. transciption
2. audio book
3. YouTube
4. podcasts | -| **LibriSpeech** | 2015 | librispeech | dev-clean \| dev-other \| test-clean \| test-other | ASR | WER | dev-clean (~2.48k) \|dev-other (~2.66k) \|test-clean(~2.55k) \| test-other (~2.70k) | 1. Transcription (audio book) | +| **LibriSpeech** | 2015 | librispeech | dev-clean \| dev-other \| test-clean \| test-other | ASR | WER | dev-clean (~2.48k) \|
dev-other (~2.66k) \|
test-clean(~2.55k) \|
test-other (~2.70k) | 1. Transcription (audio book) | | **OpenHermes** | 2024 | openhermes | test | AIF | GPT-Eval | 100 | 1. synthetic voice | | **MuchoMusic** | 2024 | muchomusic | test | AIF | Accuracy | 1.19k | 1. Music understanding | | **People_speech** | 2021 | people_speech_val | val | ASR | WER | 18.6k | 1. real people voice
2. captioning | @@ -126,11 +126,11 @@ This upgrade includes multiple benchmarks for audio understanding and instructio ### Alignment Check for Audio Datasets - -#### Table 2: Alignment check for audio datasets +| **WavCaps** | test | GPT-Eval | 1.73 | | + The result might be inconsistent with the reported result as we do not have the original prompt and we have to maintain the fair environment for all the models. For the base model, we do not test on the Chat Benchmarks.