Skip to content

Commit

Permalink
final proofread
Browse files Browse the repository at this point in the history
  • Loading branch information
KairuiHu committed Nov 27, 2024
1 parent f9ff4de commit f7398c9
Showing 1 changed file with 0 additions and 171 deletions.
171 changes: 0 additions & 171 deletions docs/lmms-eval-0.3.md
Original file line number Diff line number Diff line change
Expand Up @@ -155,177 +155,6 @@ This upgrade includes multiple benchmarks for audio understanding and instructio
| **VocalSound** | test | Acc | 0.936 | 0.81 |
| | val | | 0.9288 | 0.8 |
| **WavCaps** | test | GPT-Eval | 1.73 | |
#### Table 2: Alignment check for audio datasets

<table>
<tr>
<th></th>
<th></th>
<th>Metric</th>
<th>Qwen2-Audio-Instruct (lmms-eval)</th>
<th>Qwen2-Audio (lmms-eval)</th>
</tr>
<tr>
<td rowspan="4" align="center">AIRBench-Chat</td>
<td>Speech</td>
<td rowspan="4" align="center">GPT-Eval</td>
<td>7.16</td>
<td></td>
</tr>
<tr>
<td>Sound</td>
<td>6.14</td>
<td></td>
</tr>
<tr>
<td>Music</td>
<td>6.66</td>
<td></td>
</tr>
<tr>
<td>Mixed</td>
<td>5.75</td>
<td></td>
</tr>
<tr>
<td rowspan="3" align="center">AIRBench-Foundation</td>
<td>Speech</td>
<td rowspan="3" align="center">Acc</td>
<td>62.89</td>
<td></td>
</tr>
<tr>
<td>Sound</td>
<td></td>
<td>55.42</td>
<td></td>
</tr>
<tr>
<td>Music</td>
<td></td>
<td>56.77</td>
<td></td>
</tr>
<tr>
<td>Alpaca</td>
<td>test</td>
<td>GPT-Eval</td>
<td>51.8</td>
<td></td>
</tr>
<tr>
<td>Clotho_aqa</td>
<td>test</td>
<td>GPT-Eval</td>
<td>0.7587</td>
<td></td>
</tr>
<tr>
<td rowspan="3" align="center">Common_voice</td>
<td>zh</td>
<td rowspan="3" align="center">WER</td>
<td>15.78</td>
<td>6.7</td>
</tr>
<tr>
<td>en</td>
<td></td>
<td>36.01</td>
<td>27.9</td>
</tr>
<tr>
<td>fr</td>
<td></td>
<td>39.88</td>
<td>34.8</td>
</tr>
<tr>
<td rowspan="2" align="center">GigaSpeech</td>
<td>dev</td>
<td rowspan="2" align="center">WER</td>
<td>19.45</td>
<td>14</td>
</tr>
<tr>
<td>test</td>
<td></td>
<td>22.6</td>
<td>15.01</td>
</tr>
<tr>
<td rowspan="4" align="center">LibriSpeech</td>
<td>dev-clean</td>
<td rowspan="4" align="center">WER</td>
<td>4.24</td>
<td>1.66</td>
</tr>
<tr>
<td>dev-others</td>
<td></td>
<td>6.54</td>
<td>3.66</td>
</tr>
<tr>
<td>test-clean</td>
<td></td>
<td>3.59</td>
<td>1.74</td>
</tr>
<tr>
<td>test-others</td>
<td></td>
<td>7.46</td>
<td>3.87</td>
</tr>
<tr>
<td>MuchoMusic</td>
<td>test</td>
<td>Acc</td>
<td>68.32</td>
<td>45.07</td>
</tr>
<tr>
<td>OpenHermes</td>
<td>test</td>
<td>GPT-Eval</td>
<td>46.8</td>
<td></td>
</tr>
<tr>
<td>People_speech</td>
<td>val</td>
<td>WER</td>
<td>25.86</td>
<td>17.1</td>
</tr>
<tr>
<td>Tedium</td>
<td>val</td>
<td>WER</td>
<td>10.92</td>
<td>8.29</td>
</tr>
<tr>
<td rowspan="2" align="center">VocalSound</td>
<td>test</td>
<td rowspan="2" align="center">Acc</td>
<td>0.936</td>
<td>0.81</td>
</tr>
<tr>
<td>val</td>
<td></td>
<td>0.9288</td>
<td>0.8</td>
</tr>
<tr>
<td>WavCaps</td>
<td>test</td>
<td>GPT-Eval</td>
<td>1.73</td>
<td></td>
</tr>
</table>


The result might be inconsistent with the reported result as we do not have the original prompt and we have to maintain the fair environment for all the models. For the base model, we do not test on the Chat Benchmarks.
Expand Down

0 comments on commit f7398c9

Please sign in to comment.