Discrepancy in MM Math Evaluation Results: Possible Issue with Answer Extraction in VLMEvalkit #638

mantle2048 · 2024-12-01T11:00:11Z

Hi, everyone.

I’ve noticed a significant discrepancy between the evaluation results of the MM Math dataset and the results reported in the original paper.

In the original MM Math paper, GPT-4o achieved an accuracy of 31.8%, but the evaluation result using VLMEvalkit is only 22.5%.

This difference doesn’t seem to be caused by randomness in the model's outputs. Upon reviewing the code, I found that VLMEvalkit uses the answer extraction code provided by the original MM Math repository.

However, the current code seems to match every occurrence of the answer enclosed in \boxed{} in the output (see this line). This may lead to incorrect judgments if the model outputs the correct answer multiple times.

I believe this might be the reason for the performance discrepancy compared to the original MM Math paper.

Additionally, since the answers in this dataset are open-ended, using an LLM for answer extraction and comparison might be a better option than the hardcoded matching approach.

kennymckormick · 2024-12-01T14:05:18Z

Hi, @mantle2048 ,
we will delve into the evaluation code and add some refinement. Would you please provide us some error cases for debugging usage?

mantle2048 · 2024-12-01T14:21:38Z

Sure, I would be happy to!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discrepancy in MM Math Evaluation Results: Possible Issue with Answer Extraction in VLMEvalkit #638

Discrepancy in MM Math Evaluation Results: Possible Issue with Answer Extraction in VLMEvalkit #638

mantle2048 commented Dec 1, 2024

kennymckormick commented Dec 1, 2024

mantle2048 commented Dec 1, 2024

Discrepancy in MM Math Evaluation Results: Possible Issue with Answer Extraction in VLMEvalkit #638

Discrepancy in MM Math Evaluation Results: Possible Issue with Answer Extraction in VLMEvalkit #638

Comments

mantle2048 commented Dec 1, 2024

kennymckormick commented Dec 1, 2024

mantle2048 commented Dec 1, 2024