Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancy in MM Math Evaluation Results: Possible Issue with Answer Extraction in VLMEvalkit #638

Open
mantle2048 opened this issue Dec 1, 2024 · 2 comments

Comments

@mantle2048
Copy link

Hi, everyone.

I’ve noticed a significant discrepancy between the evaluation results of the MM Math dataset and the results reported in the original paper.

In the original MM Math paper, GPT-4o achieved an accuracy of 31.8%, but the evaluation result using VLMEvalkit is only 22.5%.

This difference doesn’t seem to be caused by randomness in the model's outputs. Upon reviewing the code, I found that VLMEvalkit uses the answer extraction code provided by the original MM Math repository.

However, the current code seems to match every occurrence of the answer enclosed in \boxed{} in the output (see this line). This may lead to incorrect judgments if the model outputs the correct answer multiple times.

I believe this might be the reason for the performance discrepancy compared to the original MM Math paper.

Additionally, since the answers in this dataset are open-ended, using an LLM for answer extraction and comparison might be a better option than the hardcoded matching approach.

@kennymckormick
Copy link
Member

Hi, @mantle2048 ,
we will delve into the evaluation code and add some refinement. Would you please provide us some error cases for debugging usage?

@mantle2048
Copy link
Author

Sure, I would be happy to!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants