You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I’ve noticed a significant discrepancy between the evaluation results of the MM Math dataset and the results reported in the original paper.
In the original MM Math paper, GPT-4o achieved an accuracy of 31.8%, but the evaluation result using VLMEvalkit is only 22.5%.
This difference doesn’t seem to be caused by randomness in the model's outputs. Upon reviewing the code, I found that VLMEvalkit uses the answer extraction code provided by the original MM Math repository.
However, the current code seems to match every occurrence of the answer enclosed in \boxed{} in the output (see this line). This may lead to incorrect judgments if the model outputs the correct answer multiple times.
I believe this might be the reason for the performance discrepancy compared to the original MM Math paper.
Additionally, since the answers in this dataset are open-ended, using an LLM for answer extraction and comparison might be a better option than the hardcoded matching approach.
The text was updated successfully, but these errors were encountered:
Hi, everyone.
I’ve noticed a significant discrepancy between the evaluation results of the MM Math dataset and the results reported in the original paper.
In the original MM Math paper, GPT-4o achieved an accuracy of 31.8%, but the evaluation result using VLMEvalkit is only 22.5%.
This difference doesn’t seem to be caused by randomness in the model's outputs. Upon reviewing the code, I found that VLMEvalkit uses the answer extraction code provided by the original MM Math repository.
However, the current code seems to match every occurrence of the answer enclosed in \boxed{} in the output (see this line). This may lead to incorrect judgments if the model outputs the correct answer multiple times.
I believe this might be the reason for the performance discrepancy compared to the original MM Math paper.
Additionally, since the answers in this dataset are open-ended, using an LLM for answer extraction and comparison might be a better option than the hardcoded matching approach.
The text was updated successfully, but these errors were encountered: