Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qwen-VL-Max-0809 MME中celebrity测出来和榜单结果差距有10分左右 #571

Open
lihua8848 opened this issue Nov 2, 2024 · 7 comments

Comments

@lihua8848
Copy link

测试代码都是VLMEvalKit,我只改了api为qwen-vl-max-0809,以及测MME的celebrity,prompt这些都没动,计算scores的方法也没动,为什么和榜单上的差异这么大
image
image

image
@lihua8848
Copy link
Author

第三张图是前几天那个榜单上的,请问1101更新为什么消失了?

@kennymckormick
Copy link
Member

Hi, @lihua8848 ,
Recently we have added Qwen2-VL-72B to the leaderboard. As the Qwen Team introduced, the model behind Qwen-VL-Max-0809 is Qwen2-VL-72B, so we remove the Qwen-VL-Max-0809 entry. But maybe we can add it back to the OpenVLM Leaderboard later.

For the performance issue, I can help re-conduct the evaluation to see the numbers. Also, note that since Qwen-VL-Max is a proprietary API model and the sample number is small for each single capability in MME, the evaluation results can be variable.

@lihua8848
Copy link
Author

It would be great if you could help me reproduce the MME celebrity results. How can I assist you? I'm using the following API for testing.
image

@kennymckormick
Copy link
Member

@lihua8848

Basically, my current evaluation results align with yours.

image

@kennymckormick
Copy link
Member

The results of Qwen-VL-Max-0809 are added back to the leaderboard.

@lihua8848
Copy link
Author

image
The celebrity metric is still incorrect

@kennymckormick
Copy link
Member

@lihua8848
To clarify, for API models, there is no so called correct performance (since the model behind may change frequently). The entry on the leaderboard only represents the performance at the evaluation time. I will add a new field Evaluation Time to the leaderboard to make it clearer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants