You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for your contribution to providing a collection of VLM datasets and models. I'm wondering why the tsv versions of datasets in this repository are smaller than the official versions. For example, the RealworldQA dataset downloaded from the official website RealWorldQA has 677MB, while the tsv version in this repo RealWorldQA_tsv only has 175MB. You are using base64 to encode images into texts and store them directly in tsv columns, which should be lossless. So why has the data size been reduced significantly? It seems that other datasets are having the similar situation.
The text was updated successfully, but these errors were encountered:
Hello, @ywwynm
The images in the dataset on the official RealWorldQA website are in webp format, whereas when we converted the original dataset to tsv format, we uniformly converted it to JPEG format during encode, you can refer to the code here in our repo.
@SYuan03 Thanks for your explanation. For other datasets like SeedBench or MMTBench, is such processing also performed? If the original images have already been in JPEG format, will you compress it again using the same code?
Hello, @ywwynm
In fact, if the original image is in jpeg format, there will not be such a significant change in data size even after our processing. We just convert it to the tsv format we need for the convenience of unified processing.
Thanks for your contribution to providing a collection of VLM datasets and models. I'm wondering why the tsv versions of datasets in this repository are smaller than the official versions. For example, the RealworldQA dataset downloaded from the official website RealWorldQA has 677MB, while the tsv version in this repo RealWorldQA_tsv only has 175MB. You are using base64 to encode images into texts and store them directly in tsv columns, which should be lossless. So why has the data size been reduced significantly? It seems that other datasets are having the similar situation.
The text was updated successfully, but these errors were encountered: