Are images compressed in tsv files? #517

ywwynm · 2024-10-12T07:31:40Z

Thanks for your contribution to providing a collection of VLM datasets and models. I'm wondering why the tsv versions of datasets in this repository are smaller than the official versions. For example, the RealworldQA dataset downloaded from the official website RealWorldQA has 677MB, while the tsv version in this repo RealWorldQA_tsv only has 175MB. You are using base64 to encode images into texts and store them directly in tsv columns, which should be lossless. So why has the data size been reduced significantly? It seems that other datasets are having the similar situation.

SYuan03 · 2024-10-14T05:25:57Z

Hello, @ywwynm
The images in the dataset on the official RealWorldQA website are in webp format, whereas when we converted the original dataset to tsv format, we uniformly converted it to JPEG format during encode, you can refer to the code here in our repo.

ywwynm · 2024-10-14T06:26:42Z

@SYuan03 Thanks for your explanation. For other datasets like SeedBench or MMTBench, is such processing also performed? If the original images have already been in JPEG format, will you compress it again using the same code?

SYuan03 · 2024-10-14T06:39:03Z

Hello, @ywwynm
In fact, if the original image is in jpeg format, there will not be such a significant change in data size even after our processing. We just convert it to the tsv format we need for the convenience of unified processing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Are images compressed in tsv files? #517

Are images compressed in tsv files? #517

ywwynm commented Oct 12, 2024

SYuan03 commented Oct 14, 2024

ywwynm commented Oct 14, 2024 •

edited

Loading

SYuan03 commented Oct 14, 2024

Are images compressed in tsv files? #517

Are images compressed in tsv files? #517

Comments

ywwynm commented Oct 12, 2024

SYuan03 commented Oct 14, 2024

ywwynm commented Oct 14, 2024 • edited Loading

SYuan03 commented Oct 14, 2024

ywwynm commented Oct 14, 2024 •

edited

Loading