Feint6K dataset for video-text understanding, from the following paper:
Huiyu Wang3 Christian Häne1 Alan Yuille2
We propose a novel evaluation task for video-text understanding, namely retrieval from counterfactually augmented data (RCAD), and a new Feint6K dataset, to better assess the capabilities of current video-text models and understand their limitations. To succeed on our new task, models must derive a comprehensive understanding of the video from cross-frame reasoning. Analyses show that previous video-text foundation models can be easily fooled by counterfactually augmented data and are far behind human-level performance.
From our experiments on RCAD, we identify a key limitation of current contrastive approaches on video-text data and introduce LLM-teacher, a more effective approach to learn action semantics by leveraging knowledge obtained from a pretrained large language model.
-
Download Feint6K data (
.csv
files with counterfactually augmented captions) from here. -
Download video data for MSR-VTT and VATEX to a video data folder, e.g.,
./videos
:./videos |- msrvttvideo | |- *.mp4 |- vatexvideo |- *.mp4
-
Compute video-text similarity matrix, e.g., with LanguageBind. Similarity matrices will be saved to
sim_mat_msrvtt.npy
andsim_mat_vatex.npy
for RCAD onMSR-VTT
andVATEX
respectively.# install and activate conda environment for LanguageBind # see: https://github.com/PKU-YuanGroup/LanguageBind?tab=readme-ov-file#%EF%B8%8F-requirements-and-installation conda activate languagebind python3 compute_sim_mat_languagebind.py --video_path videos
-
Compute RCAD metrics given the saved similarity matrix for any video-text model:
python3 eval_rcad.py
The RCAD results will be printed to the console, e.g.,
RCAD on msrvtt: R@1=41.7 R@3=76.5 meanR=2.4 medianR=2.0 RCAD on vatex: R@1=43.2 R@3=77.2 meanR=2.3 medianR=2.0
All data collection and experiments in this work were conducted at JHU.
Ethics. We follow the ethics guidelines of ECCV 2024 and obtained Institutional Review Board (IRB) approvals prior to the start of our work. We described potential risks to the annotators, such as being exposed to inappropriate videos from public video datasets, and explained the purpose of the study and how the collected data will be used. All annotators agreed to join this project voluntarily and were paid by a fair amount as required at our institution.
If you find this dataset helpful, please cite:
@inproceedings{ma2024rethinking,
title={Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data},
author={Ma, Wufei and Li, Kai and Jiang, Zhongshi and Meshry, Moustafa and Liu, Qihao and Wang, Huiyu and H{\"a}ne, Christian and Yuille, Alan},
booktitle={European Conference on Computer Vision},
year={2024},
organization={Springer}
}
Fent6k is CC-BY-NC 4.0 licensed, as found in the LICENSE file.