Skip to content

Rethinking Video-Text Understanding Retrieval from Counterfactually Augmented Data

License

Notifications You must be signed in to change notification settings

facebookresearch/feint6k

Repository files navigation

Feint6K Dataset

Feint6K dataset for video-text understanding, from the following paper:


1Meta Reality Labs   2Johns Hopkins University   3Meta AI

ECCV 2024     Project Page     arXiv

We propose a novel evaluation task for video-text understanding, namely retrieval from counterfactually augmented data (RCAD), and a new Feint6K dataset, to better assess the capabilities of current video-text models and understand their limitations. To succeed on our new task, models must derive a comprehensive understanding of the video from cross-frame reasoning. Analyses show that previous video-text foundation models can be easily fooled by counterfactually augmented data and are far behind human-level performance.

From our experiments on RCAD, we identify a key limitation of current contrastive approaches on video-text data and introduce LLM-teacher, a more effective approach to learn action semantics by leveraging knowledge obtained from a pretrained large language model.

Data Preparation

  1. Download Feint6K data (.csv files with counterfactually augmented captions) from here.

  2. Download video data for MSR-VTT and VATEX to a video data folder, e.g., ./videos:

    ./videos
      |- msrvttvideo
      |   |- *.mp4
      |- vatexvideo
          |- *.mp4
    

Example RCAD Evaluation on Feint6K Dataset

  1. Compute video-text similarity matrix, e.g., with LanguageBind. Similarity matrices will be saved to sim_mat_msrvtt.npy and sim_mat_vatex.npy for RCAD on MSR-VTT and VATEX respectively.

    # install and activate conda environment for LanguageBind
    # see: https://github.com/PKU-YuanGroup/LanguageBind?tab=readme-ov-file#%EF%B8%8F-requirements-and-installation
    conda activate languagebind
    
    python3 compute_sim_mat_languagebind.py --video_path videos
  2. Compute RCAD metrics given the saved similarity matrix for any video-text model:

    python3 eval_rcad.py

    The RCAD results will be printed to the console, e.g.,

    RCAD on msrvtt: R@1=41.7 R@3=76.5 meanR=2.4 medianR=2.0
    RCAD on vatex: R@1=43.2 R@3=77.2 meanR=2.3 medianR=2.0
    

Statements

All data collection and experiments in this work were conducted at JHU.

Ethics. We follow the ethics guidelines of ECCV 2024 and obtained Institutional Review Board (IRB) approvals prior to the start of our work. We described potential risks to the annotators, such as being exposed to inappropriate videos from public video datasets, and explained the purpose of the study and how the collected data will be used. All annotators agreed to join this project voluntarily and were paid by a fair amount as required at our institution.

Citation

If you find this dataset helpful, please cite:

@inproceedings{ma2024rethinking,
  title={Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data},
  author={Ma, Wufei and Li, Kai and Jiang, Zhongshi and Meshry, Moustafa and Liu, Qihao and Wang, Huiyu and H{\"a}ne, Christian and Yuille, Alan},
  booktitle={European Conference on Computer Vision},
  year={2024},
  organization={Springer}
}

License

Fent6k is CC-BY-NC 4.0 licensed, as found in the LICENSE file.

[Terms of Use] [Privacy Policy]

About

Rethinking Video-Text Understanding Retrieval from Counterfactually Augmented Data

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages