Table of Contents
To address existing limitations in long-form video language understanding benchmarks, our work takes inspiration from the latest developments in the field and develops a new benchmark specifically designed for the task of identifying specific content within extensive video material, a challenge we refer to as the Multimodal Needle In A Video Haystack (NIAVH). This benchmark is unique due to its flexibility in supporting various modalities such as text, images, and videos as needles. It accommodates video content of any length, enabling a more comprehensive assessment of a model's video understanding capabilities.
In our benchmark, we utilize ego-centric videos from the Ego4D dataset as the "haystack". Within this haystack, we seek to locate the "needle", which we provide in three distinct modalities. For the textual modality, we supply a crafted description. For the image modality, we employ DALL-E to create an image that visually represents this description. For the video modality, we use Sora to generate a short video clip based on the same description. In each case, the "needle" - whether text, image, or video - is set to a duration of 1 second
Currently support models:
For additional models, please refer to the baselines we have provided for setup.
Installation
- VideoLLaMB-Mem
- install environment following its instruction
- download checkpoint into
needlehaystack/baselines/checkpoints/videollamb-mem-llava-1.5-7b
- LongVA
- install environment following its instruction
- download checkpoint into
needlehaystack/baselines/checkpoints/LongVA-7B
- LLaVA-NeXT-Video:
- install environment following its instruction
- download checkpoint into
needlehaystack/baselines/checkpoints/LLaVA-NeXT-Video/LLaVA-NeXT-Video-7B-DPO
- PLLaVA
- install environment following its instruction
- download checkpoint into
needlehaystack/baselines/checkpoints/PLLaVA/pllava-7b
- MA-LMM
- install environment following its instruction
- download checkpoint into
needlehaystack/baselines/checkpoints/MA-LMM
, includingvicuna-7b-v1.1
,eva_vit_g.pth
,instruct_blip_vicuna7b_trimmed.pth
Install the additional package in the environment of model to be tested:
pip install -r requirements.txt
single needle debug
python -m needlehaystack.run --provider LongVA --video_depth_percents "[55]" --context_lengths "[9]"
pressure test
python -m needlehaystack.run --provider LongVA
parameters
provider
- currently support model:VideoLLaMB
,LongVA
,LLaVA-NeXT
,PLLaVA
,MA-LMM
evaluator_model_name
- currently support evaluation API:gpt-35-turbo-0125
needle
- needle, could be string, video name, or imageneedle_modality
- currently supporttext
,image
,video
needle_desc
- required for image or video needle (question answer)retrieval_question
- required for image or video needle (question)needle_dir
- required for image or video needle (directory to save needle)haystack_dir
- required for image or video haystack (directory to save haystacks)context_lengths
- video context lengthvideo_depth_percents
- needle percentcontext_lengths_min
- The minimum length of the context. Default is 1 seconds.context_lengths_max
- The maximum length of the context. Default is 320 seconds.context_lengths_num_intervals
- The number of intervals for the context length. Default is 40.video_depth_percent_min
- The minimum depth percent of the document. Default is 0.video_depth_percent_max
- The maximum depth percent of the document. Default is 100.video_depth_percent_intervals
- The number of intervals for the document depth percent. Default is 12.
note: you can add more videos into needlehaystack/haystack
directory to get longer video
use the viz/visualization.ipynb
to visualize your result
Given the limitations of current methods in understanding long videos, we designed an experiment where the "haystack" is a 320-second video. The "needle" is a 1-second video clip generated by Sora, prompted by the description, "the young man seated on a cloud in the sky is reading a book". The associated question posed for the experiment is, "What is the young man seated on a cloud in the sky doing?". We divided the context into 40 intervals and set the video depth at 12 intervals.
This code is built on LLMTest_NeedleInAHaystack. Many thanks to them for their work.
If you find our work helpful, please consider citing it.
@misc{mm-niavh,
title={MLLM Pressure Test: Needle In A Video Haystack},
author={Wang, Yuxuan and Xie, Cihang and Liu, Yang and Zheng, Zilong},
journal={github},
year={2024}
}
@article{videollamb,
title={VideoLLaMB: Long Video Understanding with Recurrent Memory Bridges},
author={Wang, Yuxuan and Xie, Cihang and Liu, Yang and Zheng, Zilong},
journal={arxiv},
year={2024}
}