COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

This repository is the official PyTorch implementation of our paper which will be published at NeurIPS 2020.

Model Outline

Development Roadmap

Current version features

Reproduce the evaluation results on Video-Text Retrieval either with the provided models or by training them from scratch. Configurations and weights for the COOT models described in tables 2 and 3 of the paper are provided.

Planned features

Upload COOT feature output. See this issue for an explanation on how to extract them yourself with this version.
Reproduce the results on Video Captioning described in tables 4 and 5.
Improve code to make it easier to input a custom dataset.

Prerequisites

Requires Python>=3.6, PyTorch>=1.4. Tested on Ubuntu. At least 8GB of free RAM are needed to load the text features into memory. GPU training is recommended for speed (requires 2x11GB GPU memory).

Installation

Install Python and PyTorch
Clone repository: git clone https://github.com/gingsi/coot-videotext
Set working directory to be inside the repository: cd coot-videotext
Install other requirements pip install -r requirements.txt
All future commands in this Readme assume that the current working directory is the root of this repository

Prepare datasets

ActivityNet Captions

Download: Please download the file via P2P using this torrent and kindly keep seeding after you are done. See Troubleshoot / Downloading torrents below for help.

Alternative google drive download: Download Link or Mirror Link

# 1) download ~52GB zipped features to data/activitynet/
# 2) unzip
# after extraction, the folder structure should look like this:
# data/activitynet/features/ICEP_V3_global_pool_skip_8_direct_resize/v_XXXXXXXXXXX.npz
tar -C data/activitynet/features -xvzf data/activitynet/ICEP_V3_global_pool_skip_8_direct_resize.tar.gz
# 3) preprocess dataset and compute bert features
python prepare_activitynet.py
python run_bert.py activitynet --cuda

Youcook2 with ImageNet/Kinetics Features

Download: Please download the file via P2P using this torrent and kindly keep seeding after you are done. See Troubleshoot / Downloading torrents below for help.

Alternative google drive download: Download Link

# 1) download ~13GB zipped features to data/youcook2/
# 2) unzip
tar -C data/youcook2 -xzvf data/youcook2/video_feat_2d3d.tar.gz
# after extraction, the folder structure should look like this:
# data/youcook2/video_feat_2d3d.h5
# 2) preprocess dataset and compute bert features
python prepare_youcook2.py
python run_bert.py youcook2 --cuda --metadata_name 2d3d

Youcook2 with Howto100m features

Download: Please download the file via P2P using this torrent and kindly keep seeding after you are done. See Troubleshoot / Downloading torrents below for help.

Alternative google drive download: Download Link

# 1) download ~623MB zipped features to data/youcook2/
# 2) unzip
tar -C data/youcook2 -xzvf data/youcook2/video_feat_100m.tar.gz
# after extraction, the folder structure should look like this:
# data/youcook2/video_feat_100m.h5
# 3) preprocess dataset and compute bert features
python prepare_youcook2.py --howto100m
python run_bert.py youcook2 --cuda --metadata_name 100m

Download provided models for evaluation

Google drive download: Download Link

# 1) download ~100mb zipped models
# 2) unzip
tar -xzvf provided_models.tar.gz
# after extraction, the folder structure should look like this:
# provided_models/MODEL_NAME.pth

Run

Script flags

--preload_vid  # preload video features to RAM (~110GB RAM needed for activitynet, 60GB for youcook2 resnet/resnext, 20GB for youcook2 howto100m)
--workers N    # change number of parallel dataloader workers, default: min(10, N_CPU - 1)
--cuda         # run on GPU
--single_gpu   # run on only one GPU

Notes for training

We use early stopping. Models are evaluated automatically and results are output during training. To evaluate a model again after training it, check the end of the script output or the logfile in path runs/MODEL_NAME/log_DATE_TIME.log to find the best epoch. Then run python eval.py config/MODEL_NAME.yaml runs/MODEL_NAME/ckpt_ep##.pth
When training from scratch, actual results may vary due to randomness (no fixed seeds).
Described train time assumes data preloading with --preload_vid and varies due to early stopping.

Table 2: Video-paragraph retrieval results on AcitvityNet-captions dataset (val1).

# train from scratch
python train.py config/anet_coot.yaml --cuda --log_dir runs/anet_coot

# evaluate provided model
python eval.py config/anet_coot.yaml provided_models/anet_coot.pth --cuda --workers 10

Model	Paragraph->Video R@1	R@5	R@50	Video->Paragraph R@1	R@5	R@50	Train time
COOT	61.3	86.7	98.7	60.6	87.9	98.7	~70min

Table 3: Retrieval Results on Youcook2 dataset

# train from scratch (row 1, model with ResNet/ResNext features)
python train.py config/yc2_2d3d_coot.yaml --cuda --log_dir runs/yc2_2d3d_coot

# evaluate provided model (row 1)
python eval.py config/yc2_2d3d_coot.yaml provided_models/yc2_2d3d_coot.pth --cuda

# train from scratch (row 2, model with HowTo100m features)
python train.py config/yc2_100m_coot.yaml --cuda --log_dir runs/yc2_100m_coot

# evaluate provided model (row 2)
python eval.py config/yc2_100m_coot.yaml provided_models/yc2_100m_coot.pth --cuda

Model	Paragraph->Video R@1	R@5	R@10	MR	Sentence->Clip R@1	R@5	R@50	MR	Train time
COOT with ResNet/ResNeXt features	51.2	79.9	88.20	1	6.6	17.3	25.1	48	~180min
COOT with HowTo100m features	78.3	96.2	97.8	1	16.9	40.5	52.5	9	~16 min

Additional information

The default datasets folder is data/. To use a different folder, supply all python scripts with flag --dataroot new_path and change the commands for dataset preprocessing accordingly.

Preprocessing steps, done automatically

Activitynet
- Switch start and stop timestamps when stop > start. Affects 2 videos.
- Convert start/stop timestamps to start/stop frames by multiplying with FPS in the features and using floor/ceil operation respectively.
- Captions: Replace all newlines/tabs/multiple spaces with a single space.
- Cut too long captions (>512 bert tokens in the paragraph) by retaining at least 4 tokens and the [SEP] token for each sentence. Affects 1 video.
- Expand clips to be at least 10 frames long
  - train/val_1: 2823 changed / 54926 total
  - train/val_2: 2750 changed / 54452 total
Activitynet and Youcook2
- Add [CLS] at the start of each paragraph and [SEP] at the end of each sentence before encoding with Bert model.

Troubleshoot

Downloading Torrents

If you have problems downloading our torrents, try following this tutorial:

Download and install the torrent client qBittorrent.
Download the torrent files from the links and open them with qBittorrent.
Options -> Advanced, check the fields "Always announce to all trackers in a tier" and "Always announce to all tiers".
Options -> BitTorrent, disable "Torrent Queueing"
Options -> Connection, disable "Use UPnp..." and everything under "Connection Limits" and set Proxy Server to "(None)"
Options -> Speed, make sure speed is unlimited.
Right click your torrent and "Force reannounce"
Right click your torrent and "Force resume"
Let it run for at least 24 hours.
If it still doesn't download after waiting for an hour, feel free to open an issue.
Once you are done, please keep seeding.

Acknowledgements

For the full references see our paper. We especially thank the creators of the following github repositories for providing helpful code:

Zhang et al. for retrieval code and activitynet-densecaptions features: CMHSE
Wang et al. for their captioning model and code: MART
Miech et al. for their Video feature extractor and their HowTo100M Model

We also thank the authors of all packages in the requirements.txt file.

Credit of the bird image to Laurie Boyle - Australia.

License

Citation

If you find our work or code useful, please consider citing our paper:

@inproceedings{ging2020coot,
  title={COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning},
  author={Simon Ging and Mohammadreza Zolfaghari and Hamed Pirsiavash and Thomas Brox},
  booktitle={Conference on Neural Information Processing Systems},
  year={2020}
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
assets		assets
config		config
data/youcook2/captions		data/youcook2/captions
glove_vocab		glove_vocab
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dataset.py		dataset.py
eval.py		eval.py
loss.py		loss.py
model.py		model.py
optimizer.py		optimizer.py
prepare_activitynet.py		prepare_activitynet.py
prepare_youcook2.py		prepare_youcook2.py
requirements.txt		requirements.txt
run_bert.py		run_bert.py
text_embedding.py		text_embedding.py
train.py		train.py
trainer.py		trainer.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

Model Outline

Development Roadmap

Current version features

Planned features

Prerequisites

Installation

Prepare datasets

ActivityNet Captions

Youcook2 with ImageNet/Kinetics Features

Youcook2 with Howto100m features

Download provided models for evaluation

Run

Script flags

Notes for training

Table 2: Video-paragraph retrieval results on AcitvityNet-captions dataset (val1).

Table 3: Retrieval Results on Youcook2 dataset

Additional information

Preprocessing steps, done automatically

Troubleshoot

Downloading Torrents

Acknowledgements

License

Citation

About

Releases

Packages

Languages

License

haoshuai714/coot-videotext

Folders and files

Latest commit

History

Repository files navigation

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

Model Outline

Development Roadmap

Current version features

Planned features

Prerequisites

Installation

Prepare datasets

ActivityNet Captions

Youcook2 with ImageNet/Kinetics Features

Youcook2 with Howto100m features

Download provided models for evaluation

Run

Script flags

Notes for training

Table 2: Video-paragraph retrieval results on AcitvityNet-captions dataset (val1).

Table 3: Retrieval Results on Youcook2 dataset

Additional information

Preprocessing steps, done automatically

Troubleshoot

Downloading Torrents

Acknowledgements

License

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages