A baseline code for LSMDC19 Task 1: "Multi Sentence Video Description".
This code is written in pytorch and is based on adv-inf and self-critical.pytorch repositories.
Python 2.7 (because there is no coco-caption version for python 3)
PyTorch 0.4
coco-caption (located in movie_eval; run ./get_stanford_models.sh
to set up SPICE evaluation)
Clone the repository recursively to get the submodules.
git clone --recursive https://github.com/jamespark3922/lsmdc
Download caption csv files from the website. Then, run the following command to preprocess the captions.
python prepro_vocab.py --input_path $CSV_PATH
This will result in two files, LSMDC16_info_someone.json
and LSMDC16_labels_someone.h5
, which will be used as inputs for training and testing.
Then, move the csv files to movie_eval/data
to be used as reference for evaluation.
mkdir movie_eval/data
mv *.csv movie_eval/data
The website shares precomputed visual features in the download page.
- i3d (3.3GB)
- resnet152 (6.8GB)
After downloading them all, unzip them to your preferred feature directory.
python train.py --input_json LSMDC16_info_someone.json --input_fc_dir feats/i3d/ --input_img_dir feats/resnet152/ --input_label_h5 LSMDC16_labels_someone.h5 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 0 --val_videos_use -1 --losses_print_every 10 --batch_size 16
The training scheme follows the format in self-critical.pytorch repository.
The train script will dump checkpoints into the folder specified by --checkpoint_path
(default = save/).
We save the model every epoch, along with the best-performing checkpoint on validation and the latest checkpoint.
To resume training, you can specify --g_start_from
with corresponding --g_start_epoch
(default = 'latest') option to be the path saving infos.pkl
and model_$G_START_EPOCH.pth
(usually you could just set --g_start_from
and --checkpoint_path
to be the same).
If you'd like to evaluate BLEU/METEOR/CIDEr scores during training in addition to validation cross entropy loss, use --language_eval 1
option, but don't forget to download the coco-caption code.
You can see more details in opts.py
Context: The generator model uses the hidden state of previous sentence as "context", starting at epoch --g_context_epoch
(default = after 10 epochs); before then, the context feature is zero-padded.
We share pretrained models here.
Result for the public test set is:
CIDER | METEOR | BLEU-4 | SPICE | |
---|---|---|---|---|
gen_60.pth (ep 60) | 0.119 | 0.0825 | 0.013 | 0.151 |
(Note that METEOR score is a bit higher than previous 2017 challenge submissions, becaue captions have been updated for the current version).
After training is done, one can evaluate the captions in the public test set.
The normal inference using greedy max or beamsearch can be run with the following command:
python eval.py --g_model_path save/gen_best.pth --infos_path save/infos.pkl --sample_max 1 --id $id --beam_size $beam_size
You can also do random sampling by setting --sample_max 0
and specifying --top_num n
option. (n > 1 performs top-k sampling, 0 < n < 1 performs nucleus sampling)
If you are using the pretrained model, then you will need to update the path to your input files.
python eval.py --g_model_path save/gen_60.pth --infos_path save/infos.pkl --sample_max 1 --id $id --beam_size $beam_size --input_json LSMDC16_info_someone.json --input_fc_dir feats/i3d/ --input_img_dir feats/resnet152/ --input_label_h5 LSMDC16_labels_someone.h5
To run on the blind test split, simply add --split blind_test
option.
The captions will be saved in movie_eval/captions/caption_$id.json
and results in movie_eval/results/result_$id.json
@article{lsmdc,
author = {Rohrbach, Anna and Torabi, Atousa and Rohrbach, Marcus and Tandon, Niket and Pal, Chris and Larochelle, Hugo and Courville, Aaron and Schiele, Bernt},
title = {Movie Description},
journal={International Journal of Computer Vision},
year = {2017},
url = {http://link.springer.com/article/10.1007/s11263-016-0987-1?wt_mc=Internal.Event.1.SEM.ArticleAuthorOnlineFirst}}
@article{park2019advinf,
title= Adversarial Inference for Multi-Sentence Video Descriptions,
author={Park, Jae Sung and Rohrbach, Marcus and Darrell, Trevor and Rohrbach, Anna},
jorunal={CVPR 2019},
year={2019}
}