The dataset preparation is the same as All-in-one-B.
- Video: Webvid
- Image: CC3M
cd CoTraining
python run.py with data_root=DataSet num_gpus=8 num_nodes=1 \
num_frames=3 \
task_mlm_vtm_cotrain whole_word_masking=True step200k per_gpu_batchsize=4 backend='v100'
- Video: Webvid, YTTemporal, HowTo100M
- Image: CC3M, CC12M, CoCo, VisualGenome
cd CoTraining
python run.py with data_root=DataSet num_gpus=8 num_nodes=1 \
num_frames=3 \
task_mlm_vtm_cotrain_seven whole_word_masking=True step200k per_gpu_batchsize=4 backend='v100'