GitHub - haoshuai714/ZeroVL: The implementation of ZeroVL

This repository contains source code necessary to reproduce the results presented in the paper ZeroVL: A Strong Baseline for Aligning Vision-Language Representations with Limited Resources.

Pioneering dual-encoder pre-training works (e.g., CLIP and ALIGN) require a tremendous amount of data and computational resources (e.g., billion-level web data and hundreds of GPUs), which prevent researchers with limited resources from reproduction and further exploration. To this end, we provide a comprehensive training guidance, which allows us to conduct dual-encoder multi-modal representation alignment with limited resources. Meanwhile, we provide a reproducible strong baseline of competitive results, namely ZeroVL, with publicly accessible academic datasets and a popular experimental environment.

Performance

Image-text retreival RSUM scores on MSCOCO and Flickr30K datasets:

method	computation	data	COCO(zs.)	COCO(ft.)	F30K(zs.)	F30K(ft.)
CLIP	256 V100	400M	400.2	-	540.6	-
ALIGN	1024 TPUv3	1800M	425.3	500.4	553.3	576.0
baseline	8 V100	14.2M	363.5	471.9	476.8	553.0
ZeroVL	8 V100	14.2M	425.0	485.0	536.2	561.6
ZeroVL	8 V100	100M	442.1	500.5	546.5	573.6

zs.: zero-shot setting, ft.: fine-tuned setting.

ImageNet-1K linear probing results:

method	data	backbone	top-1
CLIP	400M	ViT-B/16	80.2
ZeroVL	100M	ViT-B/16	80.6

Installation

Requirements:

Python 3.7
Pytorch 1.8.1
torchvision 0.9.1
cuda 11.1

Install requirements:

pip3 install -r requirements.txt

Pre-training

Check PRETRAINING.md for codebase usage.

Model Zoo

ZeroVL 14M weights: Google Drive, Baidu Pan

ZeroVL 100M weights: Google Drive, Baidu Pan

Evaluation

Check EVALUATION.md for codebase usage.

Linear Probing

Check LINEAR.md for codebase usage.

Citing ZeroVL

If you use ZeroVL in your research or wish to refer to the baseline results, please use the following BibTeX entry.

@article{cui2021zerovl,
  title={ZeroVL: A Strong Baseline for Aligning Vision-Language Representations with Limited Resources},
  author={Cui, Quan and Zhou, Boyan and Guo, Yu and Yin, Weidong and Wu, Hao and Yoshie, Osamu},
  journal={arXiv preprint arXiv:2112.09331},
  year={2021}
}

License

ZeroVL is released under the MIT license. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
ckpts		ckpts
configs		configs
data		data
docs		docs
tools		tools
zerovl		zerovl
.gitignore		.gitignore
EVALUATION.md		EVALUATION.md
LICENSE		LICENSE
LINEAR.md		LINEAR.md
PRETRAINING.md		PRETRAINING.md
README.md		README.md
launch.py		launch.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Performance

Installation

Pre-training

Model Zoo

Evaluation

Linear Probing

Citing ZeroVL

License

About

Releases

Packages

Languages

License

haoshuai714/ZeroVL

Folders and files

Latest commit

History

Repository files navigation

Performance

Installation

Pre-training

Model Zoo

Evaluation

Linear Probing

Citing ZeroVL

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages