The official PyTorch implementation of Masked Generative Vision-and-Language Transformer, CVPR 2023
MAGVLT is a unified non-autoregressive generative Vision-and-Language (VL) model which is trained via 1) three multimodal masked token prediction tasks along with two sub-tasks, 2) step-unrolled masked prediction and 3) MixSel.
We have tested our codes on the environment below
PyTorch 1.10.0
Python 3.7.11
Ubuntu 18.04
Please run the following command to install the other dependencies
pip install -r requirements.txt
- Implementation of MAGVLT
- Pretrained checkpoints of MAGVLT-base and MAGVLT-large
- Sampling pipelines of MAGVLT:
- Generate image from text
- Generate text from image
- Generate image from text and image (inpainting)
- Generate text from text and image (infilling)
- Generate text and image (unconditional generation)
- Evaluation pipelines of MAGVLT on downstream tasks
- Training pipeline with data preparation example
MAGVLT uses VQGAN (vqgan_imagenet_f16_16384) as the image encoder which can be downloaded from this repo.
Model | #Parameters | CIDEr (↑, coco) | CIDEr (↑, NoCaps) | FID (↓, coco) |
---|---|---|---|---|
MAGVLT-base | 371M | 60.4 | 46.3 | 12.08 |
MAGVLT-large | 840M | 68.1 | 55.8 | 10.14 |
We provide the following sampling codes.
python sampling_t2i.py --prompt=[YOUR PROMPT]
--config_path=configs/magvlt-it2it-base-sampling.yaml
--model_path=[MAGVLT_MODEL_PATH]
--stage1_model_path=[VQGAN_MODEL_PATH]
python sampling_i2t.py --source_img_path=[YOUR_IMAGE_PATH]
--config_path=configs/magvlt-it2it-base-sampling.yaml
--model_path=[MAGVLT_MODEL_PATH]
--stage1_model_path=[VQGAN_MODEL_PATH]
python sampling_it2i.py --prompt=[YOUR PROMPT]
--source_img_path=[YOUR_IMAGE_PATH]
--config_path=configs/magvlt-it2it-base-sampling.yaml
--model_path=[MAGVLT_MODEL_PATH]
--stage1_model_path=[VQGAN_MODEL_PATH]
@InProceedings{Kim_2023_CVPR,
author = {Kim, Sungwoong and Jo, Daejin and Lee, Donghoon and Kim, Jongmin},
title = {MAGVLT: Masked Generative Vision-and-Language Transformer},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2023},
pages = {23338-23348}
}
Donghoon Lee, [email protected]
Jongmin Kim, [email protected]
This project is released under MIT license.