Skip to content

Latest commit

 

History

History
88 lines (70 loc) · 3.43 KB

README.md

File metadata and controls

88 lines (70 loc) · 3.43 KB

MAGVLT: Masked Generative Vision-and-Language Transformer


The official PyTorch implementation of Masked Generative Vision-and-Language Transformer, CVPR 2023

MAGVLT is a unified non-autoregressive generative Vision-and-Language (VL) model which is trained via 1) three multimodal masked token prediction tasks along with two sub-tasks, 2) step-unrolled masked prediction and 3) MixSel.

Requirements

We have tested our codes on the environment below

PyTorch 1.10.0
Python 3.7.11
Ubuntu 18.04

Please run the following command to install the other dependencies

pip install -r requirements.txt

Coverage of Released Codes

  • Implementation of MAGVLT
  • Pretrained checkpoints of MAGVLT-base and MAGVLT-large
  • Sampling pipelines of MAGVLT:
    • Generate image from text
    • Generate text from image
    • Generate image from text and image (inpainting)
    • Generate text from text and image (infilling)
    • Generate text and image (unconditional generation)
  • Evaluation pipelines of MAGVLT on downstream tasks
  • Training pipeline with data preparation example

Pretrained Checkpoints

MAGVLT uses VQGAN (vqgan_imagenet_f16_16384) as the image encoder which can be downloaded from this repo.

Model #Parameters CIDEr (↑, coco) CIDEr (↑, NoCaps) FID (↓, coco)
MAGVLT-base 371M 60.4 46.3 12.08
MAGVLT-large 840M 68.1 55.8 10.14

Sampling

We provide the following sampling codes.

python sampling_t2i.py  --prompt=[YOUR PROMPT] 
                        --config_path=configs/magvlt-it2it-base-sampling.yaml 
                        --model_path=[MAGVLT_MODEL_PATH] 
                        --stage1_model_path=[VQGAN_MODEL_PATH]

python sampling_i2t.py  --source_img_path=[YOUR_IMAGE_PATH] 
                        --config_path=configs/magvlt-it2it-base-sampling.yaml 
                        --model_path=[MAGVLT_MODEL_PATH] 
                        --stage1_model_path=[VQGAN_MODEL_PATH]

python sampling_it2i.py --prompt=[YOUR PROMPT] 
                        --source_img_path=[YOUR_IMAGE_PATH] 
                        --config_path=configs/magvlt-it2it-base-sampling.yaml 
                        --model_path=[MAGVLT_MODEL_PATH] 
                        --stage1_model_path=[VQGAN_MODEL_PATH]

Citation

@InProceedings{Kim_2023_CVPR,
    author    = {Kim, Sungwoong and Jo, Daejin and Lee, Donghoon and Kim, Jongmin},
    title     = {MAGVLT: Masked Generative Vision-and-Language Transformer},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2023},
    pages     = {23338-23348}
}

Contact

Donghoon Lee, [email protected]
Jongmin Kim, [email protected]

License

This project is released under MIT license.