Official PyTorch Implementation of Recognize Anything: A Strong Image Tagging Model and Tag2Text: Guiding Vision-Language Model via Image Tagging.
- Recognize Anything Model(RAM) is an image tagging model, which can recognize any common category with high accuracy.
- Tag2Text is a vision-language model guided by tagging, which can support caption, retrieval and tagging.
Both Tag2Text and RAM exihibit strong recognition ability. We have combined Tag2Text and RAM with localization models (Grounding-DINO and SAM) and developed a strong visual semantic analysis pipeline in the Grounded-SAM project.
RAM is a strong image tagging model, which can recognize any common category with high accuracy.
- Strong and general. RAM exhibits exceptional image tagging capabilities with powerful zero-shot generalization;
- RAM showcases impressive zero-shot performance, significantly outperforming CLIP and BLIP.
- RAM even surpasses the fully supervised manners (ML-Decoder).
- RAM exhibits competitive performance with the Google tagging API.
- Reproducible and affordable. RAM requires Low reproduction cost with open-source and annotation-free dataset;
- Flexible and versatile. RAM offers remarkable flexibility, catering to various application scenarios.
(Green color means fully supervised learning and Blue color means zero-shot performance.)
RAM significantly improves the tagging ability based on the Tag2text framework.
- Accuracy. RAM utilizes a data engine to generate additional annotations and clean incorrect ones, higher accuracy compared to Tag2Text.
- Scope. RAM upgrades the number of fixed tags from 3,400+ to 6,400+ (synonymous reduction to 4,500+ different semantic tags), covering more valuable categories. Moreover, RAM is equipped with open-set capability, feasible to recognize tags not seen during training
Tag2Text is an efficient and controllable vision-language model with tagging guidance.
- Tagging. Tag2Text recognizes 3,400+ commonly human-used categories without manual annotations.
- Captioning. Tag2Text integrates tags information into text generation as the guiding elements, resulting in more controllable and comprehensive descriptions.
- Retrieval. Tag2Text provides tags as additional visible alignment indicators for image-text retrieval.
- Release Tag2Text demo.
- Release checkpoints.
- Release inference code.
- Release RAM demo and checkpoints.
- Release training codes.
- Release training datasets.
Name | Backbone | Data | Illustration | Checkpoint | |
---|---|---|---|---|---|
1 | RAM-14M | Swin-Large | COCO, VG, SBU, CC-3M, CC-12M | Provide strong image tagging ability. | Download link |
2 | Tag2Text-14M | Swin-Base | COCO, VG, SBU, CC-3M, CC-12M | Support comprehensive captioning and tagging. | Download link |
- Install the dependencies::
pip install -r requirements.txt
-
Download RAM pretrained checkpoints.
-
(Optional) To use RAM and Tag2Text in other projects, better to install recognize-anything as a package:
pip install -e .
Then the RAM and Tag2Text model can be imported in other projects:
from ram.models import ram, tag2text
Get the English and Chinese outputs of the images:
python inference_ram.py --image images/demo/demo1.jpg
--pretrained pretrained/ram_swin_large_14m.pth
Firstly, custom recognition categories in build_openset_label_embedding, then get the tags of the images:
python inference_ram_openset.py --image images/openset_example.jpg
--pretrained pretrained/ram_swin_large_14m.pth
Get the tagging and captioning results:
python inference_tag2text.py --image images/demo/demo1.jpgOr get the tagging and sepcifed captioning results (optional):
--pretrained pretrained/tag2text_swin_14m.pth
python inference_tag2text.py --image images/demo/demo1.jpg
--pretrained pretrained/tag2text_swin_14m.pth
--specified-tags "cloud,sky"
We release two datasets OpenImages-common
(214 seen classes) and OpenImages-rare
(200 unseen classes). Copy or sym-link test images of OpenImages v6 to datasets/openimages_common_214/imgs/
and datasets/openimages_rare_200/imgs
.
To evaluate RAM on OpenImages-common
:
python batch_inference.py \
--model-type ram \
--checkpoint pretrained/ram_swin_large_14m.pth \
--dataset openimages_common_214 \
--output-dir outputs/ram
To evaluate RAM open-set capability on OpenImages-rare
:
python batch_inference.py \
--model-type ram \
--checkpoint pretrained/ram_swin_large_14m.pth \
--open-set \
--dataset openimages_rare_200 \
--output-dir outputs/ram_openset
To evaluate Tag2Text on OpenImages-common
:
python batch_inference.py \
--model-type tag2text \
--checkpoint pretrained/tag2text_swin_14m.pth \
--dataset openimages_common_214 \
--output-dir outputs/tag2text
Please refer to batch_inference.py
for more options. To get P/R in table 3 of our paper, pass --threshold=0.86
for RAM and --threshold=0.68
for Tag2Text.
To batch inference custom images, you can set up you own datasets following the given two datasets.
At present, we can only open source the forward function of Tag2Text as much as possible. To train/finetune Tag2Text on a custom dataset, you can refer to the complete training codebase of BLIP and make the following modifications:
- Replace the "models/blip.py" file with the current "tag2text.py" model file;
- Load additional tags based on the original dataloader.
The training code of RAM cannot be open-sourced temporarily as it is in the company's process.
If you find our work to be useful for your research, please consider citing.
@article{zhang2023recognize,
title={Recognize Anything: A Strong Image Tagging Model},
author={Zhang, Youcai and Huang, Xinyu and Ma, Jinyu and Li, Zhaoyang and Luo, Zhaochuan and Xie, Yanchun and Qin, Yuzhuo and Luo, Tong and Li, Yaqian and Liu, Shilong and others},
journal={arXiv preprint arXiv:2306.03514},
year={2023}
}
@article{huang2023tag2text,
title={Tag2Text: Guiding Vision-Language Model via Image Tagging},
author={Huang, Xinyu and Zhang, Youcai and Ma, Jinyu and Tian, Weiwei and Feng, Rui and Zhang, Yuejie and Li, Yaqian and Guo, Yandong and Zhang, Lei},
journal={arXiv preprint arXiv:2303.05657},
year={2023}
}
This work is done with the help of the amazing code base of BLIP, thanks very much!
We want to thank @Cheng Rui @Shilong Liu @Ren Tianhe for their help in marrying RAM/Tag2Text with Grounded-SAM.
We also want to thank Ask-Anything, Prompt-can-anything for combining RAM/Tag2Text, which greatly expands the application boundaries of RAM/Tag2Text.