This repository contains the implementation of the paper:
MoCLE: Mixture of Cluster-conditional LoRA Experts for Vision-language Instruction Tuning
Yunhao Gou, Zhili Liu, Kai Chen, Lanqing Hong, Hang Xu, Aoxue Li, Dit-Yan Yeung, James T. Kwok, Yu Zhang
-
Install LAVIS to the current directory, the primary codebase on which MoCLE is built.
conda create -n lavis python=3.8 conda activate lavis git clone https://github.com/salesforce/LAVIS.git cd LAVIS pip install -e .
-
Clone the repository of MoCLE.
git clone https://github.com/gyhdog99/mocle.git
-
Build our modified PEFT package.
cd mocle cd peft-main pip install -e .
-
Copy
mocle.py
andmocle.yaml
in this repository into the LAVIS directory following the architecture below:cd ../ cp mocle.py ../lavis/models/blip2_models cp mocle.yaml ../lavis/configs/models/blip2
-
Modify
../lavis/models/__init__.py
in LAVIS as follows:- Add
from lavis.models.blip2_models.mocle import MoCLE
in the beginning of the file. - Add
"MoCLE"
to__all__ = [...,...]
.
- Add
-
MoCLE is based on Vicuna-7B-v1.1. Download the corresponding LLM checkpoint here.
-
Set the
llm_model
argument in../lavis/configs/mocle.yaml
to the local path towards the downloaded Vicuna checkpoint. -
Download the pre-trained checkpoint of MoCLE.
# Clusters Temperature Main Model Clustering Model 16 0.05 c16_t005 c16 64 0.05 c64_t005 c64 64 0.10 c64_t010 c64 -
Set
finetuned
andkmeans_ckpt
in../lavis/configs/mocle.yaml
to the weights of the downloaded main model and clustering model, respectively. (Please adjust thetotal_tasks
andgates_tmp
parameters as# Clusters
andTemperature
accordingly).
-
Load an image locally
import torch from PIL import Image # setup device to use device = torch.device("cuda") if torch.cuda.is_available() else "cpu" # load sample image raw_image = Image.open(".../path_to_images/").convert("RGB")
-
Load the models
from lavis.models import load_model_and_preprocess # loads MoCLE model model, vis_processors, _ = load_model_and_preprocess(name="mocle", model_type="mocle", is_eval=True, device=device) # prepare the image image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
-
Generate
response = model.generate({"image": image, "prompt": ["Your query about this image"]}) print(response)
Coming soon.
- LAVIS: Implementations of our MoCLE are built upon LAVIS.
- PEFT: Implementations of our Mixture of LoRA experts are based on PEFT.
If you're using MoCLE in your research or applications, please cite using this BibTeX:
@article{gou2023mixture,
title={Mixture of cluster-conditional lora experts for vision-language instruction tuning},
author={Gou, Yunhao and Liu, Zhili and Chen, Kai and Hong, Lanqing and Xu, Hang and Li, Aoxue and Yeung, Dit-Yan and Kwok, James T and Zhang, Yu},
journal={arXiv preprint arXiv:2312.12379},
year={2023}
}