mPLUG-Owl3
2024.11.27
🔥🔥🔥 We have released the latest version of mPLUG-Owl3-7B-241101. The performance in video and multi-image scenarios is significantly improved. It also achieves top-1 performance on LVBench🎉🎉🎉.2024.10.15
We have released small-sized models of mPLUG-Owl3 based on the 0.5B and 1.5B Qwen2. Checkpoints are available on ModelScope and HuggingFace. Now you can experience Owl3's ultra-long visual content comprehension on edge devices.2024.09.23
Thanks to ms-swift. The finetuning of mPLUG-Owl3 is now supported. Refer to the document at Finetuning of mPLUG-Owl3.2024.09.23
We have released the evaluation pipeline, which can be found at Evaluation. Please refer to the README for more details.2024.08.12
We release mPLUG-Owl3. The source code and weights are avaliable at HuggingFace.
mPLUG-Owl3 can learn from knowledge from retrieval system.
mPLUG-Owl3 can also chat with user with a interleaved image-text context.
mPLUG-Owl3 can watch long videos such as movies and remember its details.
- Evaluation with huggingface model.
- Training data releasing. All training data are sourced from the public datasets. We are preparing to release a compact version to facilitate easy training. Prior to this release, you have the option to manually organize the training data.
- Training pipeline.
Visual Question Answering Multimodal LLM Benchmarks Video Benchmarks Multi-image Benchmarks
Model | NextQA | MVBench | VideoMME w/o sub | LongVideoBench-val | MLVU | LVBench |
---|---|---|---|---|---|---|
mPLUG-Owl3-7B-240728 | 78.6 | 54.5 | 53.5 | 52.1 | 63.7 | - |
mPLUG-Owl3-7B-241101 | 82.3 | 59.5 | 59.3 | 59.7 | 70.0 | 43.5 |
Model | NLVR2 | Mantis-Eval | MathVerse-mv | SciVerse-mv | BLINK | Q-Bench2 |
---|---|---|---|---|---|---|
mPLUG-Owl3-7B-240728 | 90.8 | 63.1 | 65.0 | 86.2 | 50.3 | 74.0 |
mPLUG-Owl3-7B-241101 | 92.7 | 67.3 | 65.1 | 82.7 | 53.8 | 77.7 |
Model | VQAv2 | OK-VQA | GQA | VizWizQA | TextVQA |
---|---|---|---|---|---|
mPLUG-Owl3-7B-240728 | 82.1 | 60.1 | 65.0 | 63.5 | 69.0 |
mPLUG-Owl3-7B-241101 | 83.2 | 61.4 | 64.7 | 62.9 | 71.4 |
Model | MMB-EN | MMB-CN | MM-Vet | POPE | AI2D |
---|---|---|---|---|---|
mPLUG-Owl3-7B-240728 | 77.6 | 74.3 | 40.1 | 88.2 | 73.8 |
mPLUG-Owl3-7B-241101 | 80.4 | 79.1 | 39.8 | 88.1 | 77.8 |
To perform evaluation on the above benchmarks, first download the datasets from the official or huggingface sites: ai2d, gqa, LLaVA-NeXT-Interleave-Bench, LongVideoBench, mmbench, mmvet, mvbench, nextqa, NLVR2, okvqa, qbench2, textvqa, videomme, vizwiz, vqav2.
Then organize them as follows in ./evaluation/dataset
.
We provide the json files of some datasets here, to help reproduce the evaluation results in our paper.
click to unfold
├── ai2d
│ ├── data
│ └── README.md
├── gqa
│ └── testdev_balanced.jsonl
├── LLaVA-NeXT-Interleave-Bench
│ ├── eval_images_fix
│ └── multi_image_out_domain.json
├── LongVideoBench
│ ├── lvb_val.json
│ └── videos
├── mmbench
│ ├── mmbench_test_en_20231003.jsonl
│ └── mmbench_test_en_20231003.tsv
├── mmvet
│ └── mm-vet.json
├── mvbench
│ ├── json
│ ├── README.md
│ └── videos
├── nextqa
│ ├── MC
│ ├── NExTVideo
│ └── README.md
├── NLVR2
│ ├── data
│ └── README.md
├── okvqa
│ ├── okvqa_val.json
│ ├── mscoco_val2014_annotations.json
│ └── OpenEnded_mscoco_val2014_questions.json
├── pope
│ ├── ImageQA_POPE_adversarial.jsonl
│ ├── ImageQA_POPE_popular.jsonl
│ └── ImageQA_POPE_random.jsonl
├── qbench2
│ ├── data
│ └── README.md
├── textvqa
│ ├── textvqa_val_annotations.json
│ ├── textvqa_val.json
│ └── textvqa_val_questions_ocr.json
├── videomme
│ ├── data
│ └── test-00000-of-00001.parquet
├── vizwiz
│ └── vizwiz_test.jsonl
└── vqav2
├── v2_mscoco_val2014_annotations.json
├── v2_OpenEnded_mscoco_test2015_questions.json
└── vqav2_test.json
Download the images of the datasets, and organize as follows in ./evaluation/images
,
click to unfold
├── gqa
│ └── images
├── mmbench_test_cn_20231003
│ └── images
├── mmbench_test_en_20231003
│ └── images
├── mmvet
│ └── images
├── mscoco
│ └── images
│ ├── test2015
│ └── val2014
├── textvqa
│ └── text_vqa
└── vizwiz
└── test
Once the data is ready, run ./evaluation/eval.sh
for evaluation.
The datasets configuration can be modified in ./evaluation/tasks/plans/all.yaml
.
Model Size | ModelScope | HuggingFace |
---|---|---|
1B | mPLUG-Owl3-1B-241014 | mPLUG-Owl3-1B-241014 |
2B | mPLUG-Owl3-2B-241014 | mPLUG-Owl3-2B-241014 |
7B | mPLUG-Owl3-7B-240728 | mPLUG-Owl3-7B-240728 |
7B | - | mPLUG-Owl3-7B-241101 |
Installing the dependencies
pip install -r requirements.txt
Execute the demo.
python gradio_demo.py
Load the mPLUG-Owl3. We now only support attn_implementation in ['sdpa', 'flash_attention_2']
.
import torch
from modelscope import AutoConfig, AutoModel
model_path = 'iic/mPLUG-Owl3-2B-241101'
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
print(config)
model = AutoModel.from_pretrained(model_path, attn_implementation='flash_attention_2', torch_dtype=torch.bfloat16, trust_remote_code=True)
_ = model.eval().cuda()
device = "cuda"
Chat with images.
from PIL import Image
from modelscope import AutoTokenizer
from decord import VideoReader, cpu
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = model.init_processor(tokenizer)
image = Image.new('RGB', (500, 500), color='red')
messages = [
{"role": "user", "content": """<|image|>
Describe this image."""},
{"role": "assistant", "content": ""}
]
inputs = processor(messages, images=[image], videos=None)
inputs.to('cuda')
inputs.update({
'tokenizer': tokenizer,
'max_new_tokens':100,
'decode_text':True,
})
g = model.generate(**inputs)
print(g)
Chat with a video.
from PIL import Image
from modelscope import AutoTokenizer
from decord import VideoReader, cpu # pip install decord
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = model.init_processor(tokenizer)
messages = [
{"role": "user", "content": """<|video|>
Describe this video."""},
{"role": "assistant", "content": ""}
]
videos = ['/nas-mmu-data/examples/car_room.mp4']
MAX_NUM_FRAMES=16
def encode_video(video_path):
def uniform_sample(l, n):
gap = len(l) / n
idxs = [int(i * gap + gap / 2) for i in range(n)]
return [l[i] for i in idxs]
vr = VideoReader(video_path, ctx=cpu(0))
sample_fps = round(vr.get_avg_fps() / 1) # FPS
frame_idx = [i for i in range(0, len(vr), sample_fps)]
if len(frame_idx) > MAX_NUM_FRAMES:
frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
frames = vr.get_batch(frame_idx).asnumpy()
frames = [Image.fromarray(v.astype('uint8')) for v in frames]
print('num frames:', len(frames))
return frames
video_frames = [encode_video(_) for _ in videos]
inputs = processor(messages, images=None, videos=video_frames)
inputs.to(device)
inputs.update({
'tokenizer': tokenizer,
'max_new_tokens':100,
'decode_text':True,
})
g = model.generate(**inputs)
print(g)
mPLUG-Owl3 is based on Qwen2, which can be optimized through the Liger-Kernel to reduce memory usage.
pip install liger-kernel
def apply_liger_kernel_to_mplug_owl3(
rms_norm: bool = True,
swiglu: bool = True,
model = None,
) -> None:
from liger_kernel.transformers.monkey_patch import _patch_rms_norm_module
from liger_kernel.transformers.monkey_patch import _bind_method_to_module
from liger_kernel.transformers.swiglu import LigerSwiGLUMLP
"""
Apply Liger kernels to replace original implementation in HuggingFace Qwen2 models
Args:
rms_norm (bool): Whether to apply Liger's RMSNorm. Default is True.
swiglu (bool): Whether to apply Liger's SwiGLU MLP. Default is True.
model (PreTrainedModel): The model instance to apply Liger kernels to, if the model has already been
loaded. Default is None.
"""
base_model = model.language_model.model
if rms_norm:
_patch_rms_norm_module(base_model.norm)
for decoder_layer in base_model.layers:
if swiglu:
_bind_method_to_module(
decoder_layer.mlp, "forward", LigerSwiGLUMLP.forward
)
if rms_norm:
_patch_rms_norm_module(decoder_layer.input_layernorm)
_patch_rms_norm_module(decoder_layer.post_attention_layernorm)
print("Applied Liger kernels to Qwen2 in mPLUG-Owl3")
import torch
from modelscope import AutoConfig, AutoModel
model_path = 'iic/mPLUG-Owl3-2B-241101'
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
print(config)
model = AutoModel.from_pretrained(model_path, attn_implementation='flash_attention_2', torch_dtype=torch.bfloat16, trust_remote_code=True)
_ = model.eval().cuda()
device = "cuda"
apply_liger_kernel_to_mplug_owl3(model=model)
When you have more than one GPUs, you can set the device_map='auto'
to split the mPLUG-Owl3 into multiple GPUs. However, it will slowdown the inference speed.
model = AutoModel.from_pretrained(model_path, attn_implementation='flash_attention_2', device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
_ = model.eval()
first_layer_name = list(model.hf_device_map.keys())[0]
device = model.hf_device_map[first_layer_name]
Load the mPLUG-Owl3. We now only support attn_implementation in ['sdpa', 'flash_attention_2']
.
import torch
from transformers import AutoConfig, AutoModel
model_path = 'mPLUG/mPLUG-Owl3-7B-240728'
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
# model = mPLUGOwl3Model(config).cuda().half()
model = AutoModel.from_pretrained(model_path, attn_implementation='sdpa', torch_dtype=torch.half, trust_remote_code=True)
model.eval().cuda()
Chat with images.
from PIL import Image
from transformers import AutoTokenizer, AutoProcessor
from decord import VideoReader, cpu # pip install decord
model_path = 'mPLUG/mPLUG-Owl3-7B-240728'
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = model.init_processor(tokenizer)
image = Image.new('RGB', (500, 500), color='red')
messages = [
{"role": "user", "content": """<|image|>
Describe this image."""},
{"role": "assistant", "content": ""}
]
inputs = processor(messages, images=image, videos=None)
inputs.to('cuda')
inputs.update({
'tokenizer': tokenizer,
'max_new_tokens':100,
'decode_text':True,
})
g = model.generate(**inputs)
print(g)
Chat with a video.
from PIL import Image
from transformers import AutoTokenizer, AutoProcessor
from decord import VideoReader, cpu # pip install decord
model_path = 'mPLUG/mPLUG-Owl3-7B-240728'
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = model.init_processor(tokenizer)
messages = [
{"role": "user", "content": """<|video|>
Describe this video."""},
{"role": "assistant", "content": ""}
]
videos = ['/nas-mmu-data/examples/car_room.mp4']
MAX_NUM_FRAMES=16
def encode_video(video_path):
def uniform_sample(l, n):
gap = len(l) / n
idxs = [int(i * gap + gap / 2) for i in range(n)]
return [l[i] for i in idxs]
vr = VideoReader(video_path, ctx=cpu(0))
sample_fps = round(vr.get_avg_fps() / 1) # FPS
frame_idx = [i for i in range(0, len(vr), sample_fps)]
if len(frame_idx) > MAX_NUM_FRAMES:
frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
frames = vr.get_batch(frame_idx).asnumpy()
frames = [Image.fromarray(v.astype('uint8')) for v in frames]
print('num frames:', len(frames))
return frames
video_frames = [encode_video(_) for _ in videos]
inputs = processor(messages, images=None, videos=video_frames)
inputs.to('cuda')
inputs.update({
'tokenizer': tokenizer,
'max_new_tokens':100,
'decode_text':True,
})
g = model.generate(**inputs)
print(g)
Please use ms-swift to finetuning the mPLUG-Owl3. Here is an instruction.
For mPLUG-Owl3-7B-241101 and newer versions, you should set the model_type
to mplug-owl3v-7b-chat
instead.
If you find mPLUG-Owl3 useful for your research and applications, please cite using this BibTeX:
@misc{ye2024mplugowl3longimagesequenceunderstanding,
title={mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models},
author={Jiabo Ye and Haiyang Xu and Haowei Liu and Anwen Hu and Ming Yan and Qi Qian and Ji Zhang and Fei Huang and Jingren Zhou},
year={2024},
eprint={2408.04840},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2408.04840},
}
- LLaVA: the codebase we built upon. Thanks for the authors of LLaVA for providing the framework.
- LLaMA. A open-source collection of state-of-the-art large pre-trained language models.
- LLaVA. A visual instruction tuned vision language model which achieves GPT4 level capabilities.
- mPLUG. A vision-language foundation model for both cross-modal understanding and generation.
- mPLUG-2. A multimodal model with a modular design, which inspired our project.