Skip to content

Latest commit

 

History

History

mPLUG-Owl3

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou
Tongyi, Alibaba Group
License Hits

mPLUG-Owl3

Performance and Efficiency

News and Updates

  • 2024.11.27 🔥🔥🔥 We have released the latest version of mPLUG-Owl3-7B-241101. The performance in video and multi-image scenarios is significantly improved. It also achieves top-1 performance on LVBench🎉🎉🎉.
  • 2024.10.15 We have released small-sized models of mPLUG-Owl3 based on the 0.5B and 1.5B Qwen2. Checkpoints are available on ModelScope and HuggingFace. Now you can experience Owl3's ultra-long visual content comprehension on edge devices.
  • 2024.09.23 Thanks to ms-swift. The finetuning of mPLUG-Owl3 is now supported. Refer to the document at Finetuning of mPLUG-Owl3.
  • 2024.09.23 We have released the evaluation pipeline, which can be found at Evaluation. Please refer to the README for more details.
  • 2024.08.12 We release mPLUG-Owl3. The source code and weights are avaliable at HuggingFace.

Cases

mPLUG-Owl3 can learn from knowledge from retrieval system.

RAG ability

mPLUG-Owl3 can also chat with user with a interleaved image-text context.

Interleaved image-text Dialogue

mPLUG-Owl3 can watch long videos such as movies and remember its details.

Long video understanding

TODO List

  • Evaluation with huggingface model.
  • Training data releasing. All training data are sourced from the public datasets. We are preparing to release a compact version to facilitate easy training. Prior to this release, you have the option to manually organize the training data.
  • Training pipeline.

Performance

Visual Question Answering VQA Multimodal LLM Benchmarks Multimodal Benchmarks Video Benchmarks Video Benchmarks Multi-image Benchmarks Multiimage Benchmarks MI-Bench

The comparison between mPLUG-Owl3-7B-240728 and mPLUG-Owl3-7B-241101

Model NextQA MVBench VideoMME w/o sub LongVideoBench-val MLVU LVBench
mPLUG-Owl3-7B-240728 78.6 54.5 53.5 52.1 63.7 -
mPLUG-Owl3-7B-241101 82.3 59.5 59.3 59.7 70.0 43.5
Model NLVR2 Mantis-Eval MathVerse-mv SciVerse-mv BLINK Q-Bench2
mPLUG-Owl3-7B-240728 90.8 63.1 65.0 86.2 50.3 74.0
mPLUG-Owl3-7B-241101 92.7 67.3 65.1 82.7 53.8 77.7
Model VQAv2 OK-VQA GQA VizWizQA TextVQA
mPLUG-Owl3-7B-240728 82.1 60.1 65.0 63.5 69.0
mPLUG-Owl3-7B-241101 83.2 61.4 64.7 62.9 71.4
Model MMB-EN MMB-CN MM-Vet POPE AI2D
mPLUG-Owl3-7B-240728 77.6 74.3 40.1 88.2 73.8
mPLUG-Owl3-7B-241101 80.4 79.1 39.8 88.1 77.8

Evaluation

To perform evaluation on the above benchmarks, first download the datasets from the official or huggingface sites: ai2d, gqa, LLaVA-NeXT-Interleave-Bench, LongVideoBench, mmbench, mmvet, mvbench, nextqa, NLVR2, okvqa, qbench2, textvqa, videomme, vizwiz, vqav2. Then organize them as follows in ./evaluation/dataset.

We provide the json files of some datasets here, to help reproduce the evaluation results in our paper.

click to unfold
├── ai2d
│   ├── data
│   └── README.md
├── gqa
│   └── testdev_balanced.jsonl
├── LLaVA-NeXT-Interleave-Bench
│   ├── eval_images_fix
│   └── multi_image_out_domain.json
├── LongVideoBench
│   ├── lvb_val.json
│   └── videos
├── mmbench
│   ├── mmbench_test_en_20231003.jsonl
│   └── mmbench_test_en_20231003.tsv
├── mmvet
│   └── mm-vet.json
├── mvbench
│   ├── json
│   ├── README.md
│   └── videos
├── nextqa
│   ├── MC
│   ├── NExTVideo
│   └── README.md
├── NLVR2
│   ├── data
│   └── README.md
├── okvqa
│   ├── okvqa_val.json
│   ├── mscoco_val2014_annotations.json
│   └── OpenEnded_mscoco_val2014_questions.json
├── pope
│   ├── ImageQA_POPE_adversarial.jsonl
│   ├── ImageQA_POPE_popular.jsonl
│   └── ImageQA_POPE_random.jsonl
├── qbench2
│   ├── data
│   └── README.md
├── textvqa
│   ├── textvqa_val_annotations.json
│   ├── textvqa_val.json
│   └── textvqa_val_questions_ocr.json
├── videomme
│   ├── data
│   └── test-00000-of-00001.parquet
├── vizwiz
│   └── vizwiz_test.jsonl
└── vqav2
    ├── v2_mscoco_val2014_annotations.json
    ├── v2_OpenEnded_mscoco_test2015_questions.json
    └── vqav2_test.json

Download the images of the datasets, and organize as follows in ./evaluation/images,

click to unfold
├── gqa
│   └── images
├── mmbench_test_cn_20231003
│   └── images
├── mmbench_test_en_20231003
│   └── images
├── mmvet
│   └── images
├── mscoco
│   └── images
│       ├── test2015
│       └── val2014
├── textvqa
│   └── text_vqa
└── vizwiz
    └── test

Once the data is ready, run ./evaluation/eval.sh for evaluation. The datasets configuration can be modified in ./evaluation/tasks/plans/all.yaml.

Checkpoints

Model Size ModelScope HuggingFace
1B mPLUG-Owl3-1B-241014 mPLUG-Owl3-1B-241014
2B mPLUG-Owl3-2B-241014 mPLUG-Owl3-2B-241014
7B mPLUG-Owl3-7B-240728 mPLUG-Owl3-7B-240728
7B - mPLUG-Owl3-7B-241101

Usage

Gradio Demo

Installing the dependencies

pip install -r requirements.txt

Execute the demo.

python gradio_demo.py

Quickstart

The models after 241101

Load the mPLUG-Owl3. We now only support attn_implementation in ['sdpa', 'flash_attention_2'].

import torch
from modelscope import AutoConfig, AutoModel
model_path = 'iic/mPLUG-Owl3-2B-241101'
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
print(config)
model = AutoModel.from_pretrained(model_path, attn_implementation='flash_attention_2', torch_dtype=torch.bfloat16, trust_remote_code=True)
_ = model.eval().cuda()
device = "cuda"

Chat with images.

from PIL import Image

from modelscope import AutoTokenizer
from decord import VideoReader, cpu 
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = model.init_processor(tokenizer)

image = Image.new('RGB', (500, 500), color='red')

messages = [
    {"role": "user", "content": """<|image|>
Describe this image."""},
    {"role": "assistant", "content": ""}
]

inputs = processor(messages, images=[image], videos=None)

inputs.to('cuda')
inputs.update({
    'tokenizer': tokenizer,
    'max_new_tokens':100,
    'decode_text':True,
})


g = model.generate(**inputs)
print(g)

Chat with a video.

from PIL import Image

from modelscope import AutoTokenizer
from decord import VideoReader, cpu    # pip install decord
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = model.init_processor(tokenizer)


messages = [
    {"role": "user", "content": """<|video|>
Describe this video."""},
    {"role": "assistant", "content": ""}
]

videos = ['/nas-mmu-data/examples/car_room.mp4']

MAX_NUM_FRAMES=16

def encode_video(video_path):
    def uniform_sample(l, n):
        gap = len(l) / n
        idxs = [int(i * gap + gap / 2) for i in range(n)]
        return [l[i] for i in idxs]

    vr = VideoReader(video_path, ctx=cpu(0))
    sample_fps = round(vr.get_avg_fps() / 1)  # FPS
    frame_idx = [i for i in range(0, len(vr), sample_fps)]
    if len(frame_idx) > MAX_NUM_FRAMES:
        frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
    frames = vr.get_batch(frame_idx).asnumpy()
    frames = [Image.fromarray(v.astype('uint8')) for v in frames]
    print('num frames:', len(frames))
    return frames
video_frames = [encode_video(_) for _ in videos]
inputs = processor(messages, images=None, videos=video_frames)

inputs.to(device)
inputs.update({
    'tokenizer': tokenizer,
    'max_new_tokens':100,
    'decode_text':True,
})


g = model.generate(**inputs)
print(g)

Save memory by Liger-Kernel

mPLUG-Owl3 is based on Qwen2, which can be optimized through the Liger-Kernel to reduce memory usage.

pip install liger-kernel
def apply_liger_kernel_to_mplug_owl3(
    rms_norm: bool = True,
    swiglu: bool = True,
    model = None,
) -> None:
    from liger_kernel.transformers.monkey_patch import _patch_rms_norm_module
    from liger_kernel.transformers.monkey_patch import _bind_method_to_module
    from liger_kernel.transformers.swiglu import LigerSwiGLUMLP
    """
    Apply Liger kernels to replace original implementation in HuggingFace Qwen2 models

    Args:
        rms_norm (bool): Whether to apply Liger's RMSNorm. Default is True.
        swiglu (bool): Whether to apply Liger's SwiGLU MLP. Default is True.
        model (PreTrainedModel): The model instance to apply Liger kernels to, if the model has already been
        loaded. Default is None.
    """
  
    base_model = model.language_model.model

    if rms_norm:
        _patch_rms_norm_module(base_model.norm)

    for decoder_layer in base_model.layers:
        if swiglu:
            _bind_method_to_module(
                decoder_layer.mlp, "forward", LigerSwiGLUMLP.forward
            )
        if rms_norm:
            _patch_rms_norm_module(decoder_layer.input_layernorm)
            _patch_rms_norm_module(decoder_layer.post_attention_layernorm)
    print("Applied Liger kernels to Qwen2 in mPLUG-Owl3")

import torch
from modelscope import AutoConfig, AutoModel
model_path = 'iic/mPLUG-Owl3-2B-241101'
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
print(config)
model = AutoModel.from_pretrained(model_path, attn_implementation='flash_attention_2', torch_dtype=torch.bfloat16, trust_remote_code=True)
_ = model.eval().cuda()
device = "cuda"
apply_liger_kernel_to_mplug_owl3(model=model)

Save memory by setting device_map

When you have more than one GPUs, you can set the device_map='auto' to split the mPLUG-Owl3 into multiple GPUs. However, it will slowdown the inference speed.

model = AutoModel.from_pretrained(model_path, attn_implementation='flash_attention_2', device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
_ = model.eval()
first_layer_name = list(model.hf_device_map.keys())[0]
device = model.hf_device_map[first_layer_name]

The models before 241101

Load the mPLUG-Owl3. We now only support attn_implementation in ['sdpa', 'flash_attention_2'].

import torch
from transformers import AutoConfig, AutoModel
model_path = 'mPLUG/mPLUG-Owl3-7B-240728'
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
# model = mPLUGOwl3Model(config).cuda().half()
model = AutoModel.from_pretrained(model_path, attn_implementation='sdpa', torch_dtype=torch.half, trust_remote_code=True)
model.eval().cuda()

Chat with images.

from PIL import Image

from transformers import AutoTokenizer, AutoProcessor
from decord import VideoReader, cpu    # pip install decord
model_path = 'mPLUG/mPLUG-Owl3-7B-240728'
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = model.init_processor(tokenizer)

image = Image.new('RGB', (500, 500), color='red')

messages = [
    {"role": "user", "content": """<|image|>
Describe this image."""},
    {"role": "assistant", "content": ""}
]

inputs = processor(messages, images=image, videos=None)

inputs.to('cuda')
inputs.update({
    'tokenizer': tokenizer,
    'max_new_tokens':100,
    'decode_text':True,
})


g = model.generate(**inputs)
print(g)

Chat with a video.

from PIL import Image

from transformers import AutoTokenizer, AutoProcessor
from decord import VideoReader, cpu    # pip install decord
model_path = 'mPLUG/mPLUG-Owl3-7B-240728'
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = model.init_processor(tokenizer)


messages = [
    {"role": "user", "content": """<|video|>
Describe this video."""},
    {"role": "assistant", "content": ""}
]

videos = ['/nas-mmu-data/examples/car_room.mp4']

MAX_NUM_FRAMES=16

def encode_video(video_path):
    def uniform_sample(l, n):
        gap = len(l) / n
        idxs = [int(i * gap + gap / 2) for i in range(n)]
        return [l[i] for i in idxs]

    vr = VideoReader(video_path, ctx=cpu(0))
    sample_fps = round(vr.get_avg_fps() / 1)  # FPS
    frame_idx = [i for i in range(0, len(vr), sample_fps)]
    if len(frame_idx) > MAX_NUM_FRAMES:
        frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
    frames = vr.get_batch(frame_idx).asnumpy()
    frames = [Image.fromarray(v.astype('uint8')) for v in frames]
    print('num frames:', len(frames))
    return frames
video_frames = [encode_video(_) for _ in videos]
inputs = processor(messages, images=None, videos=video_frames)

inputs.to('cuda')
inputs.update({
    'tokenizer': tokenizer,
    'max_new_tokens':100,
    'decode_text':True,
})


g = model.generate(**inputs)
print(g)

Finetuning

Please use ms-swift to finetuning the mPLUG-Owl3. Here is an instruction.

For mPLUG-Owl3-7B-241101 and newer versions, you should set the model_type to mplug-owl3v-7b-chat instead.

Citation

If you find mPLUG-Owl3 useful for your research and applications, please cite using this BibTeX:

@misc{ye2024mplugowl3longimagesequenceunderstanding,
      title={mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models},
      author={Jiabo Ye and Haiyang Xu and Haowei Liu and Anwen Hu and Ming Yan and Qi Qian and Ji Zhang and Fei Huang and Jingren Zhou},
      year={2024},
      eprint={2408.04840},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2408.04840},
}

Acknowledgement

  • LLaVA: the codebase we built upon. Thanks for the authors of LLaVA for providing the framework.

Related Projects

  • LLaMA. A open-source collection of state-of-the-art large pre-trained language models.
  • LLaVA. A visual instruction tuned vision language model which achieves GPT4 level capabilities.
  • mPLUG. A vision-language foundation model for both cross-modal understanding and generation.
  • mPLUG-2. A multimodal model with a modular design, which inspired our project.