Skip to content

Commit

Permalink
support cogvlm2-video (#1318)
Browse files Browse the repository at this point in the history
  • Loading branch information
Jintao-Huang authored Jul 8, 2024
1 parent 1e820fd commit 0c7a29d
Show file tree
Hide file tree
Showing 12 changed files with 394 additions and 12 deletions.
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@ SWIFT has rich documentations for users, please check [here](https://github.com/
SWIFT web-ui is available both on [Huggingface space](https://huggingface.co/spaces/tastelikefeet/swift) and [ModelScope studio](https://www.modelscope.cn/studios/iic/Scalable-lightWeight-Infrastructure-for-Fine-Tuning/summary), please feel free to try!

## 🎉 News
- 2024.07.08:Support cogvlm2-video-13b-chat. You can check the best practice [here](docs/source_en/Multi-Modal/cogvlm2-video-best-practice.md).
- 2024.07.08: Support internlm-xcomposer2_5-7b-chat. You can check the best practice [here](docs/source_en/Multi-Modal/internlm-xcomposer2-best-practice.md).
- 2024.07.06: Support for the llava-next-video series models: llava-next-video-7b-instruct, llava-next-video-7b-32k-instruct, llava-next-video-7b-dpo-instruct, llava-next-video-34b-instruct. You can refer to [llava-video best practice](docs/source_en/Multi-Modal/llava-video-best-practice.md) for more information.
- 2024.07.06: Support internvl2 series: internvl2-2b, internvl2-4b, internvl2-8b, internvl2-26b.
Expand Down Expand Up @@ -558,7 +559,7 @@ The complete list of supported models and datasets can be found at [Supported Mo
| XComposer2<br>XComposer2.5 | [Pujiang AI Lab InternLM vision model](https://github.com/InternLM/InternLM-XComposer) | Chinese<br>English | 7B | chat model |
| DeepSeek-VL | [DeepSeek series vision models](https://github.com/deepseek-ai) | Chinese<br>English | 1.3B-7B | chat model |
| MiniCPM-V<br>MiniCPM-V-2<br>MiniCPM-V-2_5 | [OpenBmB MiniCPM vision model](https://github.com/OpenBMB/MiniCPM) | Chinese<br>English | 3B-9B | chat model |
| CogVLM<br>CogVLM2<br>CogAgent<br>GLM4V | [Zhipu ChatGLM visual QA and Agent model](https://github.com/THUDM/) | Chinese<br>English | 9B-19B | chat model |
| CogVLM<br>CogAgent<br>CogVLM2<br>CogVLM2-Video<br>GLM4V | [Zhipu ChatGLM visual QA and Agent model](https://github.com/THUDM/) | Chinese<br>English | 9B-19B | chat model |
| Llava1.5<br>Llava1.6 | [Llava series models](https://github.com/haotian-liu/LLaVA) | English | 7B-34B | chat model |
| Llava-Next<br>Llava-Next-Video | [Llava-Next series models](https://github.com/LLaVA-VL/LLaVA-NeXT) | Chinese<br>English | 7B-110B | chat model |
| mPLUG-Owl | [mPLUG-Owl series models](https://github.com/X-PLUG/mPLUG-Owl) | English | 11B | chat model |
Expand Down
3 changes: 2 additions & 1 deletion README_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ SWIFT具有丰富的文档体系,如有使用问题请请查看[这里](https:
可以在[Huggingface space](https://huggingface.co/spaces/tastelikefeet/swift)[ModelScope创空间](https://www.modelscope.cn/studios/iic/Scalable-lightWeight-Infrastructure-for-Fine-Tuning/summary) 中体验SWIFT web-ui功能了。

## 🎉 新闻
- 2024.07.08: 支持cogvlm2-video-13b-chat. 最佳实践可以查看[这里](docs/source/Multi-Modal/cogvlm2-video最佳实践.md).
- 2024.07.08: 支持internlm-xcomposer2_5-7b-chat. 最佳实践可以查看[这里](docs/source/Multi-Modal/internlm-xcomposer2最佳实践.md).
- 2024.07.06: 支持llava-next-video系列模型: llava-next-video-7b-instruct, llava-next-video-7b-32k-instruct, llava-next-video-7b-dpo-instruct, llava-next-video-34b-instruct. 可以查看[llava-video最佳实践](docs/source/Multi-Modal/llava-video最佳实践.md)了解更多.
- 2024.07.06: 支持internvl-2系列: internvl2-2b, internvl2-4b, internvl2-8b, internvl2-26b.
Expand Down Expand Up @@ -555,7 +556,7 @@ CUDA_VISIBLE_DEVICES=0 swift deploy \
| XComposer2<br>XComposer2.5 | [浦江实验室书生浦语视觉模型](https://github.com/InternLM/InternLM-XComposer) | 中文<br>英文 | 7B | chat模型 |
| DeepSeek-VL | [幻方系列视觉模型](https://github.com/deepseek-ai) | 中文<br>英文 | 1.3B-7B | chat模型 |
| MiniCPM-V<br>MiniCPM-V-2<br>MiniCPM-V-2_5 | [OpenBmB MiniCPM视觉模型](https://github.com/OpenBMB/MiniCPM) | 中文<br>英文 | 3B-9B | chat模型 |
| CogVLM<br>CogVLM2<br>CogAgent<br>GLM4V | [智谱ChatGLM视觉问答和Agent模型](https://github.com/THUDM/) | 中文<br>英文 | 9B-19B | chat模型 |
| CogVLM<br>CogAgent<br>CogVLM2<br>CogVLM2-Video<br>GLM4V | [智谱ChatGLM视觉问答和Agent模型](https://github.com/THUDM/) | 中文<br>英文 | 9B-19B | chat模型 |
| Llava1.5<br>Llava1.6 | [Llava系列模型](https://github.com/haotian-liu/LLaVA) | 英文 | 7B-34B | chat模型 |
| Llava-Next<br>Llava-Next-Video | [Llava-Next系列模型](https://github.com/LLaVA-VL/LLaVA-NeXT) | 中文<br>英文 | 7B-110B | chat模型 |
| mPLUG-Owl | [mPLUG-Owl系列模型](https://github.com/X-PLUG/mPLUG-Owl) | 英文 | 11B | chat模型 |
Expand Down
3 changes: 2 additions & 1 deletion docs/source/LLM/支持的模型和数据集.md
Original file line number Diff line number Diff line change
Expand Up @@ -348,7 +348,7 @@
|yi-vl-34b-chat|[01ai/Yi-VL-34B](https://modelscope.cn/models/01ai/Yi-VL-34B/summary)|q_proj, k_proj, v_proj|yi-vl|&#x2714;|&#x2718;|transformers>=4.34|vision|[01-ai/Yi-VL-34B](https://huggingface.co/01-ai/Yi-VL-34B)|
|llava-llama-3-8b-v1_1|[AI-ModelScope/llava-llama-3-8b-v1_1-transformers](https://modelscope.cn/models/AI-ModelScope/llava-llama-3-8b-v1_1-transformers/summary)|q_proj, k_proj, v_proj|llava-llama-instruct|&#x2714;|&#x2718;|transformers>=4.36|vision|[xtuner/llava-llama-3-8b-v1_1-transformers](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-transformers)|
|internlm-xcomposer2-7b-chat|[Shanghai_AI_Laboratory/internlm-xcomposer2-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-xcomposer2-7b/summary)|wqkv|internlm-xcomposer2|&#x2714;|&#x2718;||vision|[internlm/internlm-xcomposer2-7b](https://huggingface.co/internlm/internlm-xcomposer2-7b)|
|internlm-xcomposer2_5-7b-chat|[Shanghai_AI_Laboratory/internlm-xcomposer2d5-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-xcomposer2d5-7b/summary)|wqkv|internlm-xcomposer2_5|&#x2714;|&#x2718;||vision, video|[internlm/internlm-xcomposer2d5-7b](https://huggingface.co/internlm/internlm-xcomposer2d5-7b)|
|internlm-xcomposer2_5-7b-chat|[Shanghai_AI_Laboratory/internlm-xcomposer2d5-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-xcomposer2d5-7b/summary)|wqkv|internlm-xcomposer2_5|&#x2714;|&#x2718;||vision|[internlm/internlm-xcomposer2d5-7b](https://huggingface.co/internlm/internlm-xcomposer2d5-7b)|
|internvl-chat-v1_5|[AI-ModelScope/InternVL-Chat-V1-5](https://modelscope.cn/models/AI-ModelScope/InternVL-Chat-V1-5/summary)|wqkv|internvl|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/InternVL-Chat-V1-5](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5)|
|internvl-chat-v1_5-int8|[AI-ModelScope/InternVL-Chat-V1-5-int8](https://modelscope.cn/models/AI-ModelScope/InternVL-Chat-V1-5-int8/summary)|wqkv|internvl|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/InternVL-Chat-V1-5-int8](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5-int8)|
|mini-internvl-chat-2b-v1_5|[OpenGVLab/Mini-InternVL-Chat-2B-V1-5](https://modelscope.cn/models/OpenGVLab/Mini-InternVL-Chat-2B-V1-5/summary)|wqkv|internvl|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/Mini-InternVL-Chat-2B-V1-5](https://huggingface.co/OpenGVLab/Mini-InternVL-Chat-2B-V1-5)|
Expand All @@ -373,6 +373,7 @@
|cogvlm-17b-chat|[ZhipuAI/cogvlm-chat](https://modelscope.cn/models/ZhipuAI/cogvlm-chat/summary)|vision_expert_query_key_value, vision_expert_dense, language_expert_query_key_value, language_expert_dense|cogvlm|&#x2718;|&#x2718;|transformers<4.42|vision|[THUDM/cogvlm-chat-hf](https://huggingface.co/THUDM/cogvlm-chat-hf)|
|cogvlm2-19b-chat|[ZhipuAI/cogvlm2-llama3-chinese-chat-19B](https://modelscope.cn/models/ZhipuAI/cogvlm2-llama3-chinese-chat-19B/summary)|vision_expert_query_key_value, vision_expert_dense, language_expert_query_key_value, language_expert_dense|cogvlm|&#x2718;|&#x2718;|transformers<4.42|vision|[THUDM/cogvlm2-llama3-chinese-chat-19B](https://huggingface.co/THUDM/cogvlm2-llama3-chinese-chat-19B)|
|cogvlm2-en-19b-chat|[ZhipuAI/cogvlm2-llama3-chat-19B](https://modelscope.cn/models/ZhipuAI/cogvlm2-llama3-chat-19B/summary)|vision_expert_query_key_value, vision_expert_dense, language_expert_query_key_value, language_expert_dense|cogvlm|&#x2718;|&#x2718;|transformers<4.42|vision|[THUDM/cogvlm2-llama3-chat-19B](https://huggingface.co/THUDM/cogvlm2-llama3-chat-19B)|
|cogvlm2-video-13b-chat|[ZhipuAI/cogvlm2-video-llama3-chat](https://modelscope.cn/models/ZhipuAI/cogvlm2-video-llama3-chat/summary)|vision_expert_query_key_value, vision_expert_dense, language_expert_query_key_value, language_expert_dense|cogvlm2-video|&#x2718;|&#x2718;|transformers<4.42, decord, pytorchvideo|vision, video|[THUDM/cogvlm2-video-llama3-chat](https://huggingface.co/THUDM/cogvlm2-video-llama3-chat)|
|cogagent-18b-chat|[ZhipuAI/cogagent-chat](https://modelscope.cn/models/ZhipuAI/cogagent-chat/summary)|vision_expert_query_key_value, vision_expert_dense, language_expert_query_key_value, language_expert_dense, query, key_value, dense|cogagent-chat|&#x2718;|&#x2718;|timm|vision|[THUDM/cogagent-chat-hf](https://huggingface.co/THUDM/cogagent-chat-hf)|
|cogagent-18b-instruct|[ZhipuAI/cogagent-vqa](https://modelscope.cn/models/ZhipuAI/cogagent-vqa/summary)|vision_expert_query_key_value, vision_expert_dense, language_expert_query_key_value, language_expert_dense, query, key_value, dense|cogagent-instruct|&#x2718;|&#x2718;|timm|vision|[THUDM/cogagent-vqa-hf](https://huggingface.co/THUDM/cogagent-vqa-hf)|

Expand Down
9 changes: 6 additions & 3 deletions docs/source/LLM/自定义与拓展.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,17 +7,20 @@
## 自定义数据集
我们支持三种**自定义数据集**的方法.

1. 【推荐】直接命令行传参的方式,指定`--dataset xxx.json yyy.jsonl zzz.csv`, **更加方便支持自定义数据集**, 支持五种数据集格式(即使用`SmartPreprocessor`,支持的数据集格式见下方), 支持`dataset_id``dataset_path`. 不需要修改`dataset_info.json`文件.
1. 【推荐】直接命令行传参的方式,指定`--dataset xxx.json yyy.jsonl zzz.csv`, **更加方便支持自定义数据集**, 支持五种数据集格式(即使用`SmartPreprocessor`,支持的数据集格式见下方), 支持`dataset_id``dataset_path`. 不需要修改`dataset_info.json`文件. 该方法适合刚接触ms-swift的用户, 下两种方法适合对ms-swift进行拓展的开发者.
2. 添加数据集到`dataset_info.json`中, 比第一种方式更灵活但繁琐, 支持对数据集使用两种预处理器并指定其参数: `RenameColumnsPreprocessor`, `ConversationsPreprocessor`(默认使用`SmartPreprocessor`). 支持直接修改swift内置的`dataset_info.json`, 或者通过`--custom_dataset_info xxx.json`的方式传入外置的json文件(方便pip install而非git clone的用户拓展数据集).
3. **注册数据集**的方式: 比第1、2种方式更加灵活但繁琐, 支持使用函数对数据集进行预处理. 方法1、2在实现上借助了方法3. 可以直接修改源码进行拓展, 或者通过`--custom_register_path xxx.py`的方式传入, 脚本会对py文件进行解析(方便pip install的用户).

### 📌 【推荐】直接命令行传参
支持直接传入行自定义的**dataset_id**(兼容MS和HF)和**dataset_path**, 以及同时传入多个自定义数据集以及对应采样数, 脚本会进行自动的预处理和拼接. 如果传入的是`dataset_id`, 默认会使用dataset\_id中的'default'子数据集, 并设置split为'train'. 如果该dataset\_id已经注册, 则会使用注册时传入的subsets、split以及预处理函数. 如果传入的是`dataset_path`, 则可以指定为相对路径和绝对路径, 其中相对路径为相对于当前运行目录.

每个数据集指定格式如下: `[HF or MS::]{dataset_name} or {dataset_id} or {dataset_path}[:subset1/subset2/...][#dataset_sample]`, 最简只需要指定dataset_name、dataset_id或者dataset_path即可.

```bash
--dataset {dataset_id} {dataset_path}
# 默认使用modelscope的dataset_id, 同时也支持huggingface的dataset_id
--dataset {dataset_id} {dataset_path} HF::{dataset_id}

# 数据集混合: 以下取dataset_id中subset1和subset2子数据集并采样20000条
# 数据集混合: 以下取dataset_id中subset1和subset2子数据集并采样20000条. 如果不使用`#{dataset_sample}`, 则使用数据集中的所有样本
--dataset {dataset_name}#20000 {dataset_id}:{subset1}/{subset2}#20000 {dataset_path}#10000
```

Expand Down
144 changes: 144 additions & 0 deletions docs/source/Multi-Modal/cogvlm2-video最佳实践.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@

# CogVLM2 Video 最佳实践

## 目录
- [环境准备](#环境准备)
- [推理](#推理)
- [微调](#微调)
- [微调后推理](#微调后推理)


## 环境准备
```shell
git clone https://github.com/modelscope/swift.git
cd swift
pip install -e '.[llm]'

# https://github.com/facebookresearch/pytorchvideo/issues/258
# https://github.com/dmlc/decord/issues/177
pip install decord pytorchvideo
```

模型链接:
- cogvlm2-video-13b-chat: [https://modelscope.cn/models/ZhipuAI/cogvlm2-video-llama3-chat](https://modelscope.cn/models/ZhipuAI/cogvlm2-video-llama3-chat)


## 推理

推理cogvlm2-video-13b-chat:
```shell
# Experimental environment: A100
# 28GB GPU memory
CUDA_VISIBLE_DEVICES=0 swift infer --model_type cogvlm2-video-13b-chat
```

输出: (支持传入本地路径或URL)
```python
"""
<<< 描述这段视频
Input a video path or URL <<< https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/baby.mp4
In the video, a young child is seen sitting on a bed and reading a book. The child is wearing glasses and is dressed in a light blue top and pink pants. The room appears to be a bedroom with a crib in the background. The child is engrossed in the book, and the scene is captured in a series of frames showing the child's interaction with the book.
--------------------------------------------------
<<< clear
<<< Describe this video.
Input a video path or URL <<< https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/fire.mp4
In the video, a person is seen lighting a fire in a backyard setting. They start by holding a piece of food and then proceed to light a match to the food. The fire is then ignited, and the person continues to light more pieces of food, including a bag of chips and a piece of wood. The fire is seen burning brightly, and the person is seen standing over the fire, possibly enjoying the warmth. The video captures the process of starting a fire and the person's interaction with the flames, creating a cozy and inviting atmosphere.
--------------------------------------------------
<<< clear
<<< who are you
Input a video path or URL <<<
I am a person named John.
"""
```

**单样本推理**

```python
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

from swift.llm import (
get_model_tokenizer, get_template, inference, ModelType,
get_default_template_type, inference_stream
)
from swift.utils import seed_everything
import torch

model_type = ModelType.cogvlm2_video_13b_chat
template_type = get_default_template_type(model_type)
print(f'template_type: {template_type}')

model, tokenizer = get_model_tokenizer(model_type, torch.float16,
model_kwargs={'device_map': 'auto'})
model.generation_config.max_new_tokens = 256
template = get_template(template_type, tokenizer)
seed_everything(42)

videos = ['https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/baby.mp4']
query = '描述这段视频'
response, history = inference(model, template, query, videos=videos)
print(f'query: {query}')
print(f'response: {response}')

# 流式
query = 'Describe this video.'
videos = ['https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/fire.mp4']
gen = inference_stream(model, template, query, history, videos=videos)
print_idx = 0
print(f'query: {query}\nresponse: ', end='')
for response, _ in gen:
delta = response[print_idx:]
print(delta, end='', flush=True)
print_idx = len(response)
print()

"""
query: 描述这段视频
response: The video depicts a young child sitting on a bed and reading a book. The child is wearing glasses and is seen in various positions, such as sitting on the bed, sitting on a couch, and sitting on a bed with a blanket. The child's attire changes from a light blue top and pink pants to a light blue top and pink leggings. The room has a cozy and warm atmosphere with soft lighting, and there are personal items scattered around, such as a crib, a television, and a white garment.
query: Describe this video.
response: The video shows a person lighting a fire in a backyard setting. The person is seen holding a piece of food and a lighter, and then lighting the food on fire. The fire is then used to light other pieces of wood, and the person is seen standing over the fire, holding a bag of food. The video captures the process of starting a fire and the person's interaction with the fire.
"""
```


## 微调
多模态大模型微调通常使用**自定义数据集**进行微调. 这里展示可直接运行的demo:

(默认对LLM的qkv进行lora微调. 如果你想对所有linear都进行微调, 可以指定`--lora_target_modules ALL`)
```shell
# Experimental environment: A100
# 40GB GPU memory
CUDA_VISIBLE_DEVICES=0 swift sft \
--model_type cogvlm2-video-13b-chat \
--dataset video-chatgpt
```

[自定义数据集](../LLM/自定义与拓展.md#-推荐命令行参数的形式)支持json, jsonl样式, 以下是自定义数据集的例子:

(支持多轮对话, 但总的轮次对话只能包含一张图片, 支持传入本地路径或URL)

```jsonl
{"query": "55555", "response": "66666", "videos": ["video_path"]}
{"query": "eeeee", "response": "fffff", "history": [], "videos": ["video_path"]}
{"query": "EEEEE", "response": "FFFFF", "history": [["AAAAA", "BBBBB"], ["CCCCC", "DDDDD"]], "videos": ["video_path"]}
```


## 微调后推理
直接推理:
```shell
CUDA_VISIBLE_DEVICES=0 swift infer \
--ckpt_dir output/cogvlm2-video-13b-chat/vx-xxx/checkpoint-xxx \
--load_dataset_config true \
```

**merge-lora**并推理:
```shell
CUDA_VISIBLE_DEVICES=0 swift export \
--ckpt_dir output/cogvlm2-video-13b-chat/vx-xxx/checkpoint-xxx \
--merge_lora true

CUDA_VISIBLE_DEVICES=0 swift infer \
--ckpt_dir output/cogvlm2-video-13b-chat/vx-xxx/checkpoint-xxx-merged \
--load_dataset_config true
```
2 changes: 1 addition & 1 deletion docs/source/Multi-Modal/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,6 @@
4. [florence最佳实践](florence最佳实践.md)

整个对话围绕一张图片(可能可以不含图片):
1. [CogVLM最佳实践](cogvlm最佳实践.md), [CogVLM2最佳实践](cogvlm2最佳实践.md), [glm4v最佳实践](glm4v最佳实践.md)
1. [CogVLM最佳实践](cogvlm最佳实践.md), [CogVLM2最佳实践](cogvlm2最佳实践.md), [glm4v最佳实践](glm4v最佳实践.md), [CogVLM2-Video最佳实践](cogvlm2-video最佳实践.md)
2. [MiniCPM-V最佳实践](minicpm-v最佳实践.md), [MiniCPM-V-2最佳实践](minicpm-v-2最佳实践.md), [MiniCPM-V-2.5最佳实践](minicpm-v-2.5最佳实践.md)
3. [InternVL-Chat-V1.5最佳实践](internvl最佳实践.md)
Loading

0 comments on commit 0c7a29d

Please sign in to comment.