support cogvlm2-video (#1318)

modelscope · Jul 8, 2024 · 0c7a29d · 0c7a29d
1 parent 1e820fd
commit 0c7a29d
Show file tree

Hide file tree

Showing 12 changed files with 394 additions and 12 deletions.
diff --git a/README.md b/README.md
@@ -47,6 +47,7 @@ SWIFT has rich documentations for users, please check [here](https://github.com/
 SWIFT web-ui is available both on [Huggingface space](https://huggingface.co/spaces/tastelikefeet/swift) and [ModelScope studio](https://www.modelscope.cn/studios/iic/Scalable-lightWeight-Infrastructure-for-Fine-Tuning/summary), please feel free to try!
 
 ## 🎉 News
+- 2024.07.08:Support cogvlm2-video-13b-chat. You can check the best practice [here](docs/source_en/Multi-Modal/cogvlm2-video-best-practice.md).
 - 2024.07.08: Support internlm-xcomposer2_5-7b-chat. You can check the best practice [here](docs/source_en/Multi-Modal/internlm-xcomposer2-best-practice.md).
 - 2024.07.06: Support for the llava-next-video series models: llava-next-video-7b-instruct, llava-next-video-7b-32k-instruct, llava-next-video-7b-dpo-instruct, llava-next-video-34b-instruct. You can refer to [llava-video best practice](docs/source_en/Multi-Modal/llava-video-best-practice.md) for more information.
 - 2024.07.06: Support internvl2 series: internvl2-2b, internvl2-4b, internvl2-8b, internvl2-26b.
@@ -558,7 +559,7 @@ The complete list of supported models and datasets can be found at [Supported Mo
 | XComposer2<br>XComposer2.5         | [Pujiang AI Lab InternLM vision model](https://github.com/InternLM/InternLM-XComposer) | Chinese<br>English | 7B                                 | chat model         |
 | DeepSeek-VL        | [DeepSeek series vision models](https://github.com/deepseek-ai)              | Chinese<br>English | 1.3B-7B                            | chat model         |
 | MiniCPM-V<br>MiniCPM-V-2<br>MiniCPM-V-2_5  | [OpenBmB MiniCPM vision model](https://github.com/OpenBMB/MiniCPM) | Chinese<br>English | 3B-9B            | chat model          |
-| CogVLM<br>CogVLM2<br>CogAgent<br>GLM4V | [Zhipu ChatGLM visual QA and Agent model](https://github.com/THUDM/)         | Chinese<br>English | 9B-19B                            | chat model         |
+| CogVLM<br>CogAgent<br>CogVLM2<br>CogVLM2-Video<br>GLM4V | [Zhipu ChatGLM visual QA and Agent model](https://github.com/THUDM/)         | Chinese<br>English | 9B-19B                            | chat model         |
 | Llava1.5<br>Llava1.6           | [Llava series models](https://github.com/haotian-liu/LLaVA)                  | English            | 7B-34B                             | chat model |
 | Llava-Next<br>Llava-Next-Video             | [Llava-Next series models](https://github.com/LLaVA-VL/LLaVA-NeXT)                  | Chinese<br>English | 7B-110B                             | chat model |
 | mPLUG-Owl          | [mPLUG-Owl series models](https://github.com/X-PLUG/mPLUG-Owl)               | English            | 11B                                | chat model |

diff --git a/README_CN.md b/README_CN.md
@@ -48,6 +48,7 @@ SWIFT具有丰富的文档体系，如有使用问题请请查看[这里](https:
 可以在[Huggingface space](https://huggingface.co/spaces/tastelikefeet/swift) 和 [ModelScope创空间](https://www.modelscope.cn/studios/iic/Scalable-lightWeight-Infrastructure-for-Fine-Tuning/summary) 中体验SWIFT web-ui功能了。
 
 ## 🎉 新闻
+- 2024.07.08: 支持cogvlm2-video-13b-chat. 最佳实践可以查看[这里](docs/source/Multi-Modal/cogvlm2-video最佳实践.md).
 - 2024.07.08: 支持internlm-xcomposer2_5-7b-chat. 最佳实践可以查看[这里](docs/source/Multi-Modal/internlm-xcomposer2最佳实践.md).
 - 2024.07.06: 支持llava-next-video系列模型: llava-next-video-7b-instruct, llava-next-video-7b-32k-instruct, llava-next-video-7b-dpo-instruct, llava-next-video-34b-instruct. 可以查看[llava-video最佳实践](docs/source/Multi-Modal/llava-video最佳实践.md)了解更多.
 - 2024.07.06: 支持internvl-2系列: internvl2-2b, internvl2-4b, internvl2-8b, internvl2-26b.
@@ -555,7 +556,7 @@ CUDA_VISIBLE_DEVICES=0 swift deploy \
 | XComposer2<br>XComposer2.5                | [浦江实验室书生浦语视觉模型](https://github.com/InternLM/InternLM-XComposer)                      | 中文<br>英文 | 7B              | chat模型          |
 | DeepSeek-VL                               | [幻方系列视觉模型](https://github.com/deepseek-ai)                                 | 中文<br>英文 | 1.3B-7B         | chat模型          |
 | MiniCPM-V<br>MiniCPM-V-2<br>MiniCPM-V-2_5 | [OpenBmB MiniCPM视觉模型](https://github.com/OpenBMB/MiniCPM)                  | 中文<br>英文 | 3B-9B           | chat模型          |
-| CogVLM<br>CogVLM2<br>CogAgent<br>GLM4V   | [智谱ChatGLM视觉问答和Agent模型](https://github.com/THUDM/)                         | 中文<br>英文 | 9B-19B         | chat模型          |
+| CogVLM<br>CogAgent<br>CogVLM2<br>CogVLM2-Video<br>GLM4V   | [智谱ChatGLM视觉问答和Agent模型](https://github.com/THUDM/)                         | 中文<br>英文 | 9B-19B         | chat模型          |
 | Llava1.5<br>Llava1.6                       | [Llava系列模型](https://github.com/haotian-liu/LLaVA)                          | 英文 | 7B-34B          | chat模型 |
 | Llava-Next<br>Llava-Next-Video                   | [Llava-Next系列模型](https://github.com/LLaVA-VL/LLaVA-NeXT)                   | 中文<br>英文 | 7B-110B         | chat模型 |
 | mPLUG-Owl                                 | [mPLUG-Owl系列模型](https://github.com/X-PLUG/mPLUG-Owl)                       | 英文 | 11B             | chat模型 |

diff --git a/docs/source/LLM/支持的模型和数据集.md b/docs/source/LLM/支持的模型和数据集.md
@@ -348,7 +348,7 @@
 |yi-vl-34b-chat|[01ai/Yi-VL-34B](https://modelscope.cn/models/01ai/Yi-VL-34B/summary)|q_proj, k_proj, v_proj|yi-vl|&#x2714;|&#x2718;|transformers>=4.34|vision|[01-ai/Yi-VL-34B](https://huggingface.co/01-ai/Yi-VL-34B)|
 |llava-llama-3-8b-v1_1|[AI-ModelScope/llava-llama-3-8b-v1_1-transformers](https://modelscope.cn/models/AI-ModelScope/llava-llama-3-8b-v1_1-transformers/summary)|q_proj, k_proj, v_proj|llava-llama-instruct|&#x2714;|&#x2718;|transformers>=4.36|vision|[xtuner/llava-llama-3-8b-v1_1-transformers](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-transformers)|
 |internlm-xcomposer2-7b-chat|[Shanghai_AI_Laboratory/internlm-xcomposer2-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-xcomposer2-7b/summary)|wqkv|internlm-xcomposer2|&#x2714;|&#x2718;||vision|[internlm/internlm-xcomposer2-7b](https://huggingface.co/internlm/internlm-xcomposer2-7b)|
-|internlm-xcomposer2_5-7b-chat|[Shanghai_AI_Laboratory/internlm-xcomposer2d5-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-xcomposer2d5-7b/summary)|wqkv|internlm-xcomposer2_5|&#x2714;|&#x2718;||vision, video|[internlm/internlm-xcomposer2d5-7b](https://huggingface.co/internlm/internlm-xcomposer2d5-7b)|
+|internlm-xcomposer2_5-7b-chat|[Shanghai_AI_Laboratory/internlm-xcomposer2d5-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-xcomposer2d5-7b/summary)|wqkv|internlm-xcomposer2_5|&#x2714;|&#x2718;||vision|[internlm/internlm-xcomposer2d5-7b](https://huggingface.co/internlm/internlm-xcomposer2d5-7b)|
 |internvl-chat-v1_5|[AI-ModelScope/InternVL-Chat-V1-5](https://modelscope.cn/models/AI-ModelScope/InternVL-Chat-V1-5/summary)|wqkv|internvl|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/InternVL-Chat-V1-5](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5)|
 |internvl-chat-v1_5-int8|[AI-ModelScope/InternVL-Chat-V1-5-int8](https://modelscope.cn/models/AI-ModelScope/InternVL-Chat-V1-5-int8/summary)|wqkv|internvl|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/InternVL-Chat-V1-5-int8](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5-int8)|
 |mini-internvl-chat-2b-v1_5|[OpenGVLab/Mini-InternVL-Chat-2B-V1-5](https://modelscope.cn/models/OpenGVLab/Mini-InternVL-Chat-2B-V1-5/summary)|wqkv|internvl|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/Mini-InternVL-Chat-2B-V1-5](https://huggingface.co/OpenGVLab/Mini-InternVL-Chat-2B-V1-5)|
@@ -373,6 +373,7 @@
 |cogvlm-17b-chat|[ZhipuAI/cogvlm-chat](https://modelscope.cn/models/ZhipuAI/cogvlm-chat/summary)|vision_expert_query_key_value, vision_expert_dense, language_expert_query_key_value, language_expert_dense|cogvlm|&#x2718;|&#x2718;|transformers<4.42|vision|[THUDM/cogvlm-chat-hf](https://huggingface.co/THUDM/cogvlm-chat-hf)|
 |cogvlm2-19b-chat|[ZhipuAI/cogvlm2-llama3-chinese-chat-19B](https://modelscope.cn/models/ZhipuAI/cogvlm2-llama3-chinese-chat-19B/summary)|vision_expert_query_key_value, vision_expert_dense, language_expert_query_key_value, language_expert_dense|cogvlm|&#x2718;|&#x2718;|transformers<4.42|vision|[THUDM/cogvlm2-llama3-chinese-chat-19B](https://huggingface.co/THUDM/cogvlm2-llama3-chinese-chat-19B)|
 |cogvlm2-en-19b-chat|[ZhipuAI/cogvlm2-llama3-chat-19B](https://modelscope.cn/models/ZhipuAI/cogvlm2-llama3-chat-19B/summary)|vision_expert_query_key_value, vision_expert_dense, language_expert_query_key_value, language_expert_dense|cogvlm|&#x2718;|&#x2718;|transformers<4.42|vision|[THUDM/cogvlm2-llama3-chat-19B](https://huggingface.co/THUDM/cogvlm2-llama3-chat-19B)|
+|cogvlm2-video-13b-chat|[ZhipuAI/cogvlm2-video-llama3-chat](https://modelscope.cn/models/ZhipuAI/cogvlm2-video-llama3-chat/summary)|vision_expert_query_key_value, vision_expert_dense, language_expert_query_key_value, language_expert_dense|cogvlm2-video|&#x2718;|&#x2718;|transformers<4.42, decord, pytorchvideo|vision, video|[THUDM/cogvlm2-video-llama3-chat](https://huggingface.co/THUDM/cogvlm2-video-llama3-chat)|
 |cogagent-18b-chat|[ZhipuAI/cogagent-chat](https://modelscope.cn/models/ZhipuAI/cogagent-chat/summary)|vision_expert_query_key_value, vision_expert_dense, language_expert_query_key_value, language_expert_dense, query, key_value, dense|cogagent-chat|&#x2718;|&#x2718;|timm|vision|[THUDM/cogagent-chat-hf](https://huggingface.co/THUDM/cogagent-chat-hf)|
 |cogagent-18b-instruct|[ZhipuAI/cogagent-vqa](https://modelscope.cn/models/ZhipuAI/cogagent-vqa/summary)|vision_expert_query_key_value, vision_expert_dense, language_expert_query_key_value, language_expert_dense, query, key_value, dense|cogagent-instruct|&#x2718;|&#x2718;|timm|vision|[THUDM/cogagent-vqa-hf](https://huggingface.co/THUDM/cogagent-vqa-hf)|
 

diff --git a/docs/source/LLM/自定义与拓展.md b/docs/source/LLM/自定义与拓展.md
@@ -7,17 +7,20 @@
 ## 自定义数据集
 我们支持三种**自定义数据集**的方法.
 
-1. 【推荐】直接命令行传参的方式，指定`--dataset xxx.json yyy.jsonl zzz.csv`, **更加方便支持自定义数据集**, 支持五种数据集格式（即使用`SmartPreprocessor`，支持的数据集格式见下方）, 支持`dataset_id`和`dataset_path`. 不需要修改`dataset_info.json`文件.
+1. 【推荐】直接命令行传参的方式，指定`--dataset xxx.json yyy.jsonl zzz.csv`, **更加方便支持自定义数据集**, 支持五种数据集格式（即使用`SmartPreprocessor`，支持的数据集格式见下方）, 支持`dataset_id`和`dataset_path`. 不需要修改`dataset_info.json`文件. 该方法适合刚接触ms-swift的用户, 下两种方法适合对ms-swift进行拓展的开发者.
 2. 添加数据集到`dataset_info.json`中, 比第一种方式更灵活但繁琐, 支持对数据集使用两种预处理器并指定其参数: `RenameColumnsPreprocessor`, `ConversationsPreprocessor`（默认使用`SmartPreprocessor`）. 支持直接修改swift内置的`dataset_info.json`, 或者通过`--custom_dataset_info xxx.json`的方式传入外置的json文件（方便pip install而非git clone的用户拓展数据集）.
 3. **注册数据集**的方式: 比第1、2种方式更加灵活但繁琐, 支持使用函数对数据集进行预处理. 方法1、2在实现上借助了方法3. 可以直接修改源码进行拓展, 或者通过`--custom_register_path xxx.py`的方式传入, 脚本会对py文件进行解析（方便pip install的用户）.
 
 ### 📌 【推荐】直接命令行传参
 支持直接传入行自定义的**dataset_id**(兼容MS和HF)和**dataset_path**, 以及同时传入多个自定义数据集以及对应采样数, 脚本会进行自动的预处理和拼接. 如果传入的是`dataset_id`, 默认会使用dataset\_id中的'default'子数据集, 并设置split为'train'. 如果该dataset\_id已经注册, 则会使用注册时传入的subsets、split以及预处理函数. 如果传入的是`dataset_path`, 则可以指定为相对路径和绝对路径, 其中相对路径为相对于当前运行目录.
 
+每个数据集指定格式如下: `[HF or MS::]{dataset_name} or {dataset_id} or {dataset_path}[:subset1/subset2/...][#dataset_sample]`, 最简只需要指定dataset_name、dataset_id或者dataset_path即可.
+
 ```bash
---dataset {dataset_id} {dataset_path}
+# 默认使用modelscope的dataset_id, 同时也支持huggingface的dataset_id
+--dataset {dataset_id} {dataset_path} HF::{dataset_id}
 
-# 数据集混合: 以下取dataset_id中subset1和subset2子数据集并采样20000条
+# 数据集混合: 以下取dataset_id中subset1和subset2子数据集并采样20000条. 如果不使用`#{dataset_sample}`, 则使用数据集中的所有样本
 --dataset {dataset_name}#20000 {dataset_id}:{subset1}/{subset2}#20000 {dataset_path}#10000
 ```
 

diff --git a/docs/source/Multi-Modal/cogvlm2-video最佳实践.md b/docs/source/Multi-Modal/cogvlm2-video最佳实践.md
@@ -0,0 +1,144 @@
+
+# CogVLM2 Video 最佳实践
+
+## 目录
+- [环境准备](#环境准备)
+- [推理](#推理)
+- [微调](#微调)
+- [微调后推理](#微调后推理)
+
+
+## 环境准备
+```shell
+git clone https://github.com/modelscope/swift.git
+cd swift
+pip install -e '.[llm]'
+
+# https://github.com/facebookresearch/pytorchvideo/issues/258
+# https://github.com/dmlc/decord/issues/177
+pip install decord pytorchvideo
+```
+
+模型链接:
+- cogvlm2-video-13b-chat: [https://modelscope.cn/models/ZhipuAI/cogvlm2-video-llama3-chat](https://modelscope.cn/models/ZhipuAI/cogvlm2-video-llama3-chat)
+
+
+## 推理
+
+推理cogvlm2-video-13b-chat:
+```shell
+# Experimental environment: A100
+# 28GB GPU memory
+CUDA_VISIBLE_DEVICES=0 swift infer --model_type cogvlm2-video-13b-chat
+```
+
+输出: (支持传入本地路径或URL)
+```python
+"""
+<<< 描述这段视频
+Input a video path or URL <<< https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/baby.mp4
+In the video, a young child is seen sitting on a bed and reading a book. The child is wearing glasses and is dressed in a light blue top and pink pants. The room appears to be a bedroom with a crib in the background. The child is engrossed in the book, and the scene is captured in a series of frames showing the child's interaction with the book.
+--------------------------------------------------
+<<< clear
+<<< Describe this video.
+Input a video path or URL <<< https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/fire.mp4
+In the video, a person is seen lighting a fire in a backyard setting. They start by holding a piece of food and then proceed to light a match to the food. The fire is then ignited, and the person continues to light more pieces of food, including a bag of chips and a piece of wood. The fire is seen burning brightly, and the person is seen standing over the fire, possibly enjoying the warmth. The video captures the process of starting a fire and the person's interaction with the flames, creating a cozy and inviting atmosphere.
+--------------------------------------------------
+<<< clear
+<<< who are you
+Input a video path or URL <<<
+I am a person named John.
+"""
+```
+
+**单样本推理**
+
+```python
+import os
+os.environ['CUDA_VISIBLE_DEVICES'] = '0'
+
+from swift.llm import (
+    get_model_tokenizer, get_template, inference, ModelType,
+    get_default_template_type, inference_stream
+)
+from swift.utils import seed_everything
+import torch
+
+model_type = ModelType.cogvlm2_video_13b_chat
+template_type = get_default_template_type(model_type)
+print(f'template_type: {template_type}')
+
+model, tokenizer = get_model_tokenizer(model_type, torch.float16,
+                                       model_kwargs={'device_map': 'auto'})
+model.generation_config.max_new_tokens = 256
+template = get_template(template_type, tokenizer)
+seed_everything(42)
+
+videos = ['https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/baby.mp4']
+query = '描述这段视频'
+response, history = inference(model, template, query, videos=videos)
+print(f'query: {query}')
+print(f'response: {response}')
+
+# 流式
+query = 'Describe this video.'
+videos = ['https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/fire.mp4']
+gen = inference_stream(model, template, query, history, videos=videos)
+print_idx = 0
+print(f'query: {query}\nresponse: ', end='')
+for response, _ in gen:
+    delta = response[print_idx:]
+    print(delta, end='', flush=True)
+    print_idx = len(response)
+print()
+
+"""
+query: 描述这段视频
+response: The video depicts a young child sitting on a bed and reading a book. The child is wearing glasses and is seen in various positions, such as sitting on the bed, sitting on a couch, and sitting on a bed with a blanket. The child's attire changes from a light blue top and pink pants to a light blue top and pink leggings. The room has a cozy and warm atmosphere with soft lighting, and there are personal items scattered around, such as a crib, a television, and a white garment.
+query: Describe this video.
+response: The video shows a person lighting a fire in a backyard setting. The person is seen holding a piece of food and a lighter, and then lighting the food on fire. The fire is then used to light other pieces of wood, and the person is seen standing over the fire, holding a bag of food. The video captures the process of starting a fire and the person's interaction with the fire.
+"""
+```
+
+
+## 微调
+多模态大模型微调通常使用**自定义数据集**进行微调. 这里展示可直接运行的demo:
+
+(默认对LLM的qkv进行lora微调. 如果你想对所有linear都进行微调, 可以指定`--lora_target_modules ALL`)
+```shell
+# Experimental environment: A100
+# 40GB GPU memory
+CUDA_VISIBLE_DEVICES=0 swift sft \
+    --model_type cogvlm2-video-13b-chat \
+    --dataset video-chatgpt
+```
+
+[自定义数据集](../LLM/自定义与拓展.md#-推荐命令行参数的形式)支持json, jsonl样式, 以下是自定义数据集的例子:
+
+(支持多轮对话, 但总的轮次对话只能包含一张图片, 支持传入本地路径或URL)
+
+```jsonl
+{"query": "55555", "response": "66666", "videos": ["video_path"]}
+{"query": "eeeee", "response": "fffff", "history": [], "videos": ["video_path"]}
+{"query": "EEEEE", "response": "FFFFF", "history": [["AAAAA", "BBBBB"], ["CCCCC", "DDDDD"]], "videos": ["video_path"]}
+```
+
+
+## 微调后推理
+直接推理:
+```shell
+CUDA_VISIBLE_DEVICES=0 swift infer \
+    --ckpt_dir output/cogvlm2-video-13b-chat/vx-xxx/checkpoint-xxx \
+    --load_dataset_config true \
+```
+
+**merge-lora**并推理:
+```shell
+CUDA_VISIBLE_DEVICES=0 swift export \
+    --ckpt_dir output/cogvlm2-video-13b-chat/vx-xxx/checkpoint-xxx \
+    --merge_lora true
+
+CUDA_VISIBLE_DEVICES=0 swift infer \
+    --ckpt_dir output/cogvlm2-video-13b-chat/vx-xxx/checkpoint-xxx-merged \
+    --load_dataset_config true
+```
diff --git a/docs/source/Multi-Modal/index.md b/docs/source/Multi-Modal/index.md
@@ -22,6 +22,6 @@
 4. [florence最佳实践](florence最佳实践.md)
 
 整个对话围绕一张图片（可能可以不含图片）:
-1. [CogVLM最佳实践](cogvlm最佳实践.md), [CogVLM2最佳实践](cogvlm2最佳实践.md), [glm4v最佳实践](glm4v最佳实践.md)
+1. [CogVLM最佳实践](cogvlm最佳实践.md), [CogVLM2最佳实践](cogvlm2最佳实践.md), [glm4v最佳实践](glm4v最佳实践.md), [CogVLM2-Video最佳实践](cogvlm2-video最佳实践.md)
 2. [MiniCPM-V最佳实践](minicpm-v最佳实践.md), [MiniCPM-V-2最佳实践](minicpm-v-2最佳实践.md), [MiniCPM-V-2.5最佳实践](minicpm-v-2.5最佳实践.md)
 3. [InternVL-Chat-V1.5最佳实践](internvl最佳实践.md)