Merge pull request #175 from zchoi/main

update MMEvol codebase
AlibabaResearch · Nov 26, 2024 · 9406bd7 · 9406bd7
2 parents 216a79f + 37ef5e8
commit 9406bd7
Show file tree

Hide file tree

Showing 31 changed files with 139 additions and 153 deletions.
diff --git a/.DS_Store b/.DS_Store
diff --git a/mmevol/README.md b/mmevol/README.md
@@ -1,6 +1,9 @@
 <!-- # MMEvol -->
 
 # MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct
+<p align="center">
+    <img src="dataengine/assets/mmevol_logo.png" width="50%" height="50%">
+</p>
 
 <div align="center">
 <br>
@@ -23,8 +26,6 @@
 <a>Jingkuan Song<sup><span>4🌟</span></sup>,
 <br>
 
-
-
 \* Equal contribution 🌟 Corresponding author
 
 <sup>1</sup> Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences<br>
@@ -38,12 +39,9 @@
 
 </div>
 
-<p align="center">
-    <img src="mmevol_sft_data/assets/mmevol.jpg" width="100%" height="100%">
-</p>
 
 <font size=5><div align='center' >  [[📖 arXiv Paper](https://arxiv.org/pdf/2409.05840)] [[📊 Dataset](https://huggingface.co/datasets/Tongyi-ConvAI/MMEvol)] [[🏆 Models](https://huggingface.co/models/Tongyi-ConvAI/MMEvol)]  </div></font>
-MMEvol is the first method that successfully introduces Evol-Instruct into multimodal domain to improve the diversity and complexity of multimodal instruction data. Compared with previous methods like  vila2, MIMIC-IT, and MMInstruct, it can perform iterative evolution in a very elegant and simple way in a fully automatic way, breaking through human imagination of data complexity and diversity. It has no restrictions on the form of data, the type of task, or complex processing. It can quickly perform self-iterative evolution on limited image instruction data to obtain ultra-high-quality multimodal data, thereby giving multimodal models more powerful capabilities. At the same time, it can be orthogonally combined with other data flow-driven methods such as vila2, MIMIC-IT, and MMInstruct to obtain more powerful data construction effects. Everyone is welcome to experience it now!
+MMEvol is the first method that successfully introduces Evol-Instruct into multimodal domain to improve the diversity and complexity of multimodal instruction data. Compared with previous methods like  VILA2, MIMIC-IT, and MMInstruct, it can perform iterative evolution in a very elegant and simple way in a fully automatic way, breaking through human imagination of data complexity and diversity. It has no restrictions on the form of data, the type of task, or complex processing. It can quickly perform self-iterative evolution on limited image instruction data to obtain ultra-high-quality multimodal data, thereby giving multimodal models more powerful capabilities. At the same time, it can be orthogonally combined with other data flow-driven methods such as VILA2, MIMIC-IT, and MMInstruct to obtain more powerful data construction effects. Everyone is welcome to experience it now!
 
 ## 🔥 Update
 
@@ -103,8 +101,8 @@ Here are the pretrained weights and instruction tuning weights
 
 | Model            | Pretrained Projector | Base LLM  | PT Data                                                      | IT Data | Download |
 | ---------------- | -------------------- | --------- | ------------------------------------------------------------ | ------- | -------- |
-| MMEvol-Qwen2-7B  | [mm_projector]()     | Qwen2-7B  | [LLaVA-Pretrain](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) | MMEvol  | [ckpt]() |
-| MMEvol-LLaMA3-8B | [mm_projector]()     | LLaMA3-8B | [LLaVA-Pretrain](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) | MMEvol  | [ckpt]() |
+| MMEvol-Qwen2-7B  | [mm_projector](https://huggingface.co/models/Tongyi-ConvAI/MMEvol)     | Qwen2-7B  | [LLaVA-Pretrain](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) | MMEvol  | [ckpt](https://huggingface.co/models/Tongyi-ConvAI/MMEvol) |
+| MMEvol-LLaMA3-8B | [mm_projector](https://huggingface.co/models/Tongyi-ConvAI/MMEvol)     | LLaMA3-8B | [LLaVA-Pretrain](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) | MMEvol  | [ckpt](https://huggingface.co/models/Tongyi-ConvAI/MMEvol) |
 
 ### Performance
 
@@ -255,9 +253,10 @@ bash scripts/v1_6/train/llama3/finetune.sh
 bash scripts/v1_6/train/qwen2/finetune.sh
 ```
 
-
 ## 📈 Evaluation
 
+#### Ensure that your api_base and key are correctly configured before evaluation.
+
 ## opencompass
 
 First, enter the `vlmevalkit` directory and install all dependencies:
@@ -313,6 +312,8 @@ While scoring on each benchmark directly, set `MODE=all`. If only inference resu
 ./script/run_inference.sh MMEvol-Llama3-V-1_6 MathVista_MINI all
 .....
 
+# NOTE you should use llava/eval/blink_eval.py for blink evaluation individually.
+python llava/eval/blink_eval.py
 ```
 
 <br />
@@ -335,22 +336,24 @@ python llava/eval/mminst_eval.py
 
 <br />
 
+
+
 ## 👀 Visualization
 
 The Tongyi-ConvAI generates this dataset for multi-modal supervised fine-tuning. This dataset was used to train **Evol-Llama3-8B-Instruct** and **Evol-Qwen2-7B** reported in [our paper](https://arxiv.org/pdf/2409.05840). To create this dataset, we first selected 163K Seed Instruction Tuning Dataset for Evol-Instruct, then we enhance data quality through an iterative process that involves a refined combination of fine-grained perception, cognitive reasoning, and interaction evolution. This process results in the generation of a more complex and diverse image-text instruction dataset, which in turn empowers MLLMs with enhanced capabilities. Below we showcase the detailed data distribution of the SEED-163K, which is prepared for multi-round evolution mentioned above. More details can be found in our paper. 
 
 <div align=center>
-<img width="90%" src="mmevol_sft_data/assets/mmevol.jpg"/>
+<img width="90%" src="dataengine/assets/mmevol_seed_dis.jpg"/>
 </div>
 
 <div align='center' >
 <details>
 <summary> Click to expand more examples</summary>
 <p align="center">
-    <img src="mmevol_sft_data/assets/mmevol.jpg" width="60%" height="60%">
-    <img src="mmevol_sft_data/assets/mmevol.jpg" width="60%" height="60%">
-    <img src="mmevol_sft_data/assets/mmevol.jpg" width="60%" height="60%">
-    <img src="mmevol_sft_data/assets/mmevol.jpg" width="60%" height="60%">
+    <img src="dataengine/assets/mmevol_pai.png" width="90%" height="90%">
+    <img src="dataengine/assets/mmevol_dis_cam.png" width="90%" height="90%">
+    <img src="dataengine/assets/mmevol_long_tail.png" width="90%" height="90%">
+    <img src="dataengine/assets/mmevol_performance.png" width="90%" height="90%">
 </details>
 </div>
 

diff --git a/mmevol/dataengine/README.md b/mmevol/dataengine/README.md
@@ -0,0 +1,79 @@
+# Data construction pipeline for MMEvol-480k.
+
+<p align="center">
+    <img src="assets/mmevol_logo.png" width="50%" height="50%">
+</p>
+
+<div align="center">
+<br>
+<a href="https://scholar.google.com/citations?user=phg8yxoAAAAJ&hl=zh-CN&oi=ao">Run Luo</a><sup><span>1,2*</span></sup>, 
+<a>Haonan Zhang</a><sup><span>3*</span></sup>,
+<a>Longze Chen</a><sup><span>1,2*</span></sup>,
+<a>Ting-En Lin</a><sup><span>3*</span></sup>,
+<a>Xiong Liu</a><sup><span>3</span></sup>,
+<a>Yuchuan Wu</a><sup><span>3</span></sup>,
+<a>Min Yang</a><sup><span>1,2🌟</span></sup>,
+<a>Yongbin Li</a><sup><span>3🌟</span></sup>,
+<br>
+<a>Minzheng Wang<sup><span>2</span></sup>,
+<a>Pengpeng Zeng<sup><span>4</span></sup>,
+<a>Lianli Gao<sup><span>5</span></sup>,
+<a>Heng Tao Shen<sup><span>4</span></sup>,
+<a>Yunshui Li<sup><span>1,2</span></sup>,
+<a>Xiaobo Xia<sup><span>6</span></sup>,
+<a>FeiHuang<sup><span>3</span></sup>,
+<a>Jingkuan Song<sup><span>4🌟</span></sup>,
+<br>
+
+\* Equal contribution 🌟 Corresponding author
+
+<sup>1</sup> Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences<br>
+<sup>2</sup> University of Chinese Academy of Sciences<br>
+<sup>3</sup> Alibaba Group
+<sup>4</sup> Tongji University 
+<sup>5</sup> Independent Researcher
+<sup>6</sup> The University of Sydney<br>
+
+![Multi-Modal](https://img.shields.io/badge/Task-Multi--Modal-red) <a href='https://arxiv.org/pdf/2409.05840'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a> <a href='https://huggingface.co/models/Tongyi-ConvAI/MMEvol'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue'></a> <a href='https://huggingface.co/datasets/Tongyi-ConvAI/MMEvol'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Data-green'> <a href='https://mmevol.github.io/'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Project-Page-green'></a></a>
+
+</div>
+
+
+<font size=5><div align='center' >  [[📖 arXiv Paper](https://arxiv.org/pdf/2409.05840)] [[📊 Dataset](https://huggingface.co/datasets/Tongyi-ConvAI/MMEvol)] [[🏆 Models](https://huggingface.co/models/Tongyi-ConvAI/MMEvol)]  </div></font>
+
+Follow the instructions below to generate MMEvol-480k.
+
+1. Download SEED-163k json file (`mm_seed_no_evo_163k.json`) from [🤗 huggingface](https://huggingface.co/datasets/Tongyi-ConvAI/MMEvol/tree/main/jsons), and place it under the `./dataengine/datasets` path.
+2. Execute preprocessing code under `dataengine/datasets` path to extract each sample to the `meta_data` folder by:
+```python
+python dataengine/datasets/process.py
+```
+3. Prepare the data storage folder by referring to the format of `./dataengine/evolution/folder_template`, you can just copy folder_template and name it as your data name as you like, _e.g._, mmevol_1k_evo.json.
+4. Ensure that your `api_base` and `key` are correctly configured before starting generation. You should put your key and api_base on both:
+
+- lines 129-130 in dataengine/multi_round.py
+- lines 126-127 in dataengine/score_process/difficulty_scoring_v123.py
+5. Run the following code to begin the three-round data evolution: 
+```python
+python dataengine/multi_round.py
+```
+Three rounds of evolution will be performed based on the SEED-163k, and data filtering will be performed at the end of each round of evolution. The final evolution data will be stored under `./datasets` paths
+
+**License**: Please follow [Meta Llama 3.1 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE) and [Gemma License](https://www.kaggle.com/models/google/gemma/license/).
+
+## 📚 Citation
+
+```bibtex
+@article{luo2024mmevol,
+  title={Mmevol: Empowering multimodal large language models with evol-instruct},
+  author={Luo, Run and Zhang, Haonan and Chen, Longze and Lin, Ting-En and Liu, Xiong and Wu, Yuchuan and Yang, Min and Wang, Minzheng and Zeng, Pengpeng and Gao, Lianli and others},
+  journal={arXiv preprint arXiv:2409.05840},
+  year={2024}
+}
+```
+
+**Contact**:
+
+- Run Luo — [email protected]
+
+- Haonan Zhang — [email protected]
diff --git a/mmevol/dataengine/assets/mmevol_dis_cam.png b/mmevol/dataengine/assets/mmevol_dis_cam.png
diff --git a/mmevol/dataengine/assets/mmevol_logo.png b/mmevol/dataengine/assets/mmevol_logo.png
diff --git a/mmevol/dataengine/assets/mmevol_long_tail.png b/mmevol/dataengine/assets/mmevol_long_tail.png
diff --git a/mmevol/dataengine/assets/mmevol_pai.png b/mmevol/dataengine/assets/mmevol_pai.png
diff --git a/mmevol/dataengine/assets/mmevol_performance.png b/mmevol/dataengine/assets/mmevol_performance.png
diff --git a/mmevol/mmevol_sft_data/assets/seed_dis.jpg → mmevol/dataengine/assets/mmevol_seed_dis.jpg b/mmevol/mmevol_sft_data/assets/seed_dis.jpg → mmevol/dataengine/assets/mmevol_seed_dis.jpg
diff --git a/mmevol/mmevol_sft_data/base.py → mmevol/dataengine/base.py b/mmevol/mmevol_sft_data/base.py → mmevol/dataengine/base.py
diff --git a/mmevol/dataengine/datasets/process.py b/mmevol/dataengine/datasets/process.py
@@ -0,0 +1,24 @@
+import json
+import os
+import os.path as osp
+from tqdm import tqdm
+import shutil
+
+# Construct hash_id to create a unique index, because both id and image key values have duplicate values
+datasets_path = "/mnt/data/haonan/code/dataengine/datasets"
+
+a = json.load(open(osp.join(datasets_path, "seed_data_1k_demo.json"), "r"))
+for index, i in enumerate(a):
+    i["hash_id"] = str(index) + "_" + i["image"].replace("/", "_")
+
+json.dump(a, open("/mnt/data/haonan/code/dataengine/datasets/seed_data_1k_demo.json", "w"), indent=4)
+
+# If the data format is already well organized, store it separately in meta data
+if os.path.exists(osp.join(datasets_path, "meta_data")):
+    shutil.rmtree(osp.join(datasets_path, "meta_data"))
+    os.mkdir(osp.join(datasets_path, "meta_data"))
+
+data = json.load(open(osp.join(datasets_path, "seed_data_1k_demo.json"), "r"))
+
+for index, d in enumerate(tqdm(data)):
+    json.dump(d, open(osp.join(datasets_path, "meta_data", "{}.json".format(d["hash_id"])), "w"), indent=4)
diff --git a/mmevol/mmevol_sft_data/multi_round.py → mmevol/dataengine/multi_round.py b/mmevol/mmevol_sft_data/multi_round.py → mmevol/dataengine/multi_round.py
@@ -1,6 +1,6 @@
 import os
 import sys
-sys.path.append("/mnt/data/haonan/code/mmevol_sft_data")
+sys.path.append("/mnt/data/haonan/code/dataengine")
 from base import BaseAPI
 import numpy as np
 from tqdm import tqdm
@@ -466,13 +466,13 @@ def filter_round3(meta_data, conversation_v3_path):
 
 if __name__=='__main__':
 
-    final_save_path = "/mnt/data/haonan/code/mmevol_sft_data/datasets/seed_data_1k_demo_evo.json"
-    root_path = '/mnt/data/haonan/code/mmevol_sft_data/evolution/multi_round_single_imgs_1k_mini'
+    final_save_path = "/mnt/data/haonan/code/dataengine/datasets/seed_data_1k_demo_evo.json"
+    root_path = '/mnt/data/haonan/code/dataengine/evolution/multi_round_single_imgs_1k_mini'
     img_path = '/mnt/workspace/lr/datasets'
 
     for round_n in [1,2,3]:
         if round_n == 1: 
-            seed_data_path = "/mnt/data/haonan/code/mmevol_sft_data/datasets/meta_data"
+            seed_data_path = "/mnt/data/haonan/code/dataengine/datasets/meta_data"
         else:
             seed_data_path = osp.join(root_path, "round{}".format(round_n-1), "filtered_qa")
 
@@ -534,4 +534,4 @@ def filter_round3(meta_data, conversation_v3_path):
         merged_data.append(data)
 
     json.dump(merged_data, open(final_save_path, "w"), indent=4)
-    print("Saveing file to {}".format(final_save_path))
+    print("Saveing file to {}".format(final_save_path))
diff --git a/mmevol/mmevol_sft_data/prompt.py → mmevol/dataengine/prompt.py b/mmevol/mmevol_sft_data/prompt.py → mmevol/dataengine/prompt.py
diff --git a/mmevol/mmevol_sft_data/score_process/base.py → mmevol/dataengine/score_process/base.py b/mmevol/mmevol_sft_data/score_process/base.py → mmevol/dataengine/score_process/base.py
diff --git a/...ta/score_process/difficulty_scoring_v0.py → ...ne/score_process/difficulty_scoring_v0.py b/...ta/score_process/difficulty_scoring_v0.py → ...ne/score_process/difficulty_scoring_v0.py
@@ -124,12 +124,9 @@ def __init__(self,
             print('Unknown API Base. ')
             sys.exit(-1)
 
-        self.api_base="http://47.88.8.18:8088/api/ask"
-        # self.api_base = "http://47.88.8.18:8088/api/ask?tenant=gpt-4o-mini"
-        # self.key = "eyJ0eXAiOiJqd3QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VybmFtZSI6IjI1ODczMCIsInBhc3N3b3JkIjoiMjU4NzMwMTIzIiwiZXhwIjoyMDE5NTUwNzAxfQ.JuqnTa7yauGkSzWkBiEig1K_rxvfAYTXS9F9_m-h4q8"
-        # self.key = "eyJ0eXAiOiJqd3QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VybmFtZSI6IjI3NDM2OCIsInBhc3N3b3JkIjoiMjc0MzY4MTIzIiwiZXhwIjoyMDEyNjEzNjA4fQ.7OUpHs-AFPaFHuUy_p7XxXyNYhca2_-7F5GBtaahfe4"
-        self.key = "eyJhbGciOiJIUzI1NiIsInR5cCI6Imp3dCJ9.eyJ1c2VybmFtZSI6IjQ0MzQ1NSIsInBhc3N3b3JkIjoiNDQzNDU1MTIzIiwiZXhwIjoyMDMxNzA1NTA3fQ.7g4a6t9dKcRXVRa7MwQb5m2oirFu1OxjXhWbNM0w50s"
-        # self.key = "eyJhbGciOiJIUzI1NiIsInR5cCI6Imp3dCJ9.eyJ1c2VybmFtZSI6IjQzOTg2OSIsInBhc3N3b3JkIjoiNDM5ODY5MTIzIiwiZXhwIjoyMDMxNzA3NjkzfQ.ly9XNzVW7pEeW_bTZxzaqB3jt2kRr14XQIpT0DbCTto"
+        self.api_base = ""
+        self.key = ""
+
         # self.model = "gpt-4o-2024-08-06"
         self.model = "gpt-4o-mini"
 

diff --git a/.../score_process/difficulty_scoring_v123.py → .../score_process/difficulty_scoring_v123.py b/.../score_process/difficulty_scoring_v123.py → .../score_process/difficulty_scoring_v123.py
@@ -123,10 +123,9 @@ def __init__(self,
             print('Unknown API Base. ')
             sys.exit(-1)
 
-        self.api_base="http://47.88.8.18:8088/api/ask"
-        # self.api_base = "http://47.88.8.18:8088/api/ask?tenant=gpt-4o-mini"
-        # self.key = "eyJ0eXAiOiJqd3QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VybmFtZSI6IjI1ODczMCIsInBhc3N3b3JkIjoiMjU4NzMwMTIzIiwiZXhwIjoyMDE5NTUwNzAxfQ.JuqnTa7yauGkSzWkBiEig1K_rxvfAYTXS9F9_m-h4q8"
-        self.key = "eyJhbGciOiJIUzI1NiIsInR5cCI6Imp3dCJ9.eyJ1c2VybmFtZSI6IjQ0MzQ1NSIsInBhc3N3b3JkIjoiNDQzNDU1MTIzIiwiZXhwIjoyMDMxNzA1NTA3fQ.7g4a6t9dKcRXVRa7MwQb5m2oirFu1OxjXhWbNM0w50s"
+        self.api_base = ""
+        self.key = ""
+
         # self.model="gpt-4o-2024-05-13"
         self.model = "gpt-4o-mini"
 

diff --git a/...ol_sft_data/score_process/prompt_score.py → .../dataengine/score_process/prompt_score.py b/...ol_sft_data/score_process/prompt_score.py → .../dataengine/score_process/prompt_score.py
diff --git a/mmevol/mmevol_sft_data/utils/a.ipynb → mmevol/dataengine/utils/a.ipynb b/mmevol/mmevol_sft_data/utils/a.ipynb → mmevol/dataengine/utils/a.ipynb
diff --git a/mmevol/mmevol_sft_data/utils/bertopic.ipynb → mmevol/dataengine/utils/bertopic.ipynb b/mmevol/mmevol_sft_data/utils/bertopic.ipynb → mmevol/dataengine/utils/bertopic.ipynb
diff --git a/.../mmevol_sft_data/utils/coco_80_labels.txt → mmevol/dataengine/utils/coco_80_labels.txt b/.../mmevol_sft_data/utils/coco_80_labels.txt → mmevol/dataengine/utils/coco_80_labels.txt
diff --git a/mmevol/mmevol_sft_data/utils/data_process.py → mmevol/dataengine/utils/data_process.py b/mmevol/mmevol_sft_data/utils/data_process.py → mmevol/dataengine/utils/data_process.py
diff --git a/...l/mmevol_sft_data/utils/object_count.json → mmevol/dataengine/utils/object_count.json b/...l/mmevol_sft_data/utils/object_count.json → mmevol/dataengine/utils/object_count.json
diff --git a/mmevol/mmevol_sft_data/utils/small_obj.txt → mmevol/dataengine/utils/small_obj.txt b/mmevol/mmevol_sft_data/utils/small_obj.txt → mmevol/dataengine/utils/small_obj.txt
diff --git a/...evol_sft_data/utils/small_obj_process.txt → ...ol/dataengine/utils/small_obj_process.txt b/...evol_sft_data/utils/small_obj_process.txt → ...ol/dataengine/utils/small_obj_process.txt
diff --git a/mmevol/llava/eval/mmvp_eval.py b/mmevol/llava/eval/mmvp_eval.py
@@ -109,11 +109,12 @@ def make_request(meta):
 with Pool(processes=50) as pool:
     output = list(tqdm(pool.imap(make_request, data), total=len(data)))
 
-print(output)
-for i in set(all_types):
+# print(output)
+# for i in set(all_types):
 
-    for j in data:
-        if j['type']==i
+#     for j in data:
+#         if j['type']==i
+
 num_correct, num_total = 0, 0
 # Continue with the processing of the JSONL file
 index=0

diff --git a/mmevol/mmevol_sft_data/README.md b/mmevol/mmevol_sft_data/README.md
diff --git a/mmevol/mmevol_sft_data/assets/mmevol.jpg b/mmevol/mmevol_sft_data/assets/mmevol.jpg
diff --git a/mmevol/mmevol_sft_data/datasets/process.ipynb b/mmevol/mmevol_sft_data/datasets/process.ipynb
diff --git a/mmevol/vlmevalkit/.DS_Store b/mmevol/vlmevalkit/.DS_Store
diff --git a/mmevol/vlmevalkit/vlmeval/.DS_Store b/mmevol/vlmevalkit/vlmeval/.DS_Store