Update to MiniCPM-o 2.6

2026-02-05 18:29:18 +08:00 · 2025-01-14 15:33:44 +08:00
parent b75a362dd6
commit 53c0174797
123 changed files with 16848 additions and 2952 deletions
--- a/finetune/readme.md
+++ b/finetune/readme.md
@@ -1,7 +1,7 @@
 # MiniCPM-V Finetuning


-We offer the official scripts for easy finetuning of the pretrained **MiniCPM-V-2_6**, **MiniCPM-Llama3-V 2.5** and **MiniCPM-V 2.0** on downstream tasks. Our finetune scripts use transformers Trainer and DeepSpeed by default.
+We offer the official scripts for easy finetuning of the pretrained **MiniCPM-o-2_6**, **MiniCPM-V-2_6**, **MiniCPM-Llama3-V 2.5** and **MiniCPM-V 2.0** on downstream tasks. Our finetune scripts use transformers Trainer and DeepSpeed by default.

 ### Data preparation

@@ -20,30 +20,30 @@ If your input consists of a single image, you can use a single placeholder **\<i
  [
    {
      "id": "0",
-      "image": "path/to/image_0.jpg",
+      "image": 'path/to/image_0.jpg',
      "conversations": [
            {
-              "role": "user", 
-              "content": "<image>\nHow many desserts are on the white plate?"
+              'role': 'user', 
+              'content': '<image>\nHow many desserts are on the white plate?'
            }, 
            {
-                "role": "assistant", 
-                "content": "There are three desserts on the white plate."
+                'role': 'assistant', 
+                'content': 'There are three desserts on the white plate.'
            },   
            {
-                "role": "user", 
-                "content": "What type of desserts are they?"
+                'role': 'user', 
+                'content': 'What type of desserts are they?'
            },
            {
-                "role": "assistant", 
-                "content": "The desserts are cakes with bananas and pecans on top. They share similarities with donuts, but the presence of bananas and pecans differentiates them."
+                'role': 'assistant', 
+                'content': 'The desserts are cakes with bananas and pecans on top. They share similarities with donuts, but the presence of bananas and pecans differentiates them.'
            }, 
            {
-                "role": "user", 
-                "content": "What is the setting of the image?"}, 
+                'role': 'user', 
+                'content': 'What is the setting of the image?'}, 
            {
-                "role": "assistant", 
-                "content": "The image is set on a table top with a plate containing the three desserts."
+                'role': 'assistant', 
+                'content': 'The image is set on a table top with a plate containing the three desserts.'
            },
        ]
    },
@@ -91,81 +91,16 @@ If the total token count exceeds `max_length`, truncation will be applied. For m
 ```
 </details>

-#### Single Video Example
-If your input consists of a single video, you can use a single placeholder **\<video\>** to indicate where the video should be inserted in the conversation.
-<details>
-  <summary>
-    <b>Single video example (vl_finetune_video.json) with 1 samples.</b>
-  </summary>
-
-```
-  [
-    {
-      "id": "0",
-      "video": "path/to/video_0.mp4",
-      "conversations": [
-            {
-              "role": "user", 
-              "content": "<video>\nHow many desserts are on the white plate?"
-            }, 
-            {
-                "role": "assistant", 
-                "content": "There are three desserts on the white plate."
-            }
-        ]
-    }
-  ]
-```
-
-</details>
-
-#### Multiple Videos Example
-For inputs containing multiple videos, utilize a dictionary where each key represents a unique placeholder (e.g., **\<video_00\>**, **\<video_01\**) with the corresponding video path as its value. These placeholders can then be used within the conversation to seamlessly insert videos at specific positions.
-
-Additionally, to optimize resource management, especially when dealing with large batches of videos during training or inference, consider reducing `video_max_slice_nums` and `max_num_frames`. To minimize the number of tokens used per video, you can set `video_max_slice_nums=1` and `max_num_frames=1`, resulting in a single video being represented by 64 tokens.
-
-If the total token count exceeds `max_length`, truncation will be applied. For multi-video supervised fine-tuning (SFT), it's recommended to set `MODEL_MAX_LENGTH=4096` in your script for better performance.
-
-<details>
-  <summary>
-    <b>Multiple videos example (vl_finetune_data.json) with 1 samples.</b>
-  </summary>
-
-```
-  [
-    {
-      "id": "0",
-      "video": {
-        "<video_00>": "path/to/video_0.mp4",
-        "<video_01>": "path/to/video_1.avi",
-        "<video_02>": "path/to/video_2.mp4",
-        "<video_03>": "path/to/video_3.avi"
-      },
-      "conversations": [
-        {
-          "role": "user", 
-          "content": "How to create such text-only videos using CapCut?\n<video_00>\n<image_01>\n<video_01>\n<video_02>\n"
-        }, 
-        {
-          "role": "assistant", 
-          "content": "To create a text-only video as shown in the videos, follow these steps in CapCut..."
-        }
-      ]
-    }
-  ]
-```
-</details>
-
-
 ### Full-parameter finetuning

 Full-parameter parameter finetuning requires updating all parameters of LLM in the whole training process. Please specify the correct MODEL path, DATA path and LLM_TYPE in the shell scripts.

 ```shell
-MODEL="openbmb/MiniCPM-V-2_6" # or openbmb/MiniCPM-Llama3-V-2_5, openbmb/MiniCPM-V-2
+MODEL="MiniCPM-o-2_6" # or "openbmb/MiniCPM-V-2_6", openbmb/MiniCPM-Llama3-V-2_5, openbmb/MiniCPM-V-2
 DATA="path/to/trainging_data" # json file
 EVAL_DATA="path/to/test_data" # json file
-LLM_TYPE="qwen2" # if use openbmb/MiniCPM-V-2, please set LLM_TYPE=minicpm, if use openbmb/MiniCPM-Llama3-V-2_5, please set LLM_TYPE="llama3"
+LLM_TYPE="qwen" # if use openbmb/MiniCPM-V-2, please set LLM_TYPE=minicpm, if use openbmb/MiniCPM-Llama3-V-2_5, please set LLM_TYPE="llama3",
+# if use openbmb/MiniCPM-o-2_6 or openbmb/MiniCPM-V-2_6, please set LLM_TYPE=qwen
 ```

 To launch your training, run the following script:
@@ -188,7 +123,7 @@ After training, you could load the model with the path to the adapter. We advise
 ```
 from peft import PeftModel
 from transformers import AutoModel
-model_type=  "openbmb/MiniCPM-V-2_6"   # or openbmb/MiniCPM-Llama3-V-2_5 , openbmb/MiniCPM-V-2
+model_type=  ""openbmb/MiniCPM-o-2_6" or # openbmb/MiniCPM-V-2_6", openbmb/MiniCPM-Llama3-V-2_5, openbmb/MiniCPM-V-2
 path_to_adapter="path_to_your_fine_tuned_checkpoint"

 model =  AutoModel.from_pretrained(