update for multi images sft

2026-02-05 18:29:18 +08:00 · 2024-08-13 13:57:48 +08:00
parent 61e942ec7c
commit 7842ec1228
5 changed files with 143 additions and 65 deletions
--- a/finetune/readme.md
+++ b/finetune/readme.md
@@ -5,13 +5,15 @@ We offer the official scripts for easy finetuning of the pretrained **MiniCPM-V-

 ### Data preparation

-To prepare your finetuning data, you should formulate each sample as a dictionary consisting of an id, an image path list with an image, and a list of conversations. Then save data samples in JSON files.
+To prepare your fine-tuning data, you should formulate each sample as a dictionary consisting of an id, an image path (or list of images), and a list of conversations. Then, save the data samples in JSON files.

-For the vision-language example with image, you are required to provide **\<image\>** to define the position to insert the image embeddings. If you don't provide \<image\>, the image will be placed at the front of the conversation.
+For vision-language tasks, you must provide placeholders like **\<image\>** or **\<image_XX\>** to define where to insert the image embeddings within the conversation. If no placeholder is provided, the image will be placed at the front of the conversation by default.

+#### Single Image Example
+If your input consists of a single image, you can use a single placeholder **\<image\>** to indicate where the image should be inserted in the conversation.
 <details>
  <summary>
-    <b>vision-language example (vl_finetune_data.json) with 1 samples.</b>
+    <b>Single image example (vl_finetune_data.json) with 1 samples.</b>
  </summary>

 ```
@@ -50,6 +52,38 @@ For the vision-language example with image, you are required to provide **\<imag

 </details>

+#### Multiple Images Example
+For inputs with multiple images, you should use a dictionary where each key represents a unique placeholder (e.g., **\<image_00\>**, **\<image_01\>**), and the corresponding value is the image path. You can then use these placeholders in the conversation to insert the images at specific positions.
+
+<details>
+  <summary>
+    <b>Multiple images example (vl_finetune_data.json) with 1 samples.</b>
+  </summary>
+
+```
+  [
+    {
+      "id": "0",
+      "image": {
+        "<image_00>": "path/to/image_0.jpg",
+        "<image_01>": "path/to/image_1.jpg",
+        "<image_02>": "path/to/image_2.jpg",
+        "<image_03>": "path/to/image_3.jpg"
+      },
+      "conversations": [
+        {
+          "role": "user", 
+          "content": "How to create such text-only videos using CapCut?\n<image_00>\n<image_01>\n<image_02>\n<image_03>\n"
+        }, 
+        {
+          "role": "assistant", 
+          "content": "To create a text-only video as shown in the images, follow these steps in CapCut..."
+        }
+      ]
+    }
+  ]
+```
+</details>

 ### Full-parameter finetuning