support video sft and auto save and load all files

2026-02-05 18:29:18 +08:00 · 2025-01-11 13:50:36 +08:00
parent 8464c94a7b
commit c5e82b1bc7
4 changed files with 170 additions and 22 deletions
--- a/finetune/readme.md
+++ b/finetune/readme.md
@@ -20,30 +20,30 @@ If your input consists of a single image, you can use a single placeholder **\<i
  [
    {
      "id": "0",
-      "image": 'path/to/image_0.jpg',
+      "image": "path/to/image_0.jpg",
      "conversations": [
            {
-              'role': 'user', 
-              'content': '<image>\nHow many desserts are on the white plate?'
+              "role": "user", 
+              "content": "<image>\nHow many desserts are on the white plate?"
            }, 
            {
-                'role': 'assistant', 
-                'content': 'There are three desserts on the white plate.'
+                "role": "assistant", 
+                "content": "There are three desserts on the white plate."
            },   
            {
-                'role': 'user', 
-                'content': 'What type of desserts are they?'
+                "role": "user", 
+                "content": "What type of desserts are they?"
            },
            {
-                'role': 'assistant', 
-                'content': 'The desserts are cakes with bananas and pecans on top. They share similarities with donuts, but the presence of bananas and pecans differentiates them.'
+                "role": "assistant", 
+                "content": "The desserts are cakes with bananas and pecans on top. They share similarities with donuts, but the presence of bananas and pecans differentiates them."
            }, 
            {
-                'role': 'user', 
-                'content': 'What is the setting of the image?'}, 
+                "role": "user", 
+                "content": "What is the setting of the image?"}, 
            {
-                'role': 'assistant', 
-                'content': 'The image is set on a table top with a plate containing the three desserts.'
+                "role": "assistant", 
+                "content": "The image is set on a table top with a plate containing the three desserts."
            },
        ]
    },
@@ -91,6 +91,72 @@ If the total token count exceeds `max_length`, truncation will be applied. For m
 ```
 </details>

+#### Single Video Example
+If your input consists of a single video, you can use a single placeholder **\<video\>** to indicate where the video should be inserted in the conversation.
+<details>
+  <summary>
+    <b>Single video example (vl_finetune_video.json) with 1 samples.</b>
+  </summary>
+
+```
+  [
+    {
+      "id": "0",
+      "video": "path/to/video_0.mp4",
+      "conversations": [
+            {
+              "role": "user", 
+              "content": "<video>\nHow many desserts are on the white plate?"
+            }, 
+            {
+                "role": "assistant", 
+                "content": "There are three desserts on the white plate."
+            }
+        ]
+    }
+  ]
+```
+
+</details>
+
+#### Multiple Videos Example
+For inputs containing multiple videos, utilize a dictionary where each key represents a unique placeholder (e.g., **\<video_00\>**, **\<video_01\**) with the corresponding video path as its value. These placeholders can then be used within the conversation to seamlessly insert videos at specific positions.
+
+Additionally, to optimize resource management, especially when dealing with large batches of videos during training or inference, consider reducing `video_max_slice_nums` and `max_num_frames`. To minimize the number of tokens used per video, you can set `video_max_slice_nums=1` and `max_num_frames=1`, resulting in a single video being represented by 64 tokens.
+
+If the total token count exceeds `max_length`, truncation will be applied. For multi-video supervised fine-tuning (SFT), it's recommended to set `MODEL_MAX_LENGTH=4096` in your script for better performance.
+
+<details>
+  <summary>
+    <b>Multiple videos example (vl_finetune_data.json) with 1 samples.</b>
+  </summary>
+
+```
+  [
+    {
+      "id": "0",
+      "video": {
+        "<video_00>": "path/to/video_0.mp4",
+        "<video_01>": "path/to/video_1.avi",
+        "<video_02>": "path/to/video_2.mp4",
+        "<video_03>": "path/to/video_3.avi"
+      },
+      "conversations": [
+        {
+          "role": "user", 
+          "content": "How to create such text-only videos using CapCut?\n<video_00>\n<image_01>\n<video_01>\n<video_02>\n"
+        }, 
+        {
+          "role": "assistant", 
+          "content": "To create a text-only video as shown in the videos, follow these steps in CapCut..."
+        }
+      ]
+    }
+  ]
+```
+</details>
+
+
 ### Full-parameter finetuning

 Full-parameter parameter finetuning requires updating all parameters of LLM in the whole training process. Please specify the correct MODEL path, DATA path and LLM_TYPE in the shell scripts.