Update to MiniCPM-o 2.6

2026-02-04 09:49:20 +08:00 · 2025-01-14 15:33:44 +08:00
parent b75a362dd6
commit 53c0174797
123 changed files with 16848 additions and 2952 deletions
--- a/README.md
+++ b/README.md
--- a/README_en.md
+++ b/README_en.md
--- a/README_zh.md
+++ b/README_zh.md
--- a/assets/MiniCPM-o.png
+++ b/assets/MiniCPM-o.png
--- a/assets/discord.png
+++ b/assets/discord.png
--- a/assets/logo.html
+++ b/assets/logo.html
@@ -0,0 +1,3 @@
 <span style="color:#56A7DA; font-size: 10em; font-weight: bold;">
    MiniCPM-<span>o</span>
 </span>
--- a/assets/minicpm-o-26-framework.png
+++ b/assets/minicpm-o-26-framework.png
--- a/assets/minicpmo2_6/minicpmo2_6_diagram_train_NN.png
+++ b/assets/minicpmo2_6/minicpmo2_6_diagram_train_NN.png
--- a/assets/minicpmo2_6/minicpmo2_6_math_intersect.png
+++ b/assets/minicpmo2_6/minicpmo2_6_math_intersect.png
--- a/assets/minicpmo2_6/minicpmo2_6_multi-image_bike.png
+++ b/assets/minicpmo2_6/minicpmo2_6_multi-image_bike.png
--- a/assets/minicpmo2_6/show_demo.jpg
+++ b/assets/minicpmo2_6/show_demo.jpg
--- a/assets/o-2dot6-demo-video-preview.png
+++ b/assets/o-2dot6-demo-video-preview.png
--- a/assets/radar.jpg
+++ b/assets/radar.jpg
--- a/assets/ref_audios/default.wav
+++ b/assets/ref_audios/default.wav
--- a/assets/ref_audios/female_example.wav
+++ b/assets/ref_audios/female_example.wav
--- a/assets/ref_audios/male_example.wav
+++ b/assets/ref_audios/male_example.wav
--- a/assets/ref_audios/video_default.wav
+++ b/assets/ref_audios/video_default.wav
--- a/assets/wechat.png
+++ b/assets/wechat.png
--- a/docs/minicpm_llama3_v2dot5.md
+++ b/docs/minicpm_llama3_v2dot5.md
@@ -0,0 +1,333 @@
 ## MiniCPM-Llama3-V 2.5
 > Archieve at: 2025-01-13
 **MiniCPM-Llama3-V 2.5** is the latest model in the MiniCPM-V series. The model is built on SigLip-400M and Llama3-8B-Instruct with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.0. Notable features of MiniCPM-Llama3-V 2.5 include:
 - 🔥 **Leading Performance.**
  MiniCPM-Llama3-V 2.5 has achieved an average score of 65.1 on OpenCompass, a comprehensive evaluation over 11 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4V-1106, Gemini Pro, Claude 3 and Qwen-VL-Max** and greatly outperforms other Llama 3-based MLLMs.
 - 💪 **Strong OCR Capabilities.**
  MiniCPM-Llama3-V 2.5 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344), achieving a **700+ score on OCRBench, surpassing proprietary models such as GPT-4o, GPT-4V-0409, Qwen-VL-Max and Gemini Pro**. Based on recent user feedback, MiniCPM-Llama3-V 2.5 has now enhanced full-text OCR extraction, table-to-markdown conversion, and other high-utility capabilities, and has further strengthened its instruction-following and complex reasoning abilities, enhancing multimodal interaction experiences.
 - 🏆 **Trustworthy Behavior.**
  Leveraging the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) method (the newest technique in the [RLHF-V](https://github.com/RLHF-V) [CVPR'24] series), MiniCPM-Llama3-V 2.5 exhibits more trustworthy behavior. It achieves a **10.3%** hallucination rate on Object HalBench, lower than GPT-4V-1106 (13.6%), achieving the best-level performance within the open-source community. [Data released](https://huggingface.co/datasets/openbmb/RLAIF-V-Dataset).
 - 🌏 **Multilingual Support.**
  Thanks to the strong multilingual capabilities of Llama 3 and the cross-lingual generalization technique from [VisCPM](https://github.com/OpenBMB/VisCPM), MiniCPM-Llama3-V 2.5 extends its bilingual (Chinese-English) multimodal capabilities to **over 30 languages including German, French, Spanish, Italian, Korean etc.** [All Supported Languages](./assets/minicpm-llama-v-2-5_languages.md).
 - 🚀 **Efficient Deployment.**
  MiniCPM-Llama3-V 2.5 systematically employs **model quantization, CPU optimizations, NPU optimizations and compilation optimizations**, achieving high-efficiency deployment on end-side devices. For mobile phones with Qualcomm chips, we have integrated the NPU acceleration framework QNN into llama.cpp for the first time. After systematic optimization, MiniCPM-Llama3-V 2.5 has realized a **150x acceleration in end-side MLLM image encoding** and a **3x speedup in language decoding**.
 -  💫  **Easy Usage.**
 MiniCPM-Llama3-V 2.5 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-v2.5/examples/minicpmv/README.md) and [ollama](https://github.com/OpenBMB/ollama/tree/minicpm-v2.5/examples/minicpm-v2.5) support for efficient CPU inference on local devices, (2) [GGUF](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf) format quantized models in 16 sizes, (3) efficient [LoRA](https://github.com/OpenBMB/MiniCPM-V/tree/main/finetune#lora-finetuning) fine-tuning with only 2 V100 GPUs, (4) [streaming output](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5#usage), (5) quick local WebUI demo setup with [Gradio](https://github.com/OpenBMB/MiniCPM-V/blob/main/web_demo_2.5.py) and [Streamlit](https://github.com/OpenBMB/MiniCPM-V/blob/main/web_demo_streamlit-2_5.py), and (6) interactive demos on [HuggingFace Spaces](https://huggingface.co/spaces/openbmb/MiniCPM-Llama3-V-2_5).
 ### Evaluation  <!-- omit in toc -->
 <div align="center">
    <img src=../assets/MiniCPM-Llama3-V-2.5-peformance.png width=66% />
 </div>
 <details>
 <summary>Click to view results on TextVQA, DocVQA, OCRBench, OpenCompass, MME, MMBench, MMMU, MathVista, LLaVA Bench, RealWorld QA, Object HalBench. </summary>
 <div align="center">
 <table style="margin: 0px auto;">
    <thead>
        <tr>
            <th align="left">Model</th>
            <th>Size</th>
            <th>OCRBench</th>
            <th>TextVQA val</th>
            <th>DocVQA test</th>
            <th>Open-Compass</th>
            <th>MME</th>
            <th>MMB test (en)</th>
            <th>MMB test (cn)</th>
            <th>MMMU val</th>
            <th>Math-Vista</th>
            <th>LLaVA Bench</th>
            <th>RealWorld QA</th>
            <th>Object HalBench</th>
        </tr>
    </thead>
    <tbody align="center">
        <tr>
            <td colspan="14" align="left"><strong>Proprietary</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Gemini Pro</td>
            <td>-</td>
            <td>680</td>
            <td>74.6</td>
            <td>88.1</td>
            <td>62.9</td>
            <td>2148.9</td>
            <td>73.6</td>
            <td>74.3</td>
            <td>48.9</td>
            <td>45.8</td>
            <td>79.9</td>
            <td>60.4</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">GPT-4V (2023.11.06)</td>
            <td>-</td>
            <td>645</td>
            <td>78.0</td>
            <td>88.4</td>
            <td>63.5</td>
            <td>1771.5</td>
            <td>77.0</td>
            <td>74.4</td>
            <td>53.8</td>
            <td>47.8</td>
            <td>93.1</td>
            <td>63.0</td>
            <td>86.4</td>
        </tr>
        <tr>
            <td colspan="14" align="left"><strong>Open-source</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Mini-Gemini</td>
            <td>2.2B</td>
            <td>-</td>
            <td>56.2</td>
            <td>34.2*</td>
            <td>-</td>
            <td>1653.0</td>
            <td>-</td>
            <td>-</td>
            <td>31.7</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Qwen-VL-Chat</td>
            <td>9.6B</td>
            <td>488</td>
            <td>61.5</td>
            <td>62.6</td>
            <td>51.6</td>
            <td>1860.0</td>
            <td>61.8</td>
            <td>56.3</td>
            <td>37.0</td>
            <td>33.8</td>
            <td>67.7</td>
            <td>49.3</td>
            <td>56.2</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">DeepSeek-VL-7B</td>
            <td>7.3B</td>
            <td>435</td>
            <td>64.7*</td>
            <td>47.0*</td>
            <td>54.6</td>
            <td>1765.4</td>
            <td>73.8</td>
            <td>71.4</td>
            <td>38.3</td>
            <td>36.8</td>
            <td>77.8</td>
            <td>54.2</td>
            <td>-</td>
        </tr>        
        <tr>
            <td nowrap="nowrap" align="left">Yi-VL-34B</td>
            <td>34B</td>
            <td>290</td>
            <td>43.4*</td>
            <td>16.9*</td>
            <td>52.2</td>
            <td><strong>2050.2</strong></td>
            <td>72.4</td>
            <td>70.7</td>
            <td>45.1</td>
            <td>30.7</td>
            <td>62.3</td>
            <td>54.8</td>
            <td>79.3</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">CogVLM-Chat</td>
            <td>17.4B</td>
            <td>590</td>
            <td>70.4</td>
            <td>33.3*</td>
            <td>54.2</td>
            <td>1736.6</td>
            <td>65.8</td>
            <td>55.9</td>
            <td>37.3</td>
            <td>34.7</td>
            <td>73.9</td>
            <td>60.3</td>
            <td>73.6</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">TextMonkey</td>
            <td>9.7B</td>
            <td>558</td>
            <td>64.3</td>
            <td>66.7</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
        </tr>
        <tr>
          <td nowrap="nowrap" align="left">Idefics2</td>
          <td>8.0B</td>
          <td>-</td>
          <td>73.0</td>
          <td>74.0</td>
          <td>57.2</td>
          <td>1847.6</td>
          <td>75.7</td>
          <td>68.6</td>
          <td>45.2</td>
          <td>52.2</td>
          <td>49.1</td>
          <td>60.7</td>
          <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Bunny-LLama-3-8B</td>
            <td>8.4B</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>54.3</td>
            <td>1920.3</td>
            <td>77.0</td>
            <td>73.9</td>
            <td>41.3</td>
            <td>31.5</td>
            <td>61.2</td>
            <td>58.8</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">LLaVA-NeXT Llama-3-8B</td>
            <td>8.4B</td>
            <td>-</td>
            <td>-</td>
            <td>78.2</td>
            <td>-</td>
            <td>1971.5</td>
            <td>-</td>
            <td>-</td>
            <td>41.7</td>
            <td>37.5</td>
            <td>80.1</td>
            <td>60.0</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Phi-3-vision-128k-instruct</td>
            <td>4.2B</td>
            <td>639*</td>
            <td>70.9</td>
            <td>-</td>
            <td>-</td>
            <td>1537.5*</td>
            <td>-</td>
            <td>-</td>
            <td>40.4</td>
            <td>44.5</td>
            <td>64.2*</td>
            <td>58.8*</td>
            <td>-</td>
        </tr>
        <tr style="background-color: #e6f2ff;">
            <td nowrap="nowrap" align="left">MiniCPM-V 1.0</td>
            <td>2.8B</td>
            <td>366</td>
            <td>60.6</td>
            <td>38.2</td>
            <td>47.5</td>
            <td>1650.2</td>
            <td>64.1</td>
            <td>62.6</td>
            <td>38.3</td>
            <td>28.9</td>
            <td>51.3</td>
            <td>51.2</td>
            <td>78.4</td>
        </tr>
        <tr style="background-color: #e6f2ff;">
            <td nowrap="nowrap" align="left">MiniCPM-V 2.0</td>
            <td>2.8B</td>
            <td>605</td>
            <td>74.1</td>
            <td>71.9</td>
            <td>54.5</td>
            <td>1808.6</td>
            <td>69.1</td>
            <td>66.5</td>
            <td>38.2</td>
            <td>38.7</td>
            <td>69.2</td>
            <td>55.8</td>
            <td>85.5</td>
        </tr>
        <tr style="background-color: #e6f2ff;">
            <td nowrap="nowrap" align="left">MiniCPM-Llama3-V 2.5</td>
            <td>8.5B</td>
            <td><strong>725</strong></td>
            <td><strong>76.6</strong></td>
            <td><strong>84.8</strong></td>
            <td><strong>65.1</strong></td>
            <td>2024.6</td>
            <td><strong>77.2</strong></td>
            <td><strong>74.2</strong></td>
            <td><strong>45.8</strong></td>
            <td><strong>54.3</strong></td>
            <td><strong>86.7</strong></td>
            <td><strong>63.5</strong></td>
            <td><strong>89.7</strong></td>
        </tr>
    </tbody>
 </table>
 </div>
 * We evaluate the officially released checkpoint by ourselves.
 </details>
 <div align="center">
    <img src="../assets/llavabench_compare_3.png" width="100%" />
    <br>
    Evaluation results of multilingual LLaVA Bench
 </div>
 ### Examples <!-- omit in toc -->
 <table align="center" >
  <p align="center" > 
  <img src="../assets/minicpmv-llama3-v2.5/cases_all.png" />
  </p>
 </table>
 </details>
 ### Model Zoo
 | Model           | Device | Memory    | &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; Description       | Download |
 |:-----------|:--:|:-----------:|:-------------------|:---------------:|
 | MiniCPM-Llama3-V 2.5 | GPU | 19 GB | Strong end-side multimodal performance.   |  [🤗](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5/) &nbsp;&nbsp; [<img src="../assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5) |
 | MiniCPM-Llama3-V 2.5 gguf | CPU  | 6 GB | The gguf version, lower memory usage and faster inference.   |  [🤗](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf) &nbsp;&nbsp;[<img src="../assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5-gguf) |
 | MiniCPM-Llama3-V 2.5 int4 | GPU | 8 GB | The int4 quantized version, lower GPU memory usage. |  [🤗](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-int4/) &nbsp;&nbsp; [<img src="../assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5-int4) |
--- a/docs/minicpm_v1.md
+++ b/docs/minicpm_v1.md
--- a/docs/minicpm_v2.md
+++ b/docs/minicpm_v2.md
@@ -0,0 +1,294 @@
 ## MiniCPM-V 2.0
 > Archive at：2025-01-13
 **MiniCPM-V 2.0** is an efficient version with promising performance for deployment. The model is built based on SigLip-400M and [MiniCPM-2.4B](https://github.com/OpenBMB/MiniCPM/), connected by a perceiver resampler. Our latest version, MiniCPM-V 2.0 has several notable features. 
 - 🔥 **State-of-the-art Performance.** 
  MiniCPM-V 2.0 achieves **state-of-the-art performance** on multiple benchmarks (including OCRBench, TextVQA, MME, MMB, MathVista, etc) among models under 7B parameters. It even **outperforms strong Qwen-VL-Chat 9.6B, CogVLM-Chat 17.4B, and Yi-VL 34B on OpenCompass, a comprehensive evaluation over 11 popular benchmarks**. Notably, MiniCPM-V 2.0 shows **strong OCR capability**, achieving **comparable performance to Gemini Pro in scene-text understanding**, and **state-of-the-art performance on OCRBench** among open-source models.
 - 🏆 **Trustworthy Behavior.** 
  LMMs are known for suffering from hallucination, often generating text not factually grounded in images. MiniCPM-V 2.0 is **the first end-side LMM aligned via multimodal RLHF for trustworthy behavior** (using the recent [RLHF-V](https://rlhf-v.github.io/) [CVPR'24] series technique). This allows the model to **match GPT-4V in preventing hallucinations** on Object HalBench.
 - 🌟 **High-Resolution Images at Any Aspect Raito.**
  MiniCPM-V 2.0 can accept **1.8 million pixels (e.g., 1344x1344) images at any aspect ratio**. This enables better perception of fine-grained visual information such as small objects and optical characters, which is achieved via a recent technique from [LLaVA-UHD](https://arxiv.org/pdf/2403.11703.pdf).
 - ⚡️ **High Efficiency.** 
  MiniCPM-V 2.0 can be **efficiently deployed on most GPU cards and personal computers**, and **even on end devices such as mobile phones**. For visual encoding, we compress the image representations into much fewer tokens via a perceiver resampler. This allows MiniCPM-V 2.0 to operate with **favorable memory cost and speed during inference even when dealing with high-resolution images**.
 - 🙌 **Bilingual Support.** 
  MiniCPM-V 2.0 **supports strong bilingual multimodal capabilities in both English and Chinese**. This is enabled by generalizing multimodal capabilities across languages, a technique from [VisCPM](https://arxiv.org/abs/2308.12038) [ICLR'24].
 ### Evaluation <!-- omit in toc -->
 <div align="center">
    <img src=../assets/minicpmv-2-peformance.png width=66% />
 </div>
 <details>
 <summary>Click to view results on TextVQA, DocVQA, OCRBench, OpenCompass, MME, MMBench, MMMU, MathVista, LLaVA Bench, Object HalBench. </summary>
 <div align="center">
 <table style="margin: 0px auto;">
 <thead>
  <tr>
    <th align="left">Model</th>
    <th>Size</th>
    <th>TextVQA val</th>
    <th>DocVQA test</th>
    <th>OCRBench</th>
    <th>OpenCompass</th>
    <th nowrap="nowrap" >MME</th>
    <th>MMB dev(en)</th>
    <th>MMB dev(zh)</th>
    <th>MMMU val</th>
    <th>MathVista</th>
    <th>LLaVA Bench</th>
    <th nowrap="nowrap">Object HalBench</th>
  </tr>
 </thead>
 <tbody align="center">
  <tr>
    <td colspan="12" align="left"><strong>Proprietary models</strong></td>
  </tr>
  <tr>
    <td nowrap="nowrap" align="left">Gemini Pro Vision</td>
    <td>- </td>
    <td>74.6</td>
    <td>88.1</td>
    <td>680</td>
    <td>63.8</td>
    <td>2148.9</td>
    <td>75.2</td>
    <td>74.0</td>
    <td>48.9</td>
    <td>45.8</td>
    <td>79.9</td>
    <td>- </td>
  </tr>
  <tr>
    <td nowrap="nowrap" align="left">GPT-4V</td>
    <td>- </td>
    <td>78.0</td>
    <td>88.4</td>
    <td>645</td>
    <td>63.2</td>
    <td>1771.5</td>
    <td>75.1</td>
    <td>75.0</td>
    <td>53.8</td>
    <td>47.8</td>
    <td>93.1</td>
    <td>86.4 / 92.7</td>
  </tr>
  <tr>
    <td colspan="12" align="left"><strong>Open-source models 6B~34B</strong></td>
  </tr>
  <tr>
    <td  nowrap="nowrap" align="left" >Yi-VL-6B</td>
    <td align="right" >6.7B</td>
    <td>45.5*</td>
    <td>17.1*</td>
    <td>290</td>
    <td>49.3</td>
    <td>1915.1 </td>
    <td>68.6 </td>
    <td>68.3 </td>
    <td>40.3 </td>
    <td>28.8 </td>
    <td>51.9 </td>
    <td>- </td>
  </tr>
  <tr>
    <td  nowrap="nowrap" align="left" >Qwen-VL-Chat</td>
    <td align="right" >9.6B</td>
    <td>61.5</td>
    <td>62.6</td>
    <td>488 </td>
    <td>52.1 </td>
    <td>1860.0 </td>
    <td>60.6 </td>
    <td>56.7 </td>
    <td>37.0 </td>
    <td>33.8 </td>
    <td>67.7 </td>
    <td>56.2 / 80.0</td>
  </tr>
  <tr>
    <td nowrap="nowrap" align="left" >Yi-VL-34B</td>
    <td align="right" >34B</td>
    <td>43.4*</td>
    <td>16.9*</td>
    <td>290</td>
    <td>52.6 </td>
    <td>2050.2</td>
    <td>71.1</td>
    <td>71.4</td>
    <td>45.1</td>
    <td>30.7</td>
    <td>62.3</td>
    <td>- </td>
  </tr>
  <tr>
    <td  nowrap="nowrap" align="left" >DeepSeek-VL-7B</td>
    <td align="right" >7.3B</td>
    <td>64.7*</td>
    <td>47.0* </td>
    <td>435</td>
    <td>55.6 </td>
    <td>1765.4 </td>
    <td>74.1 </td>
    <td>72.8 </td>
    <td>38.3 </td>
    <td>36.8</td>
    <td>77.8 </td>
    <td>- </td>
  </tr>
  <tr>
    <td  nowrap="nowrap" align="left" >TextMonkey</td>
    <td align="right" >9.7B</td>
    <td>64.3</td>
    <td>66.7 </td>
    <td>558</td>
    <td>- </td>
    <td>- </td>
    <td>- </td>
    <td>- </td>
    <td>- </td>
    <td>-</td>
    <td>- </td>
    <td>- </td>
  </tr>
    <tr>
    <td  nowrap="nowrap" align="left" >CogVLM-Chat</td>
    <td align="right" >17.4B</td>
    <td>70.4</td>
    <td>33.3*</td>
    <td>590 </td>
    <td>52.5 </td>
    <td>1736.6 </td>
    <td>63.7 </td>
    <td>53.8 </td>
    <td>37.3 </td>
    <td>34.7 </td>
    <td>73.9 </td>
    <td>73.6 / 87.4 </td>
  </tr>
  <tr>
    <td colspan="12" align="left"><strong>Open-source models 1B~3B </strong></td>
  </tr>
  <tr>
    <td  nowrap="nowrap" align="left" >DeepSeek-VL-1.3B</td>
    <td align="right" >1.7B</td>
    <td>58.4*</td>
    <td>37.9*</td>
    <td>413</td>
    <td>46.0 </td>
    <td>1531.6 </td>
    <td>64.0 </td>
    <td>61.2 </td>
    <td>33.8 </td>
    <td>29.4 </td>
    <td>51.1 </td>
    <td>- </td>
  </tr>
  <tr>
    <td  nowrap="nowrap" align="left" >MobileVLM V2</td>
    <td align="right" >3.1B</td>
    <td>57.5</td>
    <td>19.4*</td>
    <td>-</td>
    <td>-</td>
    <td>1440.5(P) </td>
    <td>63.2 </td>
    <td>-</td>
    <td>-</td>
    <td>-</td>
    <td>-</td>
    <td>-</td>
  </tr>
  <tr>
    <td  nowrap="nowrap" align="left" >Mini-Gemini</td>
    <td align="right" >2.2B</td>
    <td>56.2</td>
    <td>34.2*</td>
    <td>-</td>
    <td>-</td>
    <td>1653.0 </td>
    <td>59.8 </td>
    <td>- </td>
    <td>31.7 </td>
    <td>-</td>
    <td>- </td>
    <td>- </td>
  </tr>
  <tr>
    <td  nowrap="nowrap" align="left" >MiniCPM-V</td>
    <td align="right" >2.8B </td>
    <td>60.6</td>
    <td>38.2 </td>
    <td>366</td>
    <td>47.6</td>
    <td>1650.2 </td>
    <td>67.9 </td>
    <td>65.3 </td>
    <td><strong>38.3</strong></td>
    <td>28.9</td>
    <td>51.3 </td>
    <td>78.4 / 88.5 </td>
  </tr>
  <tr>
    <td  nowrap="nowrap" align="left" ><strong>MiniCPM-V 2.0</strong></td>
    <td align="right" >2.8B </td>
    <td><strong>74.1</strong></td>
    <td><strong>71.9</strong> </td>
    <td><strong>605</strong></td>
    <td><strong>55.0</strong></td>
    <td><strong>1808.6</strong> </td>
    <td><strong>69.6</strong> </td>
    <td><strong>68.1</strong> </td>
    <td>38.2 </td>
    <td><strong>38.7</strong></td>
    <td><strong>69.2</strong> </td>
    <td><strong>85.5 / 92.2 </strong></td>
  </tr>
 </tbody>
 </table>
 </div>
 * We evaluate the officially released checkpoint by ourselves.
 </details>
 ### Examples <!-- omit in toc -->
 <table align="center">
    <p align="center">
      <img src="../assets/minicpmv2-cases_2.png" width=95%/>
    </p>
 </table>
 We deploy MiniCPM-V 2.0 on end devices. The demo video is the raw screen recording on a Xiaomi 14 Pro without edition.
 <table align="center">
    <p align="center">
      <img src="../assets/gif_cases/station.gif" width=36%/>
      <img src="../assets/gif_cases/london_car.gif" width=36%/>
    </p>
 </table>
 ### Model Zoo
 | Model           | Device | Memory    | &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; Description       | Download |
 |:-----------|:--:|:-----------:|:-------------------|:---------------:|
 | MiniCPM-V 2.0 | GPU | 8 GB | Light version, balance the performance the computation cost.   |  [🤗](https://huggingface.co/openbmb/MiniCPM-V-2) &nbsp;&nbsp; [<img src="../assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2) |
 | MiniCPM-V 1.0 | GPU | 7 GB | Lightest version, achieving the fastest inference. |   [🤗](https://huggingface.co/openbmb/MiniCPM-V) &nbsp;&nbsp; [<img src="../assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V) |
--- a/docs/minicpm_v2dot6.md
+++ b/docs/minicpm_v2dot6.md
@@ -0,0 +1,945 @@
 ## MiniCPM-V 2.6
 > Archieve at: 2025-01-13
 **MiniCPM-V 2.6** is the latest and most capable model in the MiniCPM-V series. The model is built on SigLip-400M and Qwen2-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-Llama3-V 2.5, and introduces new features for multi-image and video understanding. Notable features of MiniCPM-V 2.6 include:
 - 🔥 **Leading Performance.**
  MiniCPM-V 2.6 achieves an average score of 65.2 on the latest version of OpenCompass, a comprehensive evaluation over 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o mini, GPT-4V, Gemini 1.5 Pro, and Claude 3.5 Sonnet** for single image understanding.
 - 🖼️ **Multi Image Understanding and In-context Learning.** MiniCPM-V 2.6 can also perform **conversation and reasoning over multiple images**. It achieves **state-of-the-art performance** on popular multi-image benchmarks such as Mantis-Eval, BLINK, Mathverse mv and Sciverse mv, and also shows promising in-context learning capability.
 - 🎬 **Video Understanding.** MiniCPM-V 2.6 can also **accept video inputs**, performing conversation and providing dense captions for spatial-temporal information. It outperforms **GPT-4V, Claude 3.5 Sonnet and LLaVA-NeXT-Video-34B** on Video-MME with/without subtitles.
 - 💪 **Strong OCR Capability and Others.**
  MiniCPM-V 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves **state-of-the-art performance on OCRBench, surpassing proprietary models such as GPT-4o, GPT-4V, and Gemini 1.5 Pro**.
  Based on the the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) and [VisCPM](https://github.com/OpenBMB/VisCPM) techniques, it features **trustworthy behaviors**, with significantly lower hallucination rates than GPT-4o and GPT-4V on Object HalBench, and supports **multilingual capabilities** on English, Chinese, German, French, Italian, Korean, etc.
 - 🚀 **Superior Efficiency.**
  In addition to its friendly size, MiniCPM-V 2.6 also shows **state-of-the-art token density** (i.e., number of pixels encoded into each visual token). **It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models**. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-V 2.6 can efficiently support **real-time video understanding** on end-side devices such as iPad.
 -  💫  **Easy Usage.**
 MiniCPM-V 2.6 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpmv-main/examples/llava/README-minicpmv2.6.md) and [ollama](https://github.com/OpenBMB/ollama/blob/minicpm-v2.6/examples/minicpm-v2.6/README.md) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-V-2_6-int4) and [GGUF](https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf) format quantized models in 16 sizes, (3) [vLLM](#inference-with-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks, (5) quick local WebUI demo setup with [Gradio](#chat-with-our-demo-on-gradio), and (6) online web [demo](http://120.92.209.146:8887/).
 ### Evaluation  <!-- omit in toc -->
 <div align="center">
    <img src=../assets/radar_final.png width=66% />
 </div>
 <details>
 <summary>Click to view single image results on OpenCompass, MME, MMVet, OCRBench, MMMU, MathVista, MMB, AI2D, TextVQA, DocVQA, HallusionBench, Object HalBench. </summary>
 <div align="center">
 <table style="margin: 0px auto;">
    <thead>
        <tr>
            <th align="left">Model</th>
            <th>Size</th>
            <th>Token Density<sup>+</sup></th>
            <th>OpenCompass</th>
            <th>MME</th>
            <th>MMVet</th>
            <th>OCRBench</th>
            <th>MMMU val</th>
            <th>MathVista mini</th>
            <th>MMB1.1 test</th>
            <th>AI2D</th>
            <th>TextVQA val</th>
            <th>DocVQA test</th>
            <th>HallusionBench</th>
            <th>Object HalBench</th>
        </tr>
    </thead>
    <tbody align="center">
        <tr>
            <td colspan="15" align="left"><strong>Proprietary</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">GPT-4o</td>
            <td>-</td>
            <td>1088</td>
            <td>69.9</td>
            <td>2328.7</td>
            <td>69.1</td>
            <td>736</td>
            <td>69.2</td>
            <td>61.3</td>
            <td>82.2</td>
            <td>84.6</td>
            <td>-</td>
            <td>92.8</td>
            <td>55.0</td>
            <td>17.6</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Claude 3.5 Sonnet</td>
            <td>-</td>
            <td>750</td>
            <td>67.9</td>
            <td>1920.0</td>
            <td>66.0</td>
            <td>788</td>
            <td>65.9</td>
            <td>61.6</td>
            <td>78.5</td>
            <td>80.2</td>
            <td>-</td>
            <td>95.2</td>
            <td>49.9</td>
            <td>13.8</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
            <td>-</td>
            <td>-</td>
            <td>64.4</td>
            <td>2110.6</td>
            <td>64.0</td>
            <td>754</td>
            <td>60.6</td>
            <td>57.7</td>
            <td>73.9</td>
            <td>79.1</td>
            <td>73.5</td>
            <td>86.5</td>
            <td>45.6</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">GPT-4o mini</td>
            <td>-</td>
            <td>1088</td>
            <td>64.1</td>
            <td>2003.4</td>
            <td>66.9</td>
            <td>785</td>
            <td>60.0</td>
            <td>52.4</td>
            <td>76.0</td>
            <td>77.8</td>
            <td>-</td>
            <td>-</td>
            <td>46.1</td>
            <td>12.4</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">GPT-4V</td>
            <td>-</td>
            <td>1088</td>
            <td>63.5</td>
            <td>2070.2</td>
            <td>67.5</td>
            <td>656</td>
            <td>61.7</td>
            <td>54.7</td>
            <td>79.8</td>
            <td>78.6</td>
            <td>78.0</td>
            <td>87.2</td>
            <td>43.9</td>
            <td>14.2</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Step-1V</td>
            <td>-</td>
            <td>-</td>
            <td>59.5</td>
            <td>2206.4</td>
            <td>63.3</td>
            <td>625</td>
            <td>49.9</td>
            <td>44.8</td>
            <td>78.0</td>
            <td>79.2</td>
            <td>71.6</td>
            <td>-</td>
            <td>48.4</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Qwen-VL-Max</td>
            <td>-</td>
            <td>784</td>
            <td>58.3</td>
            <td>2281.7</td>
            <td>61.8</td>
            <td>684</td>
            <td>52.0</td>
            <td>43.4</td>
            <td>74.6</td>
            <td>75.7</td>
            <td>79.5</td>
            <td>93.1</td>
            <td>41.2</td>
            <td>13.4</td>
        </tr>
        <tr>
            <td colspan="15" align="left"><strong>Open-source</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">LLaVA-NeXT-Yi-34B</td>
            <td>34B</td>
            <td>157</td>
            <td>55.0</td>
            <td>2006.5</td>
            <td>50.7</td>
            <td>574</td>
            <td>48.8</td>
            <td>40.4</td>
            <td>77.8</td>
            <td>78.9</td>
            <td>69.3</td>
            <td>-</td>
            <td>34.8</td>
            <td>12.6</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Mini-Gemini-HD-34B</td>
            <td>34B</td>
            <td>157</td>
            <td>-</td>
            <td>2141.0</td>
            <td>59.3</td>
            <td>518</td>
            <td>48.0</td>
            <td>43.3</td>
            <td>-</td>
            <td>80.5</td>
            <td>74.1</td>
            <td>78.9</td>
            <td>-</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Cambrian-34B</td>
            <td>34B</td>
            <td>1820</td>
            <td>58.3</td>
            <td>2049.9</td>
            <td>53.2</td>
            <td>591</td>
            <td>50.4</td>
            <td>50.3</td>
            <td>77.8</td>
            <td>79.5</td>
            <td>76.7</td>
            <td>75.5</td>
            <td>41.6</td>
            <td>14.7</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">GLM-4V-9B</td>
            <td>13B</td>
            <td>784</td>
            <td>59.1</td>
            <td>2018.8</td>
            <td>58.0</td>
            <td>776</td>
            <td>46.9</td>
            <td>51.1</td>
            <td>67.9</td>
            <td>71.2</td>
            <td>-</td>
            <td>-</td>
            <td>45.0</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">InternVL2-8B</td>
            <td>8B</td>
            <td>706</td>
            <td>64.1</td>
            <td>2215.1</td>
            <td>54.3</td>
            <td>794</td>
            <td><strong>51.2</strong></td>
            <td>58.3</td>
            <td><strong>79.4</strong></td>
            <td><strong>83.6</strong></td>
            <td>77.4</td>
            <td><strong>91.6</strong></td>
            <td>45.0</td>
            <td>21.3</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">MiniCPM-Llama-V 2.5</td>
            <td>8B</td>
            <td>1882</td>
            <td>58.8</td>
            <td>2024.6</td>
            <td>52.8</td>
            <td>725</td>
            <td>45.8</td>
            <td>54.3</td>
            <td>72.0</td>
            <td>78.4</td>
            <td>76.6</td>
            <td>84.8</td>
            <td>42.4</td>
            <td>10.3</td>
        </tr>
        <tr style="background-color: #e6f2ff;">
            <td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
            <td>8B</td>
            <td><strong>2822</strong></td>
            <td><strong>65.2</strong></td>
            <td><strong>2348.4</strong>*</td>
            <td><strong>60.0</strong></td>
            <td><strong>852</strong>*</td>
            <td>49.8*</td>
            <td><strong>60.6</strong></td>
            <td>78.0</td>
            <td>82.1</td>
            <td><strong>80.1<strong></td>
            <td>90.8</td>
            <td><strong>48.1</strong>*</td>
            <td><strong>8.2</strong></td>
        </tr>
    </tbody>
 </table>
 </div>
 * We evaluate this benchmark using chain-of-thought prompting. Specifically, for MME, we used this technique only for the Cognition set.
 <sup>+</sup> Token Density: number of pixels encoded into each visual token at maximum resolution, i.e., # pixels at maximum resolution / # visual tokens.
 Note: For proprietary models, we calculate token density based on the image encoding charging strategy defined in the official API documentation, which provides an upper-bound estimation.
 </details>
 <details>
 <summary>Click to view multi-image results on Mantis Eval, BLINK, Mathverse mv, Sciverse mv, MIRB.</summary>
 <div align="center">
 <table style="margin: 0px auto;">
    <thead>
        <tr>
            <th align="left">Model</th>
            <th>Size</th>
            <th>Mantis Eval</th>
            <th>BLINK val</th>
            <th>Mathverse mv</th>
            <th>Sciverse mv</th>
            <th>MIRB</th>
        </tr>
    </thead>
    <tbody align="center">
        <tr>
            <td colspan="7" align="left"><strong>Proprietary</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">GPT-4V</td>
            <td>-</td>
            <td>62.7</td>
            <td>54.6</td>
            <td>60.3</td>
            <td>66.9</td>
            <td>53.1</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">LLaVA-NeXT-Interleave-14B</td>
            <td>14B</td>
            <td>66.4</td>
            <td>52.6</td>
            <td>32.7</td>
            <td>30.2</td>
            <td>-</td>
        </tr>
        <tr>
            <td colspan="7" align="left"><strong>Open-source</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Emu2-Chat</td>
            <td>37B</td>
            <td>37.8</td>
            <td>36.2</td>
            <td>-</td>
            <td>27.2</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">CogVLM</td>
            <td>17B</td>
            <td>45.2</td>
            <td>41.1</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">VPG-C</td>
            <td>7B</td>
            <td>52.4</td>
            <td>43.1</td>
            <td>24.3</td>
            <td>23.1</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">VILA 8B</td>
            <td>8B</td>
            <td>51.2</td>
            <td>39.3</td>
            <td>-</td>
            <td>36.5</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">InternLM-XComposer-2.5</td>
            <td>8B</td>
            <td>53.1*</td>
            <td>48.9</td>
            <td>32.1*</td>
            <td>-</td>
            <td>42.5</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">InternVL2-8B</td>
            <td>8B</td>
            <td>59.0*</td>
            <td>50.9</td>
            <td>30.5*</td>
            <td>34.4*</td>
            <td><strong>56.9*</strong></td>
        </tr>
        <tr style="background-color: #e6f2ff;">
            <td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
            <td>8B</td>
            <td><strong>69.1</strong></td>
            <td><strong>53.0</strong></td>
            <td><strong>84.9</strong></td>
            <td><strong>74.9</strong></td>
            <td>53.8</td>
        </tr>
    </tbody>
 </table>
 </div>
 * We evaluate the officially released checkpoint by ourselves.
 </details>
 <details>
 <summary>Click to view video results on Video-MME and Video-ChatGPT.</summary>
 <div align="center">
 <table style="margin: 0px auto;">
    <thead>
        <tr>
            <th align="left">Model</th>
            <th>Size</th>
            <th colspan="2">Video-MME</th>
            <th colspan="5">Video-ChatGPT</th>
        </tr>
        <tr>
            <th align="left"></th>
            <th></th>
            <th>w/o subs</th>
            <th>w subs</th>
            <th>Correctness</th>
            <th>Detail</th>
            <th>Context</th>
            <th>Temporal</th>
            <th>Consistency</th>
        </tr>
    </thead>
    <tbody align="center">
        <tr>
            <td colspan="9" align="left"><strong>Proprietary</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Claude 3.5 Sonnet</td>
            <td>-</td>
            <td>60.0</td>
            <td>62.9</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">GPT-4V</td>
            <td>-</td>
            <td>59.9</td>
            <td>63.3</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
        </tr>
        <tr>
            <td colspan="9" align="left"><strong>Open-source</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">LLaVA-NeXT-7B</td>
            <td>7B</td>
            <td>-</td>
            <td>-</td>
            <td>3.39</td>
            <td>3.29</td>
            <td>3.92</td>
            <td>2.60</td>
            <td>3.12</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">LLaVA-NeXT-34B</td>
            <td>34B</td>
            <td>-</td>
            <td>-</td>
            <td>3.29</td>
            <td>3.23</td>
            <td>3.83</td>
            <td>2.51</td>
            <td>3.47</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">CogVLM2-Video</td>
            <td>12B</td>
            <td>-</td>
            <td>-</td>
            <td>3.49</td>
            <td><strong>3.46</strong></td>
            <td>3.23</td>
            <td><strong>2.98</strong></td>
            <td><strong>3.64</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">LongVA</td>
            <td>7B</td>
            <td>52.4</td>
            <td>54.3</td>
            <td>3.05</td>
            <td>3.09</td>
            <td>3.77</td>
            <td>2.44</td>
            <td><strong>3.64</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">InternVL2-8B</td>
            <td>8B</td>
            <td>54.0</td>
            <td>56.9</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">InternLM-XComposer-2.5</td>
            <td>8B</td>
            <td>55.8</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">LLaVA-NeXT-Video</td>
            <td>32B</td>
            <td>60.2</td>
            <td>63.0</td>
            <td>3.48</td>
            <td>3.37</td>
            <td><strong>3.95</strong></td>
            <td>2.64</td>
            <td>3.28</td>
        </tr>
        <tr style="background-color: #e6f2ff;">
            <td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
            <td>8B</td>
            <td><strong>60.9</strong></td>
            <td><strong>63.6</strong></td>
            <td><strong>3.59</strong></td>
            <td>3.28</td>
            <td>3.93</td>
            <td>2.73</td>
            <td>3.62</td>
        </tr>
    </tbody>
 </table>
 </div>
 </details>
 <details>
 <summary>Click to view few-shot results on TextVQA, VizWiz, VQAv2, OK-VQA.</summary>
 <div align="center">
 <table style="margin: 0px auto;">
    <thead>
        <tr>
            <th align="left">Model</th>
            <th>Size</th>
            <th>Shot</th>
            <th>TextVQA val</th>
            <th>VizWiz test-dev</th>
            <th>VQAv2 test-dev</th>
            <th>OK-VQA val</th>
        </tr>
    </thead>
    <tbody align="center">
        <tr>
            <td align="left" nowrap="nowrap" rowspan="3">Flamingo</td>
            <td rowspan="3">80B</td>
            <td>0*</td>
            <td>35.0</td>
            <td>31.6</td>
            <td>56.3</td>
            <td>40.6</td>
        </tr>
        <tr>
            <td>4</td>
            <td>36.5</td>
            <td>39.6</td>
            <td>63.1</td>
            <td><strong>57.4</strong></td>
        </tr>
        <tr>
            <td>8</td>
            <td>37.3</td>
            <td>44.8</td>
            <td>65.6</td>
            <td>57.5</td>
        </tr>
        <tr>
            <td align="left" nowrap="nowrap" rowspan="3">IDEFICS</td>
            <td rowspan="3">80B</td>
            <td>0*</td>
            <td>30.9</td>
            <td>36.0</td>
            <td>60.0</td>
            <td>45.2</td>
        </tr>
        <tr>
            <td>4</td>
            <td>34.3</td>
            <td>40.4</td>
            <td>63.6</td>
            <td>52.4</td>
        </tr>
        <tr>
            <td>8</td>
            <td>35.7</td>
            <td>46.1</td>
            <td>64.8</td>
            <td>55.1</td>
        </tr>
        <tr>
            <td align="left" nowrap="nowrap" rowspan="3">OmniCorpus</td>
            <td rowspan="3">7B</td>
            <td>0*</td>
            <td>43.0</td>
            <td>49.8</td>
            <td>63.2</td>
            <td>45.5</td>
        </tr>
        <tr>
            <td>4</td>
            <td>45.4</td>
            <td>51.3</td>
            <td>64.5</td>
            <td>46.5</td>
        </tr>
        <tr>
            <td>8</td>
            <td>45.6</td>
            <td>52.2</td>
            <td>64.7</td>
            <td>46.6</td>
        </tr>
        <tr>
            <td align="left" nowrap="nowrap" rowspan="3">Emu2</td>
            <td rowspan="3">37B</td>
            <td>0</td>
            <td>26.4</td>
            <td>40.4</td>
            <td>33.5</td>
            <td>26.7</td>
        </tr>
        <tr>
            <td>4</td>
            <td>48.2</td>
            <td>54.6</td>
            <td>67.0</td>
            <td>53.2</td>
        </tr>
        <tr>
            <td>8</td>
            <td>49.3</td>
            <td>54.7</td>
            <td>67.8</td>
            <td>54.1</td>
        </tr>
        <tr>
            <td align="left" nowrap="nowrap" rowspan="2">MM1</td>
            <td rowspan="2">30B</td>
            <td>0</td>
            <td>26.2</td>
            <td>40.4</td>
            <td>48.9</td>
            <td>26.7</td>
        </tr>
        <tr>
            <td>8</td>
            <td>49.3</td>
            <td>54.7</td>
            <td><strong>70.9</strong></td>
            <td>54.1</td>
        </tr>
        <tr style="background-color: #e6f2ff;">
            <td align="left" nowrap="nowrap" rowspan="3">MiniCPM-V 2.6<sup>+</sup></td>
            <td rowspan="3">8B</td>
            <td>0</td>
            <td>43.9</td>
            <td>33.8</td>
            <td>45.4</td>
            <td>23.9</td>
        </tr>
        <tr style="background-color: #e6f2ff;">
            <td>4</td>
            <td>63.6</td>
            <td>60.5</td>
            <td>65.5</td>
            <td>50.1</td>
        </tr>
        <tr style="background-color: #e6f2ff;">
            <td>8</td>
            <td><strong>64.6</strong></td>
            <td><strong>63.4</strong></td>
            <td>68.2</td>
            <td>51.4</td>
        </tr>
    </tbody>
 </table>
 </div>
 * denotes zero image shot and two additional text shots following Flamingo.
 <sup>+</sup> We evaluate the pretraining ckpt without SFT.
 </details>
 ### Examples <!-- omit in toc -->
 <div style="display: flex; flex-direction: column; align-items: center;">
  <img src="../assets/minicpmv2_6/multi_img-bike.png" alt="Bike" style="margin-bottom: 5px;">
  <img src="../assets/minicpmv2_6/multi_img-menu.png" alt="Menu" style="margin-bottom: 5px;">
  <img src="../assets/minicpmv2_6/multi_img-code.png" alt="Code" style="margin-bottom: 5px;">
  <img src="../assets/minicpmv2_6/ICL-Mem.png" alt="Mem" style="margin-bottom: 5px;">
  <img src="../assets/minicpmv2_6/multiling-medal.png" alt="medal" style="margin-bottom: 10px;">
 </div>
 <details>
  <summary>Click to view more cases.</summary>
  <div style="display: flex; flex-direction: column; align-items: center;">
    <img src="../assets/minicpmv2_6/ICL-elec.png" alt="elec" style="margin-bottom: 5px;">
    <img src="../assets/minicpmv2_6/multiling-olympic.png" alt="Menu" style="margin-bottom: 10px;">
  </div>
 </details>
 We deploy MiniCPM-V 2.6 on end devices. The demo video is the raw screen recording on a iPad Pro without edition.
 <table align="center"> 
    <p align="center">
      <img src="../assets/gif_cases/ai.gif" width=32%/>
      &nbsp;&nbsp;&nbsp;&nbsp;
      <img src="../assets/gif_cases/beer.gif" width=32%/>
    </p>
 </table> 
 <table align="center"> 
    <p align="center">
      <img src="../assets/gif_cases/ticket.gif" width=32%/>
      &nbsp;&nbsp;&nbsp;&nbsp;
      <img src="../assets/gif_cases/wfh.gif" width=32%/>
    </p>
 </table> 
 <table align="center">
    <p align="center">
      <video src="https://github.com/user-attachments/assets/21f4b818-ede1-4822-920e-91281725c830" width="360" /> </video>
      <!-- <video src="https://github.com/user-attachments/assets/c835f757-206b-4d9c-8e36-70d67b453628" width="360" /> </video> -->
    </p>
 </table>
 </details>
 ### Multi-turn Conversation
 <div align="center">
 <img src="../assets/airplane.jpeg" width="500px">
 </div>
 ```python
 import torch
 from PIL import Image
 from transformers import AutoModel, AutoTokenizer
 torch.manual_seed(0)
 model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
 model = model.eval().cuda()
 tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
 image = Image.open('./assets/airplane.jpeg').convert('RGB')
 # First round chat 
 question = "Tell me the model of this aircraft."
 msgs = [{'role': 'user', 'content': [image, question]}]
 answer = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer
 )
 print(answer)
 # Second round chat 
 # pass history context of multi-turn conversation
 msgs.append({"role": "assistant", "content": [answer]})
 msgs.append({"role": "user", "content": ["Introduce something about Airbus A380."]})
 answer = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer
 )
 print(answer)
 ```
 You could get the following output:
 ```
 "The aircraft in the image is an Airbus A380, which can be identified by its large size, double-deck structure, and the distinctive shape of its wings and engines. The A380 is a wide-body aircraft known for being the world's largest passenger airliner, designed for long-haul flights. It has four engines, which are characteristic of large commercial aircraft. The registration number on the aircraft can also provide specific information about the model if looked up in an aviation database."
 "The Airbus A380 is a double-deck, wide-body, four-engine jet airliner made by Airbus. It is the world's largest passenger airliner and is known for its long-haul capabilities. The aircraft was developed to improve efficiency and comfort for passengers traveling over long distances. It has two full-length passenger decks, which can accommodate more passengers than a typical single-aisle airplane. The A380 has been operated by airlines such as Lufthansa, Singapore Airlines, and Emirates, among others. It is widely recognized for its unique design and significant impact on the aviation industry."
 ```
 #### Multi-image Understanding
 <details>
 <summary> Click to view Python example of MiniCPM-V 2.6 multi-image understanding </summary>
 ```python
 import torch
 from PIL import Image
 from transformers import AutoModel, AutoTokenizer
 model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
 model = model.eval().cuda()
 tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
 image1 = Image.open('image1.jpg').convert('RGB')
 image2 = Image.open('image2.jpg').convert('RGB')
 question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.'
 msgs = [{'role': 'user', 'content': [image1, image2, question]}]
 answer = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer
 )
 print(answer)
 ```
 </details>
 #### Few-shot In-Context-Learning 
 <details>
 <summary> Click to view Python example of MiniCPM-V 2.6 few-shot in-context-learning example </summary>
 ```python
 import torch
 from PIL import Image
 from transformers import AutoModel, AutoTokenizer
 model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
 model = model.eval().cuda()
 tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
 question = "production date" 
 image1 = Image.open('example1.jpg').convert('RGB')
 answer1 = "2023.08.04"
 image2 = Image.open('example2.jpg').convert('RGB')
 answer2 = "2007.04.24"
 image_test = Image.open('test.jpg').convert('RGB')
 msgs = [
    {'role': 'user', 'content': [image1, question]}, {'role': 'assistant', 'content': [answer1]},
    {'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]},
    {'role': 'user', 'content': [image_test, question]}
 ]
 answer = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer
 )
 print(answer)
 ```
 </details>
 #### Video understanding
 <details>
 <summary> Click to view Python example of MiniCPM-V 2.6 video understanding </summary>
 ```python
 import torch
 from PIL import Image
 from transformers import AutoModel, AutoTokenizer
 from decord import VideoReader, cpu    # pip install decord
 model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
 model = model.eval().cuda()
 tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
 MAX_NUM_FRAMES=64 # if cuda OOM set a smaller number
 def encode_video(video_path):
    def uniform_sample(l, n):
        gap = len(l) / n
        idxs = [int(i * gap + gap / 2) for i in range(n)]
        return [l[i] for i in idxs]
    vr = VideoReader(video_path, ctx=cpu(0))
    sample_fps = round(vr.get_avg_fps() / 1)  # FPS
    frame_idx = [i for i in range(0, len(vr), sample_fps)]
    if len(frame_idx) > MAX_NUM_FRAMES:
        frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
    frames = vr.get_batch(frame_idx).asnumpy()
    frames = [Image.fromarray(v.astype('uint8')) for v in frames]
    print('num frames:', len(frames))
    return frames
 video_path="video_test.mp4"
 frames = encode_video(video_path)
 question = "Describe the video"
 msgs = [
    {'role': 'user', 'content': frames + [question]}, 
 ]
 # Set decode params for video
 params = {}
 params["use_image_id"] = False
 params["max_slice_nums"] = 2 # 如果cuda OOM且视频分辨率大于448*448可设为1
 answer = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer,
    **params
 )
 print(answer)
 ```
 </details>
--- a/docs/omnilmm.md
+++ b/docs/omnilmm.md
--- a/docs/omnilmm_en.md
+++ b/docs/omnilmm_en.md
@@ -1,6 +1,6 @@
 ## OmniLMM-12B
-> OmniLMM-12B is released at early time of this project. We recommond you to use our [recently released models](./README_en.md), for better performance and efficiency.
+> OmniLMM-12B is released at early time of this project. We recommond you to use our [recently released models](./README.md), for better performance and efficiency.
 > Archieve at: 2024-05-19
--- a/finetune/dataset.py
+++ b/finetune/dataset.py
@@ -7,7 +7,6 @@ import re
 import random
 from dataclasses import dataclass, field
 from typing import Dict, List, Optional
 from decord import VideoReader, cpu    # pip install decord
 import numpy as np
 import torch
@@ -21,26 +20,6 @@ logger = logging.getLogger(__name__)
 llama3_chat_template = "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}"
 MAX_NUM_FRAMES=64
 def encode_video(video_path, max_num_frames=64):
    max_num_frames = min(max_num_frames, MAX_NUM_FRAMES)
    def uniform_sample(l, n):
        gap = len(l) / n
        idxs = [int(i * gap + gap / 2) for i in range(n)]
        return [l[i] for i in idxs]
    vr = VideoReader(video_path, ctx=cpu(0))
    sample_fps = round(vr.get_avg_fps() / 1)  # FPS
    frame_idx = [i for i in range(0, len(vr), sample_fps)]
    if len(frame_idx) > max_num_frames:
        if max_num_frames==1:
            frame_idx = [frame_idx[len(frame_idx)//2]]
        else:
            frame_idx = uniform_sample(frame_idx, max_num_frames)
    frames = vr.get_batch(frame_idx).asnumpy()
    frames = [Image.fromarray(v.astype('uint8')) for v in frames]
    return frames
 class SupervisedDataset(Dataset):
    """Dataset for supervised fine-tuning."""
@@ -55,8 +34,6 @@ class SupervisedDataset(Dataset):
        query_nums=64,
        batch_vision=False,
        max_length=2048,
        video_max_slice_nums=2,
        max_num_frames=1,
    ):
        super(SupervisedDataset, self).__init__()
        self.raw_data = raw_data
@@ -68,58 +45,17 @@ class SupervisedDataset(Dataset):
        self.query_nums=query_nums
        self.batch_vision = batch_vision
        self.max_length = max_length
        # video config
        self.video_slice_config = copy.deepcopy(slice_config)
        self.video_slice_config['max_slice_nums'] = video_max_slice_nums
        self.max_num_frames = max_num_frames
    def __len__(self):
        return len(self.raw_data)
    def __getitem__(self, i) -> Dict[str, torch.Tensor]:
        try:
-            # default: sft image 
+            if isinstance(self.raw_data[i]["image"], str):
-            use_image_id = True
+                images_dict = { "<image>" : Image.open(self.raw_data[i]["image"]).convert("RGB") }
-            slice_config = self.slice_config
+            elif isinstance(self.raw_data[i]["image"], Dict):
-            if "image" in self.raw_data[i]:
+                ### for multi-images input, the template for every image is <image_xx>, such as <image_00>, <image_01>
-                if isinstance(self.raw_data[i]["image"], str):
+                images_dict = {img_name : Image.open(img_path).convert("RGB") for img_name, img_path in self.raw_data[i]["image"].items()}
                    images_dict = { "<image>" : Image.open(self.raw_data[i]["image"]).convert("RGB") }
                elif isinstance(self.raw_data[i]["image"], Dict):
                    ### for multi-images input, the template for every image is <image_xx>, such as <image_00>, <image_01>
                    images_dict = {img_name : Image.open(img_path).convert("RGB") for img_name, img_path in self.raw_data[i]["image"].items()}
            elif "video" in self.raw_data[i]:
                if isinstance(self.raw_data[i]["video"], str):
                    frames = encode_video(self.raw_data[i]["video"], max_num_frames=self.max_num_frames)
                    image_names = []
                    images_dict = {}
                    for j, frame in enumerate(frames):
                        image_name = "<image_{:02d}>".format(j)
                        images_dict[image_name] = frame
                        image_names.append(image_name)
                    for j in range(len(self.raw_data[i]["conversations"])):
                        content = self.raw_data[i]["conversations"][j]['content']
                        self.raw_data[i]["conversations"][j]['content'] = content.replace("<video>", "".join(image_names))
                elif isinstance(self.raw_data[i]["video"], Dict):
                    videos = self.raw_data[i]["video"]
                    images_dict = {}
                    video_names = {}
                    cnt = 0
                    for video_name in videos:
                        video_id = video_name.split("_")[-1].strip(">")
                        video = videos[video_name]
                        frames = encode_video(video, max_num_frames=self.max_num_frames)
                        image_names = []
                        for j, frame in enumerate(frames):
                            image_name = "<image_{:02d}>".format(cnt)
                            cnt += 1
                            images_dict[image_name] = frame
                            image_names.append(image_name)
                        for j in range(len(self.raw_data[i]["conversations"])):
                            content = self.raw_data[i]["conversations"][j]['content']
                            self.raw_data[i]["conversations"][j]['content'] = content.replace(video_name, "".join(image_names))
                # video: modify config
                slice_config = self.video_slice_config
                use_image_id = False
            ret = preprocess(
                images_dict,
@@ -131,8 +67,7 @@ class SupervisedDataset(Dataset):
                llm_type=self.llm_type,
                patch_size=self.patch_size,
                batch_vision=self.batch_vision,
-                max_length=self.max_length,
+                max_length=self.max_length
                use_image_id=use_image_id
            )
            ret = dict(
                input_ids=ret["input_ids"],
@@ -197,7 +132,7 @@ def conversation_to_ids(conversation, tokenizer, llm_type=None, new_schema=False
        input_ids, context, raw_msg = conversation_to_ids_llama3(
            conversation, tokenizer
        )
-    elif llm_type == "qwen2":
+    elif llm_type == "qwen":
        input_ids, context, raw_msg = conversation_to_ids_qwen2(
            conversation, tokenizer
        )
@@ -383,7 +318,6 @@ def preprocess(
    patch_size=14,
    batch_vision=False,
    max_length=2048,
    use_image_id=True
 ):
    """
    single(multi) image(s) preprocess, the image(s) will be placed at the top of the conversation
@@ -402,9 +336,9 @@ def preprocess(
    )
    new_schema = False
    use_image_id = False
-    if llm_type=='qwen2':
+    if llm_type=='qwen':
        new_schema = True
-        use_image_id = use_image_id
+        use_image_id = True
    image_placeholder_dict = {}
    images = []
    image_id_cnt = 0 
--- a/finetune/finetune.py
+++ b/finetune/finetune.py
@@ -14,7 +14,7 @@ from accelerate.utils import DistributedType
 from deepspeed import zero
 from deepspeed.runtime.zero.partition_parameters import ZeroParamStatus
-from transformers import AutoModel, AutoTokenizer, AutoProcessor
+from transformers import AutoModel, AutoTokenizer
 from transformers.integrations import deepspeed
 from transformers import AutoModel, AutoTokenizer
@@ -53,8 +53,6 @@ class TrainingArguments(transformers.TrainingArguments):
    llm_type: str = field(default="minicpm")
    use_lora: Optional[bool] = field(default=False)
    max_slice_nums: Optional[int] = field(default=9)
    video_max_slice_nums: Optional[int] = field(default=2)
    max_num_frames: Optional[int] = field(default=1)
@dataclass
@@ -94,8 +92,6 @@ def make_supervised_data_module(
    query_nums=64,
    batch_vision=False,
    max_length=2048,
    video_max_slice_nums=2,
    max_num_frames=1,
 ) -> Dict:
    """Make dataset and collator for supervised fine-tuning."""
    dataset_cls = SupervisedDataset
@@ -113,8 +109,6 @@ def make_supervised_data_module(
        query_nums=query_nums,
        batch_vision=batch_vision,
        max_length=max_length,
        video_max_slice_nums=video_max_slice_nums,
        max_num_frames=max_num_frames,
    )
    if data_args.eval_data_path:
@@ -129,8 +123,6 @@ def make_supervised_data_module(
            query_nums=query_nums,
            batch_vision=batch_vision,
            max_length=max_length,
            video_max_slice_nums=video_max_slice_nums,
            max_num_frames=max_num_frames,
        )
    else:
        eval_dataset = None
@@ -210,10 +202,10 @@ def train():
        trust_remote_code=True,
        torch_dtype=compute_dtype,
        device_map=device_map,
        init_vision=True,
        init_audio=False,
        init_tts=False,
    )
    model.__class__.register_for_auto_class()
    model.processor = AutoProcessor.from_pretrained(model_args.model_name_or_path, trust_remote_code=True)
    tokenizer = AutoTokenizer.from_pretrained(
        model_args.model_name_or_path, trust_remote_code=True
@@ -287,8 +279,6 @@ def train():
        query_nums=model.config.query_num,
        batch_vision=batch_vision,
        max_length=training_args.model_max_length,
        video_max_slice_nums=training_args.video_max_slice_nums,
        max_num_frames=training_args.max_num_frames,
    )
    training_args.gradient_checkpointing_kwargs={"use_reentrant":False}
--- a/finetune/finetune_ds.sh
+++ b/finetune/finetune_ds.sh
@@ -5,14 +5,17 @@ NNODES=1
 NODE_RANK=0
 MASTER_ADDR=localhost
 MASTER_PORT=6001
-
+ 
-MODEL="openbmb/MiniCPM-V-2_6"
+MODEL="openbmb/MiniCPM-o-2_6"
-# or openbmb/MiniCPM-V-2, openbmb/MiniCPM-Llama3-V-2_5
+# or openbmb/MiniCPM-V-2, openbmb/MiniCPM-Llama3-V-2_5, openbmb/MiniCPM-V-2_6
 # ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations.
 # See the section for finetuning in README for more information.
 DATA="path/to/trainging_data"
 EVAL_DATA="path/to/test_data"
-LLM_TYPE="qwen2" # if use openbmb/MiniCPM-V-2, please set LLM_TYPE=minicpm, if use openbmb/MiniCPM-Llama3-V-2_5, please set LLM_TYPE="llama3"
+
 # if use openbmb/MiniCPM-V-2, please set LLM_TYPE=minicpm, if use openbmb/MiniCPM-Llama3-V-2_5, please set LLM_TYPE="llama3",
 # if use openbmb/MiniCPM-o-2_6 or openbmb/MiniCPM-V-2_6, please set LLM_TYPE=qwen
 LLM_TYPE="qwen" 
 MODEL_MAX_Length=2048 # if conduct multi-images sft, please set MODEL_MAX_Length=4096
@@ -38,7 +41,7 @@ torchrun $DISTRIBUTED_ARGS finetune.py  \
    --do_train \
    --do_eval \
    --tune_vision true \
-    --tune_llm true \
+    --tune_llm false \
    --model_max_length $MODEL_MAX_Length \
    --max_slice_nums 9 \
    --max_steps 10000 \
@@ -60,5 +63,5 @@ torchrun $DISTRIBUTED_ARGS finetune.py  \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --gradient_checkpointing true \
-    --deepspeed ds_config_zero2.json \
+    --deepspeed ds_config_zero3.json \
    --report_to "tensorboard" 
--- a/finetune/finetune_lora.sh
+++ b/finetune/finetune_lora.sh
@@ -5,16 +5,16 @@ NNODES=1
 NODE_RANK=0
 MASTER_ADDR=localhost
 MASTER_PORT=6001
-
+ 
-MODEL="openbmb/MiniCPM-V-2_6" # or openbmb/MiniCPM-V-2, openbmb/MiniCPM-Llama3-V-2_5
+MODEL="openbmb/MiniCPM-o-2_6"
 # or openbmb/MiniCPM-V-2, openbmb/MiniCPM-Llama3-V-2_5, openbmb/MiniCPM-V-2_6
 # ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations.
 # See the section for finetuning in README for more information.
 DATA="path/to/trainging_data"
 EVAL_DATA="path/to/test_data"
-LLM_TYPE="qwen2" 
+# if use openbmb/MiniCPM-V-2, please set LLM_TYPE=minicpm, if use openbmb/MiniCPM-Llama3-V-2_5, please set LLM_TYPE="llama3",
-# if use openbmb/MiniCPM-V-2, please set LLM_TYPE=minicpm
+# if use openbmb/MiniCPM-o-2_6 or openbmb/MiniCPM-V-2_6, please set LLM_TYPE=qwen
-#if use openbmb/MiniCPM-Llama3-V-2_5, please set LLM_TYPE=llama3
+LLM_TYPE="qwen"   
 MODEL_MAX_Length=2048 # if conduct multi-images sft, please set MODEL_MAX_Length=4096
 DISTRIBUTED_ARGS="
@@ -24,6 +24,7 @@ DISTRIBUTED_ARGS="
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT
 "
 torchrun $DISTRIBUTED_ARGS finetune.py  \
    --model_name_or_path $MODEL \
    --llm_type $LLM_TYPE \
--- a/finetune/readme.md
+++ b/finetune/readme.md
@@ -1,7 +1,7 @@
 # MiniCPM-V Finetuning
-We offer the official scripts for easy finetuning of the pretrained **MiniCPM-V-2_6**, **MiniCPM-Llama3-V 2.5** and **MiniCPM-V 2.0** on downstream tasks. Our finetune scripts use transformers Trainer and DeepSpeed by default.
+We offer the official scripts for easy finetuning of the pretrained **MiniCPM-o-2_6**, **MiniCPM-V-2_6**, **MiniCPM-Llama3-V 2.5** and **MiniCPM-V 2.0** on downstream tasks. Our finetune scripts use transformers Trainer and DeepSpeed by default.
 ### Data preparation
@@ -20,30 +20,30 @@ If your input consists of a single image, you can use a single placeholder **\<i
  [
    {
      "id": "0",
-      "image": "path/to/image_0.jpg",
+      "image": 'path/to/image_0.jpg',
      "conversations": [
            {
-              "role": "user", 
+              'role': 'user', 
-              "content": "<image>\nHow many desserts are on the white plate?"
+              'content': '<image>\nHow many desserts are on the white plate?'
            }, 
            {
-                "role": "assistant", 
+                'role': 'assistant', 
-                "content": "There are three desserts on the white plate."
+                'content': 'There are three desserts on the white plate.'
            },   
            {
-                "role": "user", 
+                'role': 'user', 
-                "content": "What type of desserts are they?"
+                'content': 'What type of desserts are they?'
            },
            {
-                "role": "assistant", 
+                'role': 'assistant', 
-                "content": "The desserts are cakes with bananas and pecans on top. They share similarities with donuts, but the presence of bananas and pecans differentiates them."
+                'content': 'The desserts are cakes with bananas and pecans on top. They share similarities with donuts, but the presence of bananas and pecans differentiates them.'
            }, 
            {
-                "role": "user", 
+                'role': 'user', 
-                "content": "What is the setting of the image?"}, 
+                'content': 'What is the setting of the image?'}, 
            {
-                "role": "assistant", 
+                'role': 'assistant', 
-                "content": "The image is set on a table top with a plate containing the three desserts."
+                'content': 'The image is set on a table top with a plate containing the three desserts.'
            },
        ]
    },
@@ -91,81 +91,16 @@ If the total token count exceeds `max_length`, truncation will be applied. For m
 ```
 </details>
 #### Single Video Example
 If your input consists of a single video, you can use a single placeholder **\<video\>** to indicate where the video should be inserted in the conversation.
 <details>
  <summary>
    <b>Single video example (vl_finetune_video.json) with 1 samples.</b>
  </summary>
 ```
  [
    {
      "id": "0",
      "video": "path/to/video_0.mp4",
      "conversations": [
            {
              "role": "user", 
              "content": "<video>\nHow many desserts are on the white plate?"
            }, 
            {
                "role": "assistant", 
                "content": "There are three desserts on the white plate."
            }
        ]
    }
  ]
 ```
 </details>
 #### Multiple Videos Example
 For inputs containing multiple videos, utilize a dictionary where each key represents a unique placeholder (e.g., **\<video_00\>**, **\<video_01\**) with the corresponding video path as its value. These placeholders can then be used within the conversation to seamlessly insert videos at specific positions.
 Additionally, to optimize resource management, especially when dealing with large batches of videos during training or inference, consider reducing `video_max_slice_nums` and `max_num_frames`. To minimize the number of tokens used per video, you can set `video_max_slice_nums=1` and `max_num_frames=1`, resulting in a single video being represented by 64 tokens.
 If the total token count exceeds `max_length`, truncation will be applied. For multi-video supervised fine-tuning (SFT), it's recommended to set `MODEL_MAX_LENGTH=4096` in your script for better performance.
 <details>
  <summary>
    <b>Multiple videos example (vl_finetune_data.json) with 1 samples.</b>
  </summary>
 ```
  [
    {
      "id": "0",
      "video": {
        "<video_00>": "path/to/video_0.mp4",
        "<video_01>": "path/to/video_1.avi",
        "<video_02>": "path/to/video_2.mp4",
        "<video_03>": "path/to/video_3.avi"
      },
      "conversations": [
        {
          "role": "user", 
          "content": "How to create such text-only videos using CapCut?\n<video_00>\n<image_01>\n<video_01>\n<video_02>\n"
        }, 
        {
          "role": "assistant", 
          "content": "To create a text-only video as shown in the videos, follow these steps in CapCut..."
        }
      ]
    }
  ]
 ```
 </details>
 ### Full-parameter finetuning
 Full-parameter parameter finetuning requires updating all parameters of LLM in the whole training process. Please specify the correct MODEL path, DATA path and LLM_TYPE in the shell scripts.
 ```shell
-MODEL="openbmb/MiniCPM-V-2_6" # or openbmb/MiniCPM-Llama3-V-2_5, openbmb/MiniCPM-V-2
+MODEL="MiniCPM-o-2_6" # or "openbmb/MiniCPM-V-2_6", openbmb/MiniCPM-Llama3-V-2_5, openbmb/MiniCPM-V-2
 DATA="path/to/trainging_data" # json file
 EVAL_DATA="path/to/test_data" # json file
-LLM_TYPE="qwen2" # if use openbmb/MiniCPM-V-2, please set LLM_TYPE=minicpm, if use openbmb/MiniCPM-Llama3-V-2_5, please set LLM_TYPE="llama3"
+LLM_TYPE="qwen" # if use openbmb/MiniCPM-V-2, please set LLM_TYPE=minicpm, if use openbmb/MiniCPM-Llama3-V-2_5, please set LLM_TYPE="llama3",
 # if use openbmb/MiniCPM-o-2_6 or openbmb/MiniCPM-V-2_6, please set LLM_TYPE=qwen
 ```
 To launch your training, run the following script:
@@ -188,7 +123,7 @@ After training, you could load the model with the path to the adapter. We advise
 ```
 from peft import PeftModel
 from transformers import AutoModel
-model_type=  "openbmb/MiniCPM-V-2_6"   # or openbmb/MiniCPM-Llama3-V-2_5 , openbmb/MiniCPM-V-2
+model_type=  ""openbmb/MiniCPM-o-2_6" or # openbmb/MiniCPM-V-2_6", openbmb/MiniCPM-Llama3-V-2_5, openbmb/MiniCPM-V-2
 path_to_adapter="path_to_your_fine_tuned_checkpoint"
 model =  AutoModel.from_pretrained(
--- a/finetune/requirements.txt
+++ b/finetune/requirements.txt
@@ -0,0 +1,44 @@
 packaging==23.2
 addict==2.4.0
 editdistance==0.6.2
 einops==0.7.0
 fairscale==0.4.0
 jsonlines==4.0.0
 markdown2==2.4.10
 matplotlib==3.7.4
 more_itertools==10.1.0
 nltk==3.8.1
 numpy==1.24.4
 opencv_python_headless==4.5.5.64
 openpyxl==3.1.2
 Pillow==10.1.0
 sacrebleu==2.3.2
 seaborn==0.13.0
 shortuuid==1.0.11
 spacy==3.7.2
 torch==2.2.0
 torchaudio==2.2.0
 torchvision==0.17.0
 timm==0.9.10
 tqdm==4.66.1
 protobuf==4.25.0 
 typing_extensions==4.8.0
 uvicorn==0.24.0.post1
 #xformers==0.0.22.post7
 #flash_attn==2.3.4
 sentencepiece==0.1.99
 accelerate==0.30.1
 socksio==1.0.0
 gradio==4.41.0
 gradio_client
 http://thunlp.oss-cn-qingdao.aliyuncs.com/multi_modal/never_delete/modelscope_studio-0.4.0.9-py3-none-any.whl
 decord
 aiosignal
 tensorborad
 deepspeed==0.12.3
 transformers==4.44.2
 librosa==0.9.0
 soundfile==0.12.1
 vector-quantize-pytorch==1.18.5
 vocos==0.1.0
 moviepy
--- a/finetune/trainer.py
+++ b/finetune/trainer.py
@@ -170,7 +170,7 @@ class CPMTrainer(Trainer):
        return (loss, logits, labels)
-    def training_step(self, model: nn.Module, inputs: Dict[str, Union[torch.Tensor, Any]], num_items_in_batch: int=None) -> torch.Tensor:
+    def training_step(self, model: nn.Module, inputs: Dict[str, Union[torch.Tensor, Any]]) -> torch.Tensor:
        """
        Perform a training step on a batch of inputs.
@@ -245,9 +245,6 @@ class CPMTrainer(Trainer):
        if self.tokenizer is not None:
            self.tokenizer.save_pretrained(output_dir)
        if self.model.processor is not None:
            self.model.processor.save_pretrained(output_dir)
        # Good practice: save your training arguments together with the trained model
        torch.save(self.args, os.path.join(output_dir, TRAINING_ARGS_NAME))
--- a/requirements_o2.6.txt
+++ b/requirements_o2.6.txt
@@ -0,0 +1,18 @@
 Pillow==10.1.0
 torch==2.2.0
 torchaudio==2.2.0
 torchvision==0.17.0
 transformers==4.44.2
 sentencepiece==0.2.0
 vector-quantize-pytorch==1.18.5
 vocos==0.1.0
 accelerate==1.2.1
 timm==0.9.10
 soundfile==0.12.1
 librosa==0.9.0
 decord
 moviepy
 # for web
 fastapi
 uvicorn
--- a/web_demos/minicpm-o_2.6/model_server.py
+++ b/web_demos/minicpm-o_2.6/model_server.py
@@ -0,0 +1,935 @@
 import base64
 import json
 import asyncio
 import numpy as np
 import os, sys, io
 import threading
 import time
 import aiofiles
 import librosa
 import soundfile
 import wave
 from typing import Dict, List, Any, Optional
 import argparse
 import logging
 import torch
 from PIL import Image
 from transformers import AutoModel, AutoTokenizer, AutoProcessor
 import uvicorn
 from fastapi import FastAPI, Header, Query, Request, HTTPException, WebSocket, WebSocketDisconnect
 from fastapi.responses import JSONResponse, StreamingResponse
 cur_path = os.path.split(os.path.realpath(__file__))[0]
 sys.path.append(os.path.abspath(cur_path))
 import vad_utils
 def setup_logger():
    logger = logging.getLogger("api_logger")
    logger.setLevel(logging.DEBUG)
    # Create formatter
    formatter = logging.Formatter(
        '%(asctime)s.%(msecs)03d-%(levelname)s-[%(filename)s:%(lineno)d] - %(message)s',
        datefmt='%Y-%m-%d %H:%M:%S'
    )
    # Create handlers for stdout and stderr
    stdout_handler = logging.StreamHandler(sys.stdout)
    stdout_handler.setLevel(logging.INFO)  # INFO and DEBUG go to stdout
    stdout_handler.setFormatter(formatter)
    stdout_handler.addFilter(lambda record: record.levelno <= logging.INFO)
    stderr_handler = logging.StreamHandler(sys.stderr)
    stderr_handler.setLevel(logging.WARNING)  # WARNING, ERROR, CRITICAL go to stderr
    stderr_handler.setFormatter(formatter)
    # Add handlers to logger
    logger.addHandler(stdout_handler)
    logger.addHandler(stderr_handler)
    return logger
 app = FastAPI()
 logger = setup_logger()
 ap = argparse.ArgumentParser()
 ap.add_argument('--port', type=int , default=8088)
 args = ap.parse_args()
 class StreamManager:
    def __init__(self):
        self.uid = None
        self.is_streaming_complete = threading.Event()
        self.conversation_started = threading.Event()
        self.last_request_time = None
        self.last_stream_time = None
        self.timeout = 900  # seconds timeout
        self.stream_timeout = 3  # seconds no stream
        self.num_stream = 0
        self.stream_started = False
        self.stop_response = False
        # VAD settings
        self.vad_options = vad_utils.VadOptions()
        self.vad_sequence_length = 5
        self.vad_sequence = []
        self.audio_prefill = []
        self.audio_input = []
        self.image_prefill = None
        self.audio_chunk = 200
        # customized options
        self.customized_audio = None
        self.customized_options = None
        # Omni model
        self.target_dtype = torch.bfloat16
        self.device='cuda:0'
        self.minicpmo_model_path = "openbmb/MiniCPM-o-2_6"
        self.model_version = "2.6"
        with torch.no_grad():
            self.minicpmo_model = AutoModel.from_pretrained(self.minicpmo_model_path, trust_remote_code=True, torch_dtype=self.target_dtype, attn_implementation='sdpa')
        self.minicpmo_tokenizer = AutoTokenizer.from_pretrained(self.minicpmo_model_path, trust_remote_code=True)
        self.minicpmo_model.init_tts()
        # self.minicpmo_model.tts.float()
        self.minicpmo_model.to(self.device).eval()
        self.ref_path_video_default = "assets/ref_audios/video_default.wav"
        self.ref_path_default = "assets/ref_audios/default.wav"
        self.ref_path_female = "assets/ref_audios/female_example.wav"
        self.ref_path_male = "assets/ref_audios/male_example.wav"
        self.input_audio_id = 0
        self.input_audio_vad_id = 0
        self.input_image_id = 0
        self.output_audio_id = 0
        self.flag_decode = False
        self.cnts = None
        self.all_start_time = time.time()
        self.session_id = 233
        self.sys_prompt_flag = False
        self.vad_time = 0
        self.ls_time = 0
        self.msg_type = 1
        self.speaking_time_stamp = 0
        self.cycle_wait_time = 12800/24000 + 0.15
        self.extra_wait_time = 2.5
        self.server_wait = True
        self.past_session_id = 0
        self.sys_prompt_init(0)
        self.session_id += 1
    def start_conversation(self):
        logger.info(f"uid {self.uid}: new conversation started.")
        self.conversation_started.set()
        self.stop_response = False
    def update_last_request_time(self):
        self.last_request_time = time.time()
        #logger.info(f"update last_request_time {self.last_request_time}")
    def update_last_stream_time(self):
        self.last_stream_time = time.time()
        #logger.info(f"update last_stream_time {self.last_stream_time}")
    def move_to_device(self, obj, device):
        if isinstance(obj, torch.Tensor):
            obj_ = obj.to(device)
            if (obj_.dtype == torch.float) or (obj_.dtype == torch.half):
                # cast to `torch.bfloat16`
                obj_ = obj_.to(self.target_dtype)
            return obj_
        elif isinstance(obj, dict):
            return {key: self.move_to_device(value, device) for key, value in obj.items()}
        elif isinstance(obj, list):
            return [self.move_to_device(item, device) for item in obj]
        elif isinstance(obj, tuple):
            return tuple(self.move_to_device(item, device) for item in obj)
        elif isinstance(obj, set):
            return {self.move_to_device(item, device) for item in obj}
        else:
            return obj
    def reset(self):
        logger.info("reset")
        self.is_streaming_complete.clear()
        self.conversation_started.clear()
        self.last_request_time = None
        self.last_stream_time = None
        self.audio_buffer_raw = bytearray()
        self.num_stream = 0
        self.stream_started = False
        self.stop_response = False
        # self.customized_audio = None
        # self.customized_options = None
        # clear model
        self.clear()
    def merge_wav_files(self, input_bytes_list, output_file):
        with wave.open(io.BytesIO(input_bytes_list[0]), 'rb') as wav:
            params = wav.getparams()
            n_channels, sampwidth, framerate, n_frames, comptype, compname = params
        with wave.open(output_file, 'wb') as output_wav:
            output_wav.setnchannels(n_channels)
            output_wav.setsampwidth(sampwidth)
            output_wav.setframerate(framerate)
            output_wav.setcomptype(comptype, compname)
            for wav_bytes in input_bytes_list:
                with wave.open(io.BytesIO(wav_bytes), 'rb') as wav:
                    output_wav.writeframes(wav.readframes(wav.getnframes()))
    def is_timed_out(self):
        if self.last_request_time is not None:
            return time.time() - self.last_request_time > self.timeout
        return False
    def no_active_stream(self):
        if self.last_stream_time is not None and self.stream_started:
            no_stream_duration = time.time() - self.last_stream_time
            if no_stream_duration > self.stream_timeout:
                #logger.info(f"no active stream for {no_stream_duration} secs.")
                return True
        return False
    def sys_prompt_init(self, msg_type):
        if self.past_session_id == self.session_id:
            return
        logger.info("### sys_prompt_init ###")
        logger.info(f'msg_type is {msg_type}')
        if msg_type <= 1: #audio
            audio_voice_clone_prompt = "克隆音频提示中的音色以生成语音。"
            audio_assistant_prompt = "Your task is to be a helpful assistant using this voice pattern."
            ref_path = self.ref_path_default
            if self.customized_options is not None:
                audio_voice_clone_prompt = self.customized_options['voice_clone_prompt']
                audio_assistant_prompt = self.customized_options['assistant_prompt']
                if self.customized_options['use_audio_prompt'] == 1:
                    ref_path = self.ref_path_default
                elif self.customized_options['use_audio_prompt'] == 2:
                    ref_path = self.ref_path_female
                elif self.customized_options['use_audio_prompt'] == 3:
                    ref_path = self.ref_path_male
            audio_prompt, sr = librosa.load(ref_path, sr=16000, mono=True)
            sys_msg = {'role': 'user', 'content': [audio_voice_clone_prompt + "\n", audio_prompt, "\n" + audio_assistant_prompt]}
        elif msg_type == 2: #video
            voice_clone_prompt="你是一个AI助手。你能接受视频，音频和文本输入并输出语音和文本。模仿输入音频中的声音特征。"
            assistant_prompt="作为助手，你将使用这种声音风格说话。"
            ref_path = self.ref_path_video_default
            if self.customized_options is not None:
                voice_clone_prompt = self.customized_options['voice_clone_prompt']
                assistant_prompt = self.customized_options['assistant_prompt']
                if self.customized_options['use_audio_prompt'] == 1:
                    ref_path = self.ref_path_default
                elif self.customized_options['use_audio_prompt'] == 2:
                    ref_path = self.ref_path_female
                elif self.customized_options['use_audio_prompt'] == 3:
                    ref_path = self.ref_path_male
            audio_prompt, sr = librosa.load(ref_path, sr=16000, mono=True)
            sys_msg = {'role': 'user', 'content': [voice_clone_prompt, audio_prompt, assistant_prompt]}
        # elif msg_type == 3: #user start
        #     assistant_prompt="作为助手，你将使用这种声音风格说话。"
        #     if self.customized_options is not None:
        #         assistant_prompt = self.customized_options['assistant_prompt']
        #     sys_msg = {'role': 'user', 'content': [assistant_prompt]}
        self.msg_type = msg_type
        msgs = [sys_msg]
        if self.customized_options is not None:
            if self.customized_options['use_audio_prompt'] > 0:
                self.minicpmo_model.streaming_prefill(
                    session_id=str(self.session_id),
                    msgs=msgs,
                    tokenizer=self.minicpmo_tokenizer,
                )
        if msg_type == 0:
            self.minicpmo_model.streaming_prefill(
                session_id=str(self.session_id),
                msgs=msgs,
                tokenizer=self.minicpmo_tokenizer,
            )
        self.savedir = os.path.join(f"./log_data/{args.port}/", str(time.time()))
        if not os.path.exists(self.savedir):
            os.makedirs(self.savedir)
        if not os.path.exists(self.savedir + "/input_audio_log"):
            os.makedirs(self.savedir + "/input_audio_log")
        if not os.path.exists(self.savedir + "/input_audio_vad_log"):
            os.makedirs(self.savedir + "/input_audio_vad_log")
        if not os.path.exists(self.savedir + "/input_image_log"):
            os.makedirs(self.savedir + "/input_image_log")
        if not os.path.exists(self.savedir + "/output_audio_log"):
            os.makedirs(self.savedir + "/output_audio_log")
        if not os.path.exists(self.savedir + "/feedback_log"):
            os.makedirs(self.savedir + "/feedback_log")
        if not os.path.exists(self.savedir + "/input_audio"):
            os.makedirs(self.savedir + "/input_audio")
        self.past_session_id = self.session_id
        self.audio_prefill = []
        self.audio_input = []
    def clear(self):
        try:
            self.flag_decode = False
            self.stream_started = False
            self.cnts = None
            self.vad_sequence = []
            self.audio_prefill = []
            self.audio_input = []
            self.image_prefill = None
            if self.minicpmo_model.llm_past_key_values[0][0].shape[2]>8192:
                self.session_id += 1  # to clear all kv cache
                self.sys_prompt_flag = False
            self.vad_time = 0
            self.ls_time = 0
            self.msg_type = 1
        except Exception as e:
            raise ValueError(f"Clear error: {str(e)}")
    def process_message(self, message: Dict[str, Any]):
        try:
            # Process content items
            audio_data = None
            image_data = None
            for content_item in message["content"]:
                if content_item["type"] == "stop_response":
                    logger.info("process_message: received request to stop_response")
                    self.stop_response = True
                    return "stop"
                elif content_item["type"] == "input_audio":
                    audio_data = content_item["input_audio"]["data"]
                    audio_timestamp = content_item["input_audio"].get("timestamp", "")
                elif content_item["type"] == "image_data":
                    image_data = content_item["image_data"]["data"]
            if audio_data is None:
                return "empty audio"
            if self.conversation_started.is_set() and self.is_streaming_complete.is_set():
                logger.info("conversation not started or still in generation, skip stream message.")
                return "skip"
            if self.flag_decode:
                return "skip"
            try:
                audio_bytes = base64.b64decode(audio_data)
                image = None
                if image_data is not None:
                    if len(image_data) > 0:
                        image_bytes = base64.b64decode(image_data)
                        image_buffer = io.BytesIO(image_bytes)
                        image_buffer.seek(0)
                        image = Image.open(image_buffer)
                        # logger.info("read image")
                if self.sys_prompt_flag is False:
                    self.all_start_time = time.time()
                    self.sys_prompt_flag = True
                    if image_data is not None:
                        self.sys_prompt_init(2)
                    else:
                        self.sys_prompt_init(1)
                self.prefill(audio_bytes, image, False)
                self.vad_sequence.append(audio_bytes)
                if len(self.vad_sequence) < self.vad_sequence_length:
                    # logger.info('length of vad_sequence is {}, insufficient'.format(self.vad_sequence_length))
                    return "done"
                elif len(self.vad_sequence) > self.vad_sequence_length:
                    # logger.info('length of vad_sequence exceeds {}'.format(self.vad_sequence_length))
                    self.vad_sequence.pop(0)
                self.vad_check_audio_bytes(audio_bytes, image, 16000)
                return "done"
            except Exception as e:
                raise ValueError(f"Audio processing error: {str(e)}")
        except Exception as e:
            raise ValueError(f"Message processing error: {str(e)}")
    def resample_audio(self, input_path, src_sr, tar_sr, output_path):
        audio_data, _ = librosa.load(input_path, sr=src_sr)
        audio_new = librosa.resample(audio_data, orig_sr=src_sr, target_sr=tar_sr)
        soundfile.write(output_path, audio_new, tar_sr)
    def calculate_rms(self, input_path, sr):
        audio_data, _ = librosa.load(input_path, sr=sr)
        return (np.sqrt(np.mean(audio_data**2)) > 0.002)
    def vad_check_audio_bytes(self, audio, image, sr):
        try:
            input_audio_vad_path = self.savedir + f"/input_audio_vad_log/vad_{self.input_audio_vad_id}.wav"
            self.input_audio_vad_id += 1
            self.merge_wav_files(self.vad_sequence, input_audio_vad_path)
            with open(input_audio_vad_path,"rb") as f:
                temp_audio = f.read()
            dur_vad, vad_audio_bytes, time_vad = vad_utils.run_vad(temp_audio, sr, self.vad_options)
            if self.customized_options is not None:
                vad_threshold = 1 - self.customized_options['vad_threshold']
            else:
                vad_threshold = 0.2
            if self.calculate_rms(input_audio_vad_path, sr) and dur_vad > 0.4:
                if self.stream_started == False:
                    self.vad_time = time.time()
                    self.stream_started = True
            elif dur_vad < vad_threshold:
                if self.stream_started:
                    self.stream_started = False
                    if (time.time() - self.vad_time >= 0.6):
                        self.prefill(audio, image, True)
                        self.is_streaming_complete.set()
                        # self.ls_time = time.time()
        except Exception as e:
            logger.error(f"VAD error: {e}")
            raise
        return
    def prefill(self, audio, image, is_end):
        if self.server_wait:   
            now = time.time()
            await_time = self.speaking_time_stamp - now + self.extra_wait_time
            if await_time > 0:
                return False
        if self.flag_decode:
            return False
        if image is not None:
            self.image_prefill = image
        try:
            if is_end == False:
                self.audio_prefill.append(audio)
                self.audio_input.append(audio)
            slice_nums = 1
            if is_end and self.customized_options is not None:
                if self.customized_options['hd_video']:
                    slice_nums = 6
                else:
                    return True
            if (len(self.audio_prefill) == (1000/self.audio_chunk)) or (is_end and len(self.audio_prefill)>0):
                time_prefill = time.time()
                input_audio_path = self.savedir + f"/input_audio_log/input_audio_{self.input_audio_id}.wav"
                self.merge_wav_files(self.audio_prefill, input_audio_path)
                with open(input_audio_path,"rb") as wav_io:
                    signal, sr = soundfile.read(wav_io, dtype='float32')
                    soundfile.write(input_audio_path, signal, 16000)
                    audio_np, sr = librosa.load(input_audio_path, sr=16000, mono=True)
                self.audio_prefill = []
                if len(audio_np) > 16000:
                    audio_np = audio_np[:16000] 
                with torch.no_grad():
                    if self.image_prefill is not None:
                        input_image_path = self.savedir + f'/input_image_log/input_image_{self.input_audio_id}.png'
                        self.image_prefill.save(input_image_path, 'PNG')
                        self.image_prefill = self.image_prefill.convert("RGB")
                    cnts = None
                    if self.image_prefill is not None:
                        cnts = ["<unit>", self.image_prefill, audio_np]
                    else:
                        cnts = [audio_np]
                    if cnts is not None:
                        msg = {"role":"user", "content": cnts}
                        msgs = [msg]
                        res = self.minicpmo_model.streaming_prefill(
                            session_id=str(self.session_id),
                            msgs=msgs, 
                            tokenizer=self.minicpmo_tokenizer,
                            max_slice_nums=slice_nums,
                        )
                self.input_audio_id += 1
            return True
        except Exception as e:
            logger.error(f"prefill error: {e}")
            import traceback
            traceback.print_exc()
            raise
    def generate_end(self):
        self.input_audio_id += 10
        self.output_audio_id += 10
        self.flag_decode = False
        self.reset()
        return
    async def generate(self):
        """ return audio bytes and response text (optional) """
        if self.stop_response:
            self.generate_end()
            return
        self.flag_decode = True
        try:
            with torch.no_grad():
                logger.info("=== model gen start ===")
                time_gen = time.time()
                input_audio_path = self.savedir + f"/input_audio/all_input_audio_{self.input_audio_id}.wav"
                self.merge_wav_files(self.audio_input, input_audio_path)
                audio_stream = None
                try:
                    with open(input_audio_path, 'rb') as wav_file:
                        audio_stream = wav_file.read()
                except FileNotFoundError:
                    print(f"File {input_audio_path} not found.")
                yield base64.b64encode(audio_stream).decode('utf-8'), "assistant:\n"
                print('=== gen start: ', time.time() - time_gen)
                first_time = True
                temp_time = time.time()
                temp_time1 = time.time()
                with torch.inference_mode():
                    if self.stop_response:
                        self.generate_end()
                        return
                    self.minicpmo_model.config.stream_input=True
                    msg = {"role":"user", "content": self.cnts}
                    msgs = [msg]
                    text = ''
                    self.speaking_time_stamp = time.time()
                    try:
                        for r in self.minicpmo_model.streaming_generate(
                            session_id=str(self.session_id),
                            tokenizer=self.minicpmo_tokenizer,
                            use_tts=True,
                            # enable_regenerate=True,
                        ):
                            if self.stop_response:
                                self.generate_end()
                                return
                            audio_np, sr, text = r
                            output_audio_path = self.savedir + f'/output_audio_log/output_audio_{self.output_audio_id}.wav'
                            self.output_audio_id += 1
                            soundfile.write(output_audio_path, audio_np, samplerate=sr)
                            audio_stream = None
                            try:
                                with open(output_audio_path, 'rb') as wav_file:
                                    audio_stream = wav_file.read()
                            except FileNotFoundError:
                                print(f"File {output_audio_path} not found.")
                            temp_time1 = time.time()
                            print('text: ', text)
                            yield base64.b64encode(audio_stream).decode('utf-8'), text
                            self.speaking_time_stamp += self.cycle_wait_time
                    except Exception as e:
                        logger.error(f"Error happened during generation: {str(e)}")
                    yield None, '\n<end>'
        except Exception as e:
            logger.error(f"发生异常:{e}")
            import traceback
            traceback.print_exc()
            raise
        finally:
            logger.info(f"uid {self.uid}: generation finished!")
            self.generate_end()
    async def check_activity(self):
        while True:
            # Check for overall inactivity (30 minutes)
            if self.is_timed_out():
                self.reset()
            if self.no_active_stream() and not self.is_streaming_complete.is_set():
               self.is_streaming_complete.set()
            await asyncio.sleep(1)  # Check every second
    def upload_customized_audio(self, audio_data, audio_fmt):
        self.customized_audio = None
        try:
            if audio_data is not None and len(audio_data) > 0:
                # if audio_fmt == "mp3" or audio_fmt == "wav":
                audio_bytes = base64.b64decode(audio_data)
                fio = io.BytesIO(audio_bytes)
                fio.seek(0)
                audio_np, sr = librosa.load(fio, sr=16000, mono=True)
                if audio_np is not None and len(audio_np) > 1000:
                    output_audio_path = self.savedir + f'/customized_audio.wav'
                    soundfile.write(output_audio_path, audio_np, sr)
                    self.customized_audio = output_audio_path
                    logger.info(f"processed customized {audio_fmt} audio")
                    print(audio_np.shape, type(audio_np), sr)
            else:
                logger.info(f"empty customized audio, use default value instead.")
                self.customized_audio = None
        except Exception as e:
            raise ValueError(f"Process customized audio error: {str(e)}")
    def update_customized_options(self, uid, options):
        self.customized_options = None
        if options is None:
            raise ValueError("Invalid None type for options, expected dict type")
        self.customized_options = options
        logger.info(f"uid: {uid} set customized_options to {options}")
 stream_manager = StreamManager()
@app.on_event("startup")
 async def startup_event():
    logger.info("Starting application and activity checker")
    asyncio.create_task(stream_manager.check_activity())
@app.on_event("shutdown")
 async def shutdown_event():
    logger.info("Shutting down application")
@app.post("/stream")
@app.post("/api/v1/stream")
 async def stream(request: Request, uid: Optional[str] = Header(None)):
    global stream_manager
    stream_manager.update_last_request_time()
    stream_manager.update_last_stream_time()
    if not uid:
        raise HTTPException(status_code=400, detail="Missing uid in headers")
    if stream_manager.uid is not None and stream_manager.uid != uid:
        logger.error(f"uid changed during steram: previous uid {stream_manager.uid}, new uid {uid}")
        raise HTTPException(status_code=400, detail="uid changed in stream")
    try:
        # Parse JSON request
        data = await request.json()
        # Validate basic structure
        if not isinstance(data, dict) or "messages" not in data:
            raise HTTPException(status_code=400, detail="Invalid request format")
        # Process messages
        reason = ""
        for message in data["messages"]:
            if not isinstance(message, dict) or "role" not in message or "content" not in message:
                raise HTTPException(status_code=400, detail="Invalid message format")
            reason = stream_manager.process_message(message)
        # Return response using uid from header
        response = {
            "id": uid,
            "choices": {
                "role": "assistant",
                "content": "success",
                "finish_reason": reason
            }
        }
        return JSONResponse(content=response, status_code=200)
    except json.JSONDecodeError:
        raise HTTPException(status_code=400, detail="Invalid JSON")
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))
@app.websocket("/ws/stream")
@app.websocket("/ws/api/v1/stream")
 async def websocket_stream(websocket: WebSocket,
                           uid: Optional[str] = Query(None)):
    global stream_manager
    if not uid:
        await websocket.close(code=400, reason="Missing uid in request")
        return
    # Accept the WebSocket connection
    await websocket.accept()
    #if stream_manager.uid is not None and stream_manager.uid != uid:
    #    logger.error(f"uid changed during steram: previous uid {stream_manager.uid}, new uid {uid}")
    #    await websocket.close(code=400, reason="Uid changed in stream.")
    #    return
    try:
        while True:
           # Continuously listen for incoming messages from the client
           data = await websocket.receive_text()
           # Parse JSON request
           try:
               request_data = json.loads(data)
           except json.JSONDecodeError:
               await websocket.send_json({"error": "Invalid JSON"})
               continue
           stream_manager.update_last_request_time()
           stream_manager.update_last_stream_time()
           if stream_manager.uid is not None and stream_manager.uid != uid:
               logger.error(f"uid changed during stream: previous uid {stream_manager.uid}, new uid {uid}")
               await websocket.send_json({"error": "UID changed in stream"})
               continue
           # Validate basic structure
           if not isinstance(request_data, dict) or "messages" not in request_data:
               await websocket.send_json({"error": "Invalid request format"})
               continue
           # Process messages
           try:
               reason = ""
               for message in request_data["messages"]:
                   if not isinstance(message, dict) or "role" not in message or "content" not in message:
                       await websocket.send_json({"error": "Invalid message format"})
                       continue
                   reason = stream_manager.process_message(message)
               # Respond with success message
               response = {
                   "id": uid,
                   "choices": {
                       "role": "assistant",
                       "content": "success",
                       "finish_reason": reason,
                   },
               }
               await websocket.send_json(response)
           except WebSocketDisconnect:
               # Handle WebSocket disconnection
               break
           except Exception as e:
               logger.error(f"process message error: {str(e)}")
               await websocket.close(code=1011, reason =f"Internal server error: {str(e)}")
    except WebSocketDisconnect:
        # Handle WebSocket disconnection
        return
    except Exception as e:
        logger.error(f"ws_stream error: {str(e)}")
        await websocket.close(code=1011, reason =f"Unexpected error: {str(e)}")
 async def generate_sse_response(request: Request, uid: Optional[str] = Header(None)):
    global stream_manager
    print(f"uid: {uid}")
    try:
        # Wait for streaming to complete or timeout
        while not stream_manager.is_streaming_complete.is_set():
            # if stream_manager.is_timed_out():
            #     yield f"data: {json.dumps({'error': 'Stream timeout'})}\n\n"
            #     return
            # print(f"{uid} whille not stream_manager.is_streaming_complete.is_set(), asyncio.sleep(0.1)")
            await asyncio.sleep(0.1)
        logger.info("streaming complete\n")
        # Generate response
        try:
            yield f"event: message\n"
            async for audio, text in stream_manager.generate():
                if text == "stop":
                    break
                res = {
                    "id": stream_manager.uid,
                    "response_id": stream_manager.output_audio_id,
                    "choices": [
                        {
                            "role": "assistant",
                            "audio": audio,
                            "text": text,
                            "finish_reason": "processing"
                        }
                    ]
                }
                # logger.info("generate_sse_response yield response")
                yield f"data: {json.dumps(res)}\n\n"
                await asyncio.sleep(0)
        except Exception as e:
            logger.error(f"Error while generation: {str(e)}")
            yield f'data:{{"error": "{str(exc)}"}}\n\n'
    except Exception as e:
        yield f'data:{{"error": "{str(e)}"}}\n\n'
@app.post("/completions")
@app.post("/api/v1/completions")
 async def completions(request: Request, uid: Optional[str] = Header(None)):
    global stream_manager
    if not uid:
        raise HTTPException(status_code=400, detail="Missing uid in headers")
    try:
        # if stream_manager.uid is not None and stream_manager.uid != uid:
        if stream_manager.uid != uid:
        #     stream_manager.stop_response = True
        #     logger.info(f"uid changed, reset model: previous uid {stream_manager.uid}, new uid {uid}")
            stream_manager.session_id += 1
            stream_manager.sys_prompt_flag = False
            stream_manager.reset()
            # raise HTTPException(
            #    status_code=409,
            #    detail="User id changed, reset context."
            # )
        stream_manager.speaking_time_stamp = 0
        stream_manager.update_last_request_time()
        stream_manager.uid = uid
        stream_manager.start_conversation()
        data = await request.json()
        return StreamingResponse(
            generate_sse_response(request, uid),
            media_type="text/event-stream",
            headers={
                "Cache-Control": "no-cache",
                "Connection": "keep-alive",
                "Transfer-Encoding": "chunked"
            }
        )
    except asyncio.TimeoutError:
        raise HTTPException(
            status_code=503,
            detail="Server busy, please try again later"
        )
    except Exception as e:
        logger.error(f"Error processing request for user {uid}: {str(e)}")
        raise HTTPException(status_code=500, detail=str(e))
@app.post("/stop")
@app.post("/api/v1/stop")
 async def stop_response(request: Request, uid: Optional[str] = Header(None)):
    if not uid:
        raise HTTPException(status_code=400, detail="Missing uid in headers")
    global stream_manager
    # stream_manager.session_id += 1
    logger.info(f"uid {uid}: received stop_response")
    stream_manager.stop_response = True
    response = {
        "id": uid,
        "choices": {
            "role": "assistant",
            "content": "success",
            "finish_reason": "stop"
        }
    }
    return JSONResponse(content=response, status_code=200)
@app.post("/feedback")
@app.post("/api/v1/feedback")
 async def feedback(request: Request, uid: Optional[str] = Header(None)):
    global stream_manager
    # Validate the 'uid' header
    if not uid:
        raise HTTPException(status_code=400, detail="Missing 'uid' header")
    try:
        data = await request.json()
        if "response_id" not in data or "rating" not in data:
            raise HTTPException(status_code=400, detail="Invalid request: must have response_id and rating")
        response_id = data.get("response_id", "")
        rating = data.get("rating", "")
        comment = data.get("comment", "")
        # Validate the rating
        if rating not in ["like", "dislike"]:
            raise HTTPException(status_code=400, detail=f"Invalid rating value: {rating}")
        # Define the log file path
        log_file_path = f"{stream_manager.savedir}/feedback_log/{response_id}.{rating}"
        # Write the feedback to the file asynchronously
        async with aiofiles.open(log_file_path, mode="a") as file:
            await file.write(f"model: {stream_manager.minicpmo_model_path}\nuid {uid}: {comment}\n")
        response = {
            "id": uid,
            "choices": {
                "role": "assistant",
                "content": "success",
                "finish_reason": "done"
            }
        }
        return JSONResponse(content=response, status_code=200)
    except Exception as e:
        logger.error(f"Error processing feedback for user {uid}: {str(e)}")
        raise HTTPException(status_code=500, detail=str(e))
@app.post("/init_options")
@app.post("/api/v1/init_options")
 async def init_options(request: Request, uid: Optional[str] = Header(None)):
    global stream_manager
    stream_manager.update_last_request_time()
    if not uid:
        raise HTTPException(status_code=400, detail="Missing uid in headers")
    try:
        # Parse JSON request
        data = await request.json()
        # Validate basic structure
        if not isinstance(data, dict) or "messages" not in data:
            raise HTTPException(status_code=400, detail="Invalid request format")
        messages = data.get("messages", [])
        for message in messages:
            if not isinstance(message, dict) or "role" not in message or "content" not in message:
                raise HTTPException(status_code=400, detail="Invalid message format")
            for content in message.get("content", []):
                if content["type"] == "input_audio":
                    audio_data = content["input_audio"].get("data", "")
                    audio_fmt = content["input_audio"].get("format", "")
                    stream_manager.upload_customized_audio(audio_data, audio_fmt)
                elif content["type"] == "options":
                    stream_manager.update_customized_options(uid, content["options"])
                else:
                    ctype = content["type"]
                    raise HTTPException(status_code=400, detail=f"Invalid content type: {ctype}")
        version = stream_manager.model_version
        print(version)
        response = {
            "id": uid,
            "choices": {
                "role": "assistant",
                "content": version,
                "finish_reason": "done"
            }
        }
        return JSONResponse(content=response, status_code=200)
    except Exception as e:
        raise HTTPException(status_code=400, detail=f"init options error: {str(e)}")
@app.get('/health')
@app.get('/api/v1/health')
 async def health_check():
    return {"status": "OK"}
 if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=args.port)
--- a/web_demos/minicpm-o_2.6/silero_vad.onnx
+++ b/web_demos/minicpm-o_2.6/silero_vad.onnx
--- a/web_demos/minicpm-o_2.6/vad_utils.py
+++ b/web_demos/minicpm-o_2.6/vad_utils.py
@@ -0,0 +1,301 @@
 import functools
 import numpy as np
 import librosa
 import os
 import time
 import traceback
 from typing import List, NamedTuple, Optional
 class VadOptions(NamedTuple):
    """VAD options.
    Attributes:
      threshold: Speech threshold. Silero VAD outputs speech probabilities for each audio chunk,
        probabilities ABOVE this value are considered as SPEECH. It is better to tune this
        parameter for each dataset separately, but "lazy" 0.5 is pretty good for most datasets.
      min_speech_duration_ms: Final speech chunks shorter min_speech_duration_ms are thrown out.
      max_speech_duration_s: Maximum duration of speech chunks in seconds. Chunks longer
        than max_speech_duration_s will be split at the timestamp of the last silence that
        lasts more than 100ms (if any), to prevent aggressive cutting. Otherwise, they will be
        split aggressively just before max_speech_duration_s.
      min_silence_duration_ms: In the end of each speech chunk wait for min_silence_duration_ms
        before separating it
      window_size_samples: Audio chunks of window_size_samples size are fed to the silero VAD model.
        WARNING! Silero VAD models were trained using 512, 1024, 1536 samples for 16000 sample rate.
        Values other than these may affect model performance!!
      speech_pad_ms: Final speech chunks are padded by speech_pad_ms each side
    """
    # threshold: float = 0.3 # rep 0.5
    # min_speech_duration_ms: int = 250 
    # max_speech_duration_s: float = float("inf")
    # min_silence_duration_ms: int = 2000 
    # window_size_samples: int = 1024
    # speech_pad_ms: int = 600 # rep 400
    threshold: float = 0.7 # gw: 0.3 # rep 0.5
    min_speech_duration_ms: int = 128  # original & gw: 250
    max_speech_duration_s: float = float("inf")
    min_silence_duration_ms: int = 500 # original & gw: 2000 
    window_size_samples: int = 1024
    speech_pad_ms: int = 30 # gw: 600 # rep 400
 class SileroVADModel:
    def __init__(self, path):
        try:
            import onnxruntime
        except ImportError as e:
            raise RuntimeError(
                "Applying the VAD filter requires the onnxruntime package"
            ) from e
        opts = onnxruntime.SessionOptions()
        opts.inter_op_num_threads = 1
        opts.intra_op_num_threads = 1
        opts.log_severity_level = 4
        self.session = onnxruntime.InferenceSession(
            path,
            providers=["CPUExecutionProvider"],
            sess_options=opts,
        )
    def get_initial_state(self, batch_size: int):
        h = np.zeros((2, batch_size, 64), dtype=np.float32)
        c = np.zeros((2, batch_size, 64), dtype=np.float32)
        return h, c
    def __call__(self, x, state, sr: int):
        if len(x.shape) == 1:
            x = np.expand_dims(x, 0)
        if len(x.shape) > 2:
            raise ValueError(
                f"Too many dimensions for input audio chunk {len(x.shape)}"
            )
        if sr / x.shape[1] > 31.25:
            raise ValueError("Input audio chunk is too short")
        h, c = state
        ort_inputs = {
            "input": x,
            #"state": np.concatenate((h, c), axis=0),
            "h": h,
            "c": c,
            "sr": np.array(sr, dtype="int64"),
        }
        out, h, c = self.session.run(None, ort_inputs)
        #out = self.session.run(None, ort_inputs)
        state = (h, c)
        return out, state
@functools.lru_cache
 def get_vad_model():
    """Returns the VAD model instance."""
    path = os.path.join(os.path.dirname(__file__), "silero_vad.onnx")
    return SileroVADModel(path)
 def get_speech_timestamps(
    audio: np.ndarray,
    vad_options: Optional[VadOptions] = None,
    **kwargs,
 ) -> List[dict]:
    """This method is used for splitting long audios into speech chunks using silero VAD.
    Args:
      audio: One dimensional float array.
      vad_options: Options for VAD processing.
      kwargs: VAD options passed as keyword arguments for backward compatibility.
    Returns:
      List of dicts containing begin and end samples of each speech chunk.
    """
    if vad_options is None:
        vad_options = VadOptions(**kwargs)
    threshold = vad_options.threshold
    min_speech_duration_ms = vad_options.min_speech_duration_ms
    max_speech_duration_s = vad_options.max_speech_duration_s
    min_silence_duration_ms = vad_options.min_silence_duration_ms
    window_size_samples = vad_options.window_size_samples
    speech_pad_ms = vad_options.speech_pad_ms
    if window_size_samples not in [512, 1024, 1536]:
        warnings.warn(
            "Unusual window_size_samples! Supported window_size_samples:\n"
            " - [512, 1024, 1536] for 16000 sampling_rate"
        )
    sampling_rate = 16000
    min_speech_samples = sampling_rate * min_speech_duration_ms / 1000 #如果间隔区间没这个长度就不会添加
    speech_pad_samples = sampling_rate * speech_pad_ms / 1000
    max_speech_samples = (
        sampling_rate * max_speech_duration_s
        - window_size_samples
        - 2 * speech_pad_samples
    )
    min_silence_samples = sampling_rate * min_silence_duration_ms / 1000 # 在每个silent需要等 min_silence_duration_ms 后才结束，
    min_silence_samples_at_max_speech = sampling_rate * 98 / 1000 # 0.098s # need to adjust？
    audio_length_samples = len(audio)
    # import pdb
    # pdb.set_trace()
    model = get_vad_model()
    state = model.get_initial_state(batch_size=1)
    speech_probs = []
    #print("audio_length_samples ", audio_length_samples, ", window_size_samples ", window_size_samples)
    for current_start_sample in range(0, audio_length_samples, window_size_samples):
        chunk = audio[current_start_sample : current_start_sample + window_size_samples]
        if len(chunk) < window_size_samples:
            chunk = np.pad(chunk, (0, int(window_size_samples - len(chunk))))
        speech_prob, state = model(chunk, state, sampling_rate)
        speech_probs.append(speech_prob)
    triggered = False
    speeches = []
    current_speech = {}
    neg_threshold = threshold - 0.15
    # to save potential segment end (and tolerate some silence)
    temp_end = 0
    # to save potential segment limits in case of maximum segment size reached
    prev_end = next_start = 0
    # 大概是一段音频找出其中连续部分，如果遇到silent的话会先记录temp_end，然后如果没超过最小silent长度遇到active的情况下会重置temp_end。silent片段会分别记录silent的起终，在超过长度的时候切开（不完全确定，但是inf的最大长也遇不到）
    for i, speech_prob in enumerate(speech_probs):
        if (speech_prob >= threshold) and temp_end:
            temp_end = 0
            if next_start < prev_end:
                next_start = window_size_samples * i
        if (speech_prob >= threshold) and not triggered:
            triggered = True
            current_speech["start"] = window_size_samples * i
            continue
        if (
            triggered
            and (window_size_samples * i) - current_speech["start"] > max_speech_samples
        ):
            if prev_end:
                current_speech["end"] = prev_end
                speeches.append(current_speech)
                current_speech = {}
                # previously reached silence (< neg_thres) and is still not speech (< thres)
                if next_start < prev_end:
                    triggered = False
                else:
                    current_speech["start"] = next_start
                prev_end = next_start = temp_end = 0
            else:
                current_speech["end"] = window_size_samples * i
                speeches.append(current_speech)
                current_speech = {}
                prev_end = next_start = temp_end = 0
                triggered = False
                continue
        if (speech_prob < neg_threshold) and triggered:
            if not temp_end:
                temp_end = window_size_samples * i
            # condition to avoid cutting in very short silence
            if (window_size_samples * i) - temp_end > min_silence_samples_at_max_speech:
                prev_end = temp_end
            if (window_size_samples * i) - temp_end < min_silence_samples:
                continue
            else:
                current_speech["end"] = temp_end
                if (
                    current_speech["end"] - current_speech["start"]
                ) > min_speech_samples:
                    speeches.append(current_speech)
                current_speech = {}
                prev_end = next_start = temp_end = 0
                triggered = False
                continue
    if (
        current_speech
        and (audio_length_samples - current_speech["start"]) > min_speech_samples
    ):
        current_speech["end"] = audio_length_samples
        speeches.append(current_speech)
    # pad 多少ms，每个中间都会不足平分
    for i, speech in enumerate(speeches):
        if i == 0:
            speech["start"] = int(max(0, speech["start"] - speech_pad_samples))
        if i != len(speeches) - 1:
            silence_duration = speeches[i + 1]["start"] - speech["end"]
            if silence_duration < 2 * speech_pad_samples:
                speech["end"] += int(silence_duration // 2)
                speeches[i + 1]["start"] = int(
                    max(0, speeches[i + 1]["start"] - silence_duration // 2)
                )
            else:
                speech["end"] = int(
                    min(audio_length_samples, speech["end"] + speech_pad_samples)
                )
                speeches[i + 1]["start"] = int(
                    max(0, speeches[i + 1]["start"] - speech_pad_samples)
                )
        else:
            speech["end"] = int(
                min(audio_length_samples, speech["end"] + speech_pad_samples)
            )
    return speeches
 def collect_chunks(audio: np.ndarray, chunks: List[dict]) -> np.ndarray:
    """Collects and concatenates audio chunks."""
    if not chunks:
        return np.array([], dtype=np.float32)
    return np.concatenate([audio[chunk["start"] : chunk["end"]] for chunk in chunks])
 def run_vad(ori_audio, sr, vad_options=None):
    _st = time.time()
    try:
        audio = np.frombuffer(ori_audio, dtype=np.int16)
        audio = audio.astype(np.float32) / 32768.0
        sampling_rate = 16000
        if sr != sampling_rate:
            audio = librosa.resample(audio, orig_sr=sr, target_sr=sampling_rate)
        # print('audio.encode.shape: {}'.format(audio.shape))
        if vad_options is None:
            vad_options = VadOptions()
        # 确保传递给 get_speech_timestamps 的是 VadOptions 实例
        speech_chunks = get_speech_timestamps(audio, vad_options=vad_options)
        # print(speech_chunks)
        audio = collect_chunks(audio, speech_chunks)
        # print(audio.shape)
        duration_after_vad = audio.shape[0] / sampling_rate
        # print('audio.decode.shape: {}'.format(audio.shape))
        if sr != sampling_rate:
            # resample to original sampling rate
            vad_audio = librosa.resample(audio, orig_sr=sampling_rate, target_sr=sr)
        else:
            vad_audio = audio
        vad_audio = np.round(vad_audio * 32768.0).astype(np.int16)
        # 这个round会有一定的误差
        vad_audio_bytes = vad_audio.tobytes()
        return duration_after_vad, vad_audio_bytes, round(time.time() - _st, 4)
    except Exception as e:
        msg = f"[asr vad error] audio_len: {len(ori_audio)/(sr*2):.3f} s, trace: {traceback.format_exc()}"
        print(msg)
        return -1, ori_audio, round(time.time() - _st, 4)
--- a/web_demos/minicpm-o_2.6/web_server/.env.development
+++ b/web_demos/minicpm-o_2.6/web_server/.env.development
--- a/web_demos/minicpm-o_2.6/web_server/.env.production
+++ b/web_demos/minicpm-o_2.6/web_server/.env.production
--- a/web_demos/minicpm-o_2.6/web_server/.eslintrc-auto-import.json
+++ b/web_demos/minicpm-o_2.6/web_server/.eslintrc-auto-import.json
@@ -0,0 +1,359 @@
 {
  "globals": {
    "Component": true,
    "ComponentPublicInstance": true,
    "ComputedRef": true,
    "EffectScope": true,
    "ExtractDefaultPropTypes": true,
    "ExtractPropTypes": true,
    "ExtractPublicPropTypes": true,
    "InjectionKey": true,
    "LegalTypeEnum": true,
    "LoginTypeEnum": true,
    "PropType": true,
    "Ref": true,
    "VNode": true,
    "WritableComputedRef": true,
    "acceptHMRUpdate": true,
    "ajaxHeader": true,
    "asyncComputed": true,
    "authLogin": true,
    "autoResetRef": true,
    "computed": true,
    "computedAsync": true,
    "computedEager": true,
    "computedInject": true,
    "computedWithControl": true,
    "controlledComputed": true,
    "controlledRef": true,
    "createApp": true,
    "createEventHook": true,
    "createGlobalState": true,
    "createInjectionState": true,
    "createPinia": true,
    "createReactiveFn": true,
    "createReusableTemplate": true,
    "createSharedComposable": true,
    "createTemplatePromise": true,
    "createUnrefFn": true,
    "customRef": true,
    "debouncedRef": true,
    "debouncedWatch": true,
    "defineAsyncComponent": true,
    "defineComponent": true,
    "defineStore": true,
    "eagerComputed": true,
    "effectScope": true,
    "extendRef": true,
    "fetchSmsVerifyCode": true,
    "getActivePinia": true,
    "getCurrentInstance": true,
    "getCurrentScope": true,
    "getHomeInfo": true,
    "h": true,
    "ignorableWatch": true,
    "inject": true,
    "injectLocal": true,
    "isDefined": true,
    "isProxy": true,
    "isReactive": true,
    "isReadonly": true,
    "isRef": true,
    "loginSuccess": true,
    "makeDestructurable": true,
    "mapActions": true,
    "mapGetters": true,
    "mapState": true,
    "mapStores": true,
    "mapWritableState": true,
    "markRaw": true,
    "nextTick": true,
    "onActivated": true,
    "onBeforeMount": true,
    "onBeforeRouteLeave": true,
    "onBeforeRouteUpdate": true,
    "onBeforeUnmount": true,
    "onBeforeUpdate": true,
    "onClickOutside": true,
    "onDeactivated": true,
    "onErrorCaptured": true,
    "onKeyStroke": true,
    "onLongPress": true,
    "onMounted": true,
    "onRenderTracked": true,
    "onRenderTriggered": true,
    "onScopeDispose": true,
    "onServerPrefetch": true,
    "onStartTyping": true,
    "onUnmounted": true,
    "onUpdated": true,
    "pausableWatch": true,
    "provide": true,
    "provideLocal": true,
    "reactify": true,
    "reactifyObject": true,
    "reactive": true,
    "reactiveComputed": true,
    "reactiveOmit": true,
    "reactivePick": true,
    "readonly": true,
    "ref": true,
    "refAutoReset": true,
    "refDebounced": true,
    "refDefault": true,
    "refThrottled": true,
    "refWithControl": true,
    "resolveComponent": true,
    "resolveRef": true,
    "resolveUnref": true,
    "setActivePinia": true,
    "setMapStoreSuffix": true,
    "setupStore": true,
    "shallowReactive": true,
    "shallowReadonly": true,
    "shallowRef": true,
    "store": true,
    "storeToRefs": true,
    "submitFeedback": true,
    "syncRef": true,
    "syncRefs": true,
    "templateRef": true,
    "throttledRef": true,
    "throttledWatch": true,
    "toRaw": true,
    "toReactive": true,
    "toRef": true,
    "toRefs": true,
    "toValue": true,
    "triggerRef": true,
    "tryOnBeforeMount": true,
    "tryOnBeforeUnmount": true,
    "tryOnMounted": true,
    "tryOnScopeDispose": true,
    "tryOnUnmounted": true,
    "unref": true,
    "unrefElement": true,
    "until": true,
    "useActiveElement": true,
    "useAnimate": true,
    "useArrayDifference": true,
    "useArrayEvery": true,
    "useArrayFilter": true,
    "useArrayFind": true,
    "useArrayFindIndex": true,
    "useArrayFindLast": true,
    "useArrayIncludes": true,
    "useArrayJoin": true,
    "useArrayMap": true,
    "useArrayReduce": true,
    "useArraySome": true,
    "useArrayUnique": true,
    "useAsyncQueue": true,
    "useAsyncState": true,
    "useAttrs": true,
    "useBase64": true,
    "useBattery": true,
    "useBluetooth": true,
    "useBreakpoints": true,
    "useBroadcastChannel": true,
    "useBrowserLocation": true,
    "useCached": true,
    "useClearLocalCache": true,
    "useClipboard": true,
    "useClipboardItems": true,
    "useCloned": true,
    "useColorMode": true,
    "useConfirmDialog": true,
    "useCounter": true,
    "useCssModule": true,
    "useCssVar": true,
    "useCssVars": true,
    "useCurrentElement": true,
    "useCycleList": true,
    "useDark": true,
    "useDateFormat": true,
    "useDebounce": true,
    "useDebounceFn": true,
    "useDebouncedRefHistory": true,
    "useDeviceMotion": true,
    "useDeviceOrientation": true,
    "useDevicePixelRatio": true,
    "useDevicesList": true,
    "useDisplayMedia": true,
    "useDocumentVisibility": true,
    "useDraggable": true,
    "useDropZone": true,
    "useElementBounding": true,
    "useElementByPoint": true,
    "useElementHover": true,
    "useElementSize": true,
    "useElementVisibility": true,
    "useEventBus": true,
    "useEventListener": true,
    "useEventSource": true,
    "useEyeDropper": true,
    "useFavicon": true,
    "useFetch": true,
    "useFetchLogin": true,
    "useFileDialog": true,
    "useFileSystemAccess": true,
    "useFocus": true,
    "useFocusWithin": true,
    "useFps": true,
    "useFullscreen": true,
    "useGamepad": true,
    "useGeolocation": true,
    "useGetLocalCache": true,
    "useHttp": true,
    "useIdle": true,
    "useImage": true,
    "useInfiniteScroll": true,
    "useIntersectionObserver": true,
    "useInterval": true,
    "useIntervalFn": true,
    "useKeyModifier": true,
    "useLastChanged": true,
    "useLegal": true,
    "useLink": true,
    "useLocalStorage": true,
    "useLogin": true,
    "useMagicKeys": true,
    "useManualRefHistory": true,
    "useMediaControls": true,
    "useMediaQuery": true,
    "useMemoize": true,
    "useMemory": true,
    "useMounted": true,
    "useMouse": true,
    "useMouseInElement": true,
    "useMousePressed": true,
    "useMutationObserver": true,
    "useNavigatorLanguage": true,
    "useNetwork": true,
    "useNow": true,
    "useObjectUrl": true,
    "useOffsetPagination": true,
    "useOnline": true,
    "usePageLeave": true,
    "useParallax": true,
    "useParentElement": true,
    "usePerformanceObserver": true,
    "usePermission": true,
    "usePointer": true,
    "usePointerLock": true,
    "usePointerSwipe": true,
    "usePreferredColorScheme": true,
    "usePreferredContrast": true,
    "usePreferredDark": true,
    "usePreferredLanguages": true,
    "usePreferredReducedMotion": true,
    "usePrevious": true,
    "useRafFn": true,
    "useRefHistory": true,
    "useResizeObserver": true,
    "useRoute": true,
    "useRouter": true,
    "useScreenOrientation": true,
    "useScreenSafeArea": true,
    "useScriptTag": true,
    "useScroll": true,
    "useScrollLock": true,
    "useSessionStorage": true,
    "useSetLocalCache": true,
    "useShare": true,
    "useSlots": true,
    "useSorted": true,
    "useSpeechRecognition": true,
    "useSpeechSynthesis": true,
    "useStepper": true,
    "useStorage": true,
    "useStorageAsync": true,
    "useStyleTag": true,
    "useSupported": true,
    "useSwipe": true,
    "useTemplateRefsList": true,
    "useTextDirection": true,
    "useTextSelection": true,
    "useTextareaAutosize": true,
    "useThrottle": true,
    "useThrottleFn": true,
    "useThrottledRefHistory": true,
    "useTimeAgo": true,
    "useTimeout": true,
    "useTimeoutFn": true,
    "useTimeoutPoll": true,
    "useTimestamp": true,
    "useTitle": true,
    "useToNumber": true,
    "useToString": true,
    "useToggle": true,
    "useTransition": true,
    "useUrlSearchParams": true,
    "useUserMedia": true,
    "useUserStore": true,
    "useUserStoreWithOut": true,
    "useVModel": true,
    "useVModels": true,
    "useVibrate": true,
    "useVirtualList": true,
    "useWakeLock": true,
    "useWebNotification": true,
    "useWebSocket": true,
    "useWebWorker": true,
    "useWebWorkerFn": true,
    "useWindowFocus": true,
    "useWindowScroll": true,
    "useWindowSize": true,
    "watch": true,
    "watchArray": true,
    "watchAtMost": true,
    "watchDebounced": true,
    "watchDeep": true,
    "watchEffect": true,
    "watchIgnorable": true,
    "watchImmediate": true,
    "watchOnce": true,
    "watchPausable": true,
    "watchPostEffect": true,
    "watchSyncEffect": true,
    "watchThrottled": true,
    "watchTriggerable": true,
    "watchWithFilter": true,
    "whenever": true,
    "ElMessage": true,
    "ElLoading": true,
    "deleteHistoryBatch": true,
    "deleteHistoryItem": true,
    "getHistory": true,
    "createConv": true,
    "fetchHistoryList": true,
    "stopChat": true,
    "useChatStore": true,
    "useChatStoreWithOut": true,
    "useChatExchangeStore": true,
    "useChatExchangeStoreWithOut": true,
    "useExchangeStore": true,
    "useExchangeStoreWithOut": true,
    "delMessage": true,
    "sendRating": true,
    "getInitialActions": true,
    "sendFeedback": true,
    "md": true,
    "useMarkdown": true,
    "connectService": true,
    "sendMessage": true,
    "Audio": true,
    "SoundRecording": true,
    "getVolume": true,
    "ElMessageBox": true,
    "encodeWav": true,
    "encodeWAV": true,
    "stopMessage": true,
    "TaskQueue": true,
    "getNewUserId": true,
    "setNewUserId": true,
    "uploadFile": true,
    "feedback": true,
    "uploadConfig": true
  }
 }
--- a/web_demos/minicpm-o_2.6/web_server/.eslintrc.cjs
+++ b/web_demos/minicpm-o_2.6/web_server/.eslintrc.cjs
@@ -0,0 +1,26 @@
 /* eslint-env node */
 require('@rushstack/eslint-patch/modern-module-resolution');
 module.exports = {
    root: true,
    extends: [
        'plugin:vue/vue3-essential',
        'eslint:recommended',
        '@vue/eslint-config-prettier/skip-formatting',
        './.eslintrc-auto-import.json',
    ],
    parserOptions: {
        ecmaVersion: 'latest',
    },
    rules: {
        'no-console': process.env.NODE_ENV === 'production' ? 'off' : 'warn',
        'no-debugger': process.env.NODE_ENV === 'production' ? 'error' : 'warn',
        'no-var': process.env.NODE_ENV === 'production' ? 'off' : 'warn',
        'no-undef': process.env.NODE_ENV === 'production' ? 'error' : 'warn',
        'vue/multi-word-component-names': 'off', // 不校验组件名
        'no-empty': 0, // 允许代码块为空
        'vue/no-unused-components': 'warn',
        'no-unused-vars': 'warn',
        'prettier/prettier': 'off', // 不符合prettier格式规范的编码eslint直接自动报错
    },
 };
--- a/web_demos/minicpm-o_2.6/web_server/.gitignore
+++ b/web_demos/minicpm-o_2.6/web_server/.gitignore
@@ -0,0 +1,32 @@
 # Logs
 logs
 *.log
 npm-debug.log*
 yarn-debug.log*
 yarn-error.log*
 pnpm-debug.log*
 lerna-debug.log*
 node_modules
 .DS_Store
 dist
 dist-ssr
 coverage
 *.local
 /cypress/videos/
 /cypress/screenshots/
 # Editor directories and files
 .vscode/*
 !.vscode/extensions.json
 .idea
 *.suo
 *.ntvs*
 *.njsproj
 *.sln
 *.sw?
 *.tsbuildinfo
 .VSCodeCounter
 .history
--- a/web_demos/minicpm-o_2.6/web_server/.husky/pre-push
+++ b/web_demos/minicpm-o_2.6/web_server/.husky/pre-push
@@ -0,0 +1,10 @@
 #!/usr/bin/env sh
 . "$(dirname -- "$0")/_/husky.sh"
 echo "---format start---"
 pnpm run format
 echo "---format end---"
 echo "---eslint start---"
 pnpm run lint
 echo "---eslint end---"
--- a/web_demos/minicpm-o_2.6/web_server/.prettierrc.json
+++ b/web_demos/minicpm-o_2.6/web_server/.prettierrc.json
@@ -0,0 +1,19 @@
 {
    "$schema": "https://json.schemastore.org/prettierrc",
    "semi": true,
    "trailingComma": "none",
    "singleQuote": true,
    "printWidth": 120,
    "tabWidth": 4,
    "useTabs": false,
    "quoteProps": "as-needed",
    "bracketSpacing": true,
    "jsxBracketSameLine": false,
    "arrowParens": "avoid",
    "endOfLine": "auto",
    "htmlWhitespaceSensitivity": "css",
    "cssDeclarationSortOrder": "alphabetical",
    "tableContentIndentation": "align",
    "vueIndentScriptAndStyle": true,
    "proseWrap": "preserve"
 }
--- a/web_demos/minicpm-o_2.6/web_server/.vscode/extensions.json
+++ b/web_demos/minicpm-o_2.6/web_server/.vscode/extensions.json
@@ -0,0 +1,3 @@
 {
    "recommendations": ["Vue.volar", "dbaeumer.vscode-eslint", "esbenp.prettier-vscode"]
 }
--- a/web_demos/minicpm-o_2.6/web_server/Dockerfile
+++ b/web_demos/minicpm-o_2.6/web_server/Dockerfile
@@ -0,0 +1,21 @@
 # FROM 基于node的版本镜像，并通过构建阶段命名，将有node环境的阶段命名为build-stage
 FROM modelbest-registry-vpc.cn-beijing.cr.aliyuncs.com/modelbest/playground:20.10.0 as build-stage
 # 设置工作区为 /build 于系统文件隔离
 WORKDIR /build
 COPY . /build
 # 在容器中安装依赖
 RUN npm config set registry https://registry.npmmirror.com/ 
 # 或者用源 https://registry.npm.taobao.org
 RUN npm i pnpm -g
 RUN pnpm config set registry https://registry.npmmirror.com/
 RUN pnpm install
 # 打包
 RUN pnpm run build
 # production stage
 FROM modelbest-registry-vpc.cn-beijing.cr.aliyuncs.com/modelbest/playground:alpine as production-stage
 COPY --from=build-stage /build/dist /usr/share/nginx/html
 COPY nginx.conf /etc/nginx/
 EXPOSE 3000
--- a/web_demos/minicpm-o_2.6/web_server/README.md
+++ b/web_demos/minicpm-o_2.6/web_server/README.md
@@ -0,0 +1,74 @@
 ## Language
 -   [English](#english)
 -   [中文](#中文)
 ---
 # English
 ## important
 This project depends on Node and PNPM. If they are not installed, please install them.
 ## Project Setup
 ```sh
 pnpm install
 ```
 ## Compile and Hot-Reload for Development
 ```sh
 pnpm run dev
 ```
 ## Compile and Minify for Production
 ```sh
 pnpm run build
 ```
 ### Tips
 If you want to use your own backend in the development environment, please modify the proxy object in <font color="red">vite.config.js</font> located in the root directory.
 ### Recommended IDE Setup
 [VSCode](https://code.visualstudio.com/)
 ---
 # 中文
 ## 重要
 这个项目依赖于node、pnpm环境，如果你的PC上没有，请先安装。
 ## 安装依赖
 ```sh
 pnpm install
 ```
 ## 运行在本地开发模式下（可热更新）
 ```sh
 pnpm run dev
 ```
 ## 编译代码（用于生产环境）
 ```sh
 pnpm run build
 ```
 ### 注意
 如果你想在本地开发模式下运行项目，并且调用自己的后端服务，请修改项目根目录下的<font color="red">vite.config.js</font>文件中的proxy配置。
 ### 推荐IDE
 [VSCode](https://code.visualstudio.com/)
--- a/web_demos/minicpm-o_2.6/web_server/components.d.ts
+++ b/web_demos/minicpm-o_2.6/web_server/components.d.ts
@@ -0,0 +1,31 @@
 /* eslint-disable */
 /* prettier-ignore */
 // @ts-nocheck
 // Generated by unplugin-vue-components
 // Read more: https://github.com/vuejs/core/pull/3399
 export {}
 declare module 'vue' {
  export interface GlobalComponents {
    ElButton: typeof import('element-plus/es')['ElButton']
    ElCheckbox: typeof import('element-plus/es')['ElCheckbox']
    ElCheckboxGroup: typeof import('element-plus/es')['ElCheckboxGroup']
    ElDialog: typeof import('element-plus/es')['ElDialog']
    ElDropdown: typeof import('element-plus/es')['ElDropdown']
    ElDropdownItem: typeof import('element-plus/es')['ElDropdownItem']
    ElDropdownMenu: typeof import('element-plus/es')['ElDropdownMenu']
    ElForm: typeof import('element-plus/es')['ElForm']
    ElFormItem: typeof import('element-plus/es')['ElFormItem']
    ElIcon: typeof import('element-plus/es')['ElIcon']
    ElInput: typeof import('element-plus/es')['ElInput']
    ElTooltip: typeof import('element-plus/es')['ElTooltip']
    Lottie: typeof import('./src/components/Lottie/index.vue')['default']
    RouterLink: typeof import('vue-router')['RouterLink']
    RouterView: typeof import('vue-router')['RouterView']
    SiderMenu: typeof import('./src/components/SiderMenu/index.vue')['default']
    Toast: typeof import('./src/components/Toast/index.vue')['default']
  }
  export interface ComponentCustomProperties {
    vInfiniteScroll: typeof import('element-plus/es')['ElInfiniteScroll']
  }
 }
--- a/web_demos/minicpm-o_2.6/web_server/index.html
+++ b/web_demos/minicpm-o_2.6/web_server/index.html
@@ -0,0 +1,13 @@
 <!doctype html>
 <html lang="en">
    <head>
        <meta charset="UTF-8" />
        <link rel="icon" href="/favicon.svg" />
        <meta name="viewport" content="viewport-fit=cover,maximum-scale=1" />
        <title>MiniCPM-omni</title>
    </head>
    <body>
        <div id="app"></div>
        <script type="module" src="/src/main.js"></script>
    </body>
 </html>
--- a/web_demos/minicpm-o_2.6/web_server/nginx.conf
+++ b/web_demos/minicpm-o_2.6/web_server/nginx.conf
@@ -0,0 +1,110 @@
 user root;
 worker_processes auto;
 pid /run/nginx.pid;
 include /etc/nginx/modules-enabled/*.conf;
 events {
 	worker_connections 768;
 	# multi_accept on;
 }
 http {
 	##
 	# Basic Settings
 	##
 	client_max_body_size 20M;
 	sendfile on;
 	tcp_nopush on;
 	tcp_nodelay on;
 	keepalive_timeout 65;
 	types_hash_max_size 2048;
 	# server_tokens off;
 	# server_names_hash_bucket_size 64;
 	# server_name_in_redirect off;
 	include /etc/nginx/mime.types;
 	default_type application/octet-stream;
 	##
 	# SSL Settings
 	##
 	ssl_protocols TLSv1 TLSv1.1 TLSv1.2 TLSv1.3; # Dropping SSLv3, ref: POODLE
 	ssl_prefer_server_ciphers on;
 	##
 	# Logging Settings
 	##
 	access_log /var/log/nginx/access.log;
 	error_log /var/log/nginx/error.log;
 	##
 	# Gzip Settings
 	##
 	gzip on;
 	# gzip_vary on;
 	# gzip_proxied any;
 	# gzip_comp_level 6;
 	# gzip_buffers 16 8k;
 	# gzip_http_version 1.1;
 	# gzip_types text/plain text/css application/json application/javascript text/xml application/xml application/xml+rss text/javascript;
    ##
 	# Virtual Host Configs
 	##
 	server {
 		#	listen 8080;
 		server_name localhost;
 		add_header Access-Control-Allow-Origin *;
 		add_header Access-Control-Allow-Headers X-Requested-With;
 		add_header Access-Control-Allow-Methods GET,POST,OPTIONS;
        # 后端请求
        location /api/v1 {
            proxy_pass http://127.0.0.1:32550;
            proxy_set_header Host $host;
            proxy_set_header Connection "";
            chunked_transfer_encoding off;
            proxy_set_header X-Accel-Buffering off;  # 这里设置X-Accel-Buffering头部
            add_header X-Accel-Buffering off;         # 这里是用于响应中显示X-Accel-Buffering头部
            proxy_http_version 1.1;
            # 关闭 Nginx 缓存
            proxy_buffering off;
            proxy_cache off;
            # 禁用 Nginx 默认缓冲条件
            sendfile off;
            tcp_nodelay on;
        }
        location /ws {
            proxy_pass http://127.0.0.1:32550;
            proxy_http_version 1.1;
            proxy_set_header Upgrade $http_upgrade;
            proxy_set_header Connection 'upgrade';
            proxy_set_header Host $host;
            proxy_cache_bypass $http_upgrade;
        }
 		location / {
 			root /usr/share/nginx/html;
 			index index.html index.htm;
 			try_files $uri $uri/ /index.html;
 		}
 		location @router {
 			rewrite ^.*$ /index.html last;
 		}
 		location =/robots.txt {
 			index robots.txt;
 		}
 	}
 }
--- a/web_demos/minicpm-o_2.6/web_server/package.json
+++ b/web_demos/minicpm-o_2.6/web_server/package.json
@@ -0,0 +1,45 @@
 {
    "name": "web",
    "version": "0.0.0",
    "private": true,
    "type": "module",
    "scripts": {
        "dev": "vite",
        "build": "vite build",
        "preview": "vite preview",
        "lint": "eslint . --ext .vue,.js,.jsx,.cjs,.mjs --fix --ignore-path .gitignore",
        "format": "prettier --write src/",
        "prepare": "husky install"
    },
    "dependencies": {
        "@element-plus/icons-vue": "^2.3.1",
        "@microsoft/fetch-event-source": "^2.0.1",
        "@ricky0123/vad-web": "^0.0.22",
        "@vueuse/core": "^11.0.3",
        "axios": "^1.7.7",
        "clipboard": "^2.0.11",
        "el-table-infinite-scroll": "^3.0.6",
        "element-plus": "^2.8.1",
        "pinia": "^2.1.7",
        "unplugin-icons": "^0.19.3",
        "vue": "^3.4.29",
        "vue-i18n": "^11.0.1",
        "vue-router": "^4.3.3"
    },
    "devDependencies": {
        "@iconify-json/fluent": "^1.2.1",
        "@iconify-json/material-symbols": "^1.2.1",
        "@rushstack/eslint-patch": "^1.8.0",
        "@vitejs/plugin-vue": "^5.0.5",
        "@vue/eslint-config-prettier": "^9.0.0",
        "eslint": "^8.57.0",
        "eslint-plugin-vue": "^9.23.0",
        "husky": "^9.1.5",
        "less": "^4.2.0",
        "prettier": "^3.2.5",
        "unplugin-auto-import": "^0.18.2",
        "unplugin-vue-components": "^0.27.4",
        "vite": "^5.3.1",
        "vite-plugin-vue-devtools": "^7.3.1"
    }
 }
--- a/web_demos/minicpm-o_2.6/web_server/pnpm-lock.yaml
+++ b/web_demos/minicpm-o_2.6/web_server/pnpm-lock.yaml
--- a/web_demos/minicpm-o_2.6/web_server/public/favicon.ico
+++ b/web_demos/minicpm-o_2.6/web_server/public/favicon.ico
--- a/web_demos/minicpm-o_2.6/web_server/public/favicon.svg
+++ b/web_demos/minicpm-o_2.6/web_server/public/favicon.svg
@@ -0,0 +1,9 @@
 <?xml version="1.0" encoding="UTF-8"?>
 <svg width="39px" height="40px" viewBox="0 0 39 40" version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
    <title>形状结合</title>
    <g id="封面/目录" stroke="none" stroke-width="1" fill="none" fill-rule="evenodd">
        <g id="编组-9" transform="translate(-573, -4)" fill="#EF1C2F" fill-rule="nonzero">
            <path d="M576.881892,21.235765 L580.450462,24.8041433 L580.38268,24.87313 C577.235834,28.1237009 577.267944,33.3099197 580.479012,36.5209876 C583.7129,39.7548751 588.950111,39.7644621 592.195876,36.5497487 L595.764033,40.1177144 L595.635716,40.2441837 C590.410115,45.3030383 582.072776,45.2514173 576.910679,40.0893208 L576.755816,39.9319282 C571.756877,34.7682174 571.748077,26.5660415 576.729414,21.3916323 L576.881892,21.235765 Z M592.417879,13.3160236 L604.512414,4 L599.920819,17.5607789 L607.492343,16.6473294 L602.663827,23.0830718 L611.570065,25.829445 L602.638402,29.682258 L606.245418,35.3702608 L600.389683,35.3702553 L597.78469,37.9753136 L594.216265,34.4068885 L597.546837,31.0764355 L595.209819,27.390919 L597.0017,26.6178775 L594.322938,25.7918752 L596.362671,23.0730359 L592.57191,23.5303217 L594.387004,18.1691054 L590.921987,20.8381842 L588.636635,16.8638388 L585.916869,19.3631275 L585.910577,19.36078 L582.540401,22.7310252 L578.972081,19.1627052 L581.472017,16.6628227 L581.47204,12.2468077 L584.806996,13.5296032 L589.867048,8.87978168 L592.417879,13.3160236 Z" id="形状结合"></path>
        </g>
    </g>
 </svg>
--- a/web_demos/minicpm-o_2.6/web_server/public/silero_vad_legacy.onnx
+++ b/web_demos/minicpm-o_2.6/web_server/public/silero_vad_legacy.onnx
--- a/web_demos/minicpm-o_2.6/web_server/src/App.vue
+++ b/web_demos/minicpm-o_2.6/web_server/src/App.vue
@@ -0,0 +1,7 @@
 <template>
    <RouterView />
 </template>
 <script setup></script>
 <style lang="less" scoped></style>
--- a/web_demos/minicpm-o_2.6/web_server/src/apis/index.js
+++ b/web_demos/minicpm-o_2.6/web_server/src/apis/index.js
@@ -0,0 +1,21 @@
 // 定时发送消息
 export const sendMessage = data => {
    return useHttp.post('/api/v1/stream', data);
 };
 // 跳过当前
 export const stopMessage = () => {
    return useHttp.post('/api/v1/stop');
 };
 // 上传音色文件
 export const uploadFile = data => {
    return useHttp.post('/api/v1/upload_audio', data);
 };
 // 反馈
 export const feedback = data => {
    return useHttp.post('/api/v1/feedback', data);
 };
 // 上传配置
 export const uploadConfig = data => {
    return useHttp.post('/api/v1/init_options', data);
    // return useHttp.post('/api/v1/upload_audio', data);
 };
--- a/web_demos/minicpm-o_2.6/web_server/src/assets/images/cai-active.png
+++ b/web_demos/minicpm-o_2.6/web_server/src/assets/images/cai-active.png
--- a/web_demos/minicpm-o_2.6/web_server/src/assets/images/cai.png
+++ b/web_demos/minicpm-o_2.6/web_server/src/assets/images/cai.png
--- a/web_demos/minicpm-o_2.6/web_server/src/assets/images/ideas-icon.png
+++ b/web_demos/minicpm-o_2.6/web_server/src/assets/images/ideas-icon.png
--- a/web_demos/minicpm-o_2.6/web_server/src/assets/images/logo.png
+++ b/web_demos/minicpm-o_2.6/web_server/src/assets/images/logo.png
--- a/web_demos/minicpm-o_2.6/web_server/src/assets/images/voice-icon.png
+++ b/web_demos/minicpm-o_2.6/web_server/src/assets/images/voice-icon.png
--- a/web_demos/minicpm-o_2.6/web_server/src/assets/images/voice.png
+++ b/web_demos/minicpm-o_2.6/web_server/src/assets/images/voice.png
--- a/web_demos/minicpm-o_2.6/web_server/src/assets/images/zan-active.png
+++ b/web_demos/minicpm-o_2.6/web_server/src/assets/images/zan-active.png
--- a/web_demos/minicpm-o_2.6/web_server/src/assets/images/zan.png
+++ b/web_demos/minicpm-o_2.6/web_server/src/assets/images/zan.png
--- a/web_demos/minicpm-o_2.6/web_server/src/assets/svg/config.svg
+++ b/web_demos/minicpm-o_2.6/web_server/src/assets/svg/config.svg
@@ -0,0 +1 @@
 <svg data-v-d2e47025="" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 1024 1024"><path fill="currentColor" d="M600.704 64a32 32 0 0 1 30.464 22.208l35.2 109.376c14.784 7.232 28.928 15.36 42.432 24.512l112.384-24.192a32 32 0 0 1 34.432 15.36L944.32 364.8a32 32 0 0 1-4.032 37.504l-77.12 85.12a357.12 357.12 0 0 1 0 49.024l77.12 85.248a32 32 0 0 1 4.032 37.504l-88.704 153.6a32 32 0 0 1-34.432 15.296L708.8 803.904c-13.44 9.088-27.648 17.28-42.368 24.512l-35.264 109.376A32 32 0 0 1 600.704 960H423.296a32 32 0 0 1-30.464-22.208L357.696 828.48a351.616 351.616 0 0 1-42.56-24.64l-112.32 24.256a32 32 0 0 1-34.432-15.36L79.68 659.2a32 32 0 0 1 4.032-37.504l77.12-85.248a357.12 357.12 0 0 1 0-48.896l-77.12-85.248A32 32 0 0 1 79.68 364.8l88.704-153.6a32 32 0 0 1 34.432-15.296l112.32 24.256c13.568-9.152 27.776-17.408 42.56-24.64l35.2-109.312A32 32 0 0 1 423.232 64H600.64zm-23.424 64H446.72l-36.352 113.088-24.512 11.968a294.113 294.113 0 0 0-34.816 20.096l-22.656 15.36-116.224-25.088-65.28 113.152 79.68 88.192-1.92 27.136a293.12 293.12 0 0 0 0 40.192l1.92 27.136-79.808 88.192 65.344 113.152 116.224-25.024 22.656 15.296a294.113 294.113 0 0 0 34.816 20.096l24.512 11.968L446.72 896h130.688l36.48-113.152 24.448-11.904a288.282 288.282 0 0 0 34.752-20.096l22.592-15.296 116.288 25.024 65.28-113.152-79.744-88.192 1.92-27.136a293.12 293.12 0 0 0 0-40.256l-1.92-27.136 79.808-88.128-65.344-113.152-116.288 24.96-22.592-15.232a287.616 287.616 0 0 0-34.752-20.096l-24.448-11.904L577.344 128zM512 320a192 192 0 1 1 0 384 192 192 0 0 1 0-384m0 64a128 128 0 1 0 0 256 128 128 0 0 0 0-256"></path></svg>
--- a/web_demos/minicpm-o_2.6/web_server/src/assets/svg/document.svg
+++ b/web_demos/minicpm-o_2.6/web_server/src/assets/svg/document.svg
@@ -0,0 +1 @@
 <svg data-v-d2e47025="" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 1024 1024"><path fill="currentColor" d="M832 384H576V128H192v768h640zm-26.496-64L640 154.496V320zM160 64h480l256 256v608a32 32 0 0 1-32 32H160a32 32 0 0 1-32-32V96a32 32 0 0 1 32-32m160 448h384v64H320zm0-192h160v64H320zm0 384h384v64H320z"></path></svg>
--- a/web_demos/minicpm-o_2.6/web_server/src/assets/svg/error.svg
+++ b/web_demos/minicpm-o_2.6/web_server/src/assets/svg/error.svg
@@ -0,0 +1,5 @@
 <svg width="20" height="20" viewBox="0 0 20 20" fill="none" xmlns="http://www.w3.org/2000/svg">
 <g id="Icon/Utility Icon/line/error">
 <path id="Union" fill-rule="evenodd" clip-rule="evenodd" d="M9.99997 20C4.48608 20 0 15.5139 0 10C0 4.48607 4.48606 0 9.99997 0C15.5139 0 19.9999 4.48609 19.9999 10C19.9999 15.5139 15.5139 20 9.99997 20ZM9.99997 1.875C5.52001 1.875 1.875 5.52002 1.875 10C1.875 14.48 5.52001 18.125 9.99997 18.125C14.4799 18.125 18.125 14.48 18.125 10C18.125 5.52002 14.4799 1.875 9.99997 1.875ZM13.7878 7.53784L11.3257 9.99999L13.7878 12.4621C14.154 12.8283 14.154 13.4216 13.7878 13.7878C13.6047 13.9709 13.3655 14.0625 13.125 14.0625C12.8845 14.0625 12.6452 13.9709 12.4621 13.7878L9.99998 11.3257L7.53784 13.7878C7.35473 13.9709 7.11548 14.0625 6.875 14.0625C6.63451 14.0625 6.39526 13.9709 6.21216 13.7878C5.84595 13.4216 5.84595 12.8283 6.21216 12.4621L8.6743 9.99999L6.21216 7.53784C5.84595 7.17163 5.84595 6.57837 6.21216 6.21216C6.57836 5.84595 7.17163 5.84595 7.53784 6.21216L10 8.67431L12.4621 6.21216C12.8283 5.84595 13.4216 5.84595 13.7878 6.21216C14.154 6.57837 14.154 7.17163 13.7878 7.53784Z" fill="#E72B00"/>
 </g>
 </svg>
--- a/web_demos/minicpm-o_2.6/web_server/src/assets/svg/miniCPM2.6.svg
+++ b/web_demos/minicpm-o_2.6/web_server/src/assets/svg/miniCPM2.6.svg
@@ -0,0 +1,29 @@
 <?xml version="1.0" encoding="UTF-8"?>
 <svg viewBox="0 0 2199 258" version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
    <title>编组 5</title>
    <defs>
        <linearGradient x1="45.9111958%" y1="57.6904311%" x2="4.78458419e-14%" y2="70.534914%" id="linearGradient-1">
            <stop stop-color="#373ED8" offset="0%"></stop>
            <stop stop-color="#497DFF" offset="100%"></stop>
        </linearGradient>
        <path d="M1812.80909,215.823442 L1812.80909,252.015442 L1952.00909,252.015442 L1952.00909,211.995442 L1870.22909,211.995442 L1930.08509,134.391442 C1937.27682,125.111446 1942.72882,116.063446 1946.44109,107.247442 C1950.15309,98.4314389 1952.00909,88.8034425 1952.00909,78.3634425 L1952.00909,72.4474425 C1952.00909,60.6154425 1949.16709,49.8274425 1943.48309,40.0834425 C1937.79935,30.3394425 1929.67935,22.5674425 1919.12309,16.7674425 C1908.56709,10.9674354 1896.32909,8.06744248 1882.40909,8.06744248 C1868.02509,8.06744248 1855.49709,10.8514425 1844.82509,16.4194425 C1834.15309,21.9874425 1825.97509,29.6434425 1820.29109,39.3874425 C1814.6071,49.1314425 1811.76509,60.2674425 1811.76509,72.7954425 L1811.76509,81.1474425 L1855.96109,81.1474425 L1855.96109,75.5794425 C1855.96109,66.5314425 1858.16509,59.4554496 1862.57309,54.3514425 C1866.98109,49.2474354 1873.01309,46.6954425 1880.66909,46.6954425 C1888.32509,46.6954425 1894.41509,49.1314425 1898.93909,54.0034425 C1903.46309,58.8754425 1905.72509,65.4874425 1905.72509,73.8394425 L1905.72509,78.7114425 C1905.72509,89.3834389 1901.31709,100.635442 1892.50109,112.467442 L1812.80909,215.823442 Z M1976.89309,202.599442 L1976.89309,252.015442 L2025.26509,252.015442 L2025.26509,202.599442 L1976.89309,202.599442 Z M2069.81109,237.051442 C2082.91909,249.579442 2101.07309,255.843442 2124.27309,255.843442 C2146.54509,255.843442 2164.46709,249.463444 2178.03909,236.703442 C2191.61109,223.943441 2198.39709,206.195446 2198.39709,183.459442 L2198.39709,172.323442 C2198.39709,151.675439 2192.77109,135.377446 2181.51909,123.429442 C2170.26709,111.481439 2155.24509,105.507442 2136.45309,105.507442 C2129.95709,105.507442 2124.15709,106.667446 2119.05309,108.987442 L2168.81709,11.8954425 L2120.79309,11.8954425 L2065.80909,118.731442 C2060.70509,128.939446 2056.81909,138.335446 2054.15109,146.919442 C2051.48309,155.503439 2050.14909,164.551444 2050.14909,174.063442 L2050.14909,184.851442 C2050.14909,207.123442 2056.70309,224.523442 2069.81109,237.051442 Z M2145.15309,208.863442 C2140.04909,214.431442 2133.08909,217.215442 2124.27309,217.215442 C2115.45709,217.215442 2108.55509,214.431442 2103.56709,208.863442 C2098.57909,203.295442 2096.08509,195.639442 2096.08509,185.895442 L2096.08509,174.411442 C2096.08509,164.667448 2098.57909,157.06945 2103.56709,151.617442 C2108.55507,146.165446 2115.45707,143.439442 2124.27309,143.439442 C2133.08909,143.439442 2140.04909,146.165446 2145.15309,151.617442 C2150.25709,157.069439 2152.80909,164.783441 2152.80909,174.759442 L2152.80909,185.547442 C2152.80909,195.523444 2150.25709,203.295442 2145.15309,208.863442 Z" id="path-2"></path>
    </defs>
    <g id="页面-1" stroke="none" stroke-width="1" fill="none" fill-rule="evenodd">
        <g id="画板备份-14" transform="translate(-1928, -1764)" fill-rule="nonzero">
            <g id="编组-5" transform="translate(1928, 1764.9846)">
                <path d="M760.177408,6.08426104 C780.959767,6.08426104 798.639412,11.1994393 813.310847,21.4099653 L814.826937,22.4868416 C827.871266,31.9421726 838.44267,44.5805385 846.551989,60.4441981 L846.854209,61.0497535 L805.164914,79.6080391 L804.700192,78.6693484 C800.249247,69.9367001 794.263119,63.1340103 786.753168,58.3199387 C778.149995,52.8050845 768.233984,50.0506369 757.083763,50.0506369 C744.833865,50.0506369 733.831738,53.5163071 724.177049,60.4281868 C714.581392,67.2978043 707.134706,76.9187061 701.846243,89.2224762 C696.604371,101.417853 693.991732,115.174193 693.991732,130.467076 C693.991732,146.161685 696.600155,160.26846 701.833896,172.765351 C707.114285,185.373628 714.549996,195.251726 724.136357,202.332561 C733.797042,209.468294 744.814803,213.049067 757.083763,213.049067 C767.641468,213.049067 777.212416,210.227573 785.712813,204.59744 L787.029919,203.697896 C793.997044,198.793482 800.045118,192.1801 805.176059,183.885785 L805.493229,183.359111 L847.470223,202.046016 L847.190902,202.601562 C838.534575,219.439902 826.824502,232.527091 812.037303,241.920218 C796.196991,251.982303 778.007922,257.015442 757.393128,257.015442 C735.537395,257.015442 716.157915,251.671101 699.179392,240.984619 C682.183395,230.287139 668.940168,215.394755 659.416171,196.246509 C649.861213,177.036014 645.075526,155.122604 645.075526,130.467076 C645.075526,106.236416 649.907085,84.6957154 659.55475,65.8023698 C669.170256,46.9720039 682.655972,32.3375052 700.053275,21.8391328 C717.453143,11.3392121 737.472005,6.08426104 760.177408,6.08426104 Z M472.804069,70.4320631 C490.215347,70.4320631 503.551514,75.9687387 513.08592,87.0440588 L513.922858,88.0433234 C522.993334,99.1752071 527.579887,114.927363 527.579887,135.416907 L527.579887,252.681061 L482.065262,252.681061 L482.066689,147.48212 C482.066689,137.230753 479.444996,128.966722 474.122797,122.834623 C468.710698,116.598944 461.272898,113.470346 452.076652,113.470346 C441.497963,113.470346 432.82714,116.858283 426.285393,123.629565 L425.517793,124.451905 C419.503248,131.122279 416.518055,139.98993 416.518055,150.885129 L416.517248,252.681061 L371.003248,252.681061 L371.003248,74.7611551 L416.517248,74.7611551 L416.518055,98.8935909 L423.358335,98.8935909 L424.345965,97.2602023 C429.414322,88.8779197 436.064565,82.3247602 444.331449,77.5591448 C452.566403,72.8119363 462.037327,70.4320631 472.804069,70.4320631 Z M605.481416,74.7611551 L605.481416,252.681061 L559.9708,252.681061 L559.9708,74.7611551 L605.481416,74.7611551 Z M335.78549,74.7611551 L335.78549,252.681061 L290.27149,252.681061 L290.27149,74.7611551 L335.78549,74.7611551 Z M0,10.4147082 L43.0533273,10.4147082 L122.24903,130.758127 L127.754059,130.758127 L206.947055,10.4147082 L250.000382,10.4147082 L250.000382,252.681061 L204.178374,252.681061 L204.180526,104.498777 L197.139835,104.498777 L143.308009,184.313596 L106.66868,184.313596 L52.5323692,105.736235 L45.5131981,105.736235 L45.5106162,252.681061 L0,252.681061 L0,10.4147082 Z M961.869908,10.4147082 C981.192886,10.4147082 997.923497,13.7715036 1012.08946,20.4554447 C1026.1438,27.0867173 1036.87643,36.5393059 1044.3624,48.8517555 C1051.8649,61.1913978 1055.62564,75.6900619 1055.62564,92.415251 C1055.62564,109.151297 1051.96203,123.607707 1044.65738,135.847946 C1037.36788,148.062782 1026.98507,157.46144 1013.43892,164.086199 C999.800233,170.756213 983.652345,174.105774 964.963553,174.105774 L922.598939,174.105774 L922.596926,252.681061 L875.228112,252.681061 L875.228112,10.4147082 L961.869908,10.4147082 Z M953.826433,52.8349168 L922.598939,52.8349168 L922.598939,131.686221 L953.826433,131.686221 C969.953126,131.686221 982.80491,128.309905 992.310812,121.456813 C1002.06659,114.423581 1007.0188,104.634314 1007.0188,92.415251 C1007.0188,80.2034681 1002.0737,70.3707714 992.331646,63.2341476 C982.822362,56.2680445 969.961757,52.8349168 953.826433,52.8349168 Z M335.78549,0 L335.78549,45.5106162 L290.27149,45.5106162 L290.27149,0 L335.78549,0 Z M605.119253,0 L605.119253,45.5106162 L559.605252,45.5106162 L559.605252,0 L605.119253,0 Z" id="形状" fill="#111111"></path>
                <g id="M-V" transform="translate(1084.9431, 11.7574)" fill="#000111">
                    <polygon id="路径" points="44.394 239.184 0 239.184 0 0 41.676 0 119.894 123.216 121.706 123.216 200.226 0 241.902 0 241.902 239.184 197.508 239.184 197.508 85.466 195.696 85.466 137.41 176.368 104.492 176.368 46.206 86.372 44.394 86.372"></polygon>
                    <polygon id="路径" points="274.216 96.942 374.48 96.942 374.48 138.014 274.216 138.014"></polygon>
                </g>
                <g id="o" transform="translate(1501.3431, 42.9174)" fill="#000111">
                    <path d="M95.4,213.12 C75.96,213.12 59.04,208.8 44.64,200.16 C30.24,191.52 19.2,179.16 11.52,163.08 C3.84,147 0,128.16 0,106.56 C0,84.96 3.84,66.12 11.52,50.04 C19.2,33.96 30.24,21.6 44.64,12.96 C59.04,4.32 75.96,0 95.4,0 C114.84,0 131.76,4.32 146.16,12.96 C160.56,21.6 171.66,33.96 179.46,50.04 C187.26,66.12 191.16,84.96 191.16,106.56 C191.16,128.16 187.26,147 179.46,163.08 C171.66,179.16 160.56,191.52 146.16,200.16 C131.76,208.8 114.84,213.12 95.4,213.12 Z M95.4,169.92 C110.52,169.92 122.46,164.22 131.22,152.82 C139.98,141.42 144.36,126 144.36,106.56 C144.36,86.88 139.98,71.4 131.22,60.12 C122.46,48.84 110.52,43.2 95.4,43.2 C80.52,43.2 68.7,48.84 59.94,60.12 C51.18,71.4 46.8,86.88 46.8,106.56 C46.8,126 51.18,141.42 59.94,152.82 C68.7,164.22 80.52,169.92 95.4,169.92 Z" id="形状"></path>
                </g>
                <g id="形状结合">
                    <use fill="#000111" xlink:href="#path-2"></use>
                    <use fill="url(#linearGradient-1)" xlink:href="#path-2"></use>
                </g>
            </g>
        </g>
    </g>
 </svg>
--- a/web_demos/minicpm-o_2.6/web_server/src/assets/svg/pause.svg
+++ b/web_demos/minicpm-o_2.6/web_server/src/assets/svg/pause.svg
@@ -0,0 +1,5 @@
 <svg width="18" height="18" viewBox="0 0 18 18" fill="none" xmlns="http://www.w3.org/2000/svg">
 <g id="Pause">
 <path id="Vector" fill-rule="evenodd" clip-rule="evenodd" d="M4.875 2.2522H7.125C7.5375 2.2522 7.875 2.5897 7.875 3.0022V15.0022C7.875 15.4147 7.5375 15.7522 7.125 15.7522H4.875C4.4625 15.7522 4.125 15.4147 4.125 15.0022V3.0022C4.125 2.5897 4.4625 2.2522 4.875 2.2522ZM10.875 2.2522H13.125C13.5375 2.2522 13.875 2.5897 13.875 3.0022V15.0022C13.875 15.4147 13.5375 15.7522 13.125 15.7522H10.875C10.4625 15.7522 10.125 15.4147 10.125 15.0022V3.0022C10.125 2.5897 10.4625 2.2522 10.875 2.2522Z" fill="currentColor" />
 </g>
 </svg>
--- a/web_demos/minicpm-o_2.6/web_server/src/assets/svg/phone-icon.svg
+++ b/web_demos/minicpm-o_2.6/web_server/src/assets/svg/phone-icon.svg
@@ -0,0 +1,10 @@
 <svg xmlns="http://www.w3.org/2000/svg" width="24" height="24" viewBox="0 0 24 24" fill="none">
  <g clip-path="url(#clip0_7781_19663)">
    <path d="M21.7786 18.4946C22.7599 18.6754 23.7615 17.9845 23.7955 16.9048C23.827 15.9053 23.7672 14.6519 23.4533 13.4613C23.141 12.2768 22.5546 11.0725 21.4695 10.2892C20.3647 9.49176 18.7205 8.97497 17.1207 8.64947C15.4984 8.31938 13.8179 8.16607 12.5642 8.15054C10.9332 8.13034 8.67094 8.26243 6.60622 8.68941C5.57392 8.90289 4.56701 9.19489 3.70489 9.59193C2.85192 9.98474 2.07652 10.5096 1.58739 11.2247C0.257894 13.1683 0.172116 15.4886 0.325588 16.9453C0.436943 18.0022 1.45742 18.5535 2.36025 18.353C3.07081 18.1951 3.71743 18.0593 4.36845 17.9225C5.30139 17.7265 6.24339 17.5286 7.3955 17.2614C7.46587 17.2451 7.53161 17.2194 7.59169 17.1859C7.85982 17.0768 8.05173 16.8168 8.05917 16.509C8.09666 14.957 8.40578 14.0228 8.95698 13.4586C9.50108 12.9017 10.4369 12.5484 12.1227 12.5476C13.8976 12.5468 14.8691 12.862 15.4225 13.3997C15.9698 13.9314 16.2828 14.8523 16.2836 16.5335C16.2836 16.5634 16.2854 16.5928 16.2888 16.6217C16.279 16.6521 16.2711 16.6836 16.2651 16.7159C16.19 17.1233 16.4594 17.5144 16.8668 17.5894L21.7786 18.4946Z" fill="currentColor" />
  </g>
  <defs>
    <clipPath id="clip0_7781_19663">
      <rect width="24" height="24" fill="white"/>
    </clipPath>
  </defs>
 </svg>
--- a/web_demos/minicpm-o_2.6/web_server/src/assets/svg/question.svg
+++ b/web_demos/minicpm-o_2.6/web_server/src/assets/svg/question.svg
@@ -0,0 +1 @@
 <svg t="1736675176012" class="icon" viewBox="0 0 1024 1024" version="1.1" xmlns="http://www.w3.org/2000/svg" p-id="4244" xmlns:xlink="http://www.w3.org/1999/xlink"><path d="M512 106.667A405.333 405.333 0 1 1 106.667 512 405.333 405.333 0 0 1 512 106.667m0-64A469.333 469.333 0 1 0 981.333 512 469.333 469.333 0 0 0 512 42.667z" p-id="4245"></path><path d="M501.333 664.533a32 32 0 1 0 32 32 32 32 0 0 0-32-32z m-0.426-27.093a32 32 0 0 1-32-32c0-80.213 50.56-111.787 91.306-136.96 32-19.84 51.84-33.28 59.094-60.16a85.333 85.333 0 0 0-12.587-69.547 91.52 91.52 0 0 0-76.8-29.226 123.52 123.52 0 0 0-92.16 29.866 82.56 82.56 0 0 0-21.333 52.907 32 32 0 1 1-64 2.56 144 144 0 0 1 39.466-99.84c31.574-32.853 78.08-49.493 138.24-49.493 70.827 0 108.587 29.44 128 54.186a149.333 149.333 0 0 1 23.894 125.014c-14.08 52.693-54.614 77.866-87.04 98.133-40.32 24.747-61.654 39.68-61.654 82.56a32 32 0 0 1-32.426 32z" p-id="4246"></path></svg>
--- a/web_demos/minicpm-o_2.6/web_server/src/assets/svg/switch-camera.svg
+++ b/web_demos/minicpm-o_2.6/web_server/src/assets/svg/switch-camera.svg
@@ -0,0 +1,3 @@
 <svg data-v-d2e47025="" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 1024 1024">
    <path fill="currentColor" d="M771.776 794.88A384 384 0 0 1 128 512h64a320 320 0 0 0 555.712 216.448H654.72a32 32 0 1 1 0-64h149.056a32 32 0 0 1 32 32v148.928a32 32 0 1 1-64 0v-50.56zM276.288 295.616h92.992a32 32 0 0 1 0 64H220.16a32 32 0 0 1-32-32V178.56a32 32 0 0 1 64 0v50.56A384 384 0 0 1 896.128 512h-64a320 320 0 0 0-555.776-216.384z"></path>
 </svg>
--- a/web_demos/minicpm-o_2.6/web_server/src/assets/svg/upload.svg
+++ b/web_demos/minicpm-o_2.6/web_server/src/assets/svg/upload.svg
@@ -0,0 +1,7 @@
 <svg width="24" height="24" viewBox="0 0 24 24" fill="none" xmlns="http://www.w3.org/2000/svg">
 <g id="&#228;&#184;&#139;&#232;&#189;&#189;">
 <rect width="24" height="24" rx="7" fill="#EAEFFF"/>
 <path id="Vector" d="M12.2816 16.1003C11.9134 16.1003 11.615 15.8019 11.615 15.4337V6.2513L8.28168 9.50983C8.11137 9.6765 7.86502 9.73978 7.63559 9.67571C7.40617 9.61165 7.22831 9.42989 7.16894 9.1989C7.10956 8.96817 7.17805 8.72313 7.34836 8.55646L11.8163 4.18987C12.0785 3.93363 12.4985 3.93727 12.7563 4.19794L17.088 8.56323C17.3399 8.82599 17.3344 9.24187 17.0763 9.49811C16.818 9.7541 16.4018 9.75592 16.1414 9.50202L12.9483 6.28489V15.4337C12.9483 15.8019 12.6498 16.1003 12.2816 16.1003Z" fill="#424EC5"/>
 <path id="Vector_2" d="M4.66666 13.6001C5.03488 13.6001 5.33331 13.8985 5.33331 14.2668V17.4667C5.33331 17.8349 5.63174 18.1334 5.99997 18.1334H17.9998C18.368 18.1334 18.6664 17.8349 18.6664 17.4667V14.2668C18.6664 13.8985 18.9648 13.6001 19.3331 13.6001C19.7013 13.6001 19.9997 13.8985 19.9997 14.2668V17.4667C19.9997 18.5714 19.1044 19.4667 17.9998 19.4667H5.99997C4.8953 19.4667 4 18.5714 4 17.4667V14.2668C4 13.8985 4.29843 13.6001 4.66666 13.6001Z" fill="#424EC5"/>
 </g>
 </svg>
--- a/web_demos/minicpm-o_2.6/web_server/src/assets/svg/voice.svg
+++ b/web_demos/minicpm-o_2.6/web_server/src/assets/svg/voice.svg
@@ -0,0 +1,41 @@
 <svg xmlns="http://www.w3.org/2000/svg" width="195" height="45" viewBox="0 0 195 45" fill="none">
  <rect x="16" y="18.2257" width="3.14815" height="8.54867" rx="1.57407" fill="#F3F3F3"/>
  <rect x="11" y="18.2257" width="3.14815" height="8.54867" rx="1.57407" fill="#F3F3F3"/>
  <rect x="6" y="18.2257" width="3.14815" height="8.54867" rx="1.57407" fill="#F3F3F3"/>
  <rect x="0.907227" y="18.2257" width="3.14815" height="8.54867" rx="1.57407" fill="#F3F3F3"/>
  <rect x="71" y="18" width="3.14815" height="8.54867" rx="1.57407" fill="#F3F3F3"/>
  <rect x="91" y="18" width="3.14815" height="8.54867" rx="1.57407" fill="#F3F3F3"/>
  <rect x="111" y="18" width="3.14815" height="8.54867" rx="1.57407" fill="#F3F3F3"/>
  <rect x="131" y="18" width="3.14815" height="8.54867" rx="1.57407" fill="#F3F3F3"/>
  <rect x="66" y="18" width="3.14815" height="8.54867" rx="1.57407" fill="#F3F3F3"/>
  <rect x="86" y="18" width="3.14815" height="8.54867" rx="1.57407" fill="#F3F3F3"/>
  <rect x="106" y="18" width="3.14815" height="8.54867" rx="1.57407" fill="#F3F3F3"/>
  <rect x="126" y="18" width="3.14815" height="8.54867" rx="1.57407" fill="#F3F3F3"/>
  <rect x="61" y="18" width="3.14815" height="8.54867" rx="1.57407" fill="#F3F3F3"/>
  <rect x="81" y="18" width="3.14815" height="8.54867" rx="1.57407" fill="#F3F3F3"/>
  <rect x="101" y="18" width="3.14815" height="8.54867" rx="1.57407" fill="#F3F3F3"/>
  <rect x="121" y="18" width="3.14815" height="8.54867" rx="1.57407" fill="#F3F3F3"/>
  <rect x="56" y="18" width="3.14815" height="8.54867" rx="1.57407" fill="#F3F3F3"/>
  <rect x="76" y="18" width="3.14815" height="8.54867" rx="1.57407" fill="#F3F3F3"/>
  <rect x="96" y="18" width="3.14815" height="8.54867" rx="1.57407" fill="#F3F3F3"/>
  <rect x="116" y="18" width="3.14815" height="8.54867" rx="1.57407" fill="#F3F3F3"/>
  <rect x="21" y="13.3407" width="3.14815" height="18.3186" rx="1.57407" fill="#F3F3F3"/>
  <rect width="3.14815" height="18.3186" rx="1.57407" transform="matrix(-1 0 0 1 54 13.3849)" fill="#F3F3F3"/>
  <rect x="26" y="8.45581" width="3.14815" height="28.0885" rx="1.57407" fill="#F3F3F3"/>
  <rect width="3.14815" height="28.0885" rx="1.57407" transform="matrix(-1 0 0 1 49 8.5)" fill="#F3F3F3"/>
  <rect x="31" y="9.9823" width="3.14815" height="25.0354" rx="1.57407" fill="#F3F3F3"/>
  <rect width="3.14815" height="25.0354" rx="1.57407" transform="matrix(-1 0 0 1 44 10.0265)" fill="#F3F3F3"/>
  <rect x="36" y="5.09729" width="3.14815" height="34.8053" rx="1.57407" fill="#F3F3F3"/>
  <rect x="151" y="15.4779" width="3.14815" height="14.0442" rx="1.57407" fill="#F3F3F3"/>
  <rect x="156" y="5.70801" width="3.14815" height="33.5841" rx="1.57407" fill="#F3F3F3"/>
  <rect x="161" y="7.53979" width="3.14815" height="29.9204" rx="1.57407" fill="#F3F3F3"/>
  <rect x="166" y="15.4779" width="3.14815" height="14.0442" rx="1.57407" fill="#F3F3F3"/>
  <rect x="171" y="10.8982" width="3.14815" height="23.2035" rx="1.57407" fill="#F3F3F3"/>
  <rect width="3.14815" height="29.9204" rx="1.57407" transform="matrix(-1 0 0 1 149 7.5)" fill="#F3F3F3"/>
  <rect width="3.14815" height="14.0442" rx="1.57407" transform="matrix(-1 0 0 1 144 15.4381)" fill="#F3F3F3"/>
  <rect width="3.14815" height="23.2035" rx="1.57407" transform="matrix(-1 0 0 1 139 10.8584)" fill="#F3F3F3"/>
  <rect x="176" y="18.2257" width="3.14815" height="8.54867" rx="1.57407" fill="#F3F3F3"/>
  <rect x="181" y="18.2257" width="3.14815" height="8.54867" rx="1.57407" fill="#F3F3F3"/>
  <rect x="186" y="18.2257" width="3.14815" height="8.54867" rx="1.57407" fill="#F3F3F3"/>
  <rect x="191" y="18.2257" width="3.14815" height="8.54867" rx="1.57407" fill="#F3F3F3"/>
 </svg>
--- a/web_demos/minicpm-o_2.6/web_server/src/assets/svg/warning.svg
+++ b/web_demos/minicpm-o_2.6/web_server/src/assets/svg/warning.svg
@@ -0,0 +1,5 @@
 <svg width="20" height="20" viewBox="0 0 20 20" fill="none" xmlns="http://www.w3.org/2000/svg">
 <g id="Icon/Utility Icon/line/warning">
 <path id="Union" fill-rule="evenodd" clip-rule="evenodd" d="M11.8265 2.57765L19.6724 15.1369C20.0876 15.8014 20.1095 16.6395 19.7298 17.3248C19.3513 18.0101 18.6285 18.4352 17.8459 18.4352H2.15406C1.37142 18.4352 0.648615 18.0101 0.270115 17.3248C-0.109605 16.6395 -0.0876187 15.8014 0.327499 15.1369L8.17341 2.57765C8.569 1.94364 9.25275 1.56494 9.99998 1.56494C10.7472 1.56494 11.431 1.94364 11.8265 2.57765ZM17.8459 16.5589C17.9887 16.5589 18.0608 16.4685 18.0901 16.4148C18.1194 16.361 18.1585 16.2535 18.0828 16.1314L10.2369 3.57211C10.1661 3.4585 10.0574 3.44141 10 3.44141C9.94262 3.44141 9.83395 3.4585 9.76313 3.57211L1.91722 16.1314C1.84151 16.2535 1.88058 16.361 1.90988 16.4148C1.93918 16.4685 2.01122 16.5589 2.15407 16.5589H17.8459ZM9.99995 12.1893C10.5176 12.1893 10.9377 11.769 10.9377 11.2511V7.69991C10.9377 7.18195 10.5176 6.76172 9.99995 6.76172C9.48226 6.76172 9.06225 7.18195 9.06225 7.69991V11.2511C9.06225 11.7691 9.48226 12.1893 9.99995 12.1893ZM9.99996 15.6293C9.30946 15.6293 8.7497 15.0692 8.7497 14.3784C8.7497 13.6875 9.30946 13.1274 9.99996 13.1274C10.6905 13.1274 11.2502 13.6875 11.2502 14.3784C11.2502 15.0692 10.6905 15.6293 9.99996 15.6293Z" fill="#F9AC2A"/>
 </g>
 </svg>
--- a/web_demos/minicpm-o_2.6/web_server/src/components/CallHeader/index.vue
+++ b/web_demos/minicpm-o_2.6/web_server/src/components/CallHeader/index.vue
@@ -0,0 +1,3 @@
 <template>
    <div class="call-header"></div>
 </template>
--- a/web_demos/minicpm-o_2.6/web_server/src/components/CountDown/index.vue
+++ b/web_demos/minicpm-o_2.6/web_server/src/components/CountDown/index.vue
@@ -0,0 +1,82 @@
 <template>
    <div class="time">
        <div class="time-minute">{{ minute || '00' }}</div>
        <div class="time-colon">:</div>
        <div class="time-second">{{ second || '00' }}</div>
    </div>
 </template>
 <script setup>
    import { limitTime, tipsRemainingTime } from '@/enums';
    const start = defineModel();
    const emits = defineEmits(['timeUp']);
    const remainingTime = ref();
    const minute = ref();
    const second = ref();
    const timeInterval = ref(null);
    const startCount = () => {
        remainingTime.value = limitTime;
        updateCountDown();
        timeInterval.value = setInterval(() => {
            updateCountDown();
        }, 1000);
    };
    const updateCountDown = () => {
        let minutes = Math.floor(remainingTime.value / 60);
        let seconds = remainingTime.value % 60;
        // 格式化分钟和秒，确保它们是两位数
        minute.value = minutes < 10 ? '0' + minutes : minutes;
        second.value = seconds < 10 ? '0' + seconds : seconds;
        // 剩余1分钟提示用户
        if (remainingTime.value === tipsRemainingTime) {
            ElMessage({
                type: 'warning',
                message: `This call will disconnect in ${tipsRemainingTime} seconds.`,
                duration: 3000,
                customClass: 'time-warning'
            });
        }
        // 防止倒计时变成负数
        if (remainingTime.value > 0) {
            remainingTime.value--;
        } else {
            clearInterval(timeInterval);
            emits('timeUp');
        }
    };
    watch(
        () => start.value,
        newVal => {
            timeInterval.value && clearInterval(timeInterval.value);
            if (newVal) {
                startCount();
            }
        },
        { immediate: true }
    );
 </script>
 <style lang="less" scoped>
    .time {
        display: flex;
        align-items: center;
        .time-minute,
        .time-second {
            width: 26px;
            height: 26px;
            display: flex;
            justify-content: center;
            align-items: center;
            border-radius: 3.848px;
            background: rgba(47, 47, 47, 0.5);
        }
        .time-colon {
            margin: 0 3px;
        }
    }
 </style>
--- a/web_demos/minicpm-o_2.6/web_server/src/components/DelayTips/index.vue
+++ b/web_demos/minicpm-o_2.6/web_server/src/components/DelayTips/index.vue
@@ -0,0 +1,23 @@
 <template>
    <div class="delay-tips">
        <span>当前发生延迟，目前延迟{{ delayTimestamp }}ms，积压{{ delayCount * 200 }}ms未发</span>
    </div>
 </template>
 <script setup>
    defineProps({
        delayTimestamp: {
            type: Number,
            defalult: 0
        },
        delayCount: {
            type: Number,
            defalult: 0
        }
    });
 </script>
 <style lang="less" scoped>
    .delay-tips {
        font-size: 12px;
        color: #dc3545;
    }
 </style>
--- a/web_demos/minicpm-o_2.6/web_server/src/components/ExtraInfo/index.vue
+++ b/web_demos/minicpm-o_2.6/web_server/src/components/ExtraInfo/index.vue
@@ -0,0 +1,36 @@
 <template>
    <div class="extra-info">
        <div class="model-version" v-if="modelVersion">模型版本: {{ modelVersion }}</div>
        <div class="web-version">前端版本: {{ webVersion }}</div>
    </div>
 </template>
 <script setup>
    defineProps({
        modelVersion: {
            type: String,
            default: ''
        },
        webVersion: {
            type: String,
            default: ''
        }
    });
 </script>
 <style lang="less" scoped>
    .extra-info {
        position: fixed;
        top: 62px;
        left: 4vw;
        display: flex;
        .model-version,
        .web-version {
            font-size: 12px;
            color: red;
        }
        .model-version {
            margin-right: 16px;
        }
    }
 </style>
--- a/web_demos/minicpm-o_2.6/web_server/src/components/IdeasList/index.vue
+++ b/web_demos/minicpm-o_2.6/web_server/src/components/IdeasList/index.vue
@@ -0,0 +1,67 @@
 <template>
    <div class="ideas">
        <div class="ideas-title">
            <img src="@/assets/images/ideas-icon.png " />
            <span>Convsersation ideas</span>
        </div>
        <div class="ideas-content">
            <div class="ideas-content-item" v-for="(item, index) in ideasList" :key="index">{{ item }}</div>
        </div>
    </div>
 </template>
 <script setup>
    defineProps({
        ideasList: {
            type: Array,
            default: () => []
        }
    });
 </script>
 <style lang="less" scoped>
    .ideas {
        margin-top: 16px;
        box-shadow: 0 0 0 0.5px #e0e0e0;
        border-radius: 12px;
        padding: 18px 28px;
        &-title {
            font-size: 20px;
            font-weight: 500;
            margin-bottom: 20px;
            display: flex;
            align-items: center;
            img {
                width: 24px;
                height: 24px;
                margin-right: 10px;
            }
            span {
                color: #171717;
                font-family: PingFang SC;
                font-size: 16px;
                font-style: normal;
                font-weight: 500;
                line-height: normal;
            }
        }
        &-content {
            display: grid;
            grid-template-columns: repeat(3, 1fr);
            gap: 8px;
            &-item {
                display: flex;
                align-items: center;
                border-radius: 10px;
                background: #eaefff;
                padding: 10px 24px;
                color: #7579eb;
                font-family: PingFang SC;
                font-size: 14px;
                font-style: normal;
                font-weight: 400;
                line-height: normal;
            }
        }
    }
 </style>
--- a/web_demos/minicpm-o_2.6/web_server/src/components/LikeAndDislike/index.vue
+++ b/web_demos/minicpm-o_2.6/web_server/src/components/LikeAndDislike/index.vue
@@ -0,0 +1,110 @@
 <template>
    <div class="like-box">
        <div class="like-btn" @click="selectFeedbackStatus('like')">
            <img v-if="feedbackStatus === '' || feedbackStatus === 'dislike'" src="@/assets/images/zan.png" />
            <img v-else src="@/assets/images/zan-active.png" />
        </div>
        <div class="dislike-btn" @click="selectFeedbackStatus('dislike')">
            <img v-if="feedbackStatus === '' || feedbackStatus === 'like'" src="@/assets/images/cai.png" />
            <img v-else src="@/assets/images/cai-active.png" />
        </div>
    </div>
    <el-dialog
        v-model="dialogVisible"
        :title="t('feedbackDialogTitle')"
        width="400"
        :align-center="true"
        @close="cancelFeedback"
    >
        <el-input type="textarea" :rows="4" v-model="comment" />
        <div class="operate-btn">
            <el-button type="primary" :loading="submitLoading" @click="submitFeedback">确定</el-button>
            <el-button @click="cancelFeedback">取消</el-button>
        </div>
    </el-dialog>
 </template>
 <script setup>
    import { feedback } from '@/apis';
    import { useI18n } from 'vue-i18n';
    const { t } = useI18n();
    const feedbackStatus = defineModel('feedbackStatus');
    const curResponseId = defineModel('curResponseId');
    const dialogVisible = ref(false);
    const comment = ref('');
    const submitLoading = ref(false);
    const selectFeedbackStatus = val => {
        if (!curResponseId.value) {
            return;
        }
        feedbackStatus.value = val;
        dialogVisible.value = true;
    };
    // 提交反馈
    const submitFeedback = async () => {
        submitLoading.value = true;
        const { code, message } = await feedback({
            response_id: curResponseId.value,
            rating: feedbackStatus.value,
            comment: comment.value
        });
        submitLoading.value = false;
        if (code !== 0) {
            ElMessage({
                type: 'error',
                message: message,
                duration: 3000,
                customClass: 'system-error'
            });
            return;
        }
        ElMessage.success('反馈成功');
        dialogVisible.value = false;
        setTimeout(() => {
            feedbackStatus.value = '';
        }, 2000);
    };
    const cancelFeedback = () => {
        dialogVisible.value = false;
        feedbackStatus.value = '';
    };
 </script>
 <style lang="less" scoped>
    .like-box {
        display: flex;
        margin: 0 16px;
        .like-btn,
        .dislike-btn {
            width: 26px;
            height: 26px;
            background: #f3f3f3;
            display: flex;
            align-items: center;
            justify-content: center;
            border-radius: 8px;
            cursor: pointer;
            &:hover {
                background: #d1d1d1;
            }
            img {
                width: 16px;
                height: 16px;
            }
        }
        .dislike-btn {
            margin-left: 16px;
        }
    }
    .operate-btn {
        margin-top: 20px;
        display: flex;
        justify-content: flex-end;
        .el-button--primary {
            background: #647fff;
            border-color: #647fff;
            &:hover {
                border-color: #647fff;
            }
        }
    }
 </style>
--- a/web_demos/minicpm-o_2.6/web_server/src/components/ModelConfig/index
+++ b/web_demos/minicpm-o_2.6/web_server/src/components/ModelConfig/index
@@ -0,0 +1,404 @@
 <template>
    <div class="user-config">
        <div class="user-config-title">模型配置</div>
        <div class="config-item">
            <div class="config-item-label">语音打断：</div>
            <div class="config-item-content">
                <el-switch
                    v-model="configData.canStopByVoice"
                    inline-prompt
                    active-text="是"
                    inactive-text="否"
                    size="small"
                    :disabled="isCalling"
                />
            </div>
        </div>
        <div class="config-item">
            <div class="config-item-label">视频画质：</div>
            <div class="config-item-content">
                <el-radio-group v-model="configData.videoQuality" :disabled="isCalling">
                    <el-radio :value="true">高清</el-radio>
                    <el-radio :value="false">低清</el-radio>
                </el-radio-group>
            </div>
        </div>
        <div class="config-item">
            <div class="config-item-label">VAD阈值：</div>
            <div class="config-item-content vad-slider">
                <el-slider
                    v-model="configData.vadThreshold"
                    :min="0.5"
                    :max="1"
                    :step="0.1"
                    size="small"
                    :disabled="isCalling"
                />
            </div>
        </div>
        <!-- <div class="timbre-model">
            <div class="timbre-model-label">音色人物：</div>
            <div class="timbre-model-content">
                <el-select
                    v-model="configData.timbreId"
                    style="width: 100%"
                    @change="handleChangePeople"
                    clearable
                    placeholder="请选择"
                >
                    <el-option v-for="item in peopleList" :key="item.id" :value="item.id" :label="item.name">
                        {{ item.name }}
                    </el-option>
                </el-select>
            </div>
        </div> -->
        <div class="prompt-item">
            <div class="prompt-item-label">Assistant_prompt：</div>
            <div class="prompt-item-content">
                <el-input
                    type="textarea"
                    :rows="3"
                    v-model="configData.assistantPrompt"
                    resize="none"
                    :disabled="isCalling"
                />
            </div>
        </div>
        <div class="config-item">
            <div class="config-item-label">使用语音prompt：</div>
            <div class="config-item-content">
                <el-switch
                    v-model="configData.useAudioPrompt"
                    inline-prompt
                    active-text="是"
                    inactive-text="否"
                    size="small"
                    :disabled="isCalling"
                    @change="handleSelectUseAudioPrompt"
                />
            </div>
        </div>
        <div class="voice-prompt-box">
            <div class="prompt-item" v-if="configData.useAudioPrompt">
                <div class="prompt-item-label">Voice_clone_prompt：</div>
                <div class="prompt-item-content">
                    <el-input
                        type="textarea"
                        :rows="8"
                        v-model="configData.voiceClonePrompt"
                        resize="none"
                        :disabled="isCalling"
                    />
                </div>
            </div>
            <div class="timbre-config" v-if="configData.useAudioPrompt">
                <div class="timbre-config-label">音色选择：</div>
                <div class="timbre-config-content">
                    <el-checkbox-group v-model="configData.timbre" @change="handleSelectTimbre" :disabled="isCalling">
                        <el-checkbox :value="1" label="Default Audio"></el-checkbox>
                        <el-upload
                            v-model:file-list="fileList"
                            action=""
                            :multiple="false"
                            :on-change="handleChangeFile"
                            :auto-upload="false"
                            :show-file-list="false"
                            :disabled="isCalling"
                            accept="audio/*"
                        >
                            <el-checkbox :value="2">
                                <!-- <span>Customization: Upload Audio</span> -->
                                <span>Customization</span>
                                <SvgIcon name="upload" className="checkbox-icon" />
                            </el-checkbox>
                        </el-upload>
                    </el-checkbox-group>
                </div>
            </div>
            <div class="file-content" v-if="fileName">
                <SvgIcon name="document" class="document-icon" />
                <span class="file-name">{{ fileName }}</span>
            </div>
        </div>
    </div>
 </template>
 <script setup>
    const isCalling = defineModel('isCalling');
    const type = defineModel('type');
    let defaultVoiceClonePrompt =
        '你是一个AI助手。你能接受视频，音频和文本输入并输出语音和文本。模仿输入音频中的声音特征。';
    let defaultAssistantPrompt = '作为助手，你将使用这种声音风格说话。';
    const fileList = ref([]);
    const fileName = ref('');
    const configData = ref({
        canStopByVoice: false,
        videoQuality: false,
        useAudioPrompt: true,
        vadThreshold: 0.8,
        voiceClonePrompt: defaultVoiceClonePrompt,
        assistantPrompt: defaultAssistantPrompt,
        timbre: [1],
        audioFormat: 'mp3',
        base64Str: '',
        timbreId: ''
    });
    const peopleList = [
        {
            id: 1,
            name: 'Trump',
            voiceClonePrompt: '',
            assistantPrompt: ''
        },
        {
            id: 2,
            name: '说相声',
            voiceClonePrompt: '克隆音频提示中的音色以生成语音',
            assistantPrompt: '请角色扮演这段音频，请以相声演员的口吻说话'
        },
        {
            id: 3,
            name: '默认',
            voiceClonePrompt: defaultVoiceClonePrompt,
            assistantPrompt: defaultAssistantPrompt
        }
    ];
    watch(
        () => type.value,
        val => {
            if (val === 'video') {
                console.log('val: ', val);
                defaultVoiceClonePrompt =
                    '你是一个AI助手。你能接受视频，音频和文本输入并输出语音和文本。模仿输入音频中的声音特征。';
                defaultAssistantPrompt = '作为助手，你将使用这种声音风格说话。';
            } else {
                defaultVoiceClonePrompt = '克隆音频提示中的音色以生成语音。';
                defaultAssistantPrompt = 'Your task is to be a helpful assistant using this voice pattern.';
            }
            configData.value.voiceClonePrompt = defaultVoiceClonePrompt;
            configData.value.assistantPrompt = defaultAssistantPrompt;
        },
        { immediate: true }
    );
    onMounted(() => {
        handleSetStorage();
    });
    const handleSelectTimbre = e => {
        if (e.length > 1) {
            const val = e[e.length - 1];
            configData.value.timbre = [val];
            // 默认音色
            if (val === 1) {
                configData.value.audioFormat = 'mp3';
                configData.value.base64Str = '';
                fileList.value = [];
                fileName.value = '';
            }
        }
    };
    const handleChangeFile = file => {
        if (isAudio(file) && sizeNotExceed(file)) {
            fileList.value = [file];
            fileName.value = file.name;
            configData.value.timbre = [2];
            handleUpload();
        } else {
            ElMessage.error('Please upload audio file and size not exceed 10MB');
        }
    };
    const isAudio = file => {
        return file.raw.type.includes('audio');
    };
    const sizeNotExceed = file => {
        return file.size / 1024 / 1024 <= 10;
    };
    const handleUpload = async () => {
        const file = fileList.value[0].raw;
        if (file) {
            const reader = new FileReader();
            reader.onload = e => {
                const base64String = e.target.result.split(',')[1];
                configData.value.audioFormat = file.name.split('.')[1];
                configData.value.base64Str = base64String;
            };
            reader.readAsDataURL(file);
        }
    };
    const handleSelectUseAudioPrompt = val => {
        if (val) {
            configData.value.voiceClonePrompt = defaultVoiceClonePrompt;
            configData.value.assistantPrompt = defaultAssistantPrompt;
        }
    };
    // 配置发生变化，更新到localstorage中
    watch(configData.value, () => {
        handleSetStorage();
    });
    const handleSetStorage = () => {
        const { timbre, canStopByVoice, ...others } = configData.value;
        const defaultConfigData = {
            canStopByVoice,
            ...others
        };
        localStorage.setItem('configData', JSON.stringify(defaultConfigData));
        localStorage.setItem('canStopByVoice', canStopByVoice);
    };
    const handleChangePeople = val => {
        console.log('val: ', val);
        if (!val) {
            return;
        }
        const index = peopleList.findIndex(item => item.id === val);
        configData.value.voiceClonePrompt = peopleList[index].voiceClonePrompt;
        configData.value.assistantPrompt = peopleList[index].assistantPrompt;
        configData.value.timbre = [1];
    };
 </script>
 <style lang="less">
    .user-config {
        &-title {
            height: 61px;
            padding: 18px 18px 0;
            color: rgba(23, 23, 23, 0.9);
            font-family: PingFang SC;
            font-size: 16px;
            font-style: normal;
            font-weight: 500;
            line-height: normal;
        }
        .config-item {
            display: flex;
            align-items: center;
            width: 100%;
            padding: 0 0 0 18px;
            margin-bottom: 12px;
            &-label {
                width: 120px;
                flex-shrink: 0;
            }
            &-content {
                flex: 1;
                margin-left: 16px;
                .el-radio-group {
                    .el-radio {
                        width: 50px;
                    }
                }
            }
            &-content.vad-slider {
                width: 80%;
                padding-left: 7px;
                margin-right: 20px;
                .el-slider__button {
                    width: 14px;
                    height: 14px;
                }
            }
        }
        .timbre-config {
            padding: 0 0 0 18px;
            &-label {
                margin-bottom: 12px;
            }
            &-content {
                display: flex;
                align-items: center;
                .el-checkbox-group {
                    display: flex;
                    flex-wrap: wrap;
                    flex: 1;
                    > .el-checkbox {
                        margin-right: 12px;
                    }
                }
                .el-checkbox {
                    padding: 8px 16px;
                    border-radius: 10px;
                    background: #eaefff;
                    margin-bottom: 12px;
                    height: 40px;
                    .el-checkbox__input {
                        .el-checkbox__inner {
                            border: 1px solid #4dc100;
                        }
                    }
                    .el-checkbox__input.is-checked {
                        .el-checkbox__inner {
                            background: #4dc100;
                        }
                    }
                    .el-checkbox__input.is-checked.is-disabled {
                        .el-checkbox__inner::after {
                            border-color: #ffffff;
                        }
                    }
                }
                .el-checkbox__label {
                    color: #7579eb !important;
                    font-family: PingFang SC;
                    font-size: 16px;
                    font-style: normal;
                    font-weight: 400;
                    line-height: normal;
                    display: flex;
                    align-items: center;
                    .checkbox-icon {
                        margin-left: 4px;
                    }
                }
                .el-checkbox + .el-checkbox {
                    margin-left: 12px;
                }
            }
        }
        .prompt-item {
            // padding: 0 0 0 18px;
            margin-bottom: 12px;
            &-label {
                // margin-bottom: 16px;
            }
        }
        .file-content {
            padding: 0 0 0 18px;
            font-size: 14px;
            display: flex;
            align-items: center;
            .document-icon {
                width: 16px;
                height: 16px;
                margin-right: 4px;
            }
            .file-name {
                flex: 1;
                overflow: hidden;
                white-space: nowrap;
                text-overflow: ellipsis;
            }
        }
        .timbre-model {
            padding: 0 0 0 18px;
            margin-bottom: 12px;
            display: flex;
            align-items: center;
            &-label {
                width: 120px;
                flex-shrink: 0;
            }
            &-content {
                flex: 1;
                margin-left: 16px;
            }
        }
        .voice-prompt-box {
            border: 1px solid #eaefff;
            margin-left: 18px;
            padding: 12px;
            width: 50%;
        }
    }
 </style>
--- a/web_demos/minicpm-o_2.6/web_server/src/components/ModelConfig/index.vue
+++ b/web_demos/minicpm-o_2.6/web_server/src/components/ModelConfig/index.vue
@@ -0,0 +1,456 @@
 <template>
    <div :class="`user-config ${t('modelConfigTitle') === '模型配置' ? '' : 'en-user-config'}`">
        <div class="user-config-title">{{ t('modelConfigTitle') }}</div>
        <div class="config-item">
            <div class="config-item-label">
                <span>{{ t('audioInterruptionBtn') }}</span>
                <el-tooltip class="box-item" effect="dark" :content="t('audioInterruptionTips')" placement="top">
                    <SvgIcon name="question" class="question-icon" /> </el-tooltip
                >:
            </div>
            <div class="config-item-content">
                <el-switch
                    v-model="configData.canStopByVoice"
                    inline-prompt
                    :active-text="t('yes')"
                    :inactive-text="t('no')"
                    size="small"
                    :disabled="isCalling"
                />
            </div>
        </div>
        <div class="config-item" v-if="type === 'video'">
            <div class="config-item-label">
                <span>{{ t('videoQualityBtn') }}</span>
                <el-tooltip class="box-item" effect="dark" :content="t('videoQualityTips')" placement="top">
                    <SvgIcon name="question" class="question-icon" /> </el-tooltip
                >:
            </div>
            <div class="config-item-content">
                <el-switch
                    v-model="configData.videoQuality"
                    inline-prompt
                    :active-text="t('yes')"
                    :inactive-text="t('no')"
                    size="small"
                    :disabled="isCalling"
                />
            </div>
        </div>
        <div class="config-item">
            <div class="config-item-label">
                <span>{{ t('vadThresholdBtn') }}</span>
                <el-tooltip class="box-item" effect="dark" :content="t('vadThresholdTips')" placement="top">
                    <SvgIcon name="question" class="question-icon" /> </el-tooltip
                >:
            </div>
            <div class="config-item-content vad-slider">
                <el-slider
                    v-model="configData.vadThreshold"
                    :min="0.5"
                    :max="1"
                    :step="0.1"
                    size="small"
                    :disabled="isCalling"
                />
            </div>
        </div>
        <div class="prompt-item" v-if="type === 'voice'">
            <div class="prompt-item-label">
                <span>{{ t('assistantPromptBtn') }}</span>
                <el-tooltip class="box-item" effect="dark" :content="t('assistantPromptTips')" placement="top">
                    <SvgIcon name="question" class="question-icon" /> </el-tooltip
                >:
            </div>
            <div class="prompt-item-content">
                <el-input
                    type="textarea"
                    :rows="3"
                    v-model="configData.assistantPrompt"
                    resize="none"
                    :disabled="isCalling"
                />
            </div>
        </div>
        <!-- <div class="config-item">
            <div class="config-item-label">{{ t('useVoicePromptBtn') }}:</div>
            <div class="config-item-content">
                <el-switch
                    v-model="configData.useAudioPrompt"
                    inline-prompt
                    :active-text="t('yes')"
                    :inactive-text="t('no')"
                    size="small"
                    :disabled="isCalling"
                    @change="handleSelectUseAudioPrompt"
                />
            </div>
        </div> -->
        <div class="timbre-model">
            <div class="timbre-model-label">
                <span>{{ t('toneColorOptions') }}</span>
                <el-tooltip class="box-item" effect="dark" :content="t('toneColorOptionsTips')" placement="top">
                    <SvgIcon name="question" class="question-icon" /> </el-tooltip
                >:
            </div>
            <div class="timbre-model-content">
                <el-select
                    v-model="configData.useAudioPrompt"
                    style="width: 100%"
                    @change="handleChangePeople"
                    placeholder="请选择"
                    :disabled="isCalling"
                >
                    <el-option :value="0" :label="t('nullOption')">{{ t('nullOption') }}</el-option>
                    <el-option :value="1" :label="t('defaultOption')">{{ t('defaultOption') }}</el-option>
                    <el-option :value="2" :label="t('femaleOption')">{{ t('femaleOption') }}</el-option>
                    <el-option :value="3" :label="t('maleOption')">{{ t('maleOption') }}</el-option>
                </el-select>
            </div>
        </div>
        <!-- <div class="prompt-item">
            <div class="prompt-item-label">
                <span>{{ t('voiceClonePromptInput') }}</span>
                <el-tooltip class="box-item" effect="dark" :content="t('voiceClonePromptTips')" placement="top">
                    <SvgIcon name="question" class="question-icon" /> </el-tooltip
                >:
            </div>
            <div class="prompt-item-content">
                <el-input
                    type="textarea"
                    :rows="3"
                    v-model="configData.voiceClonePrompt"
                    resize="none"
                    :disabled="true"
                />
            </div>
        </div> -->
        <!-- <div class="timbre-config" v-if="configData.useAudioPrompt">
            <div class="timbre-config-label">{{ t('audioChoiceBtn') }}:</div>
            <div class="timbre-config-content">
                <el-checkbox-group v-model="configData.timbre" @change="handleSelectTimbre" :disabled="isCalling">
                    <el-checkbox :value="1" :label="t('defaultAudioBtn')"></el-checkbox>
                    <el-upload
                        v-model:file-list="fileList"
                        action=""
                        :multiple="false"
                        :on-change="handleChangeFile"
                        :auto-upload="false"
                        :show-file-list="false"
                        :disabled="isCalling"
                        accept="audio/*"
                    >
                        <el-checkbox :value="2">
                            <span>{{ t('customizationBtn') }}</span>
                            <SvgIcon name="upload" className="checkbox-icon" />
                        </el-checkbox>
                    </el-upload>
                </el-checkbox-group>
            </div>
        </div>
        <div class="file-content" v-if="fileName">
            <SvgIcon name="document" class="document-icon" />
            <span class="file-name">{{ fileName }}</span>
        </div> -->
    </div>
 </template>
 <script setup>
    const isCalling = defineModel('isCalling');
    const type = defineModel('type');
    import { useI18n } from 'vue-i18n';
    const { t, locale } = useI18n();
    let defaultVoiceClonePrompt =
        '你是一个AI助手。你能接受视频，音频和文本输入并输出语音和文本。模仿输入音频中的声音特征。';
    let defaultAssistantPrompt = '';
    const fileList = ref([]);
    const fileName = ref('');
    const configData = ref({
        canStopByVoice: false,
        videoQuality: false,
        useAudioPrompt: 1,
        vadThreshold: 0.8,
        voiceClonePrompt: defaultVoiceClonePrompt,
        assistantPrompt: defaultAssistantPrompt,
        timbre: [1],
        audioFormat: 'mp3',
        base64Str: ''
    });
    // let peopleList = [];
    // watch(
    //     () => type.value,
    //     val => {
    //         console.log('val: ', val);
    //         if (val === 'video') {
    //             defaultVoiceClonePrompt =
    //                 '你是一个AI助手。你能接受视频，音频和文本输入并输出语音和文本。模仿输入音频中的声音特征。';
    //             defaultAssistantPrompt = '作为助手，你将使用这种声音风格说话。';
    //         } else {
    //             defaultVoiceClonePrompt = '克隆音频提示中的音色以生成语音。';
    //             defaultAssistantPrompt = 'Your task is to be a helpful assistant using this voice pattern.';
    //         }
    //         configData.value.voiceClonePrompt = defaultVoiceClonePrompt;
    //         configData.value.assistantPrompt = defaultAssistantPrompt;
    //     },
    //     { immediate: true }
    // );
    watch(
        locale,
        (newLocale, oldLocale) => {
            console.log(`Language switched from ${oldLocale} to ${newLocale}`);
            if (newLocale === 'zh' && type.value === 'video') {
                defaultAssistantPrompt = '作为助手，你将使用这种声音风格说话。';
            } else if (newLocale === 'zh' && type.value === 'voice') {
                defaultAssistantPrompt = '作为助手，你将使用这种声音风格说话。';
            } else if (newLocale === 'en' && type.value === 'video') {
                defaultAssistantPrompt = 'As an assistant, you will speak using this voice style.';
            } else {
                defaultAssistantPrompt = 'As an assistant, you will speak using this voice style.';
            }
            configData.value.assistantPrompt = defaultAssistantPrompt;
        },
        { immediate: true }
    );
    onMounted(() => {
        handleSetStorage();
    });
    const handleSelectTimbre = e => {
        if (e.length > 1) {
            const val = e[e.length - 1];
            configData.value.timbre = [val];
            // 默认音色
            if (val === 1) {
                configData.value.audioFormat = 'mp3';
                configData.value.base64Str = '';
                fileList.value = [];
                fileName.value = '';
            }
        }
    };
    const handleChangeFile = file => {
        if (isAudio(file) && sizeNotExceed(file)) {
            fileList.value = [file];
            fileName.value = file.name;
            configData.value.timbre = [2];
            handleUpload();
        } else {
            ElMessage.error('Please upload audio file and size not exceed 10MB');
        }
    };
    const isAudio = file => {
        return file.raw.type.includes('audio');
    };
    const sizeNotExceed = file => {
        return file.size / 1024 / 1024 <= 10;
    };
    const handleUpload = async () => {
        const file = fileList.value[0].raw;
        if (file) {
            const reader = new FileReader();
            reader.onload = e => {
                const base64String = e.target.result.split(',')[1];
                configData.value.audioFormat = file.name.split('.')[1];
                configData.value.base64Str = base64String;
            };
            reader.readAsDataURL(file);
        }
    };
    const handleSelectUseAudioPrompt = val => {
        if (val) {
            configData.value.voiceClonePrompt = defaultVoiceClonePrompt;
            configData.value.assistantPrompt = defaultAssistantPrompt;
        }
    };
    // 配置发生变化，更新到localstorage中
    watch(configData.value, () => {
        handleSetStorage();
    });
    const handleSetStorage = () => {
        const { timbre, canStopByVoice, ...others } = configData.value;
        const defaultConfigData = {
            canStopByVoice,
            ...others
        };
        localStorage.setItem('configData', JSON.stringify(defaultConfigData));
        localStorage.setItem('canStopByVoice', canStopByVoice);
    };
    const handleChangePeople = val => {
        console.log('val: ', val);
        // const index = peopleList.findIndex(item => item.id === val);
        configData.value.voiceClonePrompt = defaultVoiceClonePrompt;
        configData.value.assistantPrompt = defaultAssistantPrompt;
        configData.value.timbre = [1];
    };
 </script>
 <style lang="less" scoped>
    .user-config {
        &-title {
            height: 61px;
            padding: 18px 18px 0;
            color: rgba(23, 23, 23, 0.9);
            font-family: PingFang SC;
            font-size: 16px;
            font-style: normal;
            font-weight: 500;
            line-height: normal;
        }
        .config-item {
            display: flex;
            align-items: center;
            width: 100%;
            padding: 0 0 0 18px;
            margin-bottom: 20px;
            &-label {
                width: 120px;
                flex-shrink: 0;
                display: flex;
                align-items: center;
            }
            &-content {
                flex: 1;
                margin-left: 16px;
                .el-radio-group {
                    .el-radio {
                        width: 50px;
                    }
                }
            }
            &-content.vad-slider {
                width: 80%;
                padding-left: 7px;
                margin-right: 20px;
                .el-slider__button {
                    width: 14px;
                    height: 14px;
                }
            }
        }
        .timbre-config {
            padding: 0 0 0 18px;
            &-label {
                margin-bottom: 20px;
                display: flex;
                align-items: center;
            }
            &-content {
                display: flex;
                align-items: center;
                .el-checkbox-group {
                    display: flex;
                    flex-wrap: wrap;
                    flex: 1;
                    > .el-checkbox {
                        margin-right: 12px;
                    }
                }
                .el-checkbox {
                    padding: 8px 16px;
                    border-radius: 10px;
                    background: #eaefff;
                    margin-bottom: 12px;
                    height: 40px;
                    .el-checkbox__input {
                        .el-checkbox__inner {
                            border: 1px solid #4dc100;
                        }
                    }
                    .el-checkbox__input.is-checked {
                        .el-checkbox__inner {
                            background: #4dc100;
                        }
                    }
                    .el-checkbox__input.is-checked.is-disabled {
                        .el-checkbox__inner::after {
                            border-color: #ffffff;
                        }
                    }
                }
                .el-checkbox__label {
                    color: #7579eb !important;
                    font-family: PingFang SC;
                    font-size: 16px;
                    font-style: normal;
                    font-weight: 400;
                    line-height: normal;
                    display: flex;
                    align-items: center;
                    .checkbox-icon {
                        margin-left: 4px;
                    }
                }
                .el-checkbox + .el-checkbox {
                    margin-left: 12px;
                }
            }
        }
        .prompt-item {
            padding: 0 0 0 18px;
            margin-bottom: 20px;
            &-label {
                // margin-bottom: 16px;
                display: flex;
                align-items: center;
            }
        }
        .file-content {
            padding: 0 0 0 18px;
            font-size: 14px;
            display: flex;
            align-items: center;
            .document-icon {
                width: 16px;
                height: 16px;
                margin-right: 4px;
            }
            .file-name {
                flex: 1;
                overflow: hidden;
                white-space: nowrap;
                text-overflow: ellipsis;
            }
        }
        .timbre-model {
            padding: 0 0 0 18px;
            margin-bottom: 20px;
            display: flex;
            align-items: center;
            &-label {
                width: 120px;
                flex-shrink: 0;
                display: flex;
                align-items: center;
            }
            &-content {
                flex: 1;
                margin-left: 16px;
            }
        }
    }
    .en-user-config {
        .config-item-label {
            width: 160px;
        }
        .timbre-model-label {
            width: 160px;
        }
    }
    .question-icon {
        width: 14px;
        height: 14px;
        cursor: pointer;
        margin-left: 6px;
    }
 </style>
 <style lang="less">
    .el-switch--small .el-switch__core {
        min-width: 50px;
    }
    .el-popper.is-dark {
        max-width: 300px;
    }
 </style>
--- a/web_demos/minicpm-o_2.6/web_server/src/components/ModelOutput/index.vue
+++ b/web_demos/minicpm-o_2.6/web_server/src/components/ModelOutput/index.vue
@@ -0,0 +1,91 @@
 <template>
    <div class="output-area">
        <div
            :class="`output-area-item ${item.type === 'USER' ? 'user-item' : 'bot-item'}`"
            :key="index"
            v-for="(item, index) in outputData"
        >
            <div v-if="item.type === 'USER'" class="user-input">
                <audio v-if="item.audio" :src="item.audio" controls></audio>
            </div>
            <div v-else class="bot-output">
                <div class="output-item">{{ item.text }}</div>
                <audio v-if="item.audio" :src="item.audio" controls></audio>
            </div>
        </div>
    </div>
 </template>
 <script setup>
    const props = defineProps({
        outputData: {
            type: Array,
            default: () => []
        },
        containerClass: {
            type: String,
            default: ''
        }
    });
    watch(
        () => props.outputData,
        newVal => {
            nextTick(() => {
                if (newVal && props.containerClass) {
                    let dom = document.querySelector(`.${props.containerClass}`);
                    if (dom) {
                        dom.scrollTop = dom.scrollHeight;
                    }
                }
            });
        },
        { deep: true }
    );
 </script>
 <style lang="less" scoped>
    .output-area {
        display: flex;
        flex-direction: column;
        &-item {
            width: fit-content;
        }
        &-item + &-item {
            margin-top: 16px;
        }
        &-item.user-item {
            align-self: flex-end;
            .user-input {
            }
        }
        &-item.bot-item {
            align-self: flex-start;
            width: 100%;
            .bot-output {
                width: 100%;
                display: flex;
                flex-direction: column;
                .output-item {
                    padding: 8px 24px;
                    border-radius: 10px;
                    color: #202224;
                    background: #f3f3f3;
                    max-width: 90%;
                    width: fit-content;
                    font-family: PingFang SC;
                    font-size: 16px;
                    font-style: normal;
                    font-weight: 400;
                    line-height: normal;
                    word-break: break-all;
                    word-wrap: break-word;
                    white-space: pre-wrap;
                    display: inline-block;
                }
                .output-item + audio {
                    margin-top: 16px;
                }
            }
        }
    }
 </style>
--- a/web_demos/minicpm-o_2.6/web_server/src/components/SelectTimbre/index.vue
+++ b/web_demos/minicpm-o_2.6/web_server/src/components/SelectTimbre/index.vue
@@ -0,0 +1,122 @@
 <template>
    <div class="select-timbre">
        <el-checkbox-group v-model="timbre" @change="handleSelectTimbre" :disabled="disabled">
            <el-checkbox :value="1" label="Default Audio"></el-checkbox>
            <!-- <el-upload
                v-model:file-list="fileList"
                action=""
                :multiple="false"
                :on-change="handleChangeFile"
                :auto-upload="false"
                :show-file-list="false"
                :disabled="disabled"
                accept="audio/*"
            >
                <el-checkbox :value="2">
                    <span>Customization: Upload Audio</span>
                    <SvgIcon name="upload" className="checkbox-icon" />
                </el-checkbox>
            </el-upload> -->
        </el-checkbox-group>
    </div>
 </template>
 <script setup>
    const timbre = defineModel('timbre');
    const audioData = defineModel('audioData');
    const disabled = defineModel('disabled');
    const fileList = ref([]);
    const handleSelectTimbre = e => {
        if (e.length > 1) {
            const val = e[e.length - 1];
            timbre.value = [val];
            // 默认音色
            if (val === 1) {
                audioData.value = {
                    base64Str: '',
                    type: 'mp3'
                };
            }
        }
    };
    const handleChangeFile = file => {
        if (isAudio(file) && sizeNotExceed(file)) {
            fileList.value = [file];
            timbre.value = [2];
            handleUpload();
        } else {
            ElMessage.error('Please upload audio file and size not exceed 1MB');
        }
    };
    const isAudio = file => {
        return file.name.endsWith('.mp3') || file.name.endsWith('.wav');
    };
    const sizeNotExceed = file => {
        return file.size / 1024 / 1024 <= 1;
    };
    const handleUpload = async () => {
        const file = fileList.value[0].raw;
        if (file) {
            const reader = new FileReader();
            reader.onload = e => {
                const base64String = e.target.result.split(',')[1];
                audioData.value = {
                    base64Str: base64String,
                    type: file.name.split('.')[1]
                };
            };
            reader.readAsDataURL(file);
        }
    };
 </script>
 <style lang="less">
    .select-timbre {
        display: flex;
        align-items: center;
        .el-checkbox-group {
            display: flex;
            > .el-checkbox {
                margin-right: 12px;
            }
        }
        .el-checkbox {
            padding: 8px 16px;
            border-radius: 10px;
            background: #eaefff;
            margin-right: 0;
            height: 40px;
            .el-checkbox__input {
                .el-checkbox__inner {
                    border: 1px solid #4dc100;
                }
            }
            .el-checkbox__input.is-checked {
                .el-checkbox__inner {
                    background: #4dc100;
                }
            }
            .el-checkbox__input.is-checked.is-disabled {
                .el-checkbox__inner::after {
                    border-color: #ffffff;
                }
            }
        }
        .el-checkbox__label {
            color: #7579eb !important;
            font-family: PingFang SC;
            font-size: 16px;
            font-style: normal;
            font-weight: 400;
            line-height: normal;
            display: flex;
            align-items: center;
            .checkbox-icon {
                margin-left: 4px;
            }
        }
        .el-checkbox + .el-checkbox {
            margin-left: 12px;
        }
    }
 </style>
--- a/web_demos/minicpm-o_2.6/web_server/src/components/SkipBtn/index.vue
+++ b/web_demos/minicpm-o_2.6/web_server/src/components/SkipBtn/index.vue
@@ -0,0 +1,67 @@
 <template>
    <div :class="`skip-btn ${disabled ? 'disabled-btn' : ''}`">
        <div class="pause-icon">
            <SvgIcon name="pause" className="pause-svg" />
        </div>
        <span class="btn-text">{{ t('skipMessageBtn') }}</span>
    </div>
 </template>
 <script setup>
    import { useI18n } from 'vue-i18n';
    const { t } = useI18n();
    defineProps({
        disabled: {
            type: Boolean,
            default: false
        }
    });
 </script>
 <style lang="less">
    .skip-btn {
        flex-shrink: 0;
        display: flex;
        align-items: center;
        padding: 8px 14px 8px 10px;
        border-radius: 90px;
        background: #5865f2;
        cursor: pointer;
        user-select: none;
        .pause-icon {
            display: flex;
            justify-content: center;
            align-items: center;
            width: 32px;
            height: 32px;
            background: #ffffff;
            border-radius: 50%;
            margin-right: 8px;
            .pause-svg {
                width: 18px;
                height: 18px;
                color: #5865f2;
            }
        }
        .btn-text {
            color: #fff;
            font-family: PingFang SC;
            font-size: 16px;
            font-style: normal;
            font-weight: 400;
            line-height: normal;
        }
    }
    .disabled-btn {
        cursor: not-allowed;
        background: #f3f3f3;
        .pause-icon {
            background: #d1d1d1;
            .pause-svg {
                color: #ffffff;
            }
        }
        .btn-text {
            color: #d1d1d1;
        }
    }
 </style>
--- a/web_demos/minicpm-o_2.6/web_server/src/components/SvgIcon/index.vue
+++ b/web_demos/minicpm-o_2.6/web_server/src/components/SvgIcon/index.vue
@@ -0,0 +1,39 @@
 <template>
    <svg :class="iconClass" v-html="content"></svg>
 </template>
 <script setup>
    const props = defineProps({
        name: {
            type: String,
            required: true
        },
        className: {
            type: String,
            default: ''
        }
    });
    const content = ref('');
    const iconClass = computed(() => ['svg-icon', props.className]);
    onMounted(() => {
        import(`@/assets/svg/${props.name}.svg`)
            .then(module => {
                fetch(module.default)
                    .then(response => response.text())
                    .then(svg => {
                        content.value = svg;
                    });
            })
            .catch(error => {
                console.error(`Error loading SVG icon: ${props.name}`, error);
            });
    });
 </script>
 <style lang="less" scoped>
    .svg-icon {
        width: 24px;
        height: 24px;
    }
 </style>
--- a/web_demos/minicpm-o_2.6/web_server/src/components/Voice/index.vue
+++ b/web_demos/minicpm-o_2.6/web_server/src/components/Voice/index.vue
@@ -0,0 +1,138 @@
 <template>
    <div class="bars" id="bars" :style="boxStyle">
        <!-- 柱形条 -->
        <div class="bar" v-for="(item, index) in defaultList" :key="index" :style="itemAttr(item)"></div>
    </div>
 </template>
 <script setup>
    const props = defineProps({
        analyser: {
            type: Object
        },
        dataArray: {
            type: [Array, Uint8Array]
        },
        isCalling: {
            type: Boolean,
            default: false
        },
        isPlaying: {
            type: Boolean,
            default: false
        },
        // 容器高度
        boxStyle: {
            type: Object,
            default: () => {
                return {
                    height: '80px'
                };
            }
        },
        // 柱形条宽度
        itemStyle: {
            type: Object,
            default: () => {
                return {
                    width: '6px',
                    margin: '0 2px',
                    borderRadius: '5px'
                };
            }
        },
        configList: {
            type: Array,
            default: () => []
        }
    });
    const animationFrameId = ref();
    const defaultList = ref([]);
    const bgColor = ref('#4c5cf8');
    const itemAttr = computed(() => item => {
        return {
            height: item + 'px',
            ...props.itemStyle
        };
    });
    watch(
        () => props.dataArray,
        newVal => {
            if (newVal && props.isCalling) {
                console.log('draw');
                drawBars();
            } else {
                console.log('stop');
                stopDraw();
            }
        }
    );
    watch(
        () => props.configList,
        newVal => {
            if (newVal.length > 0) {
                defaultList.value = newVal;
            }
        },
        { immediate: true }
    );
    watch(
        () => props.isPlaying,
        newVal => {
            if (newVal) {
                // 绿色
                bgColor.value = '#4dc100';
            } else {
                // 蓝色
                bgColor.value = '#4c5cf8';
            }
        }
    );
    function drawBars() {
        const bars = document.querySelectorAll('.bar');
        if (bars.length === 0) {
            cancelAnimationFrame(animationFrameId.value);
            return;
        }
        const maxHeight = document.querySelector('.bars').clientHeight; // 最大高度为容器的高度
        const averageVolume = props.dataArray.reduce((sum, value) => sum + value, 0) / props.dataArray.length;
        const normalizedVolume = props.isPlaying ? Math.random() : averageVolume / 128; // 将音量数据归一化为0到1之间
        bars.forEach((bar, index) => {
            const minHeight = defaultList.value[index];
            const randomFactor = Math.random() * 1.5 + 0.5; // 随机因子
            const newHeight = Math.min(
                maxHeight,
                minHeight + (maxHeight - minHeight) * normalizedVolume * randomFactor
            ); // 根据音量设置高度
            bar.style.height = `${newHeight}px`; // 设置新的高度
            bar.style.backgroundColor = bgColor.value;
        });
        animationFrameId.value = requestAnimationFrame(drawBars);
    }
    const stopDraw = () => {
        if (animationFrameId.value) {
            cancelAnimationFrame(animationFrameId.value);
        }
    };
 </script>
 <style lang="less" scoped>
    .bars {
        display: flex;
        justify-content: center;
        align-items: center;
    }
    .bar {
        // width: 6px;
        // margin: 0 2px;
        background-color: #4c5cf8;
        transition:
            height 0.1s,
            background-color 0.1s;
        border-radius: 5px; /* 圆角 */
    }
 </style>
--- a/web_demos/minicpm-o_2.6/web_server/src/directives/index.js
+++ b/web_demos/minicpm-o_2.6/web_server/src/directives/index.js
@@ -0,0 +1,8 @@
 /**
 * Configure and register global directives
 */
 import ElTableInfiniteScroll from 'el-table-infinite-scroll';
 export function setupGlobDirectives(app) {
    app.use(ElTableInfiniteScroll);
 }
--- a/web_demos/minicpm-o_2.6/web_server/src/enums/index.js
+++ b/web_demos/minicpm-o_2.6/web_server/src/enums/index.js
@@ -0,0 +1,18 @@
 export const voiceIdeasList = ['TBD', 'TBD', 'TBD'];
 export const videoIdeasList = ['TBD', 'TBD', 'TBD'];
 export const limitTime = 10 * 60; // 限制单次使用时常不超过10分钟
 export const tipsRemainingTime = 30; // 剩余30s时提醒用户
 // 初始音频波形
 export const voiceConfigList = [
    16, 16, 16, 16, 36, 58, 50, 70, 50, 58, 36, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 46, 28,
    60, 28, 68, 60, 28, 46, 16, 16, 16, 16, 16, 16, 16, 16, 36, 58, 50, 70, 50, 58, 36, 16, 16, 16, 16, 16, 16, 16, 16,
    16, 16, 16, 16, 16, 16, 16, 16, 46, 28, 60, 28, 68, 60, 28, 46, 16, 16, 16, 16
 ];
 // 初始视频中的音频波形
 export const videoConfigList = [
    8, 8, 8, 8, 18, 28, 26, 36, 26, 28, 18, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 24, 14, 30, 14, 34, 30, 14,
    24, 8, 8, 8, 8, 8, 8, 8, 8, 18, 28, 26, 36, 26, 28, 18, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 24, 14, 30,
    14, 34, 30, 14, 24, 8, 8, 8, 8, 8, 8, 8, 8, 18, 28, 26, 36, 26, 28, 18, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
    8, 24, 14, 30, 14, 34, 30, 14, 24, 8, 8, 8, 8
 ];
 export const showIdeasList = false;
--- a/web_demos/minicpm-o_2.6/web_server/src/hooks/useHttp.js
+++ b/web_demos/minicpm-o_2.6/web_server/src/hooks/useHttp.js
@@ -0,0 +1,61 @@
 import axios from 'axios';
 import { setNewUserId, getNewUserId } from './useRandomId';
 // 创建实例时配置默认值
 const service = axios.create({
    baseURL: '/',
    timeout: 30000,
    responseType: 'json'
 });
 // 请求拦截器
 service.interceptors.request.use(config => {
    if (config.url.includes('stream')) {
        config.timeout = 3000;
    }
    if (window.location.search) {
        config.url += window.location.search;
    }
    Object.assign(config.headers, ajaxHeader());
    return config;
 });
 // 响应拦截器
 service.interceptors.response.use(
    response => {
        let res = response.data;
        if (response?.status === 200) {
            return Promise.resolve({
                code: 0,
                message: '',
                data: res
            });
        }
        return Promise.resolve({ code: -1, message: '网络异常，请稍后再试', data: null });
    },
    error => {
        const res = { code: -1, message: error?.response?.data?.detail || '网络异常，请稍后再试', data: null };
        return Promise.resolve(res);
    }
 );
 export const ajaxHeader = () => {
    if (!localStorage.getItem('uid')) {
        setNewUserId();
    }
    return {
        'Content-Type': 'application/json;charset=UTF-8',
        Accept: 'application/json',
        service: 'minicpmo-server',
        uid: getNewUserId()
    };
 };
 export default {
    get(url, params, config = {}) {
        return service.get(url, { params, ...config });
    },
    post(url, data, config = {}) {
        return service.post(url, data, { ...config });
    }
 };
--- a/web_demos/minicpm-o_2.6/web_server/src/hooks/useQueue.js
+++ b/web_demos/minicpm-o_2.6/web_server/src/hooks/useQueue.js
@@ -0,0 +1,95 @@
 export class TaskQueue {
    constructor() {
        this.tasks = [];
        this.isRunning = false;
        this.isPaused = false;
        this.currentTask = null;
    }
    // 添加任务到队列
    addTask(task) {
        this.tasks.push(task);
        if (!this.isRunning) {
            this.start();
        }
    }
    // 删除任务
    removeTask(taskToRemove) {
        this.tasks = this.tasks.filter(task => task !== taskToRemove);
    }
    // 清空任务队列
    clearQueue() {
        this.tasks = [];
    }
    // 暂停任务执行
    pause() {
        this.isPaused = true;
    }
    // 恢复任务执行
    resume() {
        if (this.isPaused) {
            this.isPaused = false;
            if (!this.isRunning) {
                this.start();
            }
        }
    }
    // 内部启动方法
    async start() {
        this.isRunning = true;
        while (this.tasks.length > 0 && !this.isPaused) {
            this.currentTask = this.tasks.shift();
            await this.currentTask();
            // 检查是否暂停或任务队列已清空
            if (this.isPaused || this.tasks.length === 0) {
                this.isRunning = false;
                break;
            }
        }
        this.isRunning = false;
    }
 }
 // 示例任务函数
 function exampleTask(id) {
    return () =>
        new Promise(resolve => {
            console.log(`Executing task ${id}`);
            setTimeout(() => {
                console.log(`Task ${id} completed`);
                resolve();
            }, 1000); // 每个任务耗时1秒
        });
 }
 // 测试示例
 const queue = new TaskQueue();
 // 添加任务到队列
 for (let i = 1; i <= 5; i++) {
    queue.addTask(exampleTask(i));
 }
 // 暂停队列，在2.5秒后执行
 setTimeout(() => {
    console.log('Pausing queue...');
    queue.pause();
 }, 2500);
 // 恢复队列，在4.5秒后执行
 setTimeout(() => {
    console.log('Resuming queue...');
    queue.resume();
 }, 4500);
 // 清空队列，在3秒后执行
 setTimeout(() => {
    console.log('Clearing queue...');
    queue.clearQueue();
 }, 3000);
--- a/web_demos/minicpm-o_2.6/web_server/src/hooks/useRandomId.js
+++ b/web_demos/minicpm-o_2.6/web_server/src/hooks/useRandomId.js
@@ -0,0 +1,9 @@
 const uid = 'uid';
 export const setNewUserId = () => {
    const randomId = Math.random().toString(36).slice(2).toUpperCase();
    localStorage.setItem(uid, randomId);
    return randomId;
 };
 export const getNewUserId = () => {
    return localStorage.getItem('uid');
 };
--- a/web_demos/minicpm-o_2.6/web_server/src/hooks/useVoice.js
+++ b/web_demos/minicpm-o_2.6/web_server/src/hooks/useVoice.js
@@ -0,0 +1,38 @@
 const writeString = (view, offset, string) => {
    for (let i = 0; i < string.length; i++) {
        view.setUint8(offset + i, string.charCodeAt(i));
    }
 };
 const floatTo16BitPCM = (output, offset, input) => {
    for (let i = 0; i < input.length; i++, offset += 2) {
        const s = Math.max(-1, Math.min(1, input[i]));
        output.setInt16(offset, s < 0 ? s * 0x8000 : s * 0x7fff, true);
    }
 };
 // audio buffer to wav file, need add 44 length header
 export const encodeWAV = (samples, sampleRate) => {
    const buffer = new ArrayBuffer(44 + samples.length * 2);
    const view = new DataView(buffer);
    const numChannels = 1;
    const bitsPerSample = 16;
    /* WAV 标头 */
    writeString(view, 0, 'RIFF');
    view.setUint32(4, 36 + samples.length * 2, true);
    writeString(view, 8, 'WAVE');
    writeString(view, 12, 'fmt ');
    view.setUint32(16, 16, true);
    view.setUint16(20, 1, true);
    view.setUint16(22, numChannels, true);
    view.setUint32(24, sampleRate, true);
    view.setUint32(28, (sampleRate * numChannels * bitsPerSample) / 8, true);
    view.setUint16(32, (numChannels * bitsPerSample) / 8, true);
    view.setUint16(34, bitsPerSample, true);
    writeString(view, 36, 'data');
    view.setUint32(40, samples.length * 2, true);
    /* PCM 数据 */
    floatTo16BitPCM(view, 44, samples);
    return new Blob([view], { type: 'audio/wav' });
 };
--- a/web_demos/minicpm-o_2.6/web_server/src/i18n/en.json
+++ b/web_demos/minicpm-o_2.6/web_server/src/i18n/en.json
@@ -0,0 +1,36 @@
 {
    "menuTabVideo": "Realtime Video Call",
    "menuTabAudio": "Realtime Voice Call",
    "menuTabChatbot": "Chatbot",
    "videoCallBtn": "Call MiniCPM-omni",
    "audioCallBtn": "Call MiniCPM-omni",
    "hangUpBtn": "Hang Up",
    "notReadyBtn": "Not ready yet, please wait",
    "skipMessageBtn": "Skip this message",
    "feedbackDialogTitle": "Feedback issue",
    "modelConfigTitle": "Model Config",
    "audioInterruptionBtn": "Speech Interruption",
    "audioInterruptionTips": "When the \"voice interruption\" mode is enabled, it allows users to interrupt the model while it is speaking. The model will immediately terminate the previous round of generation and respond to the user's latest question.",
    "yes": "Yes",
    "no": "No",
    "videoQualityBtn": "HD Mode",
    "videoQualityTips": "When the \"high resulation\" mode is enabled, the model will perform high resolution encoding on the last frame, allowing the model to see more detailed parts.",
    "high": "High",
    "low": "Low",
    "vadThresholdBtn": "VAD Threshold",
    "vadThresholdTips": "The VAD threshold indicates how long the sound needs to be silent before triggering inference. If the VAD threshold is too low, it may trigger accidentally during speech pauses, while if it's too high, it will result in slower initial response.",
    "assistantPromptBtn": "Task Prompt",
    "assistantPromptTips": "Model task instructions are used to support different task objectives.",
    "useVoicePromptBtn": "Tone Color Prompt",
    "voiceClonePromptInput": "Tone Color Prompt",
    "voiceClonePromptTips": "Tone Color Prompt tips",
    "audioChoiceBtn": "Audio Choice",
    "defaultAudioBtn": "Default Audio",
    "customizationBtn": "Customization: Upload Audio",
    "toneColorOptions": "Voice Options",
    "toneColorOptionsTips": "We have provided a selection of sample tone colors, and you also have the option to choose \"none\" and instruct the model to create a new tone color.",
    "nullOption": "Null",
    "defaultOption": "Female 1(Default)",
    "femaleOption": "Female 2",
    "maleOption": "Male 1"
 }
--- a/web_demos/minicpm-o_2.6/web_server/src/i18n/zh.json
+++ b/web_demos/minicpm-o_2.6/web_server/src/i18n/zh.json
@@ -0,0 +1,36 @@
 {
    "menuTabVideo": "实时视频通话",
    "menuTabAudio": "实时语音通话",
    "menuTabChatbot": "聊天机器人",
    "videoCallBtn": "视频通话",
    "audioCallBtn": "语音通话",
    "hangUpBtn": "挂断",
    "notReadyBtn": "服务繁忙，请稍后",
    "skipMessageBtn": "跳过当前对话",
    "feedbackDialogTitle": "请输入反馈意见",
    "modelConfigTitle": "模型配置",
    "audioInterruptionBtn": "语音打断",
    "audioInterruptionTips": "开启\"语音打断\"功能，支持在模型说话时打断模型，模型会立刻结束上一轮的生成，并支持用户最新的问题。",
    "yes": "是",
    "no": "否",
    "videoQualityBtn": "高清模式",
    "videoQualityTips": "开启高清模式，模型会在最后一帧对图片进行高清编码，可以使得模型看得清更细节的部分。",
    "high": "高清",
    "low": "低清",
    "vadThresholdBtn": "VAD阈值",
    "vadThresholdTips": "vad阈值表示声音静音多久才开始触发推理，vad阈值过低会导致说话气口误触，过高会导致首响更慢。",
    "assistantPromptBtn": "任务指令",
    "assistantPromptTips": "模型的任务指令，用于支持不同的任务目标",
    "useVoicePromptBtn": "音色指令",
    "voiceClonePromptInput": "音色指令",
    "voiceClonePromptTips": "我们的模型具有端到端的音色克隆能力，提供一段 5-7 秒的音频，模型在一定程度上可以用这种音色来说话。但基于法律考虑，我们的demo并不开启这个能力的试用。社区可以参照我们的开源代码自行适配。",
    "audioChoiceBtn": "音色选择",
    "defaultAudioBtn": "默认音色",
    "customizationBtn": "自定义：上传音频",
    "toneColorOptions": "语音选项",
    "toneColorOptionsTips": "我们提供了一些示例音色，也可以选择“无”并通过指令让模型创建音色。",
    "nullOption": "无",
    "defaultOption": "女一号(默认)",
    "femaleOption": "女二号",
    "maleOption": "男一号"
 }
--- a/web_demos/minicpm-o_2.6/web_server/src/main.js
+++ b/web_demos/minicpm-o_2.6/web_server/src/main.js
@@ -0,0 +1,40 @@
 import './styles/main.css';
 import { router, setupRouter } from '@/router';
 import { setupRouterGuard } from '@/router/guard';
 import SvgIcon from '@/components/SvgIcon/index.vue';
 import { createI18n } from 'vue-i18n';
 import App from './App.vue';
 import en from './i18n/en.json';
 import zh from './i18n/zh.json';
 const savedLanguage = localStorage.getItem('language') || 'zh';
 const i18n = createI18n({
    locale: savedLanguage, // 默认语言
    messages: {
        en,
        zh
    }
 });
 const app = createApp(App);
 // Configure routing
 // 配置路由
 setupRouter(app);
 // router-guard
 // 路由守卫
 setupRouterGuard(router);
 // Register global directive
 // 注册全局指令
 // setupGlobDirectives(app);
 app.component('SvgIcon', SvgIcon);
 app.use(i18n);
 app.mount('#app');
--- a/web_demos/minicpm-o_2.6/web_server/src/router/guard/index.js
+++ b/web_demos/minicpm-o_2.6/web_server/src/router/guard/index.js
@@ -0,0 +1,5 @@
 import { createStateGuard } from './stateGuard';
 export function setupRouterGuard(router) {
    createStateGuard(router);
 }
--- a/web_demos/minicpm-o_2.6/web_server/src/router/guard/stateGuard.js
+++ b/web_demos/minicpm-o_2.6/web_server/src/router/guard/stateGuard.js
@@ -0,0 +1 @@
 export function createStateGuard() {}
--- a/web_demos/minicpm-o_2.6/web_server/src/router/index.js
+++ b/web_demos/minicpm-o_2.6/web_server/src/router/index.js
@@ -0,0 +1,16 @@
 import { createRouter, createWebHistory } from 'vue-router';
 import { basicRoutes } from './menu';
 // 创建一个可以被 Vue 应用程序使用的路由实例
 export const router = createRouter({
    // 创建一个 hash 历史记录。
    history: createWebHistory(import.meta.env.BASE_URL),
    // 路由列表。
    routes: basicRoutes
 });
 // config router
 // 配置路由器
 export function setupRouter(app) {
    app.use(router);
 }
--- a/web_demos/minicpm-o_2.6/web_server/src/router/menu/index.js
+++ b/web_demos/minicpm-o_2.6/web_server/src/router/menu/index.js
@@ -0,0 +1,10 @@
 export const basicRoutes = [
    {
        path: '/',
        component: () => import('@/views/home/index.vue')
    },
    {
        path: '/:port',
        component: () => import('@/views/home/index.vue')
    }
 ];
--- a/Show More
+++ b/Show More
		`@@ -0,0 +1 @@`
							<svg data-v-d2e47025="" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 1024 1024"><path fill="currentColor" d="M600.704 64a32 32 0 0 1 30.464 22.208l35.2 109.376c14.784 7.232 28.928 15.36 42.432 24.512l112.384-24.192a32 32 0 0 1 34.432 15.36L944.32 364.8a32 32 0 0 1-4.032 37.504l-77.12 85.12a357.12 357.12 0 0 1 0 49.024l77.12 85.248a32 32 0 0 1 4.032 37.504l-88.704 153.6a32 32 0 0 1-34.432 15.296L708.8 803.904c-13.44 9.088-27.648 17.28-42.368 24.512l-35.264 109.376A32 32 0 0 1 600.704 960H423.296a32 32 0 0 1-30.464-22.208L357.696 828.48a351.616 351.616 0 0 1-42.56-24.64l-112.32 24.256a32 32 0 0 1-34.432-15.36L79.68 659.2a32 32 0 0 1 4.032-37.504l77.12-85.248a357.12 357.12 0 0 1 0-48.896l-77.12-85.248A32 32 0 0 1 79.68 364.8l88.704-153.6a32 32 0 0 1 34.432-15.296l112.32 24.256c13.568-9.152 27.776-17.408 42.56-24.64l35.2-109.312A32 32 0 0 1 423.232 64H600.64zm-23.424 64H446.72l-36.352 113.088-24.512 11.968a294.113 294.113 0 0 0-34.816 20.096l-22.656 15.36-116.224-25.088-65.28 113.152 79.68 88.192-1.92 27.136a293.12 293.12 0 0 0 0 40.192l1.92 27.136-79.808 88.192 65.344 113.152 116.224-25.024 22.656 15.296a294.113 294.113 0 0 0 34.816 20.096l24.512 11.968L446.72 896h130.688l36.48-113.152 24.448-11.904a288.282 288.282 0 0 0 34.752-20.096l22.592-15.296 116.288 25.024 65.28-113.152-79.744-88.192 1.92-27.136a293.12 293.12 0 0 0 0-40.256l-1.92-27.136 79.808-88.128-65.344-113.152-116.288 24.96-22.592-15.232a287.616 287.616 0 0 0-34.752-20.096l-24.448-11.904L577.344 128zM512 320a192 192 0 1 1 0 384 192 192 0 0 1 0-384m0 64a128 128 0 1 0 0 256 128 128 0 0 0 0-256"></path></svg>
		`@@ -0,0 +1 @@`
							`<svg data-v-d2e47025="" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 1024 1024"><path fill="currentColor" d="M832 384H576V128H192v768h640zm-26.496-64L640 154.496V320zM160 64h480l256 256v608a32 32 0 0 1-32 32H160a32 32 0 0 1-32-32V96a32 32 0 0 1 32-32m160 448h384v64H320zm0-192h160v64H320zm0 384h384v64H320z"></path></svg>`
		`@@ -0,0 +1 @@`
							<svg t="1736675176012" class="icon" viewBox="0 0 1024 1024" version="1.1" xmlns="http://www.w3.org/2000/svg" p-id="4244" xmlns:xlink="http://www.w3.org/1999/xlink"><path d="M512 106.667A405.333 405.333 0 1 1 106.667 512 405.333 405.333 0 0 1 512 106.667m0-64A469.333 469.333 0 1 0 981.333 512 469.333 469.333 0 0 0 512 42.667z" p-id="4245"></path><path d="M501.333 664.533a32 32 0 1 0 32 32 32 32 0 0 0-32-32z m-0.426-27.093a32 32 0 0 1-32-32c0-80.213 50.56-111.787 91.306-136.96 32-19.84 51.84-33.28 59.094-60.16a85.333 85.333 0 0 0-12.587-69.547 91.52 91.52 0 0 0-76.8-29.226 123.52 123.52 0 0 0-92.16 29.866 82.56 82.56 0 0 0-21.333 52.907 32 32 0 1 1-64 2.56 144 144 0 0 1 39.466-99.84c31.574-32.853 78.08-49.493 138.24-49.493 70.827 0 108.587 29.44 128 54.186a149.333 149.333 0 0 1 23.894 125.014c-14.08 52.693-54.614 77.866-87.04 98.133-40.32 24.747-61.654 39.68-61.654 82.56a32 32 0 0 1-32.426 32z" p-id="4246"></path></svg>