update minicpm-o 4.5 (#1052)

Co-authored-by: wangchongyi <>
This commit is contained in:
YuzaChongyi
2026-02-04 01:55:48 +08:00
committed by GitHub
parent 74aa48ebeb
commit 28632248d5
15 changed files with 7380 additions and 3514 deletions

4272
README.md

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

Binary file not shown.

Binary file not shown.

After

Width:  |  Height:  |  Size: 359 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.1 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 304 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 3.0 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 4.4 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 7.5 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 2.9 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.2 MiB

964
docs/minicpm_o2dot6_en.md Normal file
View File

@@ -0,0 +1,964 @@
## MiniCPM-o 2.6
> Archieve at: 2026-02-02
**MiniCPM-o 2.6** is the latest and most capable model in the MiniCPM-o series. The model is built in an end-to-end fashion based on SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.6, and introduces new features for real-time speech conversation and multimodal live streaming. Notable features of MiniCPM-o 2.6 include:
- 🔥 **Leading Visual Capability.**
MiniCPM-o 2.6 achieves an average score of 70.2 on OpenCompass, a comprehensive evaluation of 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-202405, Gemini 1.5 Pro, and Claude 3.5 Sonnet** for single image understanding. It also **outperforms GPT-4V and Claude 3.5 Sonnet** in multi-image and video understanding, and shows promising in-context learning capability.
- 🎙 **State-of-the-art Speech Capability.** MiniCPM-o 2.6 supports **bilingual real-time speech conversation with configurable voices** in English and Chinese. It **outperforms GPT-4o-realtime on audio understanding tasks** such as ASR and STT translation, and shows **state-of-the-art performance on speech conversation in both semantic and acoustic evaluations in the open-source community**. It also allows for fun features such as emotion/speed/style control, end-to-end voice cloning, role play, etc.
- 🎬 **Strong Multimodal Live Streaming Capability.** As a new feature, MiniCPM-o 2.6 can **accept continuous video and audio streams independent of user queries, and support real-time speech interaction**. It **outperforms GPT-4o-202408 and Claude 3.5 Sonnet and shows state-of-the-art performance in the open-source community on StreamingBench**, a comprehensive benchmark for real-time video understanding, omni-source (video & audio) understanding, and multimodal contextual understanding.
- 💪 **Strong OCR Capability and Others.**
Advancing popular visual capabilities from MiniCPM-V series, MiniCPM-o 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves **state-of-the-art performance on OCRBench for models under 25B, surpassing proprietary models such as GPT-4o-202405**.
Based on the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) and [VisCPM](https://github.com/OpenBMB/VisCPM) techniques, it features **trustworthy behaviors**, outperforming GPT-4o and Claude 3.5 Sonnet on MMHal-Bench, and supports **multilingual capabilities** on more than 30 languages.
- 🚀 **Superior Efficiency.**
In addition to its friendly size, MiniCPM-o 2.6 also shows **state-of-the-art token density** (i.e., the number of pixels encoded into each visual token). **It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models**. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-o 2.6 can efficiently support **multimodal live streaming** on end-side devices such as iPads.
- 💫 **Easy Usage.**
MiniCPM-o 2.6 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4) and [GGUF](https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf) format quantized models in 16 sizes, (3) [vLLM](#efficient-inference-with-llamacpp-ollama-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks with [LLaMA-Factory](./docs/llamafactory_train_and_infer.md), (5) quick [local WebUI demo](#chat-with-our-demo-on-gradio), and (6) online web demo on [server](https://minicpm-omni-webdemo-us.modelbest.cn/).
**Model Architecture.**
- **End-to-end Omni-modal Architecture.** Different modality encoders/decoders are connected and trained in an **end-to-end** fashion to fully exploit rich multimodal knowledge. The model is trained in a fully end-to-end manner with only CE loss.
- **Omni-modal Live Streaming Mechanism.** (1) We change the offline modality encoder/decoders into online ones for **streaming inputs/outputs.** (2) We devise a **time-division multiplexing (TDM) mechanism** for omni-modality streaming processing in the LLM backbone. It divides parallel omni-modality streams into sequential info within small periodic time slices.
- **Configurable Speech Modeling Design.** We devise a multimodal system prompt, including traditional text system prompt, and **a new audio system prompt to determine the assistant voice**. This enables flexible voice configurations in inference time, and also facilitates end-to-end voice cloning and description-based voice creation.
<div align="center">
<img src="./assets/minicpm-o-26-framework-v2.png" , width=80%>
</div>
### Evaluation <!-- omit in toc -->
<div align="center">
<img src="./assets/radar.jpg", width=80%>
</div>
<details>
<summary>Click to view visual understanding results.</summary>
**Image Understanding**
<div align="center">
<table style="margin: 0px auto;">
<thead>
<tr>
<th align="left">Model</th>
<th>Size</th>
<th>Token Density<sup>+</sup></th>
<th>OpenCompass</th>
<th>OCRBench</th>
<th>MathVista mini</th>
<th>ChartQA</th>
<th>MMVet</th>
<th>MMStar</th>
<th>MME</th>
<th>MMB1.1 test</th>
<th>AI2D</th>
<th>MMMU val</th>
<th>HallusionBench</th>
<th>TextVQA val</th>
<th>DocVQA test</th>
<th>MathVerse mini</th>
<th>MathVision</th>
<th>MMHal Score</th>
</tr>
</thead>
<tbody align="center">
<tr>
<td colspan="19" align="left"><strong>Proprietary</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4o-20240513</td>
<td>-</td>
<td>1088</td>
<td><u>69.9</u></td>
<td>736</td>
<td>61.3</td>
<td>85.7</td>
<td><strong>69.1</strong></td>
<td>63.9</td>
<td>2328.7</td>
<td>82.2</td>
<td>84.6</td>
<td><strong>69.2</strong></td>
<td><strong>55.0</strong></td>
<td>-</td>
<td>92.8</td>
<td><strong>50.2</strong></td>
<td><strong>30.4</strong></td>
<td><u>3.6</u></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Claude3.5-Sonnet</td>
<td>-</td>
<td>750</td>
<td>67.9</td>
<td>788</td>
<td>61.6</td>
<td><strong>90.8</strong></td>
<td>66.0</td>
<td>62.2</td>
<td>1920.0</td>
<td>78.5</td>
<td>80.2</td>
<td><u>65.9</u></td>
<td>49.9</td>
<td>-</td>
<td><strong>95.2</strong></td>
<td>-</td>
<td>-</td>
<td>3.4</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
<td>-</td>
<td>-</td>
<td>64.4</td>
<td>754</td>
<td>57.7</td>
<td>81.3</td>
<td>64.0</td>
<td>59.1</td>
<td>2110.6</td>
<td>73.9</td>
<td>79.1</td>
<td>60.6</td>
<td>45.6</td>
<td>73.5</td>
<td>86.5</td>
<td>-</td>
<td>19.2</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4o-mini-20240718</td>
<td>-</td>
<td>1088</td>
<td>64.1</td>
<td>785</td>
<td>52.4</td>
<td>-</td>
<td>66.9</td>
<td>54.8</td>
<td>2003.4</td>
<td>76.0</td>
<td>77.8</td>
<td>60.0</td>
<td>46.1</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>3.3</td>
</tr>
<tr>
<td colspan="19" align="left"><strong>Open Source</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Cambrian-34B</td>
<td>34B</td>
<td><u>1820</u></td>
<td>58.3</td>
<td>591</td>
<td>50.3</td>
<td>75.6</td>
<td>53.2</td>
<td>54.2</td>
<td>2049.9</td>
<td>77.8</td>
<td>79.5</td>
<td>50.4</td>
<td>41.6</td>
<td>76.7</td>
<td>75.5</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GLM-4V-9B</td>
<td>13B</td>
<td>784</td>
<td>59.1</td>
<td>776</td>
<td>51.1</td>
<td>-</td>
<td>58.0</td>
<td>54.8</td>
<td>2018.8</td>
<td>67.9</td>
<td>71.2</td>
<td>46.9</td>
<td>45.0</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Pixtral-12B</td>
<td>12B</td>
<td>256</td>
<td>61.0</td>
<td>685</td>
<td>56.9</td>
<td>81.8</td>
<td>58.5</td>
<td>54.5</td>
<td>-</td>
<td>72.7</td>
<td>79.0</td>
<td>51.1</td>
<td>47.0</td>
<td>75.7</td>
<td>90.7</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">VITA-1.5</td>
<td>8B</td>
<td>784</td>
<td>63.3</td>
<td>741</td>
<td>66.2</td>
<td>-</td>
<td>52.7</td>
<td>60.2</td>
<td>2328.1</td>
<td>76.8</td>
<td>79.2</td>
<td>52.6</td>
<td>44.6</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">DeepSeek-VL2-27B (4B)</td>
<td>27B</td>
<td>672</td>
<td>66.4</td>
<td>809</td>
<td>63.9</td>
<td>86.0</td>
<td>60.0</td>
<td>61.9</td>
<td>2253.0</td>
<td>81.2</td>
<td>83.8</td>
<td>54.0</td>
<td>45.3</td>
<td><u>84.2</u></td>
<td>93.3</td>
<td>-</td>
<td>-</td>
<td>3.0</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Qwen2-VL-7B</td>
<td>8B</td>
<td>784</td>
<td>67.1</td>
<td><u>866</u></td>
<td>58.2</td>
<td>83.0</td>
<td>62.0</td>
<td>60.7</td>
<td>2326.0</td>
<td>81.8</td>
<td>83.0</td>
<td>54.1</td>
<td>50.6</td>
<td><strong>84.3</strong></td>
<td><u>94.5</u></td>
<td>31.9</td>
<td>16.3</td>
<td>3.2</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">LLaVA-OneVision-72B</td>
<td>72B</td>
<td>182</td>
<td>68.1</td>
<td>741</td>
<td>67.5</td>
<td>83.7</td>
<td>60.6</td>
<td><strong>65.8</strong></td>
<td>2261.0</td>
<td><strong>85.0</strong></td>
<td><u>85.6</u></td>
<td>56.8</td>
<td>49.0</td>
<td>80.5</td>
<td>91.3</td>
<td>39.1</td>
<td>-</td>
<td>3.5</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">InternVL2.5-8B</td>
<td>8B</td>
<td>706</td>
<td>68.3</td>
<td>822</td>
<td><u>64.4</u></td>
<td>84.8</td>
<td>62.8</td>
<td>62.8</td>
<td>2344.0</td>
<td><u>83.6</u></td>
<td>84.5</td>
<td>56.0</td>
<td>50.1</td>
<td>79.1</td>
<td>93.0</td>
<td>39.5</td>
<td>19.7</td>
<td>3.4</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
<td>8B</td>
<td><strong>2822</strong></td>
<td>65.2</td>
<td>852*</td>
<td>60.6</td>
<td>79.4</td>
<td>60.0</td>
<td>57.5</td>
<td><u>2348.4*</u></td>
<td>78.0</td>
<td>82.1</td>
<td>49.8*</td>
<td>48.1*</td>
<td>80.1</td>
<td>90.8</td>
<td>25.7</td>
<td>18.3</td>
<td>3.6</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
<td>8B</td>
<td><strong>2822</strong></td>
<td><strong>70.2</strong></td>
<td><strong>897*</strong></td>
<td><strong>71.9*</strong></td>
<td><u>86.9*</u></td>
<td><u>67.5</u></td>
<td><u>64.0</u></td>
<td><strong>2372.0*</strong></td>
<td>80.5</td>
<td><strong>85.8</strong></td>
<td>50.4*</td>
<td><u>51.9</u></td>
<td>82.0</td>
<td>93.5</td>
<td><u>41.4*</u></td>
<td><u>23.1*</u></td>
<td><strong>3.8</strong></td>
</tr>
</tbody>
</table>
</div>
* We evaluate this benchmark using chain-of-thought prompting. Specifically, for MME, we used this technique only for the Cognition set.
<sup>+</sup> Token Density: number of pixels encoded into each visual token at maximum resolution, i.e., # pixels at maximum resolution / # visual tokens.
Note: For proprietary models, we calculate token density based on the image encoding charging strategy defined in the official API documentation, which provides an upper-bound estimation.
**Multi-image and Video Understanding**
<div align="center">
<table style="margin: 0px auto;">
<thead>
<tr>
<th align="left">Model</th>
<th>Size</th>
<th>BLINK val</th>
<th>Mantis Eval</th>
<th>MIRB</th>
<th>Video-MME (wo / w subs)</th>
</tr>
</thead>
<tbody align="center">
<tr>
<td colspan="6" align="left"><strong>Proprietary</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4o-20240513</td>
<td>-</td>
<td><strong>68.0</strong></td>
<td>-</td>
<td>-</td>
<td><strong>71.9/77.2<strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT4V</td>
<td>-</td>
<td>54.6</td>
<td>62.7</td>
<td>53.1</td>
<td>59.9/63.3</td>
</tr>
<tr>
<td colspan="6" align="left"><strong>Open-source</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">VITA-1.5</td>
<td>8B</td>
<td>45.0</td>
<td>-</td>
<td>-</td>
<td>56.1/58.7</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">LLaVA-NeXT-Interleave 14B</td>
<td>14B</td>
<td>52.6</td>
<td>66.4</td>
<td>30.2</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">LLaVA-OneVision-72B</td>
<td>72B</td>
<td>55.4</td>
<td><strong>77.6</strong></td>
<td>-</td>
<td><u>66.2/69.5</u></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MANTIS 8B</td>
<td>8B</td>
<td>49.1</td>
<td>59.5</td>
<td>34.8</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Qwen2-VL-7B</td>
<td>8B</td>
<td>53.2</td>
<td>69.6*</td>
<td><strong>67.6*</strong></td>
<td>63.3/69.0</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">InternVL2.5-8B</td>
<td>8B</td>
<td>54.8</td>
<td>67.7</td>
<td>52.5</td>
<td>64.2/66.9</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
<td>8B</td>
<td>53.0</td>
<td>69.1</td>
<td>53.8</td>
<td>60.9/63.6</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
<td>8B</td>
<td><u>56.7</u></td>
<td><u>71.9</u></td>
<td><u>58.6</u></td>
<td>63.9/67.9</td>
</tr>
</tbody>
</table>
</div>
* We evaluate officially released checkpoints by ourselves.
</details>
<details>
<summary>Click to view audio understanding and speech conversation results.</summary>
**Audio Understanding**
<div align="center">
<table style="margin: 0px auto;">
<thead>
<tr>
<th align="left">Task</th>
<th>Size</th>
<th colspan="3">ASR (zh)</th>
<th colspan="3">ASR (en)</th>
<th colspan="2">AST</th>
<th>Emotion</th>
</tr>
<tr>
<th align="left">Metric</th>
<td></td>
<th colspan="3">CER↓</th>
<th colspan="3">WER↓</th>
<th colspan="2">BLEU↑</th>
<th>ACC↑</th>
</tr>
<tr>
<th align="left">Dataset</th>
<td></td>
<th>AISHELL-1</th>
<th>Fleurs zh</th>
<th>WenetSpeech test-net</th>
<th>LibriSpeech test-clean</th>
<th>GigaSpeech</th>
<th>TED-LIUM</th>
<th>CoVoST en2zh</th>
<th>CoVoST zh2en</th>
<th>MELD emotion</th>
</tr>
</thead>
<tbody align="center">
<tr>
<td colspan="11" align="left"><strong>Proprietary</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4o-Realtime</td>
<td>-</td>
<td>7.3*</td>
<td><u>5.4*</u></td>
<td>28.9*</td>
<td>2.6*</td>
<td>12.9*</td>
<td>4.8*</td>
<td>37.1*</td>
<td>15.7*</td>
<td>33.2*</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
<td>-</td>
<td>4.5*</td>
<td>5.9*</td>
<td>14.3*</td>
<td>2.9*</td>
<td>10.6*</td>
<td><strong>3.0*</strong></td>
<td><u>47.3*</u></td>
<td>22.6*</td>
<td>48.4*</td>
</tr>
<tr>
<td colspan="11" align="left"><strong>Open-Source</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Qwen2-Audio-7B</td>
<td>8B</td>
<td>-</td>
<td>7.5</td>
<td>-</td>
<td><strong>1.6</strong></td>
<td>-</td>
<td>-</td>
<td>45.2</td>
<td><u>24.4</u></td>
<td><strong>55.3</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Qwen2-Audio-7B-Instruct</td>
<td>8B</td>
<td>2.6*</td>
<td>6.9*</td>
<td><u>10.3*</u></td>
<td>3.1*</td>
<td><u>9.7</u>*</td>
<td>5.9*</td>
<td>39.5*</td>
<td>22.9*</td>
<td>17.4*</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">VITA-1.5</td>
<td>8B</td>
<td>2.16</td>
<td>-</td>
<td>8.4</td>
<td>3.4</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GLM-4-Voice-Base</td>
<td>9B</td>
<td><u>2.5</u></td>
<td>-</td>
<td>-</td>
<td>2.8</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
<td>8B</td>
<td><strong>1.6</strong></td>
<td><strong>4.4</strong></td>
<td><strong>6.9</strong></td>
<td><u>1.7</u></td>
<td><strong>8.7</strong></td>
<td><strong>3.0</strong></td>
<td><strong>48.2</strong></td>
<td><strong>27.2</strong></td>
<td><u>52.4</u></td>
</tr>
</tbody>
</table>
</div>
* We evaluate officially released checkpoints by ourselves.<br><br>
**Speech Generation**
<div align="center">
<table style="margin: 0px auto;">
<thead>
<tr>
<th align="left">Task</th>
<th>Size</th>
<th colspan="9">SpeechQA</th>
</tr>
<tr>
<th align="left">Metric</th>
<th></th>
<th colspan="3">ACC↑</th>
<th>G-Eval (10 point)↑</th>
<th>Semantic ELO score↑</th>
<th>Acoustic ELO score↑</th>
<th>Overall ELO score↑</th>
<th>UTMOS↑</th>
<th>ASR-WER↓</th>
</tr>
<tr>
<th align="left">Dataset</th>
<th></th>
<th>Speech Llama Q.</th>
<th>Speech Web Q.</th>
<th>Speech Trivia QA</th>
<th>Speech AlpacaEval</th>
<th colspan="5">AudioArena</th>
</tr>
</thead>
<tbody align="center">
<tr>
<td colspan="11" align="left"><strong>Proprietary</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4o-Realtime</td>
<td></td>
<td><strong>71.7</strong></td>
<td><strong>51.6</strong></td>
<td><strong>69.7</strong></td>
<td><strong>7.4</strong></td>
<td><strong>1157</strong></td>
<td><strong>1203</strong></td>
<td><strong>1200</strong></td>
<td><strong>4.2</strong></td>
<td><strong>2.3</strong></td>
</tr>
<tr>
<td colspan="11" align="left"><strong>Open-Source</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GLM-4-Voice</td>
<td>9B</td>
<td>50.0</td>
<td>32.0</td>
<td>36.4</td>
<td><u>5.1</u></td>
<td>999</td>
<td>1147</td>
<td>1035</td>
<td><u>4.1</u></td>
<td><u>11.7</u></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Llama-Omni</td>
<td>8B</td>
<td>45.3</td>
<td>22.9</td>
<td>10.7</td>
<td>3.9</td>
<td>960</td>
<td>878</td>
<td>897</td>
<td>3.2</td>
<td>24.3</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">VITA-1.5</td>
<td>8B</td>
<td>46.7</td>
<td>28.1</td>
<td>23.3</td>
<td>2.0</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Moshi</td>
<td>7B</td>
<td>43.7</td>
<td>23.8</td>
<td>16.7</td>
<td>2.4</td>
<td>871</td>
<td>808</td>
<td>875</td>
<td>2.8</td>
<td>8.2</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Mini-Omni</td>
<td>1B</td>
<td>22.0</td>
<td>12.8</td>
<td>6.9</td>
<td>2.5</td>
<td>926</td>
<td>803</td>
<td>865</td>
<td>3.4</td>
<td>10.0</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
<td>8B</td>
<td><u>61.0</u></td>
<td><u>40.0</u></td>
<td><u>40.2</u></td>
<td><u>5.1</u></td>
<td><u>1088</u></td>
<td><u>1163</u></td>
<td><u>1131</u></td>
<td><strong>4.2</strong></td>
<td>9.8</td>
</tr>
</tbody>
</table>
</div>
All results are from AudioEvals, and the evaluation methods along with further details can be found in <a href="https://github.com/OpenBMB/UltraEval-Audio" target="_blank">AudioEvals</a>.<br><br>
**End-to-end Voice Cloning**
<div align="center">
<table style="margin: 0px auto;">
<thead>
<tr>
<th align="left">Task</th>
<th colspan="2">Voice cloning</th>
</tr>
<tr>
<th align="left">Metric</th>
<th>SIMO↑</th>
<th>SIMO↑</th>
</tr>
<tr>
<th align="left">Dataset</th>
<th>Seed-TTS test-zh</th>
<th>Seed-TTS test-en</th>
</tr>
</thead>
<tbody align="center">
<tr>
<td nowrap="nowrap" align="left">F5-TTS</td>
<td><strong>76</strong></td>
<td><strong>67</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">CosyVoice</td>
<td><u>75</u></td>
<td><u>64</u></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">FireRedTTS</td>
<td>63</td>
<td>46</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
<td>57</td>
<td>47</td>
</tr>
</tbody>
</table>
</div>
</details>
<details>
<summary>Click to view multimodal live streaming results.</summary>
**Multimodal Live Streaming**: results on StreamingBench
<table style="margin: 0px auto;">
<thead>
<tr>
<th align="left">Model</th>
<th>Size</th>
<th>Real-Time Video Understanding</th>
<th>Omni-Source Understanding</th>
<th>Contextual Understanding</th>
<th>Overall</th>
</tr>
</thead>
<tbody align="center">
<tr>
<td colspan="7" align="left"><strong>Proprietary</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
<td>-</td>
<td><u>77.4</u></td>
<td><strong>67.8</strong></td>
<td><strong>51.1</strong></td>
<td><strong>70.3</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4o-202408</td>
<td>-</td>
<td>74.5</td>
<td>51.0</td>
<td><u>48.0</u></td>
<td>64.1</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Claude-3.5-Sonnet</td>
<td>-</td>
<td>74.0</td>
<td>41.4</td>
<td>37.8</td>
<td>59.7</td>
</tr>
<tr>
<td colspan="9" align="left"><strong>Open-source</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">VILA-1.5</td>
<td>8B</td>
<td>61.5</td>
<td>37.5</td>
<td>26.7</td>
<td>49.5</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">LongVA</td>
<td>7B</td>
<td>63.1</td>
<td>35.9</td>
<td>30.2</td>
<td>50.7</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">LLaVA-Next-Video-34B</td>
<td>34B</td>
<td>69.8</td>
<td>41.7</td>
<td>34.3</td>
<td>56.7</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Qwen2-VL-7B</td>
<td>8B</td>
<td>71.2</td>
<td>40.7</td>
<td>33.1</td>
<td>57.0</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">InternVL2-8B</td>
<td>8B</td>
<td>70.1</td>
<td>42.7</td>
<td>34.1</td>
<td>57.0</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">VITA-1.5</td>
<td>8B</td>
<td>70.9</td>
<td>40.8</td>
<td>35.8</td>
<td>57.4</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">LLaVA-OneVision-7B</td>
<td>8B</td>
<td>74.3</td>
<td>40.8</td>
<td>31.0</td>
<td>58.4</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">InternLM-XC2.5-OL-7B</td>
<td>8B</td>
<td>75.4</td>
<td>46.2</td>
<td>33.6</td>
<td>60.8</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
<td>8B</td>
<td>72.4</td>
<td>40.2</td>
<td>33.4</td>
<td>57.7</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
<td>8B</td>
<td><strong>79.9</strong></td>
<td><u>53.4</u></td>
<td>38.5</td>
<td><u>66.0</u></td>
</tr>
</tbody>
</table>
</details>
### Examples <!-- omit in toc -->
We deploy MiniCPM-o 2.6 on end devices. The demo video is the raw-speed recording on an iPad Pro and a Web demo.
<div align="center">
<a href="https://www.youtube.com/watch?v=vRIMbxJzStY&t=2s"><img src="./assets/minicpmo2_6/2dot6_o_demo_video_img.png", width=70%></a>
</div>
<br>
<div style="display: flex; flex-direction: column; align-items: center;">
<img src="assets/minicpmo2_6/minicpmo2_6_math_intersect.png" alt="math" style="margin-bottom: 5px;">
<img src="assets/minicpmo2_6/minicpmo2_6_diagram_train_NN.png" alt="diagram" style="margin-bottom: 5px;">
<img src="assets/minicpmo2_6/minicpmo2_6_multi-image_bike.png" alt="bike" style="margin-bottom: 5px;">
</div>

927
docs/minicpm_o2dot6_zh.md Normal file
View File

@@ -0,0 +1,927 @@
## MiniCPM-o 2.6
> Archieve at: 2026-02-02
MiniCPM-o 2.6 是 MiniCPM-o 系列的最新、性能最佳模型。该模型基于 SigLip-400M、Whisper-medium-300M、ChatTTS-200M 和 Qwen2.5-7B 构建,共 8B 参数,通过端到端方式训练和推理。相比 MiniCPM-V 2.6该模型在性能上有了显著提升并支持了实时语音对话和多模态流式交互的新功能。MiniCPM-o 2.6 的主要特性包括:
- 🔥 **领先的视觉能力。**
MiniCPM-o 2.6 在 OpenCompass 榜单上(综合 8 个主流多模态评测基准)平均得分 70.2**以 8B 量级的大小在单图理解方面超越了 GPT-4o-202405、Gemini 1.5 Pro 和 Claude 3.5 Sonnet 等主流商用闭源多模态大模型**。此外,它的多图和视频理解表现也**优于 GPT-4V 和 Claude 3.5 Sonnet**,并展现出了优秀的上下文学习能力。
- 🎙 **出色的语音能力。**
MiniCPM-o 2.6 **支持可配置声音的中英双语实时对话**。MiniCPM-o 2.6 在语音理解任务(如 ASR 和 STT 等)**优于 GPT-4o-realtime**,并在语音对话的语义和声学评估中展现了**开源模型中最高的语音生成性能**。它还支持情绪/语速/风格控制、语音克隆、角色扮演等进阶能力。
- 🎬 **强大的多模态流式交互能力。**
作为一项新功能MiniCPM-o 2.6 能够**接受连续的视频和音频流,并和用户进行实时语音交互**。在针对实时视频理解、全模态视音频理解、多模态上下文理解的综合评测基准 StreamingBench 中MiniCPM-o 2.6 取得开源社区最佳水平,并**超过了 GPT-4o-202408 和 Claude 3.5 Sonnet**。
- 💪 **强大的 OCR 能力及其他功能。**
MiniCPM-o 2.6 进一步优化了 MiniCPM-V 2.6 的众多视觉理解能力,其可以处理任意长宽比的图像,像素数可达 180 万(如 1344x1344。在 OCRBench 上取得**25B 以下最佳水平,超过 GPT-4o-202405 等商用闭源模型**。基于最新的 [RLHF-V](https://rlhf-v.github.io/)、[RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) 和 [VisCPM](https://github.com/OpenBMB/VisCPM) 技术,其具备了**可信的多模态行为**,在 MMHal-Bench 上超过了 GPT-4o 和 Claude 3.5,并支持英语、中文、德语、法语、意大利语、韩语等**30多种语言**。
- 🚀 **卓越的效率。**
除了对个人用户友好的模型大小MiniCPM-o 2.6 还表现出**最先进的视觉 token 密度**(即每个视觉 token 编码的像素数量)。它**仅需 640 个 token 即可处理 180 万像素图像,比大多数模型少 75%**。这一特性优化了模型的推理速度、首 token 延迟、内存占用和功耗。因此MiniCPM-o 2.6 可以支持 iPad 等终端设备上的高效**多模态实时流式交互**。
- 💫 **易于使用。**
MiniCPM-o 2.6 可以通过多种方式轻松使用:(1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md) 支持在本地设备上进行高效的 CPU 推理,(2) [int4](https://huggingface.co/openbmb/MiniCPM-V-2_6-int4) 和 [GGUF](https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf) 格式的量化模型,有 16 种尺寸,(3) [vLLM](#基于-llamacppollamavllm-的高效推理) 支持高吞吐量和内存高效的推理,(4) 通过[LLaMA-Factory](./docs/llamafactory_train_and_infer.md)框架针对新领域和任务进行微调,(5) 使用 [Gradio](#本地-webui-demo-) 快速设置本地 WebUI 演示,(6) 部署于服务器的在线 [demo](https://minicpm-omni-webdemo-us.modelbest.cn/)。
**模型架构。**
- **端到端全模态架构。** 通过**端到端**的方式连接和训练不同模态的编/解码模块以充分利用丰富的多模态知识。模型完全使用 CE 损失端到端训练。
- **全模态流式机制。** (1) 我们将不同模态的离线编/解码器改造为适用于**流式输入/输出**的在线模块。 (2) 我们针对大语言模型基座设计了**时分复用的全模态流式信息处理机制**,将平行的不同模态的信息流拆分重组为周期性时间片序列。
- **可配置的声音方案。** 我们设计了新的多模态系统提示,包含传统文本系统提示词,和**用于指定模型声音的语音系统提示词**。模型可在推理时灵活地通过文字或语音样例控制声音风格,并支持端到端声音克隆和音色创建等高级能力。
<div align="center">
<img src="./assets/minicpm-o-26-framework-v2.png" , width=80%>
</div>
<br>
### 性能评估 <!-- omit in toc -->
<div align="center">
<img src="./assets/radar.jpg", width=80%>
</div>
<details>
<summary>点击查看视觉理解能力详细评测结果。</summary>
**图像理解能力**
<div align="center">
<table style="margin: 0px auto;">
<thead>
<tr>
<th align="left">Model</th>
<th>Size</th>
<th>Token Density<sup>+</sup></th>
<th>OpenCompass</th>
<th>OCRBench</th>
<th>MathVista mini</th>
<th>ChartQA</th>
<th>MMVet</th>
<th>MMStar</th>
<th>MME</th>
<th>MMB1.1 test</th>
<th>AI2D</th>
<th>MMMU val</th>
<th>HallusionBench</th>
<th>TextVQA val</th>
<th>DocVQA test</th>
<th>MathVerse mini</th>
<th>MathVision</th>
<th>MMHal Score</th>
</tr>
</thead>
<tbody align="center">
<tr>
<td colspan="19" align="left"><strong>Proprietary</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4o-20240513</td>
<td>-</td>
<td>1088</td>
<td><u>69.9</u></td>
<td>736</td>
<td>61.3</td>
<td>85.7</td>
<td><strong>69.1</strong></td>
<td>63.9</td>
<td>2328.7</td>
<td>82.2</td>
<td>84.6</td>
<td><strong>69.2</strong></td>
<td><strong>55.0</strong></td>
<td>-</td>
<td>92.8</td>
<td><strong>50.2</strong></td>
<td><strong>30.4</strong></td>
<td><u>3.6</u></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Claude3.5-Sonnet</td>
<td>-</td>
<td>750</td>
<td>67.9</td>
<td>788</td>
<td>61.6</td>
<td><strong>90.8</strong></td>
<td>66.0</td>
<td>62.2</td>
<td>1920.0</td>
<td>78.5</td>
<td>80.2</td>
<td><u>65.9</u></td>
<td>49.9</td>
<td>-</td>
<td><strong>95.2</strong></td>
<td>-</td>
<td>-</td>
<td>3.4</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
<td>-</td>
<td>-</td>
<td>64.4</td>
<td>754</td>
<td>57.7</td>
<td>81.3</td>
<td>64.0</td>
<td>59.1</td>
<td>2110.6</td>
<td>73.9</td>
<td>79.1</td>
<td>60.6</td>
<td>45.6</td>
<td>73.5</td>
<td>86.5</td>
<td>-</td>
<td>19.2</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4o-mini-20240718</td>
<td>-</td>
<td>1088</td>
<td>64.1</td>
<td>785</td>
<td>52.4</td>
<td>-</td>
<td>66.9</td>
<td>54.8</td>
<td>2003.4</td>
<td>76.0</td>
<td>77.8</td>
<td>60.0</td>
<td>46.1</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>3.3</td>
</tr>
<tr>
<td colspan="19" align="left"><strong>Open Source</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Cambrian-34B</td>
<td>34B</td>
<td><u>1820</u></td>
<td>58.3</td>
<td>591</td>
<td>50.3</td>
<td>75.6</td>
<td>53.2</td>
<td>54.2</td>
<td>2049.9</td>
<td>77.8</td>
<td>79.5</td>
<td>50.4</td>
<td>41.6</td>
<td>76.7</td>
<td>75.5</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GLM-4V-9B</td>
<td>13B</td>
<td>784</td>
<td>59.1</td>
<td>776</td>
<td>51.1</td>
<td>-</td>
<td>58.0</td>
<td>54.8</td>
<td>2018.8</td>
<td>67.9</td>
<td>71.2</td>
<td>46.9</td>
<td>45.0</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Pixtral-12B</td>
<td>12B</td>
<td>256</td>
<td>61.0</td>
<td>685</td>
<td>56.9</td>
<td>81.8</td>
<td>58.5</td>
<td>54.5</td>
<td>-</td>
<td>72.7</td>
<td>79.0</td>
<td>51.1</td>
<td>47.0</td>
<td>75.7</td>
<td>90.7</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">DeepSeek-VL2-27B (4B)</td>
<td>27B</td>
<td>672</td>
<td>66.4</td>
<td>809</td>
<td>63.9</td>
<td>86.0</td>
<td>60.0</td>
<td>61.9</td>
<td>2253.0</td>
<td>81.2</td>
<td>83.8</td>
<td>54.0</td>
<td>45.3</td>
<td><u>84.2</u></td>
<td>93.3</td>
<td>-</td>
<td>-</td>
<td>3.0</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Qwen2-VL-7B</td>
<td>8B</td>
<td>784</td>
<td>67.1</td>
<td><u>866</u></td>
<td>58.2</td>
<td>83.0</td>
<td>62.0</td>
<td>60.7</td>
<td>2326.0</td>
<td>81.8</td>
<td>83.0</td>
<td>54.1</td>
<td>50.6</td>
<td><strong>84.3</strong></td>
<td><u>94.5</u></td>
<td>31.9</td>
<td>16.3</td>
<td>3.2</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">LLaVA-OneVision-72B</td>
<td>72B</td>
<td>182</td>
<td>68.1</td>
<td>741</td>
<td>67.5</td>
<td>83.7</td>
<td>60.6</td>
<td><strong>65.8</strong></td>
<td>2261.0</td>
<td><strong>85.0</strong></td>
<td><u>85.6</u></td>
<td>56.8</td>
<td>49.0</td>
<td>80.5</td>
<td>91.3</td>
<td>39.1</td>
<td>-</td>
<td>3.5</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">InternVL2.5-8B</td>
<td>8B</td>
<td>706</td>
<td>68.3</td>
<td>822</td>
<td><u>64.4</u></td>
<td>84.8</td>
<td>62.8</td>
<td>62.8</td>
<td>2344.0</td>
<td><u>83.6</u></td>
<td>84.5</td>
<td>56.0</td>
<td>50.1</td>
<td>79.1</td>
<td>93.0</td>
<td>39.5</td>
<td>19.7</td>
<td>3.4</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
<td>8B</td>
<td><strong>2822</strong></td>
<td>65.2</td>
<td>852*</td>
<td>60.6</td>
<td>79.4</td>
<td>60.0</td>
<td>57.5</td>
<td><u>2348.4*</u></td>
<td>78.0</td>
<td>82.1</td>
<td>49.8*</td>
<td>48.1*</td>
<td>80.1</td>
<td>90.8</td>
<td>25.7</td>
<td>18.3</td>
<td>3.6</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
<td>8B</td>
<td><strong>2822</strong></td>
<td><strong>70.2</strong></td>
<td><strong>897*</strong></td>
<td><strong>71.9*</strong></td>
<td><u>86.9*</u></td>
<td><u>67.5</u></td>
<td><u>64.0</u></td>
<td><strong>2372.0*</strong></td>
<td>80.5</td>
<td><strong>85.8</strong></td>
<td>50.4*</td>
<td><u>51.9</u></td>
<td>82.0</td>
<td>93.5</td>
<td><u>41.4*</u></td>
<td><u>23.1*</u></td>
<td><strong>3.8</strong></td>
</tr>
</tbody>
</table>
</div>
* 我们使用思维链提示词来评估这些基准,对于 MME 我们只在 Cognition 任务上使用了思维链。
+ Token Density每个视觉 token 在最大分辨率下编码的像素数,即最大分辨率下的像素数 / 视觉 token 数。
注意:闭源模型的 Token Density 由 API 收费方式估算得到。
**多图和视频理解能力**
<div align="center">
<table style="margin: 0px auto;">
<thead>
<tr>
<th align="left">Model</th>
<th>Size</th>
<th>BLINK val</th>
<th>Mantis Eval</th>
<th>MIRB</th>
<th>Video-MME (wo / w subs)</th>
</tr>
</thead>
<tbody align="center">
<tr>
<td colspan="6" align="left"><strong>Proprietary</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4o-20240513</td>
<td>-</td>
<td><strong>68</strong></td>
<td>-</td>
<td>-</td>
<td><strong>71.9/77.2<strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT4V</td>
<td>-</td>
<td>54.6</td>
<td>62.7</td>
<td>53.1</td>
<td>59.9/63.3</td>
</tr>
<tr>
<td colspan="6" align="left"><strong>Open-source</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">LLaVA-NeXT-Interleave 14B</td>
<td>14B</td>
<td>52.6</td>
<td>66.4</td>
<td>30.2</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">LLaVA-OneVision-72B</td>
<td>72B</td>
<td>55.4</td>
<td><strong>77.6</strong></td>
<td>-</td>
<td><u>66.2/69.5</u></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MANTIS 8B</td>
<td>8B</td>
<td>49.1</td>
<td>59.5</td>
<td>34.8</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Qwen2-VL-7B</td>
<td>8B</td>
<td>53.2</td>
<td>69.6*</td>
<td><strong>67.6*</strong></td>
<td>63.3/69.0</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">InternVL2.5-8B</td>
<td>8B</td>
<td>54.8</td>
<td>67.7</td>
<td>52.5</td>
<td>64.2/66.9</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
<td>8B</td>
<td>53</td>
<td>69.1</td>
<td>53.8</td>
<td>60.9/63.6</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
<td>8B</td>
<td><u>56.7</u></td>
<td><u>71.9</u></td>
<td><u>58.6</u></td>
<td>63.9/67.9</td>
</tr>
</tbody>
</table>
</div>
* 正式开源模型权重的评测结果。
</details>
<details>
<summary>点击查看语音理解和生成能力的详细评测结果。</summary>
**语音理解能力**
<div align="center">
<table style="margin: 0px auto;">
<thead>
<tr>
<th align="left">Task</th>
<th>Size</th>
<th colspan="3">ASR (zh)</th>
<th colspan="3">ASR (en)</th>
<th colspan="2">AST</th>
<th>Emotion</th>
</tr>
<tr>
<th align="left">Metric</th>
<td></td>
<th colspan="3">CER↓</th>
<th colspan="3">WER↓</th>
<th colspan="2">BLEU↑</th>
<th>ACC↑</th>
</tr>
<tr>
<th align="left">Dataset</th>
<td></td>
<th>AISHELL-1</th>
<th>Fleurs zh</th>
<th>WenetSpeech test-net</th>
<th>LibriSpeech test-clean</th>
<th>GigaSpeech</th>
<th>TED-LIUM</th>
<th>CoVoST en2zh</th>
<th>CoVoST zh2en</th>
<th>MELD emotion</th>
</tr>
</thead>
<tbody align="center">
<tr>
<td colspan="11" align="left"><strong>Proprietary</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4o-Realtime</td>
<td>-</td>
<td>7.3*</td>
<td><u>5.4*</u></td>
<td>28.9*</td>
<td>2.6*</td>
<td>12.9*</td>
<td>4.8*</td>
<td>37.1*</td>
<td>15.7*</td>
<td>33.2*</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
<td>-</td>
<td>4.5*</td>
<td>5.9*</td>
<td>14.3*</td>
<td>2.9*</td>
<td>10.6*</td>
<td><strong>3.0*</strong></td>
<td><u>47.3*</u></td>
<td>22.6*</td>
<td>48.4*</td>
</tr>
<tr>
<td colspan="11" align="left"><strong>Open-Source</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Qwen2-Audio-7B</td>
<td>8B</td>
<td>-</td>
<td>7.5</td>
<td>-</td>
<td><strong>1.6</strong></td>
<td>-</td>
<td>-</td>
<td>45.2</td>
<td><u>24.4</u></td>
<td><strong>55.3</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Qwen2-Audio-7B-Instruct</td>
<td>8B</td>
<td>2.6*</td>
<td>6.9*</td>
<td><u>10.3*</u></td>
<td>3.1*</td>
<td><u>9.7</u>*</td>
<td>5.9*</td>
<td>39.5*</td>
<td>22.9*</td>
<td>17.4*</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GLM-4-Voice-Base</td>
<td>9B</td>
<td><u>2.5</u></td>
<td>-</td>
<td>-</td>
<td>2.8</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
<td>8B</td>
<td><strong>1.6</strong></td>
<td><strong>4.4</strong></td>
<td><strong>6.9</strong></td>
<td><u>1.7</u></td>
<td><strong>8.7</strong></td>
<td><strong>3.0</strong></td>
<td><strong>48.2</strong></td>
<td><strong>27.2</strong></td>
<td><u>52.4</u></td>
</tr>
</tbody>
</table>
</div>
* 正式开源模型权重的评测结果。<br><br>
**语音生成能力。**
<div align="center">
<table style="margin: 0px auto;">
<thead>
<tr>
<th align="left">Task</th>
<th>Size</th>
<th colspan="9">SpeechQA</th>
</tr>
<tr>
<th align="left">Metric</th>
<th></th>
<th colspan="3">ACC↑</th>
<th>G-Eval (10 point)↑</th>
<th>Semantic ELO score↑</th>
<th>Acoustic ELO score↑</th>
<th>Overall ELO score↑</th>
<th>UTMOS↑</th>
<th>ASR-WER↓</th>
</tr>
<tr>
<th align="left">Dataset</th>
<th></th>
<th>Speech Llama Q.</th>
<th>Speech Web Q.</th>
<th>Speech Trivia QA</th>
<th>Speech AlpacaEval</th>
<th colspan="5">AudioArena</th>
</tr>
</thead>
<tbody align="center">
<tr>
<td colspan="11" align="left"><strong>Proprietary</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4o-Realtime</td>
<td></td>
<td><strong>71.7</strong></td>
<td><strong>51.6</strong></td>
<td><strong>69.7</strong></td>
<td><strong>7.4</strong></td>
<td><strong>1157</strong></td>
<td><strong>1203</strong></td>
<td><strong>1200</strong></td>
<td><strong>4.2</strong></td>
<td><strong>2.3</strong></td>
</tr>
<tr>
<td colspan="11" align="left"><strong>Open-Source</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GLM-4-Voice</td>
<td>9B</td>
<td>50.0</td>
<td>32.0</td>
<td>36.4</td>
<td><u>5.1</u></td>
<td>999</td>
<td>1147</td>
<td>1035</td>
<td><u>4.1</u></td>
<td><u>11.7</u></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Llama-Omni</td>
<td>8B</td>
<td>45.3</td>
<td>22.9</td>
<td>10.7</td>
<td>3.9</td>
<td>960</td>
<td>878</td>
<td>897</td>
<td>3.2</td>
<td>24.3</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">VITA-1.5</td>
<td>8B</td>
<td>46.7</td>
<td>28.1</td>
<td>23.3</td>
<td>2.0</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Moshi</td>
<td>7B</td>
<td>43.7</td>
<td>23.8</td>
<td>16.7</td>
<td>2.4</td>
<td>871</td>
<td>808</td>
<td>875</td>
<td>2.8</td>
<td>8.2</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Mini-Omni</td>
<td>1B</td>
<td>22.0</td>
<td>12.8</td>
<td>6.9</td>
<td>2.5</td>
<td>926</td>
<td>803</td>
<td>865</td>
<td>3.4</td>
<td>10.0</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
<td>8B</td>
<td><u>61.0</u></td>
<td><u>40.0</u></td>
<td><u>40.2</u></td>
<td><u>5.1</u></td>
<td><u>1088</u></td>
<td><u>1163</u></td>
<td><u>1131</u></td>
<td><strong>4.2</strong></td>
<td>9.8</td>
</tr>
</tbody>
</table>
</div>
所有的结果都基于 <a href="https://github.com/OpenBMB/UltraEval-Audio" target="_blank">AudioEvals</a><br><br>
**端到端声音克隆能力。**
<div align="center">
<table style="margin: 0px auto;">
<thead>
<tr>
<th align="left">Task</th>
<th colspan="2">TTS</th>
</tr>
<tr>
<th align="left">Metric</th>
<th>SIMO↑</th>
<th>SIMO↑</th>
</tr>
<tr>
<th align="left">Dataset</th>
<th>Seed-TTS test-zh</th>
<th>Seed-TTS test-en</th>
</tr>
</thead>
<tbody align="center">
<tr>
<td nowrap="nowrap" align="left">F5-TTS</td>
<td><strong>76</strong></td>
<td><strong>67</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">CosyVoice</td>
<td><u>75</u></td>
<td><u>64</u></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">FireRedTTS</td>
<td>63</td>
<td>46</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
<td>57</td>
<td>47</td>
</tr>
</tbody>
</table>
</div>
</details>
<details>
<summary>点击查看多模态流式交互能力评测详细结果。</summary>
**多模态流式交互能力**: StreamingBench 分数
<table style="margin: 0px auto;">
<thead>
<tr>
<th align="left">Model</th>
<th>Size</th>
<th>Real-Time Video Understanding</th>
<th>Omni-Source Understanding</th>
<th>Contextual Understanding</th>
<th>Overall</th>
</tr>
</thead>
<tbody align="center">
<tr>
<td colspan="7" align="left"><strong>Proprietary</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
<td>-</td>
<td><u>77.4</u></td>
<td><strong>67.8</strong></td>
<td><strong>51.1</strong></td>
<td><strong>70.3</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4o-202408</td>
<td>-</td>
<td>74.5</td>
<td>51.0</td>
<td><u>48.0</u></td>
<td>64.1</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Claude-3.5-Sonnet</td>
<td>-</td>
<td>74.0</td>
<td>41.4</td>
<td>37.8</td>
<td>59.7</td>
</tr>
<tr>
<td colspan="9" align="left"><strong>Open-source</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">VILA-1.5</td>
<td>8B</td>
<td>61.5</td>
<td>37.5</td>
<td>26.7</td>
<td>49.5</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">LongVA</td>
<td>7B</td>
<td>63.1</td>
<td>35.9</td>
<td>30.2</td>
<td>50.7</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">LLaVA-Next-Video-34B</td>
<td>34B</td>
<td>69.8</td>
<td>41.7</td>
<td>34.3</td>
<td>56.7</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Qwen2-VL-7B</td>
<td>8B</td>
<td>71.2</td>
<td>40.7</td>
<td>33.1</td>
<td>57.0</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">InternVL2-8B</td>
<td>8B</td>
<td>70.1</td>
<td>42.7</td>
<td>34.1</td>
<td>57.0</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">VITA-1.5</td>
<td>8B</td>
<td>70.9</td>
<td>40.8</td>
<td>35.8</td>
<td>57.4</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">LLaVA-OneVision-7B</td>
<td>8B</td>
<td>74.3</td>
<td>40.8</td>
<td>31.0</td>
<td>58.4</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">InternLM-XC2.5-OL-7B</td>
<td>8B</td>
<td>75.4</td>
<td>46.2</td>
<td>33.6</td>
<td>60.8</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
<td>8B</td>
<td>72.4</td>
<td>40.2</td>
<td>33.4</td>
<td>57.7</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
<td>8B</td>
<td><strong>79.9</strong></td>
<td><u>53.4</u></td>
<td>38.5</td>
<td><u>66.0</u></td>
</tr>
</tbody>
</table>
</details>
### 典型示例 <!-- omit in toc -->
以下为 MiniCPM-o 2.6 的 iPad Pro 实机演示和 web demo 演示样例:
<div align="center">
<a href="https://www.youtube.com/watch?v=vRIMbxJzStY&t=2s"><img src="./assets/minicpmo2_6/2dot6_o_demo_video_img.png", width=70%></a>
</div>
<br>
<div style="display: flex; flex-direction: column; align-items: center;">
<img src="assets/minicpmo2_6/minicpmo2_6_math_intersect.png" alt="math" style="margin-bottom: 5px;">
<img src="assets/minicpmo2_6/minicpmo2_6_diagram_train_NN.png" alt="diagram" style="margin-bottom: 5px;">
<img src="assets/minicpmo2_6/minicpmo2_6_multi-image_bike.png" alt="bike" style="margin-bottom: 5px;">
</div>

158
docs/minicpm_v4dot5_en.md Normal file
View File

@@ -0,0 +1,158 @@
## MiniCPM-V 4.5
> Archieve at: 2026-02-03
**MiniCPM-V 4.5** is the latest and most capable model in the MiniCPM-V series. The model is built on Qwen3-8B and SigLIP2-400M with a total of 8B parameters. It exhibits a significant performance improvement over previous MiniCPM-V and MiniCPM-o models, and introduces new useful features. Notable features of MiniCPM-V 4.5 include:
- 🔥 **State-of-the-art Vision-Language Capability.**
MiniCPM-V 4.5 achieves an average score of 77.0 on OpenCompass, a comprehensive evaluation of 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-latest, Gemini-2.0 Pro, and strong open-source models like Qwen2.5-VL 72B** for vision-language capabilities, making it the most performant MLLM under 30B parameters.
- 🎬 **Efficient High-FPS and Long Video Understanding.** Powered by a new unified 3D-Resampler over images and videos, MiniCPM-V 4.5 can now achieve 96x compression rate for video tokens, where 6 448x448 video frames can be jointly compressed into 64 video tokens (normally 1,536 tokens for most MLLMs). This means that the model can perceive significantly more video frames without increasing the LLM inference cost. This brings state-of-the-art high-FPS (up to 10FPS) video understanding and long video understanding capabilities on Video-MME, LVBench, MLVU, MotionBench, FavorBench, etc., efficiently.
- ⚙️ **Controllable Hybrid Fast/Deep Thinking.** MiniCPM-V 4.5 supports both fast thinking for efficient frequent usage with competitive performance, and deep thinking for more complex problem solving. To cover efficiency and performance trade-offs in different user scenarios, this fast/deep thinking mode can be switched in a highly controlled fashion.
- 💪 **Strong OCR, Document Parsing and Others.**
Based on [LLaVA-UHD](https://arxiv.org/pdf/2403.11703) architecture, MiniCPM-V 4.5 can process high-resolution images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344), using 4x fewer visual tokens than most MLLMs. The model achieves **leading performance on OCRBench, surpassing proprietary models such as GPT-4o-latest and Gemini 2.5**. It also achieves state-of-the-art performance for PDF document parsing capability on OmniDocBench among general MLLMs. Based on the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) and [VisCPM](https://github.com/OpenBMB/VisCPM) techniques, it features **trustworthy behaviors**, outperforming GPT-4o-latest on MMHal-Bench, and supports **multilingual capabilities** in more than 30 languages.
- 💫 **Easy Usage.**
MiniCPM-V 4.5 can be easily used in various ways: (1) [llama.cpp](https://github.com/tc-mb/llama.cpp/blob/Support-MiniCPM-V-4.5/docs/multimodal/minicpmv4.5.md) and [ollama](https://github.com/tc-mb/ollama/tree/MIniCPM-V) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-V-4_5-int4), [GGUF](https://huggingface.co/openbmb/MiniCPM-V-4_5-gguf) and [AWQ](https://github.com/tc-mb/AutoAWQ) format quantized models in 16 sizes, (3) [SGLang](https://github.com/tc-mb/sglang/tree/main) and [vLLM](#efficient-inference-with-llamacpp-ollama-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks with [Transformers](https://github.com/tc-mb/transformers/tree/main) and [LLaMA-Factory](./docs/llamafactory_train_and_infer.md), (5) quick [local WebUI demo](#chat-with-our-demo-on-gradio), (6) optimized [local iOS app](https://github.com/tc-mb/MiniCPM-o-demo-iOS) on iPhone and iPad, and (7) online web demo on [server](http://101.126.42.235:30910/). See our [Cookbook](https://github.com/OpenSQZ/MiniCPM-V-CookBook) for full usage!
### Key Techniques <!-- omit in toc -->
<div align="center">
<img src="../assets/minicpm-v-4dot5-framework.png" , width=100%>
</div>
- **Architechture: Unified 3D-Resampler for High-density Video Compression.** MiniCPM-V 4.5 introduces a 3D-Resampler that overcomes the performance-efficiency trade-off in video understanding. By grouping and jointly compressing up to 6 consecutive video frames into just 64 tokens (the same token count used for a single image in MiniCPM-V series), MiniCPM-V 4.5 achieves a 96× compression rate for video tokens. This allows the model to process more video frames without additional LLM computational cost, enabling high-FPS video and long video understanding. The architecture supports unified encoding for images, multi-image inputs, and videos, ensuring seamless capability and knowledge transfer.
- **Pre-training: Unified Learning for OCR and Knowledge from Documents.** Existing MLLMs learn OCR capability and knowledge from documents in isolated training approaches. We observe that the essential difference between these two training approaches is the visibility of the text in images. By dynamically corrupting text regions in documents with varying noise levels and asking the model to reconstruct the text, the model learns to adaptively and properly switch between accurate text recognition (when text is visible) and multimodal context-based knowledge reasoning (when text is heavily obscured). This eliminates reliance on error-prone document parsers in knowledge learning from documents, and prevents hallucinations from over-augmented OCR data, resulting in top-tier OCR and multimodal knowledge performance with minimal engineering overhead.
- **Post-training: Hybrid Fast/Deep Thinking with Multimodal RL.** MiniCPM-V 4.5 offers a balanced reasoning experience through two switchable modes: fast thinking for efficient daily use and deep thinking for complex tasks. Using a new hybrid reinforcement learning method, the model jointly optimizes both modes, significantly enhancing fast-mode performance without compromising deep-mode capability. Incorporated with [RLPR](https://github.com/OpenBMB/RLPR) and [RLAIF-V](https://github.com/RLHF-V/RLAIF-V), it generalizes robust reasoning skills from broad multimodal data while effectively reducing hallucinations.
### Evaluation <!-- omit in toc -->
<div align="center">
<img src="./assets/radar_minicpm_v45.png", width=60%>
</div>
<div align="center">
<img src="./assets/minicpmv_4_5_evaluation_result.png" , width=80%>
</div>
### Inference Efficiency
**OpenCompass**
<div align="left">
<table style="margin: 0px auto;">
<thead>
<tr>
<th align="left">Model</th>
<th>Size</th>
<th>Avg Score ↑</th>
<th>Total Inference Time ↓</th>
</tr>
</thead>
<tbody align="center">
<tr>
<td nowrap="nowrap" align="left">GLM-4.1V-9B-Thinking</td>
<td>10.3B</td>
<td>76.6</td>
<td>17.5h</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiMo-VL-7B-RL</td>
<td>8.3B</td>
<td>76.4</td>
<td>11h</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-V 4.5</td>
<td>8.7B</td>
<td><b>77.0</td>
<td><b>7.5h</td>
</tr>
</tbody>
</table>
</div>
**Video-MME**
<div align="left">
<table style="margin: 0px auto;">
<thead>
<tr>
<th align="left">Model</th>
<th>Size</th>
<th>Avg Score ↑</th>
<th>Total Inference Time ↓</th>
<th>GPU Mem ↓</th>
</tr>
</thead>
<tbody align="center">
<tr>
<td nowrap="nowrap" align="left">Qwen2.5-VL-7B-Instruct</td>
<td>8.3B</td>
<td>71.6</td>
<td>3h</td>
<td>60G</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GLM-4.1V-9B-Thinking</td>
<td>10.3B</td>
<td><b>73.6</td>
<td>2.63h</td>
<td>32G</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-V 4.5</td>
<td>8.7B</td>
<td>73.5</td>
<td><b>0.26h</td>
<td><b>28G</td>
</tr>
</tbody>
</table>
</div>
Both Video-MME and OpenCompass were evaluated using 8×A100 GPUs for inference. The reported inference time of Video-MME includes full model-side computation, and excludes the external cost of video frame extraction (dependent on specific frame extraction tools) for fair comparison.
### Examples <!-- omit in toc -->
<div align="center">
<a href="https://www.youtube.com/watch?v=Cn23FujYMMU"><img src="../assets/minicpmv4_5/MiniCPM-V 4.5-8.26_img.jpeg", width=70%></a>
</div>
<div style="display: flex; flex-direction: column; align-items: center;">
<img src="../assets/minicpmv4_5/en_case1.png" alt="en_case1" style="margin-bottom: 5px;">
<img src="../assets/minicpmv4_5/en_case2.png" alt="en_case2" style="margin-bottom: 5px;">
<img src="../assets/minicpmv4_5/en_case3.jpeg" alt="en_case3" style="margin-bottom: 5px;">
</div>
<details>
<summary>Click to view more cases.</summary>
<div style="display: flex; flex-direction: column; align-items: center;">
<img src="../assets/minicpmv4_5/zh_extra.jpeg" alt="zh_extra" style="margin-bottom: 5px;">
</div>
</details>
We deploy MiniCPM-V 4.5 on iPad M4 with [iOS demo](https://github.com/tc-mb/MiniCPM-o-demo-iOS). The demo video is the raw screen recording without edition.
<table align="center">
<p align="center">
<img src="../assets/minicpmv4_5/v45_en_handwriting.gif" width=45%/>
&nbsp;&nbsp;&nbsp;&nbsp;
<img src="../assets/minicpmv4_5/v45_en_cot.gif" width=45%/>
</p>
<p align="center">
<img src="../assets/minicpmv4_5/v45_cn_handwriting.gif" width=45%/>
&nbsp;&nbsp;&nbsp;&nbsp;
<img src="../assets/minicpmv4_5/v45_cn_travel.gif" width=45%/>
</p>
</table>

156
docs/minicpm_v4dot5_zh.md Normal file
View File

@@ -0,0 +1,156 @@
## MiniCPM-V 4.5
> Archieve at: 2026-02-03
**MiniCPM-V 4.5** 是 MiniCPM-V 系列中最新、最强大的模型。该模型基于 Qwen3-8B 与 SigLIP2-400M 构建,总参数量为 8B。其在性能上较前代 MiniCPM-V 与 MiniCPM-o 有显著提升,并引入了一系列全新的实用特性。其主要亮点包括:
- 🔥 **领先的视觉理解能力**
MiniCPM-V 4.5 在 OpenCompass 综合评测(涵盖 8 个主流评测基准)中取得了 77.0 的高分。**在仅 8B 参数的情况下超越了广泛使用的闭源模型(如 GPT-4o-latest、Gemini-2.0 Pro以及强大的开源模型如 Qwen2.5-VL 72B**,成为 30B 参数规模以下最强的多模态大模型。
- 🎬 **高效的高帧率与长视频理解**
借助全新的图像-视频统一 3D-ResamplerMiniCPM-V 4.5 能够实现 96 倍视频 token 压缩率,即将 6 帧 448x448 视频帧联合压缩为 64 个 token大多数多模态大模型需约 1536 个 token。这意味着模型在语言模型推理成本不增加的情况下可以感知显著更多的视频帧从而实现业界领先的 高帧率(最高 10FPS视频理解与长视频理解并在 Video-MME、LVBench、MLVU、MotionBench、FavorBench 等基准上高效率地展现出色性能。
- ⚙️ **可控的快思考 / 深思考模式**
MiniCPM-V 4.5 同时支持 快思考(用于高频高效推理,性能具竞争力)与 深思考(用于复杂问题求解)。用户可根据不同场景对效率与性能的权衡,自由切换两种模式,实现高度可控的推理过程。
- 💪 **优秀的 OCR、文档解析与多语言能力**
基于 [LLaVA-UHD](https://arxiv.org/pdf/2403.11703) 架构MiniCPM-V 4.5 能处理任意长宽比、最高达 180 万像素(如 1344x1344 的高分辨率图像,同时使用的视觉 token 数仅为多数 MLLM 的 1/4。其在 OCRBench 上取得超越 GPT-4o-latest 与 Gemini 2.5 等闭源模型的性能,并在 OmniDocBench 上展现了业界顶尖的 PDF 文档解析能力。借助最新的 [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) 和 [VisCPM](https://github.com/OpenBMB/VisCPM) 技术,模型在可靠性上表现优异,在 MMHal-Bench 上超越 GPT-4o-latest并支持 30+ 种语言的多语言能力。
- 💫 **便捷易用的部署方式**
MiniCPM-V 4.5 提供丰富灵活的使用方式:(1) [llama.cpp](https://github.com/tc-mb/llama.cpp/blob/master/docs/multimodal/minicpmo4.5.md) 与 [ollama](https://github.com/tc-mb/ollama/tree/MIniCPM-V) 支持本地 CPU 高效推理;(2) 提供 [int4](https://huggingface.co/openbmb/MiniCPM-V-4_5-int4)、[GGUF](https://huggingface.co/openbmb/MiniCPM-V-4_5-gguf)、[AWQ](https://github.com/tc-mb/AutoAWQ) 等 16 种规格的量化模型;(3)兼容 SGLang 与 [vLLM](#efficient-inference-with-llamacpp-ollama-vllm) (4) 借助 [Transformers](https://github.com/tc-mb/transformers/tree/main) 与 [LLaMA-Factory](./docs/llamafactory_train_and_infer.md) 在新领域与任务上进行微调;(5) 快速启动本地 [WebUI demo](#chat-with-our-demo-on-gradio)(6) 优化适配的 [iOS 本地应用](https://github.com/tc-mb/MiniCPM-o-demo-iOS),可在 iPhone 与 iPad 上高效运行;(7) 在线 [Web demo](http://101.126.42.235:30910/) 体验。更多使用方式请见 [Cookbook](https://github.com/OpenSQZ/MiniCPM-V-CookBook)。
### 技术亮点 <!-- omit in toc -->
- **架构:图像-视频统一的高密度视觉压缩 3D-Resampler**。 MiniCPM-V 4.5 在架构上引入了 3D-Resampler成功突破了视频理解任务中性能与效率难以兼得的瓶颈。该方法能够将多达 6 帧连续视频帧压缩为仅 64 个 token与 MiniCPM-V 系列中单张图像所用的 token 数相同),从而实现 96× 的视频 token 压缩率。这使得模型在语言模型计算成本不增加的情况下,可以处理更多的视频帧,从而实现高帧率视频理解和长视频理解。该架构统一支持单图、多图和视频的编码处理,确保了能力与知识的无缝迁移。
- **学习机制OCR与文档知识的统一学习**。现有多模态大模型一般在不同训练阶段分别单独训练 OCR 能力与文档知识。我们发现这两个训练过程的本质差异在于图像中文本的可见性。通过动态对文档文本区域施加不同强度的噪声干扰,并要求模型重建文本,使其学会自适应地在准确文本识别(当文本清晰时)与基于多模态上下文的知识推理(当文本严重遮挡时)之间切换。这种方法使得 MiniCPM-V 在文档知识学习中摆脱了对高错误率的文档解析器的依赖,同时避免了过度增强的 OCR 数据产生的幻觉问题,以最小工程开销实现了顶尖的 OCR 与多模态知识处理性能。
- **后训练优化:基于多模态强化学习的混合快思考/深度思考模式**。 MiniCPM-V 4.5 通过两种可切换推理模式提供均衡的体验:面向高效日常应用的快速思考模式,以及处理复杂任务的深度思考模式。采用新颖的混合强化学习方法,模型可联合优化两种模式,在保持深度模式能力的同时显著提升快速模式性能。结合 [RLPR](https://github.com/OpenBMB/RLPR) 和 [RLAIF-V](https://github.com/RLHF-V/RLAIF-V) 技术,该模型可以从海量多模态数据中泛化出强大的推理能力,并有效减少幻觉现象。
<div align="center">
<img src="../assets/minicpm-v-4dot5-framework.png" , width=80%>
</div>
### 性能评估 <!-- omit in toc -->
<div align="center">
<img src="../assets/radar_minicpm_v45.png", width=80%>
</div>
<div align="center">
<img src="../assets/minicpmv_4_5_evaluation_result.png" , width=80%>
</div>
### 推理效率 <!-- omit in toc -->
**OpenCompass**
<div align="left">
<table style="margin: 0px auto;">
<thead>
<tr>
<th align="left">Model</th>
<th>Size</th>
<th>Avg Score ↑</th>
<th>Total Inference Time ↓</th>
</tr>
</thead>
<tbody align="center">
<tr>
<td nowrap="nowrap" align="left">GLM-4.1V-9B-Thinking</td>
<td>10.3B</td>
<td>76.6</td>
<td>17.5h</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiMo-VL-7B-RL</td>
<td>8.3B</td>
<td>76.4</td>
<td>11h</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-V 4.5</td>
<td>8.7B</td>
<td><b>77.0</td>
<td><b>7.5h</td>
</tr>
</tbody>
</table>
</div>
**Video-MME**
<div align="left">
<table style="margin: 0px auto;">
<thead>
<tr>
<th align="left">Model</th>
<th>Size</th>
<th>Avg Score ↑</th>
<th>Total Inference Time ↓</th>
<th>GPU Mem ↓</th>
</tr>
</thead>
<tbody align="center">
<tr>
<td nowrap="nowrap" align="left">Qwen2.5-VL-7B-Instruct</td>
<td>8.3B</td>
<td>71.6</td>
<td>3h</td>
<td>60G</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GLM-4.1V-9B-Thinking</td>
<td>10.3B</td>
<td><b>73.6</td>
<td>2.63h</td>
<td>32G</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-V 4.5</td>
<td>8.7B</td>
<td>73.5</td>
<td><b>0.26h</td>
<td><b>28G</td>
</tr>
</tbody>
</table>
OpenCompass 和 Video-MME 均采用 A100*8卡 推理,其中 Video-MME 的推理时间未统计视频抽帧时间
### 典型示例 <!-- omit in toc -->
<div align="center">
<a href="https://www.youtube.com/watch?v=Cn23FujYMMU"><img src="../assets/minicpmv4_5/MiniCPM-V 4.5-8.26_img.jpeg", width=70%></a>
</div>
<div style="display: flex; flex-direction: column; align-items: center;">
<img src="../assets/minicpmv4_5/zh_case1.jpeg" alt="zh_case1" style="margin-bottom: 5px;">
<img src="../assets/minicpmv4_5/zh_case2.jpeg" alt="zh_case2" style="margin-bottom: 5px;">
</div>
<details>
<summary>点击查看更多示例</summary>
<div style="display: flex; flex-direction: column; align-items: center;">
<img src="../assets/minicpmv4_5/en_extra.jpg" alt="en_extra" style="margin-bottom: 5px;">
<img src="../assets/minicpmv4_5/en_case3.jpeg" alt="en_extra" style="margin-bottom: 5px;">
</div>
</details>
我们使用 [iOS demo](https://github.com/tc-mb/MiniCPM-o-demo-iOS) 将 MiniCPM-V 4.5 部署在 iPad M4 ,并录制以下演示录屏,视频未经任何编辑。
<table align="center">
<p align="center">
<img src="../assets/minicpmv4_5/v45_en_handwriting.gif" width=45%/>
&nbsp;&nbsp;&nbsp;&nbsp;
<img src="../assets/minicpmv4_5/v45_en_cot.gif" width=45%/>
</p>
<p align="center">
<img src="../assets/minicpmv4_5/v45_cn_handwriting.gif" width=45%/>
&nbsp;&nbsp;&nbsp;&nbsp;
<img src="../assets/minicpmv4_5/v45_cn_travel.gif" width=45%/>
</p>
</table>