Files
MiniCPM-o/docs/minicpm_o2dot6_zh.md
YuzaChongyi 28632248d5 update minicpm-o 4.5 (#1052)
Co-authored-by: wangchongyi <>
2026-02-04 01:55:48 +08:00

928 lines
29 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

## MiniCPM-o 2.6
> Archieve at: 2026-02-02
MiniCPM-o 2.6 是 MiniCPM-o 系列的最新、性能最佳模型。该模型基于 SigLip-400M、Whisper-medium-300M、ChatTTS-200M 和 Qwen2.5-7B 构建,共 8B 参数,通过端到端方式训练和推理。相比 MiniCPM-V 2.6该模型在性能上有了显著提升并支持了实时语音对话和多模态流式交互的新功能。MiniCPM-o 2.6 的主要特性包括:
- 🔥 **领先的视觉能力。**
MiniCPM-o 2.6 在 OpenCompass 榜单上(综合 8 个主流多模态评测基准)平均得分 70.2**以 8B 量级的大小在单图理解方面超越了 GPT-4o-202405、Gemini 1.5 Pro 和 Claude 3.5 Sonnet 等主流商用闭源多模态大模型**。此外,它的多图和视频理解表现也**优于 GPT-4V 和 Claude 3.5 Sonnet**,并展现出了优秀的上下文学习能力。
- 🎙 **出色的语音能力。**
MiniCPM-o 2.6 **支持可配置声音的中英双语实时对话**。MiniCPM-o 2.6 在语音理解任务(如 ASR 和 STT 等)**优于 GPT-4o-realtime**,并在语音对话的语义和声学评估中展现了**开源模型中最高的语音生成性能**。它还支持情绪/语速/风格控制、语音克隆、角色扮演等进阶能力。
- 🎬 **强大的多模态流式交互能力。**
作为一项新功能MiniCPM-o 2.6 能够**接受连续的视频和音频流,并和用户进行实时语音交互**。在针对实时视频理解、全模态视音频理解、多模态上下文理解的综合评测基准 StreamingBench 中MiniCPM-o 2.6 取得开源社区最佳水平,并**超过了 GPT-4o-202408 和 Claude 3.5 Sonnet**。
- 💪 **强大的 OCR 能力及其他功能。**
MiniCPM-o 2.6 进一步优化了 MiniCPM-V 2.6 的众多视觉理解能力,其可以处理任意长宽比的图像,像素数可达 180 万(如 1344x1344。在 OCRBench 上取得**25B 以下最佳水平,超过 GPT-4o-202405 等商用闭源模型**。基于最新的 [RLHF-V](https://rlhf-v.github.io/)、[RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) 和 [VisCPM](https://github.com/OpenBMB/VisCPM) 技术,其具备了**可信的多模态行为**,在 MMHal-Bench 上超过了 GPT-4o 和 Claude 3.5,并支持英语、中文、德语、法语、意大利语、韩语等**30多种语言**。
- 🚀 **卓越的效率。**
除了对个人用户友好的模型大小MiniCPM-o 2.6 还表现出**最先进的视觉 token 密度**(即每个视觉 token 编码的像素数量)。它**仅需 640 个 token 即可处理 180 万像素图像,比大多数模型少 75%**。这一特性优化了模型的推理速度、首 token 延迟、内存占用和功耗。因此MiniCPM-o 2.6 可以支持 iPad 等终端设备上的高效**多模态实时流式交互**。
- 💫 **易于使用。**
MiniCPM-o 2.6 可以通过多种方式轻松使用:(1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md) 支持在本地设备上进行高效的 CPU 推理,(2) [int4](https://huggingface.co/openbmb/MiniCPM-V-2_6-int4) 和 [GGUF](https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf) 格式的量化模型,有 16 种尺寸,(3) [vLLM](#基于-llamacppollamavllm-的高效推理) 支持高吞吐量和内存高效的推理,(4) 通过[LLaMA-Factory](./docs/llamafactory_train_and_infer.md)框架针对新领域和任务进行微调,(5) 使用 [Gradio](#本地-webui-demo-) 快速设置本地 WebUI 演示,(6) 部署于服务器的在线 [demo](https://minicpm-omni-webdemo-us.modelbest.cn/)。
**模型架构。**
- **端到端全模态架构。** 通过**端到端**的方式连接和训练不同模态的编/解码模块以充分利用丰富的多模态知识。模型完全使用 CE 损失端到端训练。
- **全模态流式机制。** (1) 我们将不同模态的离线编/解码器改造为适用于**流式输入/输出**的在线模块。 (2) 我们针对大语言模型基座设计了**时分复用的全模态流式信息处理机制**,将平行的不同模态的信息流拆分重组为周期性时间片序列。
- **可配置的声音方案。** 我们设计了新的多模态系统提示,包含传统文本系统提示词,和**用于指定模型声音的语音系统提示词**。模型可在推理时灵活地通过文字或语音样例控制声音风格,并支持端到端声音克隆和音色创建等高级能力。
<div align="center">
<img src="./assets/minicpm-o-26-framework-v2.png" , width=80%>
</div>
<br>
### 性能评估 <!-- omit in toc -->
<div align="center">
<img src="./assets/radar.jpg", width=80%>
</div>
<details>
<summary>点击查看视觉理解能力详细评测结果。</summary>
**图像理解能力**
<div align="center">
<table style="margin: 0px auto;">
<thead>
<tr>
<th align="left">Model</th>
<th>Size</th>
<th>Token Density<sup>+</sup></th>
<th>OpenCompass</th>
<th>OCRBench</th>
<th>MathVista mini</th>
<th>ChartQA</th>
<th>MMVet</th>
<th>MMStar</th>
<th>MME</th>
<th>MMB1.1 test</th>
<th>AI2D</th>
<th>MMMU val</th>
<th>HallusionBench</th>
<th>TextVQA val</th>
<th>DocVQA test</th>
<th>MathVerse mini</th>
<th>MathVision</th>
<th>MMHal Score</th>
</tr>
</thead>
<tbody align="center">
<tr>
<td colspan="19" align="left"><strong>Proprietary</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4o-20240513</td>
<td>-</td>
<td>1088</td>
<td><u>69.9</u></td>
<td>736</td>
<td>61.3</td>
<td>85.7</td>
<td><strong>69.1</strong></td>
<td>63.9</td>
<td>2328.7</td>
<td>82.2</td>
<td>84.6</td>
<td><strong>69.2</strong></td>
<td><strong>55.0</strong></td>
<td>-</td>
<td>92.8</td>
<td><strong>50.2</strong></td>
<td><strong>30.4</strong></td>
<td><u>3.6</u></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Claude3.5-Sonnet</td>
<td>-</td>
<td>750</td>
<td>67.9</td>
<td>788</td>
<td>61.6</td>
<td><strong>90.8</strong></td>
<td>66.0</td>
<td>62.2</td>
<td>1920.0</td>
<td>78.5</td>
<td>80.2</td>
<td><u>65.9</u></td>
<td>49.9</td>
<td>-</td>
<td><strong>95.2</strong></td>
<td>-</td>
<td>-</td>
<td>3.4</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
<td>-</td>
<td>-</td>
<td>64.4</td>
<td>754</td>
<td>57.7</td>
<td>81.3</td>
<td>64.0</td>
<td>59.1</td>
<td>2110.6</td>
<td>73.9</td>
<td>79.1</td>
<td>60.6</td>
<td>45.6</td>
<td>73.5</td>
<td>86.5</td>
<td>-</td>
<td>19.2</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4o-mini-20240718</td>
<td>-</td>
<td>1088</td>
<td>64.1</td>
<td>785</td>
<td>52.4</td>
<td>-</td>
<td>66.9</td>
<td>54.8</td>
<td>2003.4</td>
<td>76.0</td>
<td>77.8</td>
<td>60.0</td>
<td>46.1</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>3.3</td>
</tr>
<tr>
<td colspan="19" align="left"><strong>Open Source</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Cambrian-34B</td>
<td>34B</td>
<td><u>1820</u></td>
<td>58.3</td>
<td>591</td>
<td>50.3</td>
<td>75.6</td>
<td>53.2</td>
<td>54.2</td>
<td>2049.9</td>
<td>77.8</td>
<td>79.5</td>
<td>50.4</td>
<td>41.6</td>
<td>76.7</td>
<td>75.5</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GLM-4V-9B</td>
<td>13B</td>
<td>784</td>
<td>59.1</td>
<td>776</td>
<td>51.1</td>
<td>-</td>
<td>58.0</td>
<td>54.8</td>
<td>2018.8</td>
<td>67.9</td>
<td>71.2</td>
<td>46.9</td>
<td>45.0</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Pixtral-12B</td>
<td>12B</td>
<td>256</td>
<td>61.0</td>
<td>685</td>
<td>56.9</td>
<td>81.8</td>
<td>58.5</td>
<td>54.5</td>
<td>-</td>
<td>72.7</td>
<td>79.0</td>
<td>51.1</td>
<td>47.0</td>
<td>75.7</td>
<td>90.7</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">DeepSeek-VL2-27B (4B)</td>
<td>27B</td>
<td>672</td>
<td>66.4</td>
<td>809</td>
<td>63.9</td>
<td>86.0</td>
<td>60.0</td>
<td>61.9</td>
<td>2253.0</td>
<td>81.2</td>
<td>83.8</td>
<td>54.0</td>
<td>45.3</td>
<td><u>84.2</u></td>
<td>93.3</td>
<td>-</td>
<td>-</td>
<td>3.0</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Qwen2-VL-7B</td>
<td>8B</td>
<td>784</td>
<td>67.1</td>
<td><u>866</u></td>
<td>58.2</td>
<td>83.0</td>
<td>62.0</td>
<td>60.7</td>
<td>2326.0</td>
<td>81.8</td>
<td>83.0</td>
<td>54.1</td>
<td>50.6</td>
<td><strong>84.3</strong></td>
<td><u>94.5</u></td>
<td>31.9</td>
<td>16.3</td>
<td>3.2</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">LLaVA-OneVision-72B</td>
<td>72B</td>
<td>182</td>
<td>68.1</td>
<td>741</td>
<td>67.5</td>
<td>83.7</td>
<td>60.6</td>
<td><strong>65.8</strong></td>
<td>2261.0</td>
<td><strong>85.0</strong></td>
<td><u>85.6</u></td>
<td>56.8</td>
<td>49.0</td>
<td>80.5</td>
<td>91.3</td>
<td>39.1</td>
<td>-</td>
<td>3.5</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">InternVL2.5-8B</td>
<td>8B</td>
<td>706</td>
<td>68.3</td>
<td>822</td>
<td><u>64.4</u></td>
<td>84.8</td>
<td>62.8</td>
<td>62.8</td>
<td>2344.0</td>
<td><u>83.6</u></td>
<td>84.5</td>
<td>56.0</td>
<td>50.1</td>
<td>79.1</td>
<td>93.0</td>
<td>39.5</td>
<td>19.7</td>
<td>3.4</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
<td>8B</td>
<td><strong>2822</strong></td>
<td>65.2</td>
<td>852*</td>
<td>60.6</td>
<td>79.4</td>
<td>60.0</td>
<td>57.5</td>
<td><u>2348.4*</u></td>
<td>78.0</td>
<td>82.1</td>
<td>49.8*</td>
<td>48.1*</td>
<td>80.1</td>
<td>90.8</td>
<td>25.7</td>
<td>18.3</td>
<td>3.6</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
<td>8B</td>
<td><strong>2822</strong></td>
<td><strong>70.2</strong></td>
<td><strong>897*</strong></td>
<td><strong>71.9*</strong></td>
<td><u>86.9*</u></td>
<td><u>67.5</u></td>
<td><u>64.0</u></td>
<td><strong>2372.0*</strong></td>
<td>80.5</td>
<td><strong>85.8</strong></td>
<td>50.4*</td>
<td><u>51.9</u></td>
<td>82.0</td>
<td>93.5</td>
<td><u>41.4*</u></td>
<td><u>23.1*</u></td>
<td><strong>3.8</strong></td>
</tr>
</tbody>
</table>
</div>
* 我们使用思维链提示词来评估这些基准,对于 MME 我们只在 Cognition 任务上使用了思维链。
+ Token Density每个视觉 token 在最大分辨率下编码的像素数,即最大分辨率下的像素数 / 视觉 token 数。
注意:闭源模型的 Token Density 由 API 收费方式估算得到。
**多图和视频理解能力**
<div align="center">
<table style="margin: 0px auto;">
<thead>
<tr>
<th align="left">Model</th>
<th>Size</th>
<th>BLINK val</th>
<th>Mantis Eval</th>
<th>MIRB</th>
<th>Video-MME (wo / w subs)</th>
</tr>
</thead>
<tbody align="center">
<tr>
<td colspan="6" align="left"><strong>Proprietary</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4o-20240513</td>
<td>-</td>
<td><strong>68</strong></td>
<td>-</td>
<td>-</td>
<td><strong>71.9/77.2<strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT4V</td>
<td>-</td>
<td>54.6</td>
<td>62.7</td>
<td>53.1</td>
<td>59.9/63.3</td>
</tr>
<tr>
<td colspan="6" align="left"><strong>Open-source</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">LLaVA-NeXT-Interleave 14B</td>
<td>14B</td>
<td>52.6</td>
<td>66.4</td>
<td>30.2</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">LLaVA-OneVision-72B</td>
<td>72B</td>
<td>55.4</td>
<td><strong>77.6</strong></td>
<td>-</td>
<td><u>66.2/69.5</u></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MANTIS 8B</td>
<td>8B</td>
<td>49.1</td>
<td>59.5</td>
<td>34.8</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Qwen2-VL-7B</td>
<td>8B</td>
<td>53.2</td>
<td>69.6*</td>
<td><strong>67.6*</strong></td>
<td>63.3/69.0</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">InternVL2.5-8B</td>
<td>8B</td>
<td>54.8</td>
<td>67.7</td>
<td>52.5</td>
<td>64.2/66.9</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
<td>8B</td>
<td>53</td>
<td>69.1</td>
<td>53.8</td>
<td>60.9/63.6</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
<td>8B</td>
<td><u>56.7</u></td>
<td><u>71.9</u></td>
<td><u>58.6</u></td>
<td>63.9/67.9</td>
</tr>
</tbody>
</table>
</div>
* 正式开源模型权重的评测结果。
</details>
<details>
<summary>点击查看语音理解和生成能力的详细评测结果。</summary>
**语音理解能力**
<div align="center">
<table style="margin: 0px auto;">
<thead>
<tr>
<th align="left">Task</th>
<th>Size</th>
<th colspan="3">ASR (zh)</th>
<th colspan="3">ASR (en)</th>
<th colspan="2">AST</th>
<th>Emotion</th>
</tr>
<tr>
<th align="left">Metric</th>
<td></td>
<th colspan="3">CER↓</th>
<th colspan="3">WER↓</th>
<th colspan="2">BLEU↑</th>
<th>ACC↑</th>
</tr>
<tr>
<th align="left">Dataset</th>
<td></td>
<th>AISHELL-1</th>
<th>Fleurs zh</th>
<th>WenetSpeech test-net</th>
<th>LibriSpeech test-clean</th>
<th>GigaSpeech</th>
<th>TED-LIUM</th>
<th>CoVoST en2zh</th>
<th>CoVoST zh2en</th>
<th>MELD emotion</th>
</tr>
</thead>
<tbody align="center">
<tr>
<td colspan="11" align="left"><strong>Proprietary</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4o-Realtime</td>
<td>-</td>
<td>7.3*</td>
<td><u>5.4*</u></td>
<td>28.9*</td>
<td>2.6*</td>
<td>12.9*</td>
<td>4.8*</td>
<td>37.1*</td>
<td>15.7*</td>
<td>33.2*</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
<td>-</td>
<td>4.5*</td>
<td>5.9*</td>
<td>14.3*</td>
<td>2.9*</td>
<td>10.6*</td>
<td><strong>3.0*</strong></td>
<td><u>47.3*</u></td>
<td>22.6*</td>
<td>48.4*</td>
</tr>
<tr>
<td colspan="11" align="left"><strong>Open-Source</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Qwen2-Audio-7B</td>
<td>8B</td>
<td>-</td>
<td>7.5</td>
<td>-</td>
<td><strong>1.6</strong></td>
<td>-</td>
<td>-</td>
<td>45.2</td>
<td><u>24.4</u></td>
<td><strong>55.3</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Qwen2-Audio-7B-Instruct</td>
<td>8B</td>
<td>2.6*</td>
<td>6.9*</td>
<td><u>10.3*</u></td>
<td>3.1*</td>
<td><u>9.7</u>*</td>
<td>5.9*</td>
<td>39.5*</td>
<td>22.9*</td>
<td>17.4*</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GLM-4-Voice-Base</td>
<td>9B</td>
<td><u>2.5</u></td>
<td>-</td>
<td>-</td>
<td>2.8</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
<td>8B</td>
<td><strong>1.6</strong></td>
<td><strong>4.4</strong></td>
<td><strong>6.9</strong></td>
<td><u>1.7</u></td>
<td><strong>8.7</strong></td>
<td><strong>3.0</strong></td>
<td><strong>48.2</strong></td>
<td><strong>27.2</strong></td>
<td><u>52.4</u></td>
</tr>
</tbody>
</table>
</div>
* 正式开源模型权重的评测结果。<br><br>
**语音生成能力。**
<div align="center">
<table style="margin: 0px auto;">
<thead>
<tr>
<th align="left">Task</th>
<th>Size</th>
<th colspan="9">SpeechQA</th>
</tr>
<tr>
<th align="left">Metric</th>
<th></th>
<th colspan="3">ACC↑</th>
<th>G-Eval (10 point)↑</th>
<th>Semantic ELO score↑</th>
<th>Acoustic ELO score↑</th>
<th>Overall ELO score↑</th>
<th>UTMOS↑</th>
<th>ASR-WER↓</th>
</tr>
<tr>
<th align="left">Dataset</th>
<th></th>
<th>Speech Llama Q.</th>
<th>Speech Web Q.</th>
<th>Speech Trivia QA</th>
<th>Speech AlpacaEval</th>
<th colspan="5">AudioArena</th>
</tr>
</thead>
<tbody align="center">
<tr>
<td colspan="11" align="left"><strong>Proprietary</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4o-Realtime</td>
<td></td>
<td><strong>71.7</strong></td>
<td><strong>51.6</strong></td>
<td><strong>69.7</strong></td>
<td><strong>7.4</strong></td>
<td><strong>1157</strong></td>
<td><strong>1203</strong></td>
<td><strong>1200</strong></td>
<td><strong>4.2</strong></td>
<td><strong>2.3</strong></td>
</tr>
<tr>
<td colspan="11" align="left"><strong>Open-Source</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GLM-4-Voice</td>
<td>9B</td>
<td>50.0</td>
<td>32.0</td>
<td>36.4</td>
<td><u>5.1</u></td>
<td>999</td>
<td>1147</td>
<td>1035</td>
<td><u>4.1</u></td>
<td><u>11.7</u></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Llama-Omni</td>
<td>8B</td>
<td>45.3</td>
<td>22.9</td>
<td>10.7</td>
<td>3.9</td>
<td>960</td>
<td>878</td>
<td>897</td>
<td>3.2</td>
<td>24.3</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">VITA-1.5</td>
<td>8B</td>
<td>46.7</td>
<td>28.1</td>
<td>23.3</td>
<td>2.0</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Moshi</td>
<td>7B</td>
<td>43.7</td>
<td>23.8</td>
<td>16.7</td>
<td>2.4</td>
<td>871</td>
<td>808</td>
<td>875</td>
<td>2.8</td>
<td>8.2</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Mini-Omni</td>
<td>1B</td>
<td>22.0</td>
<td>12.8</td>
<td>6.9</td>
<td>2.5</td>
<td>926</td>
<td>803</td>
<td>865</td>
<td>3.4</td>
<td>10.0</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
<td>8B</td>
<td><u>61.0</u></td>
<td><u>40.0</u></td>
<td><u>40.2</u></td>
<td><u>5.1</u></td>
<td><u>1088</u></td>
<td><u>1163</u></td>
<td><u>1131</u></td>
<td><strong>4.2</strong></td>
<td>9.8</td>
</tr>
</tbody>
</table>
</div>
所有的结果都基于 <a href="https://github.com/OpenBMB/UltraEval-Audio" target="_blank">AudioEvals</a><br><br>
**端到端声音克隆能力。**
<div align="center">
<table style="margin: 0px auto;">
<thead>
<tr>
<th align="left">Task</th>
<th colspan="2">TTS</th>
</tr>
<tr>
<th align="left">Metric</th>
<th>SIMO↑</th>
<th>SIMO↑</th>
</tr>
<tr>
<th align="left">Dataset</th>
<th>Seed-TTS test-zh</th>
<th>Seed-TTS test-en</th>
</tr>
</thead>
<tbody align="center">
<tr>
<td nowrap="nowrap" align="left">F5-TTS</td>
<td><strong>76</strong></td>
<td><strong>67</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">CosyVoice</td>
<td><u>75</u></td>
<td><u>64</u></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">FireRedTTS</td>
<td>63</td>
<td>46</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
<td>57</td>
<td>47</td>
</tr>
</tbody>
</table>
</div>
</details>
<details>
<summary>点击查看多模态流式交互能力评测详细结果。</summary>
**多模态流式交互能力**: StreamingBench 分数
<table style="margin: 0px auto;">
<thead>
<tr>
<th align="left">Model</th>
<th>Size</th>
<th>Real-Time Video Understanding</th>
<th>Omni-Source Understanding</th>
<th>Contextual Understanding</th>
<th>Overall</th>
</tr>
</thead>
<tbody align="center">
<tr>
<td colspan="7" align="left"><strong>Proprietary</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
<td>-</td>
<td><u>77.4</u></td>
<td><strong>67.8</strong></td>
<td><strong>51.1</strong></td>
<td><strong>70.3</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4o-202408</td>
<td>-</td>
<td>74.5</td>
<td>51.0</td>
<td><u>48.0</u></td>
<td>64.1</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Claude-3.5-Sonnet</td>
<td>-</td>
<td>74.0</td>
<td>41.4</td>
<td>37.8</td>
<td>59.7</td>
</tr>
<tr>
<td colspan="9" align="left"><strong>Open-source</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">VILA-1.5</td>
<td>8B</td>
<td>61.5</td>
<td>37.5</td>
<td>26.7</td>
<td>49.5</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">LongVA</td>
<td>7B</td>
<td>63.1</td>
<td>35.9</td>
<td>30.2</td>
<td>50.7</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">LLaVA-Next-Video-34B</td>
<td>34B</td>
<td>69.8</td>
<td>41.7</td>
<td>34.3</td>
<td>56.7</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Qwen2-VL-7B</td>
<td>8B</td>
<td>71.2</td>
<td>40.7</td>
<td>33.1</td>
<td>57.0</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">InternVL2-8B</td>
<td>8B</td>
<td>70.1</td>
<td>42.7</td>
<td>34.1</td>
<td>57.0</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">VITA-1.5</td>
<td>8B</td>
<td>70.9</td>
<td>40.8</td>
<td>35.8</td>
<td>57.4</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">LLaVA-OneVision-7B</td>
<td>8B</td>
<td>74.3</td>
<td>40.8</td>
<td>31.0</td>
<td>58.4</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">InternLM-XC2.5-OL-7B</td>
<td>8B</td>
<td>75.4</td>
<td>46.2</td>
<td>33.6</td>
<td>60.8</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
<td>8B</td>
<td>72.4</td>
<td>40.2</td>
<td>33.4</td>
<td>57.7</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
<td>8B</td>
<td><strong>79.9</strong></td>
<td><u>53.4</u></td>
<td>38.5</td>
<td><u>66.0</u></td>
</tr>
</tbody>
</table>
</details>
### 典型示例 <!-- omit in toc -->
以下为 MiniCPM-o 2.6 的 iPad Pro 实机演示和 web demo 演示样例:
<div align="center">
<a href="https://www.youtube.com/watch?v=vRIMbxJzStY&t=2s"><img src="./assets/minicpmo2_6/2dot6_o_demo_video_img.png", width=70%></a>
</div>
<br>
<div style="display: flex; flex-direction: column; align-items: center;">
<img src="assets/minicpmo2_6/minicpmo2_6_math_intersect.png" alt="math" style="margin-bottom: 5px;">
<img src="assets/minicpmo2_6/minicpmo2_6_diagram_train_NN.png" alt="diagram" style="margin-bottom: 5px;">
<img src="assets/minicpmo2_6/minicpmo2_6_multi-image_bike.png" alt="bike" style="margin-bottom: 5px;">
</div>