mirror of
https://github.com/OpenBMB/MiniCPM-V.git
synced 2026-02-05 02:09:20 +08:00
928 lines
29 KiB
Markdown
928 lines
29 KiB
Markdown
## MiniCPM-o 2.6
|
||
|
||
> Archieve at: 2026-02-02
|
||
|
||
MiniCPM-o 2.6 是 MiniCPM-o 系列的最新、性能最佳模型。该模型基于 SigLip-400M、Whisper-medium-300M、ChatTTS-200M 和 Qwen2.5-7B 构建,共 8B 参数,通过端到端方式训练和推理。相比 MiniCPM-V 2.6,该模型在性能上有了显著提升,并支持了实时语音对话和多模态流式交互的新功能。MiniCPM-o 2.6 的主要特性包括:
|
||
|
||
|
||
- 🔥 **领先的视觉能力。**
|
||
MiniCPM-o 2.6 在 OpenCompass 榜单上(综合 8 个主流多模态评测基准)平均得分 70.2,**以 8B 量级的大小在单图理解方面超越了 GPT-4o-202405、Gemini 1.5 Pro 和 Claude 3.5 Sonnet 等主流商用闭源多模态大模型**。此外,它的多图和视频理解表现也**优于 GPT-4V 和 Claude 3.5 Sonnet**,并展现出了优秀的上下文学习能力。
|
||
|
||
- 🎙 **出色的语音能力。**
|
||
MiniCPM-o 2.6 **支持可配置声音的中英双语实时对话**。MiniCPM-o 2.6 在语音理解任务(如 ASR 和 STT 等)**优于 GPT-4o-realtime**,并在语音对话的语义和声学评估中展现了**开源模型中最高的语音生成性能**。它还支持情绪/语速/风格控制、语音克隆、角色扮演等进阶能力。
|
||
|
||
- 🎬 **强大的多模态流式交互能力。**
|
||
作为一项新功能,MiniCPM-o 2.6 能够**接受连续的视频和音频流,并和用户进行实时语音交互**。在针对实时视频理解、全模态视音频理解、多模态上下文理解的综合评测基准 StreamingBench 中,MiniCPM-o 2.6 取得开源社区最佳水平,并**超过了 GPT-4o-202408 和 Claude 3.5 Sonnet**。
|
||
|
||
- 💪 **强大的 OCR 能力及其他功能。**
|
||
MiniCPM-o 2.6 进一步优化了 MiniCPM-V 2.6 的众多视觉理解能力,其可以处理任意长宽比的图像,像素数可达 180 万(如 1344x1344)。在 OCRBench 上取得**25B 以下最佳水平,超过 GPT-4o-202405 等商用闭源模型**。基于最新的 [RLHF-V](https://rlhf-v.github.io/)、[RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) 和 [VisCPM](https://github.com/OpenBMB/VisCPM) 技术,其具备了**可信的多模态行为**,在 MMHal-Bench 上超过了 GPT-4o 和 Claude 3.5,并支持英语、中文、德语、法语、意大利语、韩语等**30多种语言**。
|
||
|
||
- 🚀 **卓越的效率。**
|
||
除了对个人用户友好的模型大小,MiniCPM-o 2.6 还表现出**最先进的视觉 token 密度**(即每个视觉 token 编码的像素数量)。它**仅需 640 个 token 即可处理 180 万像素图像,比大多数模型少 75%**。这一特性优化了模型的推理速度、首 token 延迟、内存占用和功耗。因此,MiniCPM-o 2.6 可以支持 iPad 等终端设备上的高效**多模态实时流式交互**。
|
||
|
||
|
||
- 💫 **易于使用。**
|
||
MiniCPM-o 2.6 可以通过多种方式轻松使用:(1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md) 支持在本地设备上进行高效的 CPU 推理,(2) [int4](https://huggingface.co/openbmb/MiniCPM-V-2_6-int4) 和 [GGUF](https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf) 格式的量化模型,有 16 种尺寸,(3) [vLLM](#基于-llamacppollamavllm-的高效推理) 支持高吞吐量和内存高效的推理,(4) 通过[LLaMA-Factory](./docs/llamafactory_train_and_infer.md)框架针对新领域和任务进行微调,(5) 使用 [Gradio](#本地-webui-demo-) 快速设置本地 WebUI 演示,(6) 部署于服务器的在线 [demo](https://minicpm-omni-webdemo-us.modelbest.cn/)。
|
||
|
||
**模型架构。**
|
||
|
||
- **端到端全模态架构。** 通过**端到端**的方式连接和训练不同模态的编/解码模块以充分利用丰富的多模态知识。模型完全使用 CE 损失端到端训练。
|
||
- **全模态流式机制。** (1) 我们将不同模态的离线编/解码器改造为适用于**流式输入/输出**的在线模块。 (2) 我们针对大语言模型基座设计了**时分复用的全模态流式信息处理机制**,将平行的不同模态的信息流拆分重组为周期性时间片序列。
|
||
- **可配置的声音方案。** 我们设计了新的多模态系统提示,包含传统文本系统提示词,和**用于指定模型声音的语音系统提示词**。模型可在推理时灵活地通过文字或语音样例控制声音风格,并支持端到端声音克隆和音色创建等高级能力。
|
||
|
||
<div align="center">
|
||
<img src="./assets/minicpm-o-26-framework-v2.png" , width=80%>
|
||
</div>
|
||
|
||
<br>
|
||
|
||
|
||
|
||
### 性能评估 <!-- omit in toc -->
|
||
|
||
<div align="center">
|
||
<img src="./assets/radar.jpg", width=80%>
|
||
</div>
|
||
|
||
<details>
|
||
<summary>点击查看视觉理解能力详细评测结果。</summary>
|
||
|
||
**图像理解能力**
|
||
|
||
<div align="center">
|
||
<table style="margin: 0px auto;">
|
||
<thead>
|
||
<tr>
|
||
<th align="left">Model</th>
|
||
<th>Size</th>
|
||
<th>Token Density<sup>+</sup></th>
|
||
<th>OpenCompass</th>
|
||
<th>OCRBench</th>
|
||
<th>MathVista mini</th>
|
||
<th>ChartQA</th>
|
||
<th>MMVet</th>
|
||
<th>MMStar</th>
|
||
<th>MME</th>
|
||
<th>MMB1.1 test</th>
|
||
<th>AI2D</th>
|
||
<th>MMMU val</th>
|
||
<th>HallusionBench</th>
|
||
<th>TextVQA val</th>
|
||
<th>DocVQA test</th>
|
||
<th>MathVerse mini</th>
|
||
<th>MathVision</th>
|
||
<th>MMHal Score</th>
|
||
</tr>
|
||
</thead>
|
||
<tbody align="center">
|
||
<tr>
|
||
<td colspan="19" align="left"><strong>Proprietary</strong></td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">GPT-4o-20240513</td>
|
||
<td>-</td>
|
||
<td>1088</td>
|
||
<td><u>69.9</u></td>
|
||
<td>736</td>
|
||
<td>61.3</td>
|
||
<td>85.7</td>
|
||
<td><strong>69.1</strong></td>
|
||
<td>63.9</td>
|
||
<td>2328.7</td>
|
||
<td>82.2</td>
|
||
<td>84.6</td>
|
||
<td><strong>69.2</strong></td>
|
||
<td><strong>55.0</strong></td>
|
||
<td>-</td>
|
||
<td>92.8</td>
|
||
<td><strong>50.2</strong></td>
|
||
<td><strong>30.4</strong></td>
|
||
<td><u>3.6</u></td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">Claude3.5-Sonnet</td>
|
||
<td>-</td>
|
||
<td>750</td>
|
||
<td>67.9</td>
|
||
<td>788</td>
|
||
<td>61.6</td>
|
||
<td><strong>90.8</strong></td>
|
||
<td>66.0</td>
|
||
<td>62.2</td>
|
||
<td>1920.0</td>
|
||
<td>78.5</td>
|
||
<td>80.2</td>
|
||
<td><u>65.9</u></td>
|
||
<td>49.9</td>
|
||
<td>-</td>
|
||
<td><strong>95.2</strong></td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
<td>3.4</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
<td>64.4</td>
|
||
<td>754</td>
|
||
<td>57.7</td>
|
||
<td>81.3</td>
|
||
<td>64.0</td>
|
||
<td>59.1</td>
|
||
<td>2110.6</td>
|
||
<td>73.9</td>
|
||
<td>79.1</td>
|
||
<td>60.6</td>
|
||
<td>45.6</td>
|
||
<td>73.5</td>
|
||
<td>86.5</td>
|
||
<td>-</td>
|
||
<td>19.2</td>
|
||
<td>-</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">GPT-4o-mini-20240718</td>
|
||
<td>-</td>
|
||
<td>1088</td>
|
||
<td>64.1</td>
|
||
<td>785</td>
|
||
<td>52.4</td>
|
||
<td>-</td>
|
||
<td>66.9</td>
|
||
<td>54.8</td>
|
||
<td>2003.4</td>
|
||
<td>76.0</td>
|
||
<td>77.8</td>
|
||
<td>60.0</td>
|
||
<td>46.1</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
<td>3.3</td>
|
||
</tr>
|
||
<tr>
|
||
<td colspan="19" align="left"><strong>Open Source</strong></td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">Cambrian-34B</td>
|
||
<td>34B</td>
|
||
<td><u>1820</u></td>
|
||
<td>58.3</td>
|
||
<td>591</td>
|
||
<td>50.3</td>
|
||
<td>75.6</td>
|
||
<td>53.2</td>
|
||
<td>54.2</td>
|
||
<td>2049.9</td>
|
||
<td>77.8</td>
|
||
<td>79.5</td>
|
||
<td>50.4</td>
|
||
<td>41.6</td>
|
||
<td>76.7</td>
|
||
<td>75.5</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">GLM-4V-9B</td>
|
||
<td>13B</td>
|
||
<td>784</td>
|
||
<td>59.1</td>
|
||
<td>776</td>
|
||
<td>51.1</td>
|
||
<td>-</td>
|
||
<td>58.0</td>
|
||
<td>54.8</td>
|
||
<td>2018.8</td>
|
||
<td>67.9</td>
|
||
<td>71.2</td>
|
||
<td>46.9</td>
|
||
<td>45.0</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">Pixtral-12B</td>
|
||
<td>12B</td>
|
||
<td>256</td>
|
||
<td>61.0</td>
|
||
<td>685</td>
|
||
<td>56.9</td>
|
||
<td>81.8</td>
|
||
<td>58.5</td>
|
||
<td>54.5</td>
|
||
<td>-</td>
|
||
<td>72.7</td>
|
||
<td>79.0</td>
|
||
<td>51.1</td>
|
||
<td>47.0</td>
|
||
<td>75.7</td>
|
||
<td>90.7</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">DeepSeek-VL2-27B (4B)</td>
|
||
<td>27B</td>
|
||
<td>672</td>
|
||
<td>66.4</td>
|
||
<td>809</td>
|
||
<td>63.9</td>
|
||
<td>86.0</td>
|
||
<td>60.0</td>
|
||
<td>61.9</td>
|
||
<td>2253.0</td>
|
||
<td>81.2</td>
|
||
<td>83.8</td>
|
||
<td>54.0</td>
|
||
<td>45.3</td>
|
||
<td><u>84.2</u></td>
|
||
<td>93.3</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
<td>3.0</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">Qwen2-VL-7B</td>
|
||
<td>8B</td>
|
||
<td>784</td>
|
||
<td>67.1</td>
|
||
<td><u>866</u></td>
|
||
<td>58.2</td>
|
||
<td>83.0</td>
|
||
<td>62.0</td>
|
||
<td>60.7</td>
|
||
<td>2326.0</td>
|
||
<td>81.8</td>
|
||
<td>83.0</td>
|
||
<td>54.1</td>
|
||
<td>50.6</td>
|
||
<td><strong>84.3</strong></td>
|
||
<td><u>94.5</u></td>
|
||
<td>31.9</td>
|
||
<td>16.3</td>
|
||
<td>3.2</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">LLaVA-OneVision-72B</td>
|
||
<td>72B</td>
|
||
<td>182</td>
|
||
<td>68.1</td>
|
||
<td>741</td>
|
||
<td>67.5</td>
|
||
<td>83.7</td>
|
||
<td>60.6</td>
|
||
<td><strong>65.8</strong></td>
|
||
<td>2261.0</td>
|
||
<td><strong>85.0</strong></td>
|
||
<td><u>85.6</u></td>
|
||
<td>56.8</td>
|
||
<td>49.0</td>
|
||
<td>80.5</td>
|
||
<td>91.3</td>
|
||
<td>39.1</td>
|
||
<td>-</td>
|
||
<td>3.5</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">InternVL2.5-8B</td>
|
||
<td>8B</td>
|
||
<td>706</td>
|
||
<td>68.3</td>
|
||
<td>822</td>
|
||
<td><u>64.4</u></td>
|
||
<td>84.8</td>
|
||
<td>62.8</td>
|
||
<td>62.8</td>
|
||
<td>2344.0</td>
|
||
<td><u>83.6</u></td>
|
||
<td>84.5</td>
|
||
<td>56.0</td>
|
||
<td>50.1</td>
|
||
<td>79.1</td>
|
||
<td>93.0</td>
|
||
<td>39.5</td>
|
||
<td>19.7</td>
|
||
<td>3.4</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
|
||
<td>8B</td>
|
||
<td><strong>2822</strong></td>
|
||
<td>65.2</td>
|
||
<td>852*</td>
|
||
<td>60.6</td>
|
||
<td>79.4</td>
|
||
<td>60.0</td>
|
||
<td>57.5</td>
|
||
<td><u>2348.4*</u></td>
|
||
<td>78.0</td>
|
||
<td>82.1</td>
|
||
<td>49.8*</td>
|
||
<td>48.1*</td>
|
||
<td>80.1</td>
|
||
<td>90.8</td>
|
||
<td>25.7</td>
|
||
<td>18.3</td>
|
||
<td>3.6</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
|
||
<td>8B</td>
|
||
<td><strong>2822</strong></td>
|
||
<td><strong>70.2</strong></td>
|
||
<td><strong>897*</strong></td>
|
||
<td><strong>71.9*</strong></td>
|
||
<td><u>86.9*</u></td>
|
||
<td><u>67.5</u></td>
|
||
<td><u>64.0</u></td>
|
||
<td><strong>2372.0*</strong></td>
|
||
<td>80.5</td>
|
||
<td><strong>85.8</strong></td>
|
||
<td>50.4*</td>
|
||
<td><u>51.9</u></td>
|
||
<td>82.0</td>
|
||
<td>93.5</td>
|
||
<td><u>41.4*</u></td>
|
||
<td><u>23.1*</u></td>
|
||
<td><strong>3.8</strong></td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
</div>
|
||
* 我们使用思维链提示词来评估这些基准,对于 MME 我们只在 Cognition 任务上使用了思维链。
|
||
+ Token Density:每个视觉 token 在最大分辨率下编码的像素数,即最大分辨率下的像素数 / 视觉 token 数。
|
||
|
||
注意:闭源模型的 Token Density 由 API 收费方式估算得到。
|
||
|
||
**多图和视频理解能力**
|
||
|
||
<div align="center">
|
||
|
||
<table style="margin: 0px auto;">
|
||
<thead>
|
||
<tr>
|
||
<th align="left">Model</th>
|
||
<th>Size</th>
|
||
<th>BLINK val</th>
|
||
<th>Mantis Eval</th>
|
||
<th>MIRB</th>
|
||
<th>Video-MME (wo / w subs)</th>
|
||
</tr>
|
||
</thead>
|
||
<tbody align="center">
|
||
<tr>
|
||
<td colspan="6" align="left"><strong>Proprietary</strong></td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">GPT-4o-20240513</td>
|
||
<td>-</td>
|
||
<td><strong>68</strong></td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
<td><strong>71.9/77.2<strong></td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">GPT4V</td>
|
||
<td>-</td>
|
||
<td>54.6</td>
|
||
<td>62.7</td>
|
||
<td>53.1</td>
|
||
<td>59.9/63.3</td>
|
||
</tr>
|
||
<tr>
|
||
<td colspan="6" align="left"><strong>Open-source</strong></td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">LLaVA-NeXT-Interleave 14B</td>
|
||
<td>14B</td>
|
||
<td>52.6</td>
|
||
<td>66.4</td>
|
||
<td>30.2</td>
|
||
<td>-</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">LLaVA-OneVision-72B</td>
|
||
<td>72B</td>
|
||
<td>55.4</td>
|
||
<td><strong>77.6</strong></td>
|
||
<td>-</td>
|
||
<td><u>66.2/69.5</u></td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">MANTIS 8B</td>
|
||
<td>8B</td>
|
||
<td>49.1</td>
|
||
<td>59.5</td>
|
||
<td>34.8</td>
|
||
<td>-</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">Qwen2-VL-7B</td>
|
||
<td>8B</td>
|
||
<td>53.2</td>
|
||
<td>69.6*</td>
|
||
<td><strong>67.6*</strong></td>
|
||
<td>63.3/69.0</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">InternVL2.5-8B</td>
|
||
<td>8B</td>
|
||
<td>54.8</td>
|
||
<td>67.7</td>
|
||
<td>52.5</td>
|
||
<td>64.2/66.9</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
|
||
<td>8B</td>
|
||
<td>53</td>
|
||
<td>69.1</td>
|
||
<td>53.8</td>
|
||
<td>60.9/63.6</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
|
||
<td>8B</td>
|
||
<td><u>56.7</u></td>
|
||
<td><u>71.9</u></td>
|
||
<td><u>58.6</u></td>
|
||
<td>63.9/67.9</td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
|
||
</div>
|
||
* 正式开源模型权重的评测结果。
|
||
|
||
</details>
|
||
|
||
|
||
<details>
|
||
<summary>点击查看语音理解和生成能力的详细评测结果。</summary>
|
||
|
||
**语音理解能力**
|
||
|
||
<div align="center">
|
||
<table style="margin: 0px auto;">
|
||
<thead>
|
||
<tr>
|
||
<th align="left">Task</th>
|
||
<th>Size</th>
|
||
<th colspan="3">ASR (zh)</th>
|
||
<th colspan="3">ASR (en)</th>
|
||
<th colspan="2">AST</th>
|
||
<th>Emotion</th>
|
||
</tr>
|
||
<tr>
|
||
<th align="left">Metric</th>
|
||
<td></td>
|
||
<th colspan="3">CER↓</th>
|
||
<th colspan="3">WER↓</th>
|
||
<th colspan="2">BLEU↑</th>
|
||
<th>ACC↑</th>
|
||
</tr>
|
||
<tr>
|
||
<th align="left">Dataset</th>
|
||
<td></td>
|
||
<th>AISHELL-1</th>
|
||
<th>Fleurs zh</th>
|
||
<th>WenetSpeech test-net</th>
|
||
<th>LibriSpeech test-clean</th>
|
||
<th>GigaSpeech</th>
|
||
<th>TED-LIUM</th>
|
||
<th>CoVoST en2zh</th>
|
||
<th>CoVoST zh2en</th>
|
||
<th>MELD emotion</th>
|
||
</tr>
|
||
</thead>
|
||
<tbody align="center">
|
||
<tr>
|
||
<td colspan="11" align="left"><strong>Proprietary</strong></td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">GPT-4o-Realtime</td>
|
||
<td>-</td>
|
||
<td>7.3*</td>
|
||
<td><u>5.4*</u></td>
|
||
<td>28.9*</td>
|
||
<td>2.6*</td>
|
||
<td>12.9*</td>
|
||
<td>4.8*</td>
|
||
<td>37.1*</td>
|
||
<td>15.7*</td>
|
||
<td>33.2*</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
|
||
<td>-</td>
|
||
<td>4.5*</td>
|
||
<td>5.9*</td>
|
||
<td>14.3*</td>
|
||
<td>2.9*</td>
|
||
<td>10.6*</td>
|
||
<td><strong>3.0*</strong></td>
|
||
<td><u>47.3*</u></td>
|
||
<td>22.6*</td>
|
||
<td>48.4*</td>
|
||
</tr>
|
||
<tr>
|
||
<td colspan="11" align="left"><strong>Open-Source</strong></td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">Qwen2-Audio-7B</td>
|
||
<td>8B</td>
|
||
<td>-</td>
|
||
<td>7.5</td>
|
||
<td>-</td>
|
||
<td><strong>1.6</strong></td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
<td>45.2</td>
|
||
<td><u>24.4</u></td>
|
||
<td><strong>55.3</strong></td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">Qwen2-Audio-7B-Instruct</td>
|
||
<td>8B</td>
|
||
<td>2.6*</td>
|
||
<td>6.9*</td>
|
||
<td><u>10.3*</u></td>
|
||
<td>3.1*</td>
|
||
<td><u>9.7</u>*</td>
|
||
<td>5.9*</td>
|
||
<td>39.5*</td>
|
||
<td>22.9*</td>
|
||
<td>17.4*</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">GLM-4-Voice-Base</td>
|
||
<td>9B</td>
|
||
<td><u>2.5</u></td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
<td>2.8</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
|
||
<td>8B</td>
|
||
<td><strong>1.6</strong></td>
|
||
<td><strong>4.4</strong></td>
|
||
<td><strong>6.9</strong></td>
|
||
<td><u>1.7</u></td>
|
||
<td><strong>8.7</strong></td>
|
||
<td><strong>3.0</strong></td>
|
||
<td><strong>48.2</strong></td>
|
||
<td><strong>27.2</strong></td>
|
||
<td><u>52.4</u></td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
</div>
|
||
* 正式开源模型权重的评测结果。<br><br>
|
||
|
||
**语音生成能力。**
|
||
|
||
<div align="center">
|
||
<table style="margin: 0px auto;">
|
||
<thead>
|
||
<tr>
|
||
<th align="left">Task</th>
|
||
<th>Size</th>
|
||
<th colspan="9">SpeechQA</th>
|
||
</tr>
|
||
<tr>
|
||
<th align="left">Metric</th>
|
||
<th></th>
|
||
<th colspan="3">ACC↑</th>
|
||
<th>G-Eval (10 point)↑</th>
|
||
<th>Semantic ELO score↑</th>
|
||
<th>Acoustic ELO score↑</th>
|
||
<th>Overall ELO score↑</th>
|
||
<th>UTMOS↑</th>
|
||
<th>ASR-WER↓</th>
|
||
</tr>
|
||
<tr>
|
||
<th align="left">Dataset</th>
|
||
<th></th>
|
||
<th>Speech Llama Q.</th>
|
||
<th>Speech Web Q.</th>
|
||
<th>Speech Trivia QA</th>
|
||
<th>Speech AlpacaEval</th>
|
||
<th colspan="5">AudioArena</th>
|
||
</tr>
|
||
</thead>
|
||
<tbody align="center">
|
||
<tr>
|
||
<td colspan="11" align="left"><strong>Proprietary</strong></td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">GPT-4o-Realtime</td>
|
||
<td></td>
|
||
<td><strong>71.7</strong></td>
|
||
<td><strong>51.6</strong></td>
|
||
<td><strong>69.7</strong></td>
|
||
<td><strong>7.4</strong></td>
|
||
<td><strong>1157</strong></td>
|
||
<td><strong>1203</strong></td>
|
||
<td><strong>1200</strong></td>
|
||
<td><strong>4.2</strong></td>
|
||
<td><strong>2.3</strong></td>
|
||
</tr>
|
||
<tr>
|
||
<td colspan="11" align="left"><strong>Open-Source</strong></td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">GLM-4-Voice</td>
|
||
<td>9B</td>
|
||
<td>50.0</td>
|
||
<td>32.0</td>
|
||
<td>36.4</td>
|
||
<td><u>5.1</u></td>
|
||
<td>999</td>
|
||
<td>1147</td>
|
||
<td>1035</td>
|
||
<td><u>4.1</u></td>
|
||
<td><u>11.7</u></td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">Llama-Omni</td>
|
||
<td>8B</td>
|
||
<td>45.3</td>
|
||
<td>22.9</td>
|
||
<td>10.7</td>
|
||
<td>3.9</td>
|
||
<td>960</td>
|
||
<td>878</td>
|
||
<td>897</td>
|
||
<td>3.2</td>
|
||
<td>24.3</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">VITA-1.5</td>
|
||
<td>8B</td>
|
||
<td>46.7</td>
|
||
<td>28.1</td>
|
||
<td>23.3</td>
|
||
<td>2.0</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">Moshi</td>
|
||
<td>7B</td>
|
||
<td>43.7</td>
|
||
<td>23.8</td>
|
||
<td>16.7</td>
|
||
<td>2.4</td>
|
||
<td>871</td>
|
||
<td>808</td>
|
||
<td>875</td>
|
||
<td>2.8</td>
|
||
<td>8.2</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">Mini-Omni</td>
|
||
<td>1B</td>
|
||
<td>22.0</td>
|
||
<td>12.8</td>
|
||
<td>6.9</td>
|
||
<td>2.5</td>
|
||
<td>926</td>
|
||
<td>803</td>
|
||
<td>865</td>
|
||
<td>3.4</td>
|
||
<td>10.0</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
|
||
<td>8B</td>
|
||
<td><u>61.0</u></td>
|
||
<td><u>40.0</u></td>
|
||
<td><u>40.2</u></td>
|
||
<td><u>5.1</u></td>
|
||
<td><u>1088</u></td>
|
||
<td><u>1163</u></td>
|
||
<td><u>1131</u></td>
|
||
<td><strong>4.2</strong></td>
|
||
<td>9.8</td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
</div>
|
||
所有的结果都基于 <a href="https://github.com/OpenBMB/UltraEval-Audio" target="_blank">AudioEvals</a>。<br><br>
|
||
|
||
**端到端声音克隆能力。**
|
||
|
||
<div align="center">
|
||
<table style="margin: 0px auto;">
|
||
<thead>
|
||
<tr>
|
||
<th align="left">Task</th>
|
||
<th colspan="2">TTS</th>
|
||
</tr>
|
||
<tr>
|
||
<th align="left">Metric</th>
|
||
<th>SIMO↑</th>
|
||
<th>SIMO↑</th>
|
||
</tr>
|
||
<tr>
|
||
<th align="left">Dataset</th>
|
||
<th>Seed-TTS test-zh</th>
|
||
<th>Seed-TTS test-en</th>
|
||
</tr>
|
||
</thead>
|
||
<tbody align="center">
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">F5-TTS</td>
|
||
<td><strong>76</strong></td>
|
||
<td><strong>67</strong></td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">CosyVoice</td>
|
||
<td><u>75</u></td>
|
||
<td><u>64</u></td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">FireRedTTS</td>
|
||
<td>63</td>
|
||
<td>46</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
|
||
<td>57</td>
|
||
<td>47</td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
</div>
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary>点击查看多模态流式交互能力评测详细结果。</summary>
|
||
|
||
**多模态流式交互能力**: StreamingBench 分数
|
||
|
||
<table style="margin: 0px auto;">
|
||
<thead>
|
||
<tr>
|
||
<th align="left">Model</th>
|
||
<th>Size</th>
|
||
<th>Real-Time Video Understanding</th>
|
||
<th>Omni-Source Understanding</th>
|
||
<th>Contextual Understanding</th>
|
||
<th>Overall</th>
|
||
</tr>
|
||
</thead>
|
||
<tbody align="center">
|
||
<tr>
|
||
<td colspan="7" align="left"><strong>Proprietary</strong></td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
|
||
<td>-</td>
|
||
<td><u>77.4</u></td>
|
||
<td><strong>67.8</strong></td>
|
||
<td><strong>51.1</strong></td>
|
||
<td><strong>70.3</strong></td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">GPT-4o-202408</td>
|
||
<td>-</td>
|
||
<td>74.5</td>
|
||
<td>51.0</td>
|
||
<td><u>48.0</u></td>
|
||
<td>64.1</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">Claude-3.5-Sonnet</td>
|
||
<td>-</td>
|
||
<td>74.0</td>
|
||
<td>41.4</td>
|
||
<td>37.8</td>
|
||
<td>59.7</td>
|
||
</tr>
|
||
<tr>
|
||
<td colspan="9" align="left"><strong>Open-source</strong></td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">VILA-1.5</td>
|
||
<td>8B</td>
|
||
<td>61.5</td>
|
||
<td>37.5</td>
|
||
<td>26.7</td>
|
||
<td>49.5</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">LongVA</td>
|
||
<td>7B</td>
|
||
<td>63.1</td>
|
||
<td>35.9</td>
|
||
<td>30.2</td>
|
||
<td>50.7</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">LLaVA-Next-Video-34B</td>
|
||
<td>34B</td>
|
||
<td>69.8</td>
|
||
<td>41.7</td>
|
||
<td>34.3</td>
|
||
<td>56.7</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">Qwen2-VL-7B</td>
|
||
<td>8B</td>
|
||
<td>71.2</td>
|
||
<td>40.7</td>
|
||
<td>33.1</td>
|
||
<td>57.0</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">InternVL2-8B</td>
|
||
<td>8B</td>
|
||
<td>70.1</td>
|
||
<td>42.7</td>
|
||
<td>34.1</td>
|
||
<td>57.0</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">VITA-1.5</td>
|
||
<td>8B</td>
|
||
<td>70.9</td>
|
||
<td>40.8</td>
|
||
<td>35.8</td>
|
||
<td>57.4</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">LLaVA-OneVision-7B</td>
|
||
<td>8B</td>
|
||
<td>74.3</td>
|
||
<td>40.8</td>
|
||
<td>31.0</td>
|
||
<td>58.4</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">InternLM-XC2.5-OL-7B</td>
|
||
<td>8B</td>
|
||
<td>75.4</td>
|
||
<td>46.2</td>
|
||
<td>33.6</td>
|
||
<td>60.8</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
|
||
<td>8B</td>
|
||
<td>72.4</td>
|
||
<td>40.2</td>
|
||
<td>33.4</td>
|
||
<td>57.7</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
|
||
<td>8B</td>
|
||
<td><strong>79.9</strong></td>
|
||
<td><u>53.4</u></td>
|
||
<td>38.5</td>
|
||
<td><u>66.0</u></td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
|
||
</details>
|
||
|
||
|
||
### 典型示例 <!-- omit in toc -->
|
||
|
||
以下为 MiniCPM-o 2.6 的 iPad Pro 实机演示和 web demo 演示样例:
|
||
|
||
|
||
<div align="center">
|
||
<a href="https://www.youtube.com/watch?v=vRIMbxJzStY&t=2s"><img src="./assets/minicpmo2_6/2dot6_o_demo_video_img.png", width=70%></a>
|
||
</div>
|
||
<br>
|
||
|
||
|
||
|
||
<div style="display: flex; flex-direction: column; align-items: center;">
|
||
<img src="assets/minicpmo2_6/minicpmo2_6_math_intersect.png" alt="math" style="margin-bottom: 5px;">
|
||
<img src="assets/minicpmo2_6/minicpmo2_6_diagram_train_NN.png" alt="diagram" style="margin-bottom: 5px;">
|
||
<img src="assets/minicpmo2_6/minicpmo2_6_multi-image_bike.png" alt="bike" style="margin-bottom: 5px;">
|
||
</div>
|
||
|
||
|