## MiniCPM-o 2.6
> Archieve at: 2026-02-02
**MiniCPM-o 2.6** is the latest and most capable model in the MiniCPM-o series. The model is built in an end-to-end fashion based on SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.6, and introduces new features for real-time speech conversation and multimodal live streaming. Notable features of MiniCPM-o 2.6 include:
- 🔥 **Leading Visual Capability.**
MiniCPM-o 2.6 achieves an average score of 70.2 on OpenCompass, a comprehensive evaluation of 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-202405, Gemini 1.5 Pro, and Claude 3.5 Sonnet** for single image understanding. It also **outperforms GPT-4V and Claude 3.5 Sonnet** in multi-image and video understanding, and shows promising in-context learning capability.
- 🎙 **State-of-the-art Speech Capability.** MiniCPM-o 2.6 supports **bilingual real-time speech conversation with configurable voices** in English and Chinese. It **outperforms GPT-4o-realtime on audio understanding tasks** such as ASR and STT translation, and shows **state-of-the-art performance on speech conversation in both semantic and acoustic evaluations in the open-source community**. It also allows for fun features such as emotion/speed/style control, end-to-end voice cloning, role play, etc.
- 🎬 **Strong Multimodal Live Streaming Capability.** As a new feature, MiniCPM-o 2.6 can **accept continuous video and audio streams independent of user queries, and support real-time speech interaction**. It **outperforms GPT-4o-202408 and Claude 3.5 Sonnet and shows state-of-the-art performance in the open-source community on StreamingBench**, a comprehensive benchmark for real-time video understanding, omni-source (video & audio) understanding, and multimodal contextual understanding.
- 💪 **Strong OCR Capability and Others.**
Advancing popular visual capabilities from MiniCPM-V series, MiniCPM-o 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves **state-of-the-art performance on OCRBench for models under 25B, surpassing proprietary models such as GPT-4o-202405**.
Based on the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) and [VisCPM](https://github.com/OpenBMB/VisCPM) techniques, it features **trustworthy behaviors**, outperforming GPT-4o and Claude 3.5 Sonnet on MMHal-Bench, and supports **multilingual capabilities** on more than 30 languages.
- 🚀 **Superior Efficiency.**
In addition to its friendly size, MiniCPM-o 2.6 also shows **state-of-the-art token density** (i.e., the number of pixels encoded into each visual token). **It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models**. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-o 2.6 can efficiently support **multimodal live streaming** on end-side devices such as iPads.
- 💫 **Easy Usage.**
MiniCPM-o 2.6 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4) and [GGUF](https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf) format quantized models in 16 sizes, (3) [vLLM](#efficient-inference-with-llamacpp-ollama-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks with [LLaMA-Factory](./docs/llamafactory_train_and_infer.md), (5) quick [local WebUI demo](#chat-with-our-demo-on-gradio), and (6) online web demo on [server](https://minicpm-omni-webdemo-us.modelbest.cn/).
**Model Architecture.**
- **End-to-end Omni-modal Architecture.** Different modality encoders/decoders are connected and trained in an **end-to-end** fashion to fully exploit rich multimodal knowledge. The model is trained in a fully end-to-end manner with only CE loss.
- **Omni-modal Live Streaming Mechanism.** (1) We change the offline modality encoder/decoders into online ones for **streaming inputs/outputs.** (2) We devise a **time-division multiplexing (TDM) mechanism** for omni-modality streaming processing in the LLM backbone. It divides parallel omni-modality streams into sequential info within small periodic time slices.
- **Configurable Speech Modeling Design.** We devise a multimodal system prompt, including traditional text system prompt, and **a new audio system prompt to determine the assistant voice**. This enables flexible voice configurations in inference time, and also facilitates end-to-end voice cloning and description-based voice creation.
### Evaluation
Click to view visual understanding results.
**Image Understanding**
| Model |
Size |
Token Density+ |
OpenCompass |
OCRBench |
MathVista mini |
ChartQA |
MMVet |
MMStar |
MME |
MMB1.1 test |
AI2D |
MMMU val |
HallusionBench |
TextVQA val |
DocVQA test |
MathVerse mini |
MathVision |
MMHal Score |
| Proprietary |
| GPT-4o-20240513 |
- |
1088 |
69.9 |
736 |
61.3 |
85.7 |
69.1 |
63.9 |
2328.7 |
82.2 |
84.6 |
69.2 |
55.0 |
- |
92.8 |
50.2 |
30.4 |
3.6 |
| Claude3.5-Sonnet |
- |
750 |
67.9 |
788 |
61.6 |
90.8 |
66.0 |
62.2 |
1920.0 |
78.5 |
80.2 |
65.9 |
49.9 |
- |
95.2 |
- |
- |
3.4 |
| Gemini 1.5 Pro |
- |
- |
64.4 |
754 |
57.7 |
81.3 |
64.0 |
59.1 |
2110.6 |
73.9 |
79.1 |
60.6 |
45.6 |
73.5 |
86.5 |
- |
19.2 |
- |
| GPT-4o-mini-20240718 |
- |
1088 |
64.1 |
785 |
52.4 |
- |
66.9 |
54.8 |
2003.4 |
76.0 |
77.8 |
60.0 |
46.1 |
- |
- |
- |
- |
3.3 |
| Open Source |
| Cambrian-34B |
34B |
1820 |
58.3 |
591 |
50.3 |
75.6 |
53.2 |
54.2 |
2049.9 |
77.8 |
79.5 |
50.4 |
41.6 |
76.7 |
75.5 |
- |
- |
- |
| GLM-4V-9B |
13B |
784 |
59.1 |
776 |
51.1 |
- |
58.0 |
54.8 |
2018.8 |
67.9 |
71.2 |
46.9 |
45.0 |
- |
- |
- |
- |
- |
| Pixtral-12B |
12B |
256 |
61.0 |
685 |
56.9 |
81.8 |
58.5 |
54.5 |
- |
72.7 |
79.0 |
51.1 |
47.0 |
75.7 |
90.7 |
- |
- |
- |
| VITA-1.5 |
8B |
784 |
63.3 |
741 |
66.2 |
- |
52.7 |
60.2 |
2328.1 |
76.8 |
79.2 |
52.6 |
44.6 |
- |
- |
- |
- |
- |
| DeepSeek-VL2-27B (4B) |
27B |
672 |
66.4 |
809 |
63.9 |
86.0 |
60.0 |
61.9 |
2253.0 |
81.2 |
83.8 |
54.0 |
45.3 |
84.2 |
93.3 |
- |
- |
3.0 |
| Qwen2-VL-7B |
8B |
784 |
67.1 |
866 |
58.2 |
83.0 |
62.0 |
60.7 |
2326.0 |
81.8 |
83.0 |
54.1 |
50.6 |
84.3 |
94.5 |
31.9 |
16.3 |
3.2 |
| LLaVA-OneVision-72B |
72B |
182 |
68.1 |
741 |
67.5 |
83.7 |
60.6 |
65.8 |
2261.0 |
85.0 |
85.6 |
56.8 |
49.0 |
80.5 |
91.3 |
39.1 |
- |
3.5 |
| InternVL2.5-8B |
8B |
706 |
68.3 |
822 |
64.4 |
84.8 |
62.8 |
62.8 |
2344.0 |
83.6 |
84.5 |
56.0 |
50.1 |
79.1 |
93.0 |
39.5 |
19.7 |
3.4 |
| MiniCPM-V 2.6 |
8B |
2822 |
65.2 |
852* |
60.6 |
79.4 |
60.0 |
57.5 |
2348.4* |
78.0 |
82.1 |
49.8* |
48.1* |
80.1 |
90.8 |
25.7 |
18.3 |
3.6 |
| MiniCPM-o 2.6 |
8B |
2822 |
70.2 |
897* |
71.9* |
86.9* |
67.5 |
64.0 |
2372.0* |
80.5 |
85.8 |
50.4* |
51.9 |
82.0 |
93.5 |
41.4* |
23.1* |
3.8 |
* We evaluate this benchmark using chain-of-thought prompting. Specifically, for MME, we used this technique only for the Cognition set.
+ Token Density: number of pixels encoded into each visual token at maximum resolution, i.e., # pixels at maximum resolution / # visual tokens.
Note: For proprietary models, we calculate token density based on the image encoding charging strategy defined in the official API documentation, which provides an upper-bound estimation.
**Multi-image and Video Understanding**
| Model |
Size |
BLINK val |
Mantis Eval |
MIRB |
Video-MME (wo / w subs) |
| Proprietary |
| GPT-4o-20240513 |
- |
68.0 |
- |
- |
71.9/77.2 |
| GPT4V |
- |
54.6 |
62.7 |
53.1 |
59.9/63.3 |
| Open-source |
| VITA-1.5 |
8B |
45.0 |
- |
- |
56.1/58.7 |
| LLaVA-NeXT-Interleave 14B |
14B |
52.6 |
66.4 |
30.2 |
- |
| LLaVA-OneVision-72B |
72B |
55.4 |
77.6 |
- |
66.2/69.5 |
| MANTIS 8B |
8B |
49.1 |
59.5 |
34.8 |
- |
| Qwen2-VL-7B |
8B |
53.2 |
69.6* |
67.6* |
63.3/69.0 |
| InternVL2.5-8B |
8B |
54.8 |
67.7 |
52.5 |
64.2/66.9 |
| MiniCPM-V 2.6 |
8B |
53.0 |
69.1 |
53.8 |
60.9/63.6 |
| MiniCPM-o 2.6 |
8B |
56.7 |
71.9 |
58.6 |
63.9/67.9 |
* We evaluate officially released checkpoints by ourselves.
Click to view audio understanding and speech conversation results.
**Audio Understanding**
| Task |
Size |
ASR (zh) |
ASR (en) |
AST |
Emotion |
| Metric |
|
CER↓ |
WER↓ |
BLEU↑ |
ACC↑ |
| Dataset |
|
AISHELL-1 |
Fleurs zh |
WenetSpeech test-net |
LibriSpeech test-clean |
GigaSpeech |
TED-LIUM |
CoVoST en2zh |
CoVoST zh2en |
MELD emotion |
| Proprietary |
| GPT-4o-Realtime |
- |
7.3* |
5.4* |
28.9* |
2.6* |
12.9* |
4.8* |
37.1* |
15.7* |
33.2* |
| Gemini 1.5 Pro |
- |
4.5* |
5.9* |
14.3* |
2.9* |
10.6* |
3.0* |
47.3* |
22.6* |
48.4* |
| Open-Source |
| Qwen2-Audio-7B |
8B |
- |
7.5 |
- |
1.6 |
- |
- |
45.2 |
24.4 |
55.3 |
| Qwen2-Audio-7B-Instruct |
8B |
2.6* |
6.9* |
10.3* |
3.1* |
9.7* |
5.9* |
39.5* |
22.9* |
17.4* |
| VITA-1.5 |
8B |
2.16 |
- |
8.4 |
3.4 |
- |
- |
- |
- |
- |
| GLM-4-Voice-Base |
9B |
2.5 |
- |
- |
2.8 |
- |
- |
- |
- |
| MiniCPM-o 2.6 |
8B |
1.6 |
4.4 |
6.9 |
1.7 |
8.7 |
3.0 |
48.2 |
27.2 |
52.4 |
* We evaluate officially released checkpoints by ourselves.
**Speech Generation**
| Task |
Size |
SpeechQA |
| Metric |
|
ACC↑ |
G-Eval (10 point)↑ |
Semantic ELO score↑ |
Acoustic ELO score↑ |
Overall ELO score↑ |
UTMOS↑ |
ASR-WER↓ |
| Dataset |
|
Speech Llama Q. |
Speech Web Q. |
Speech Trivia QA |
Speech AlpacaEval |
AudioArena |
| Proprietary |
| GPT-4o-Realtime |
|
71.7 |
51.6 |
69.7 |
7.4 |
1157 |
1203 |
1200 |
4.2 |
2.3 |
| Open-Source |
| GLM-4-Voice |
9B |
50.0 |
32.0 |
36.4 |
5.1 |
999 |
1147 |
1035 |
4.1 |
11.7 |
| Llama-Omni |
8B |
45.3 |
22.9 |
10.7 |
3.9 |
960 |
878 |
897 |
3.2 |
24.3 |
| VITA-1.5 |
8B |
46.7 |
28.1 |
23.3 |
2.0 |
- |
- |
- |
- |
- |
| Moshi |
7B |
43.7 |
23.8 |
16.7 |
2.4 |
871 |
808 |
875 |
2.8 |
8.2 |
| Mini-Omni |
1B |
22.0 |
12.8 |
6.9 |
2.5 |
926 |
803 |
865 |
3.4 |
10.0 |
| MiniCPM-o 2.6 |
8B |
61.0 |
40.0 |
40.2 |
5.1 |
1088 |
1163 |
1131 |
4.2 |
9.8 |
All results are from AudioEvals, and the evaluation methods along with further details can be found in AudioEvals.
**End-to-end Voice Cloning**
| Task |
Voice cloning |
| Metric |
SIMO↑ |
SIMO↑ |
| Dataset |
Seed-TTS test-zh |
Seed-TTS test-en |
| F5-TTS |
76 |
67 |
| CosyVoice |
75 |
64 |
| FireRedTTS |
63 |
46 |
| MiniCPM-o 2.6 |
57 |
47 |
Click to view multimodal live streaming results.
**Multimodal Live Streaming**: results on StreamingBench
| Model |
Size |
Real-Time Video Understanding |
Omni-Source Understanding |
Contextual Understanding |
Overall |
| Proprietary |
| Gemini 1.5 Pro |
- |
77.4 |
67.8 |
51.1 |
70.3 |
| GPT-4o-202408 |
- |
74.5 |
51.0 |
48.0 |
64.1 |
| Claude-3.5-Sonnet |
- |
74.0 |
41.4 |
37.8 |
59.7 |
| Open-source |
| VILA-1.5 |
8B |
61.5 |
37.5 |
26.7 |
49.5 |
| LongVA |
7B |
63.1 |
35.9 |
30.2 |
50.7 |
| LLaVA-Next-Video-34B |
34B |
69.8 |
41.7 |
34.3 |
56.7 |
| Qwen2-VL-7B |
8B |
71.2 |
40.7 |
33.1 |
57.0 |
| InternVL2-8B |
8B |
70.1 |
42.7 |
34.1 |
57.0 |
| VITA-1.5 |
8B |
70.9 |
40.8 |
35.8 |
57.4 |
| LLaVA-OneVision-7B |
8B |
74.3 |
40.8 |
31.0 |
58.4 |
| InternLM-XC2.5-OL-7B |
8B |
75.4 |
46.2 |
33.6 |
60.8 |
| MiniCPM-V 2.6 |
8B |
72.4 |
40.2 |
33.4 |
57.7 |
| MiniCPM-o 2.6 |
8B |
79.9 |
53.4 |
38.5 |
66.0 |
### Examples
We deploy MiniCPM-o 2.6 on end devices. The demo video is the raw-speed recording on an iPad Pro and a Web demo.