31 KiB
MiniCPM-o 2.6
Archieve at: 2026-02-02
MiniCPM-o 2.6 is the latest and most capable model in the MiniCPM-o series. The model is built in an end-to-end fashion based on SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.6, and introduces new features for real-time speech conversation and multimodal live streaming. Notable features of MiniCPM-o 2.6 include:
-
🔥 Leading Visual Capability. MiniCPM-o 2.6 achieves an average score of 70.2 on OpenCompass, a comprehensive evaluation of 8 popular benchmarks. With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-202405, Gemini 1.5 Pro, and Claude 3.5 Sonnet for single image understanding. It also outperforms GPT-4V and Claude 3.5 Sonnet in multi-image and video understanding, and shows promising in-context learning capability.
-
🎙 State-of-the-art Speech Capability. MiniCPM-o 2.6 supports bilingual real-time speech conversation with configurable voices in English and Chinese. It outperforms GPT-4o-realtime on audio understanding tasks such as ASR and STT translation, and shows state-of-the-art performance on speech conversation in both semantic and acoustic evaluations in the open-source community. It also allows for fun features such as emotion/speed/style control, end-to-end voice cloning, role play, etc.
-
🎬 Strong Multimodal Live Streaming Capability. As a new feature, MiniCPM-o 2.6 can accept continuous video and audio streams independent of user queries, and support real-time speech interaction. It outperforms GPT-4o-202408 and Claude 3.5 Sonnet and shows state-of-the-art performance in the open-source community on StreamingBench, a comprehensive benchmark for real-time video understanding, omni-source (video & audio) understanding, and multimodal contextual understanding.
-
💪 Strong OCR Capability and Others. Advancing popular visual capabilities from MiniCPM-V series, MiniCPM-o 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves state-of-the-art performance on OCRBench for models under 25B, surpassing proprietary models such as GPT-4o-202405. Based on the latest RLAIF-V and VisCPM techniques, it features trustworthy behaviors, outperforming GPT-4o and Claude 3.5 Sonnet on MMHal-Bench, and supports multilingual capabilities on more than 30 languages.
-
🚀 Superior Efficiency. In addition to its friendly size, MiniCPM-o 2.6 also shows state-of-the-art token density (i.e., the number of pixels encoded into each visual token). It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-o 2.6 can efficiently support multimodal live streaming on end-side devices such as iPads.
-
💫 Easy Usage. MiniCPM-o 2.6 can be easily used in various ways: (1) llama.cpp support for efficient CPU inference on local devices, (2) int4 and GGUF format quantized models in 16 sizes, (3) vLLM support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks with LLaMA-Factory, (5) quick local WebUI demo, and (6) online web demo on server.
Model Architecture.
- End-to-end Omni-modal Architecture. Different modality encoders/decoders are connected and trained in an end-to-end fashion to fully exploit rich multimodal knowledge. The model is trained in a fully end-to-end manner with only CE loss.
- Omni-modal Live Streaming Mechanism. (1) We change the offline modality encoder/decoders into online ones for streaming inputs/outputs. (2) We devise a time-division multiplexing (TDM) mechanism for omni-modality streaming processing in the LLM backbone. It divides parallel omni-modality streams into sequential info within small periodic time slices.
- Configurable Speech Modeling Design. We devise a multimodal system prompt, including traditional text system prompt, and a new audio system prompt to determine the assistant voice. This enables flexible voice configurations in inference time, and also facilitates end-to-end voice cloning and description-based voice creation.
Evaluation
Click to view visual understanding results.
Image Understanding
| Model | Size | Token Density+ | OpenCompass | OCRBench | MathVista mini | ChartQA | MMVet | MMStar | MME | MMB1.1 test | AI2D | MMMU val | HallusionBench | TextVQA val | DocVQA test | MathVerse mini | MathVision | MMHal Score |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Proprietary | ||||||||||||||||||
| GPT-4o-20240513 | - | 1088 | 69.9 | 736 | 61.3 | 85.7 | 69.1 | 63.9 | 2328.7 | 82.2 | 84.6 | 69.2 | 55.0 | - | 92.8 | 50.2 | 30.4 | 3.6 |
| Claude3.5-Sonnet | - | 750 | 67.9 | 788 | 61.6 | 90.8 | 66.0 | 62.2 | 1920.0 | 78.5 | 80.2 | 65.9 | 49.9 | - | 95.2 | - | - | 3.4 |
| Gemini 1.5 Pro | - | - | 64.4 | 754 | 57.7 | 81.3 | 64.0 | 59.1 | 2110.6 | 73.9 | 79.1 | 60.6 | 45.6 | 73.5 | 86.5 | - | 19.2 | - |
| GPT-4o-mini-20240718 | - | 1088 | 64.1 | 785 | 52.4 | - | 66.9 | 54.8 | 2003.4 | 76.0 | 77.8 | 60.0 | 46.1 | - | - | - | - | 3.3 |
| Open Source | ||||||||||||||||||
| Cambrian-34B | 34B | 1820 | 58.3 | 591 | 50.3 | 75.6 | 53.2 | 54.2 | 2049.9 | 77.8 | 79.5 | 50.4 | 41.6 | 76.7 | 75.5 | - | - | - |
| GLM-4V-9B | 13B | 784 | 59.1 | 776 | 51.1 | - | 58.0 | 54.8 | 2018.8 | 67.9 | 71.2 | 46.9 | 45.0 | - | - | - | - | - |
| Pixtral-12B | 12B | 256 | 61.0 | 685 | 56.9 | 81.8 | 58.5 | 54.5 | - | 72.7 | 79.0 | 51.1 | 47.0 | 75.7 | 90.7 | - | - | - |
| VITA-1.5 | 8B | 784 | 63.3 | 741 | 66.2 | - | 52.7 | 60.2 | 2328.1 | 76.8 | 79.2 | 52.6 | 44.6 | - | - | - | - | - |
| DeepSeek-VL2-27B (4B) | 27B | 672 | 66.4 | 809 | 63.9 | 86.0 | 60.0 | 61.9 | 2253.0 | 81.2 | 83.8 | 54.0 | 45.3 | 84.2 | 93.3 | - | - | 3.0 |
| Qwen2-VL-7B | 8B | 784 | 67.1 | 866 | 58.2 | 83.0 | 62.0 | 60.7 | 2326.0 | 81.8 | 83.0 | 54.1 | 50.6 | 84.3 | 94.5 | 31.9 | 16.3 | 3.2 |
| LLaVA-OneVision-72B | 72B | 182 | 68.1 | 741 | 67.5 | 83.7 | 60.6 | 65.8 | 2261.0 | 85.0 | 85.6 | 56.8 | 49.0 | 80.5 | 91.3 | 39.1 | - | 3.5 |
| InternVL2.5-8B | 8B | 706 | 68.3 | 822 | 64.4 | 84.8 | 62.8 | 62.8 | 2344.0 | 83.6 | 84.5 | 56.0 | 50.1 | 79.1 | 93.0 | 39.5 | 19.7 | 3.4 |
| MiniCPM-V 2.6 | 8B | 2822 | 65.2 | 852* | 60.6 | 79.4 | 60.0 | 57.5 | 2348.4* | 78.0 | 82.1 | 49.8* | 48.1* | 80.1 | 90.8 | 25.7 | 18.3 | 3.6 |
| MiniCPM-o 2.6 | 8B | 2822 | 70.2 | 897* | 71.9* | 86.9* | 67.5 | 64.0 | 2372.0* | 80.5 | 85.8 | 50.4* | 51.9 | 82.0 | 93.5 | 41.4* | 23.1* | 3.8 |
+ Token Density: number of pixels encoded into each visual token at maximum resolution, i.e., # pixels at maximum resolution / # visual tokens.
Note: For proprietary models, we calculate token density based on the image encoding charging strategy defined in the official API documentation, which provides an upper-bound estimation.
Multi-image and Video Understanding
| Model | Size | BLINK val | Mantis Eval | MIRB | Video-MME (wo / w subs) |
|---|---|---|---|---|---|
| Proprietary | |||||
| GPT-4o-20240513 | - | 68.0 | - | - | 71.9/77.2 |
| GPT4V | - | 54.6 | 62.7 | 53.1 | 59.9/63.3 |
| Open-source | |||||
| VITA-1.5 | 8B | 45.0 | - | - | 56.1/58.7 |
| LLaVA-NeXT-Interleave 14B | 14B | 52.6 | 66.4 | 30.2 | - |
| LLaVA-OneVision-72B | 72B | 55.4 | 77.6 | - | 66.2/69.5 |
| MANTIS 8B | 8B | 49.1 | 59.5 | 34.8 | - |
| Qwen2-VL-7B | 8B | 53.2 | 69.6* | 67.6* | 63.3/69.0 |
| InternVL2.5-8B | 8B | 54.8 | 67.7 | 52.5 | 64.2/66.9 |
| MiniCPM-V 2.6 | 8B | 53.0 | 69.1 | 53.8 | 60.9/63.6 |
| MiniCPM-o 2.6 | 8B | 56.7 | 71.9 | 58.6 | 63.9/67.9 |
Click to view audio understanding and speech conversation results.
Audio Understanding
| Task | Size | ASR (zh) | ASR (en) | AST | Emotion | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| Metric | CER↓ | WER↓ | BLEU↑ | ACC↑ | ||||||
| Dataset | AISHELL-1 | Fleurs zh | WenetSpeech test-net | LibriSpeech test-clean | GigaSpeech | TED-LIUM | CoVoST en2zh | CoVoST zh2en | MELD emotion | |
| Proprietary | ||||||||||
| GPT-4o-Realtime | - | 7.3* | 5.4* | 28.9* | 2.6* | 12.9* | 4.8* | 37.1* | 15.7* | 33.2* |
| Gemini 1.5 Pro | - | 4.5* | 5.9* | 14.3* | 2.9* | 10.6* | 3.0* | 47.3* | 22.6* | 48.4* |
| Open-Source | ||||||||||
| Qwen2-Audio-7B | 8B | - | 7.5 | - | 1.6 | - | - | 45.2 | 24.4 | 55.3 |
| Qwen2-Audio-7B-Instruct | 8B | 2.6* | 6.9* | 10.3* | 3.1* | 9.7* | 5.9* | 39.5* | 22.9* | 17.4* |
| VITA-1.5 | 8B | 2.16 | - | 8.4 | 3.4 | - | - | - | - | - |
| GLM-4-Voice-Base | 9B | 2.5 | - | - | 2.8 | - | - | - | - | |
| MiniCPM-o 2.6 | 8B | 1.6 | 4.4 | 6.9 | 1.7 | 8.7 | 3.0 | 48.2 | 27.2 | 52.4 |
Speech Generation
| Task | Size | SpeechQA | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Metric | ACC↑ | G-Eval (10 point)↑ | Semantic ELO score↑ | Acoustic ELO score↑ | Overall ELO score↑ | UTMOS↑ | ASR-WER↓ | |||
| Dataset | Speech Llama Q. | Speech Web Q. | Speech Trivia QA | Speech AlpacaEval | AudioArena | |||||
| Proprietary | ||||||||||
| GPT-4o-Realtime | 71.7 | 51.6 | 69.7 | 7.4 | 1157 | 1203 | 1200 | 4.2 | 2.3 | |
| Open-Source | ||||||||||
| GLM-4-Voice | 9B | 50.0 | 32.0 | 36.4 | 5.1 | 999 | 1147 | 1035 | 4.1 | 11.7 |
| Llama-Omni | 8B | 45.3 | 22.9 | 10.7 | 3.9 | 960 | 878 | 897 | 3.2 | 24.3 |
| VITA-1.5 | 8B | 46.7 | 28.1 | 23.3 | 2.0 | - | - | - | - | - |
| Moshi | 7B | 43.7 | 23.8 | 16.7 | 2.4 | 871 | 808 | 875 | 2.8 | 8.2 |
| Mini-Omni | 1B | 22.0 | 12.8 | 6.9 | 2.5 | 926 | 803 | 865 | 3.4 | 10.0 |
| MiniCPM-o 2.6 | 8B | 61.0 | 40.0 | 40.2 | 5.1 | 1088 | 1163 | 1131 | 4.2 | 9.8 |
End-to-end Voice Cloning
| Task | Voice cloning | |
|---|---|---|
| Metric | SIMO↑ | SIMO↑ |
| Dataset | Seed-TTS test-zh | Seed-TTS test-en |
| F5-TTS | 76 | 67 |
| CosyVoice | 75 | 64 |
| FireRedTTS | 63 | 46 |
| MiniCPM-o 2.6 | 57 | 47 |
Click to view multimodal live streaming results.
Multimodal Live Streaming: results on StreamingBench
| Model | Size | Real-Time Video Understanding | Omni-Source Understanding | Contextual Understanding | Overall | |||
|---|---|---|---|---|---|---|---|---|
| Proprietary | ||||||||
| Gemini 1.5 Pro | - | 77.4 | 67.8 | 51.1 | 70.3 | |||
| GPT-4o-202408 | - | 74.5 | 51.0 | 48.0 | 64.1 | |||
| Claude-3.5-Sonnet | - | 74.0 | 41.4 | 37.8 | 59.7 | |||
| Open-source | ||||||||
| VILA-1.5 | 8B | 61.5 | 37.5 | 26.7 | 49.5 | |||
| LongVA | 7B | 63.1 | 35.9 | 30.2 | 50.7 | |||
| LLaVA-Next-Video-34B | 34B | 69.8 | 41.7 | 34.3 | 56.7 | |||
| Qwen2-VL-7B | 8B | 71.2 | 40.7 | 33.1 | 57.0 | |||
| InternVL2-8B | 8B | 70.1 | 42.7 | 34.1 | 57.0 | |||
| VITA-1.5 | 8B | 70.9 | 40.8 | 35.8 | 57.4 | |||
| LLaVA-OneVision-7B | 8B | 74.3 | 40.8 | 31.0 | 58.4 | |||
| InternLM-XC2.5-OL-7B | 8B | 75.4 | 46.2 | 33.6 | 60.8 | |||
| MiniCPM-V 2.6 | 8B | 72.4 | 40.2 | 33.4 | 57.7 | |||
| MiniCPM-o 2.6 | 8B | 79.9 | 53.4 | 38.5 | 66.0 | |||
Examples
We deploy MiniCPM-o 2.6 on end devices. The demo video is the raw-speed recording on an iPad Pro and a Web demo.





