diff --git a/README.md b/README.md index e8e6c29..a498122 100644 --- a/README.md +++ b/README.md @@ -1,43 +1,54 @@
+
-**A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone**
+**A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone**
[中文](./README_zh.md) |
English
-Join our 💬 WeChat | View MiniCPM-V 📖 best practices
+
+
+
+
+ WeChat |
+
+
+
+
+ Discord
+
+
- MiniCPM-V 2.6 🤗 🤖 | MiniCPM-Llama3-V 2.5 🤗 🤖 | - MiniCPM-Llama3-V 2.5 Technical Report + MiniCPM-o 2.6 🤗 CN🤖 US🤖 | MiniCPM-V 2.6 🤗 🤖 | + Technical Blog Coming Soon
+
+| Model | +Size | +Token Density+ | +OpenCompass | +OCRBench | +MathVista mini | +ChartQA | +MMVet | +MMStar | +MME | +MMB1.1 test | +AI2D | +MMMU val | +HallusionBench | +TextVQA val | +DocVQA test | +MathVerse mini | +MathVision | +MMHal Score | +
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Proprietary | +||||||||||||||||||
| GPT-4o-20240513 | +- | +1088 | +69.9 | +736 | +61.3 | +85.7 | +69.1 | +63.9 | +2328.7 | +82.2 | +84.6 | +69.2 | +55.0 | +- | +92.8 | +50.2 | +30.4 | +3.6 | +
| Claude3.5-Sonnet | +- | +750 | +67.9 | +788 | +61.6 | +90.8 | +66.0 | +62.2 | +1920.0 | +78.5 | +80.2 | +65.9 | +49.9 | +- | +95.2 | +- | +- | +3.4 | +
| Gemini-1.5-Pro | +- | +- | +64.4 | +754 | +57.7 | +81.3 | +64.0 | +59.1 | +2110.6 | +73.9 | +79.1 | +60.6 | +45.6 | +73.5 | +86.5 | +- | +19.2 | +- | +
| GPT-4o-mini-20240718 | +- | +1088 | +64.1 | +785 | +52.4 | +- | +66.9 | +54.8 | +2003.4 | +76.0 | +77.8 | +60.0 | +46.1 | +- | +- | +- | +- | +3.3 | +
| Open Source | +||||||||||||||||||
| Cambrian-34B | +34B | +1820 | +58.3 | +591 | +50.3 | +75.6 | +53.2 | +54.2 | +2049.9 | +77.8 | +79.5 | +50.4 | +41.6 | +76.7 | +75.5 | +- | +- | +- | +
| GLM-4V-9B | +13B | +784 | +59.1 | +776 | +51.1 | +- | +58.0 | +54.8 | +2018.8 | +67.9 | +71.2 | +46.9 | +45.0 | +- | +- | +- | +- | +- | +
| Pixtral-12B | +12B | +256 | +61.0 | +685 | +56.9 | +81.8 | +58.5 | +54.5 | +- | +72.7 | +79.0 | +51.1 | +47.0 | +75.7 | +90.7 | +- | +- | +- | +
| DeepSeek-VL2-27B (4B) | +27B | +672 | +66.4 | +809 | +63.9 | +86.0 | +60.0 | +61.9 | +2253.0 | +81.2 | +83.8 | +54.0 | +45.3 | +84.2 | +93.3 | +- | +- | +3.0 | +
| Qwen2-VL-7B | +8B | +784 | +67.1 | +866 | +58.2 | +83.0 | +62.0 | +60.7 | +2326.0 | +81.8 | +83.0 | +54.1 | +50.6 | +84.3 | +94.5 | +31.9 | +16.3 | +3.2 | +
| LLaVA-OneVision-72B | +72B | +182 | +68.1 | +741 | +67.5 | +83.7 | +60.6 | +65.8 | +2261.0 | +85.0 | +85.6 | +56.8 | +49.0 | +80.5 | +91.3 | +39.1 | +- | +3.5 | +
| InternVL-2.5-8B | +8B | +706 | +68.3 | +822 | +64.4 | +84.8 | +62.8 | +62.8 | +2344.0 | +83.6 | +84.5 | +56.0 | +50.1 | +79.1 | +93.0 | +39.5 | +19.7 | +3.4 | +
| MiniCPM-V 2.6 | +8B | +2822 | +65.2 | +852* | +60.6 | +79.4 | +60.0 | +57.5 | +2348.4* | +78.0 | +82.1 | +49.8* | +48.1* | +80.1 | +90.8 | +25.7 | +18.3 | +3.6 | +
| MiniCPM-o 2.6 | +8B | +2822 | +70.2 | +897* | +71.9* | +86.9* | +67.5 | +64.0 | +2372.0* | +80.5 | +85.8 | +50.4* | +51.9 | +82.0 | +93.5 | +41.4* | +23.1* | +3.8 | +
| Model | +Size | +BLINK-val | +Mantis-Eval | +MIRB | +Video-MME (wo / w subs) | +
|---|---|---|---|---|---|
| Proprietary | +|||||
| GPT-4o-20240513 | +- | +68.0 | +- | +- | +71.9/77.2 | +
| GPT4V | +- | +54.6 | +62.7 | +53.1 | +59.9/63.3 | +
| Open-source | +|||||
| LLaVA-NeXT-Interleave 14B | +14B | +52.6 | +66.4 | +30.2 | +- | +
| LLaVA-One-Vision-72B | +72B | +55.4 | +77.6 | +- | +66.2/69.5 | +
| MANTIS 8B | +8B | +49.1 | +59.5 | +34.8 | +- | +
| Qwen2-VL-7B | +8B | +53.2 | +69.6* | +67.6* | +63.3/69.0 | +
| InternVL-2.5-8B | +8B | +54.8 | +67.7 | +52.5 | +64.2/66.9 | +
| MiniCPM-V 2.6 | +8B | +53.0 | +69.1 | +53.8 | +60.9/63.6 | +
| MiniCPM-o 2.6 | +8B | +56.7 | +71.9 | +58.6 | +63.9/67.9 | +
| Task | +Size | +ASR (zh) | +ASR (en) | +AST | +Emotion | +|||||
|---|---|---|---|---|---|---|---|---|---|---|
| Metric | ++ | CER↓ | +WER↓ | +BLEU↑ | +ACC↑ | +|||||
| Dataset | ++ | AISHELL-1 | +Fleurs zh | +WenetSpeech test-net | +LibriSpeech test-clean | +GigaSpeech | +TED-LIUM | +CoVoST en2zh | +CoVoST zh2en | +MELD emotion | +
| Proprietary | +||||||||||
| GPT-4o-Realtime | +- | +7.3* | +5.4* | +28.9* | +2.6* | +12.9* | +4.8* | +37.1* | +15.7* | +33.2* | +
| Gemini-1.5-Pro | +- | +4.5* | +5.9* | +14.3* | +2.9* | +10.6* | +3.0* | +47.3* | +22.6* | +48.4* | +
| Open-Source | +||||||||||
| Qwen2-Audio-Base | +8B | +- | +7.5 | +- | +1.6 | +- | +- | +45.2 | +24.4 | +55.3 | +
| Qwen2-Audio-Instruction | +8B | +2.6* | +6.9* | +10.3* | +3.1* | +9.7* | +5.9* | +39.5* | +22.9* | +17.4* | +
| GLM-4-Voice-Base | +9B | +2.5 | +- | +- | +2.8 | +- | +- | +- | +- | +|
| MiniCPM-o 2.6 | +8B | +1.6 | +4.4 | +6.9 | +1.7 | +8.7 | +3.0 | +48.2 | +27.2 | +52.4 | +
| Task | +Size | +SpeechQA | +||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Metric | ++ | ACC↑ | +G-Eval (10 point)↑ | +Semantic ELO score↑ | +Acoustic ELO score↑ | +Overall ELO score↑ | +UTMOS↑ | +ASR-WER↓ | +||
| Dataset | ++ | Speech Llama Q. | +Speech Web Q. | +Speech Trivia QA | +Speech AlpacaEval | +AudioArena | +||||
| Proprietary | +||||||||||
| GPT-4o-Realtime | ++ | 71.7 | +51.6 | +69.7 | +7.4 | +1157 | +1203 | +1200 | +4.2 | +2.3 | +
| Open-Source | +||||||||||
| GLM-4-Voice | +9B | +50.0 | +32.0 | +36.4 | +5.1 | +999 | +1147 | +1035 | +4.1 | +11.7 | +
| Llama-Omni | +8B | +45.3 | +22.9 | +10.7 | +3.9 | +960 | +878 | +897 | +3.2 | +24.3 | +
| Moshi | +7B | +43.7 | +23.8 | +16.7 | +2.4 | +871 | +808 | +875 | +2.8 | +8.2 | +
| Mini-Omni | +1B | +22.0 | +12.8 | +6.9 | +2.5 | +926 | +803 | +865 | +3.4 | +10.0 | +
| MiniCPM-o 2.6 | +8B | +61.0 | +40.0 | +40.2 | +5.1 | +1088 | +1163 | +1131 | +4.2 | +9.8 | +
| Task | +Voice cloning | +|
|---|---|---|
| Metric | +SIMO↑ | +SIMO↑ | +
| Dataset | +Seed-TTS test-zh | +Seed-TTS test-en | +
| F5-TTS | +76 | +67 | +
| CosyVoice | +75 | +64 | +
| FireRedTTS | +63 | +46 | +
| MiniCPM-o 2.6 | +57 | +47 | +
| Model | +Size | +Real-Time Video Understanding | +Omni-Source Understanding | +Contextual Understanding | +Overall | +|||
|---|---|---|---|---|---|---|---|---|
| Proprietary | +||||||||
| Gemini 1.5 Pro | +- | +77.4 | +67.8 | +51.1 | +70.3 | +|||
| GPT-4o-202408 | +- | +74.5 | +51.0 | +48.0 | +64.1 | +|||
| Claude-3.5-Sonnet | +- | +74.0 | +41.4 | +37.8 | +59.7 | +|||
| Open-source | +||||||||
| VILA-1.5 | +8B | +61.5 | +37.5 | +26.7 | +49.5 | +|||
| LongVA | +7B | +63.1 | +35.9 | +30.2 | +50.7 | +|||
| LLaVA-Next-Video-34B | +34B | +69.8 | +41.7 | +34.3 | +56.7 | +|||
| Qwen2-VL-7B | +8B | +71.2 | +40.7 | +33.1 | +57.0 | +|||
| InternVL2-8B | +8B | +70.1 | +42.7 | +34.1 | +57.0 | +|||
| VITA-1.5 | +8B | +70.9 | +40.8 | +35.8 | +57.4 | +|||
| LLaVA-OneVision-7B | +8B | +74.3 | +40.8 | +31.0 | +58.4 | +|||
| InternLM-XC2.5-OL-7B | +8B | +75.4 | +46.2 | +33.6 | +60.8 | +|||
| MiniCPM-V 2.6 | +8B | +72.4 | +40.2 | +33.4 | +57.7 | +|||
| MiniCPM-o 2.6 | +8B | +79.9 | +53.4 | +38.5 | +66.0 | +|||
+
+
+
-| Model | -Size | -OCRBench | -TextVQA val | -DocVQA test | -Open-Compass | -MME | -MMB test (en) | -MMB test (cn) | -MMMU val | -Math-Vista | -LLaVA Bench | -RealWorld QA | -Object HalBench | -
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Proprietary | -|||||||||||||
| Gemini Pro | -- | -680 | -74.6 | -88.1 | -62.9 | -2148.9 | -73.6 | -74.3 | -48.9 | -45.8 | -79.9 | -60.4 | -- | -
| GPT-4V (2023.11.06) | -- | -645 | -78.0 | -88.4 | -63.5 | -1771.5 | -77.0 | -74.4 | -53.8 | -47.8 | -93.1 | -63.0 | -86.4 | -
| Open-source | -|||||||||||||
| Mini-Gemini | -2.2B | -- | -56.2 | -34.2* | -- | -1653.0 | -- | -- | -31.7 | -- | -- | -- | -- | -
| Qwen-VL-Chat | -9.6B | -488 | -61.5 | -62.6 | -51.6 | -1860.0 | -61.8 | -56.3 | -37.0 | -33.8 | -67.7 | -49.3 | -56.2 | -
| DeepSeek-VL-7B | -7.3B | -435 | -64.7* | -47.0* | -54.6 | -1765.4 | -73.8 | -71.4 | -38.3 | -36.8 | -77.8 | -54.2 | -- | -
| Yi-VL-34B | -34B | -290 | -43.4* | -16.9* | -52.2 | -2050.2 | -72.4 | -70.7 | -45.1 | -30.7 | -62.3 | -54.8 | -79.3 | -
| CogVLM-Chat | -17.4B | -590 | -70.4 | -33.3* | -54.2 | -1736.6 | -65.8 | -55.9 | -37.3 | -34.7 | -73.9 | -60.3 | -73.6 | -
| TextMonkey | -9.7B | -558 | -64.3 | -66.7 | -- | -- | -- | -- | -- | -- | -- | -- | -- | -
| Idefics2 | -8.0B | -- | -73.0 | -74.0 | -57.2 | -1847.6 | -75.7 | -68.6 | -45.2 | -52.2 | -49.1 | -60.7 | -- | -
| Bunny-LLama-3-8B | -8.4B | -- | -- | -- | -54.3 | -1920.3 | -77.0 | -73.9 | -41.3 | -31.5 | -61.2 | -58.8 | -- | -
| LLaVA-NeXT Llama-3-8B | -8.4B | -- | -- | -78.2 | -- | -1971.5 | -- | -- | -41.7 | -37.5 | -80.1 | -60.0 | -- | -
| Phi-3-vision-128k-instruct | -4.2B | -639* | -70.9 | -- | -- | -1537.5* | -- | -- | -40.4 | -44.5 | -64.2* | -58.8* | -- | -
| MiniCPM-V 1.0 | -2.8B | -366 | -60.6 | -38.2 | -47.5 | -1650.2 | -64.1 | -62.6 | -38.3 | -28.9 | -51.3 | -51.2 | -78.4 | -
| MiniCPM-V 2.0 | -2.8B | -605 | -74.1 | -71.9 | -54.5 | -1808.6 | -69.1 | -66.5 | -38.2 | -38.7 | -69.2 | -55.8 | -85.5 | -
| MiniCPM-Llama3-V 2.5 | -8.5B | -725 | -76.6 | -84.8 | -65.1 | -2024.6 | -77.2 | -74.2 | -45.8 | -54.3 | -86.7 | -63.5 | -89.7 | -
-
](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6) |
+| MiniCPM-o 2.6| GPU | 18 GB | The latest version, achieving GPT-4o level performance for vision, speech and multimodal live streaming on end-side devices. | [🤗](https://huggingface.co/openbmb/MiniCPM-o-2_6) [
](https://modelscope.cn/models/OpenBMB/MiniCPM-o-2_6) |
+| MiniCPM-o 2.6 gguf | CPU | 8 GB | The gguf version, lower memory usage and faster inference. | [🤗](https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf) [
](https://modelscope.cn/models/OpenBMB/MiniCPM-o-2_6-gguf) |
+| MiniCPM-o 2.6 int4 | GPU | 9 GB | The int4 quantized version, lower GPU memory usage. | [🤗](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4) [
](https://modelscope.cn/models/OpenBMB/MiniCPM-o-2_6-int4) |
+| MiniCPM-V 2.6| GPU | 17 GB | Strong end-side multimodal performance for single image, multi-image and video understanding. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-2_6) [
](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6) |
| MiniCPM-V 2.6 gguf | CPU | 6 GB | The gguf version, lower memory usage and faster inference. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf) [
](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6-gguf) |
| MiniCPM-V 2.6 int4 | GPU | 7 GB | The int4 quantized version, lower GPU memory usage. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-2_6-int4) [
](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6-int4) |
-| MiniCPM-Llama3-V 2.5 | GPU | 19 GB | Strong end-side multimodal performance. | [🤗](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5/) [
](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5) |
-| MiniCPM-Llama3-V 2.5 gguf | CPU | 6 GB | The gguf version, lower memory usage and faster inference. | [🤗](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf) [
](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5-gguf) |
-| MiniCPM-Llama3-V 2.5 int4 | GPU | 8 GB | The int4 quantized version, lower GPU memory usage. | [🤗](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-int4/) [
](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5-int4) |
-| MiniCPM-V 2.0 | GPU | 8 GB | Light version, balance the performance the computation cost. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-2) [
](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2) |
-| MiniCPM-V 1.0 | GPU | 7 GB | Lightest version, achieving the fastest inference. | [🤗](https://huggingface.co/openbmb/MiniCPM-V) [
](https://modelscope.cn/models/OpenBMB/MiniCPM-V) |
### Multi-turn Conversation
Please refer to the following codes to run.
+
[THUNLP](https://nlp.csai.tsinghua.edu.cn/)
-
[ModelBest](https://modelbest.cn/)
--
[Zhihu](https://www.zhihu.com/ )
## 🌟 Star History
@@ -1676,14 +2557,14 @@ This project is developed by the following institutions:
## Key Techniques and Other Multimodal Projects
-👏 Welcome to explore key techniques of MiniCPM-V and other multimodal projects of our team:
+👏 Welcome to explore key techniques of MiniCPM-o/V and other multimodal projects of our team:
[VisCPM](https://github.com/OpenBMB/VisCPM/tree/main) | [RLHF-V](https://github.com/RLHF-V/RLHF-V) | [LLaVA-UHD](https://github.com/thunlp/LLaVA-UHD) | [RLAIF-V](https://github.com/RLHF-V/RLAIF-V)
## Citation
-If you find our model/code/paper helpful, please consider cite our papers 📝 and star us ⭐️!
+If you find our model/code/paper helpful, please consider citing our papers 📝 and staring us ⭐️!
```bib
@article{yao2024minicpm,
diff --git a/README_en.md b/README_en.md
deleted file mode 100644
index e8e6c29..0000000
--- a/README_en.md
+++ /dev/null
@@ -1,1695 +0,0 @@
-
-
-**A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone**
-
- [中文](./README_zh.md) |
- English
-
-Join our 💬 WeChat | View MiniCPM-V 📖 best practices
-
-
-- MiniCPM-V 2.6 🤗 🤖 | MiniCPM-Llama3-V 2.5 🤗 🤖 | - MiniCPM-Llama3-V 2.5 Technical Report -
- -
-| Model | -Size | -Token Density+ | -OpenCompass | -MME | -MMVet | -OCRBench | -MMMU val | -MathVista mini | -MMB1.1 test | -AI2D | -TextVQA val | -DocVQA test | -HallusionBench | -Object HalBench | -
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Proprietary | -||||||||||||||
| GPT-4o | -- | -1088 | -69.9 | -2328.7 | -69.1 | -736 | -69.2 | -61.3 | -82.2 | -84.6 | -- | -92.8 | -55.0 | -17.6 | -
| Claude 3.5 Sonnet | -- | -750 | -67.9 | -1920.0 | -66.0 | -788 | -65.9 | -61.6 | -78.5 | -80.2 | -- | -95.2 | -49.9 | -13.8 | -
| Gemini 1.5 Pro | -- | -- | -64.4 | -2110.6 | -64.0 | -754 | -60.6 | -57.7 | -73.9 | -79.1 | -73.5 | -86.5 | -45.6 | -- | -
| GPT-4o mini | -- | -1088 | -64.1 | -2003.4 | -66.9 | -785 | -60.0 | -52.4 | -76.0 | -77.8 | -- | -- | -46.1 | -12.4 | -
| GPT-4V | -- | -1088 | -63.5 | -2070.2 | -67.5 | -656 | -61.7 | -54.7 | -79.8 | -78.6 | -78.0 | -87.2 | -43.9 | -14.2 | -
| Step-1V | -- | -- | -59.5 | -2206.4 | -63.3 | -625 | -49.9 | -44.8 | -78.0 | -79.2 | -71.6 | -- | -48.4 | -- | -
| Qwen-VL-Max | -- | -784 | -58.3 | -2281.7 | -61.8 | -684 | -52.0 | -43.4 | -74.6 | -75.7 | -79.5 | -93.1 | -41.2 | -13.4 | -
| Open-source | -||||||||||||||
| LLaVA-NeXT-Yi-34B | -34B | -157 | -55.0 | -2006.5 | -50.7 | -574 | -48.8 | -40.4 | -77.8 | -78.9 | -69.3 | -- | -34.8 | -12.6 | -
| Mini-Gemini-HD-34B | -34B | -157 | -- | -2141.0 | -59.3 | -518 | -48.0 | -43.3 | -- | -80.5 | -74.1 | -78.9 | -- | -- | -
| Cambrian-34B | -34B | -1820 | -58.3 | -2049.9 | -53.2 | -591 | -50.4 | -50.3 | -77.8 | -79.5 | -76.7 | -75.5 | -41.6 | -14.7 | -
| GLM-4V-9B | -13B | -784 | -59.1 | -2018.8 | -58.0 | -776 | -46.9 | -51.1 | -67.9 | -71.2 | -- | -- | -45.0 | -- | -
| InternVL2-8B | -8B | -706 | -64.1 | -2215.1 | -54.3 | -794 | -51.2 | -58.3 | -79.4 | -83.6 | -77.4 | -91.6 | -45.0 | -21.3 | -
| MiniCPM-Llama-V 2.5 | -8B | -1882 | -58.8 | -2024.6 | -52.8 | -725 | -45.8 | -54.3 | -72.0 | -78.4 | -76.6 | -84.8 | -42.4 | -10.3 | -
| MiniCPM-V 2.6 | -8B | -2822 | -65.2 | -2348.4* | -60.0 | -852* | -49.8* | -60.6 | -78.0 | -82.1 | -80.1 | -90.8 | -48.1* | -8.2 | -
| Model | -Size | -Mantis Eval | -BLINK val | -Mathverse mv | -Sciverse mv | -MIRB | -
|---|---|---|---|---|---|---|
| Proprietary | -||||||
| GPT-4V | -- | -62.7 | -54.6 | -60.3 | -66.9 | -53.1 | -
| LLaVA-NeXT-Interleave-14B | -14B | -66.4 | -52.6 | -32.7 | -30.2 | -- | -
| Open-source | -||||||
| Emu2-Chat | -37B | -37.8 | -36.2 | -- | -27.2 | -- | -
| CogVLM | -17B | -45.2 | -41.1 | -- | -- | -- | -
| VPG-C | -7B | -52.4 | -43.1 | -24.3 | -23.1 | -- | -
| VILA 8B | -8B | -51.2 | -39.3 | -- | -36.5 | -- | -
| InternLM-XComposer-2.5 | -8B | -53.1* | -48.9 | -32.1* | -- | -42.5 | -
| InternVL2-8B | -8B | -59.0* | -50.9 | -30.5* | -34.4* | -56.9* | -
| MiniCPM-V 2.6 | -8B | -69.1 | -53.0 | -84.9 | -74.9 | -53.8 | -
| Model | -Size | -Video-MME | -Video-ChatGPT | -|||||
|---|---|---|---|---|---|---|---|---|
| - | - | w/o subs | -w subs | -Correctness | -Detail | -Context | -Temporal | -Consistency | -
| Proprietary | -||||||||
| Claude 3.5 Sonnet | -- | -60.0 | -62.9 | -- | -- | -- | -- | -- | -
| GPT-4V | -- | -59.9 | -63.3 | -- | -- | -- | -- | -- | -
| Open-source | -||||||||
| LLaVA-NeXT-7B | -7B | -- | -- | -3.39 | -3.29 | -3.92 | -2.60 | -3.12 | -
| LLaVA-NeXT-34B | -34B | -- | -- | -3.29 | -3.23 | -3.83 | -2.51 | -3.47 | -
| CogVLM2-Video | -12B | -- | -- | -3.49 | -3.46 | -3.23 | -2.98 | -3.64 | -
| LongVA | -7B | -52.4 | -54.3 | -3.05 | -3.09 | -3.77 | -2.44 | -3.64 | -
| InternVL2-8B | -8B | -54.0 | -56.9 | -- | -- | -- | -- | -- | -
| InternLM-XComposer-2.5 | -8B | -55.8 | -- | -- | -- | -- | -- | -- | -
| LLaVA-NeXT-Video | -32B | -60.2 | -63.0 | -3.48 | -3.37 | -3.95 | -2.64 | -3.28 | -
| MiniCPM-V 2.6 | -8B | -60.9 | -63.6 | -3.59 | -3.28 | -3.93 | -2.73 | -3.62 | -
| Model | -Size | -Shot | -TextVQA val | -VizWiz test-dev | -VQAv2 test-dev | -OK-VQA val | -
|---|---|---|---|---|---|---|
| Flamingo | -80B | -0* | -35.0 | -31.6 | -56.3 | -40.6 | -
| 4 | -36.5 | -39.6 | -63.1 | -57.4 | -||
| 8 | -37.3 | -44.8 | -65.6 | -57.5 | -||
| IDEFICS | -80B | -0* | -30.9 | -36.0 | -60.0 | -45.2 | -
| 4 | -34.3 | -40.4 | -63.6 | -52.4 | -||
| 8 | -35.7 | -46.1 | -64.8 | -55.1 | -||
| OmniCorpus | -7B | -0* | -43.0 | -49.8 | -63.2 | -45.5 | -
| 4 | -45.4 | -51.3 | -64.5 | -46.5 | -||
| 8 | -45.6 | -52.2 | -64.7 | -46.6 | -||
| Emu2 | -37B | -0 | -26.4 | -40.4 | -33.5 | -26.7 | -
| 4 | -48.2 | -54.6 | -67.0 | -53.2 | -||
| 8 | -49.3 | -54.7 | -67.8 | -54.1 | -||
| MM1 | -30B | -0 | -26.2 | -40.4 | -48.9 | -26.7 | -
| 8 | -49.3 | -54.7 | -70.9 | -54.1 | -||
| MiniCPM-V 2.6+ | -8B | -0 | -43.9 | -33.8 | -45.4 | -23.9 | -
| 4 | -63.6 | -60.5 | -65.5 | -50.1 | -||
| 8 | -64.6 | -63.4 | -68.2 | -51.4 | -
-
-
-
-
-
-
-
-| Model | -Size | -OCRBench | -TextVQA val | -DocVQA test | -Open-Compass | -MME | -MMB test (en) | -MMB test (cn) | -MMMU val | -Math-Vista | -LLaVA Bench | -RealWorld QA | -Object HalBench | -
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Proprietary | -|||||||||||||
| Gemini Pro | -- | -680 | -74.6 | -88.1 | -62.9 | -2148.9 | -73.6 | -74.3 | -48.9 | -45.8 | -79.9 | -60.4 | -- | -
| GPT-4V (2023.11.06) | -- | -645 | -78.0 | -88.4 | -63.5 | -1771.5 | -77.0 | -74.4 | -53.8 | -47.8 | -93.1 | -63.0 | -86.4 | -
| Open-source | -|||||||||||||
| Mini-Gemini | -2.2B | -- | -56.2 | -34.2* | -- | -1653.0 | -- | -- | -31.7 | -- | -- | -- | -- | -
| Qwen-VL-Chat | -9.6B | -488 | -61.5 | -62.6 | -51.6 | -1860.0 | -61.8 | -56.3 | -37.0 | -33.8 | -67.7 | -49.3 | -56.2 | -
| DeepSeek-VL-7B | -7.3B | -435 | -64.7* | -47.0* | -54.6 | -1765.4 | -73.8 | -71.4 | -38.3 | -36.8 | -77.8 | -54.2 | -- | -
| Yi-VL-34B | -34B | -290 | -43.4* | -16.9* | -52.2 | -2050.2 | -72.4 | -70.7 | -45.1 | -30.7 | -62.3 | -54.8 | -79.3 | -
| CogVLM-Chat | -17.4B | -590 | -70.4 | -33.3* | -54.2 | -1736.6 | -65.8 | -55.9 | -37.3 | -34.7 | -73.9 | -60.3 | -73.6 | -
| TextMonkey | -9.7B | -558 | -64.3 | -66.7 | -- | -- | -- | -- | -- | -- | -- | -- | -- | -
| Idefics2 | -8.0B | -- | -73.0 | -74.0 | -57.2 | -1847.6 | -75.7 | -68.6 | -45.2 | -52.2 | -49.1 | -60.7 | -- | -
| Bunny-LLama-3-8B | -8.4B | -- | -- | -- | -54.3 | -1920.3 | -77.0 | -73.9 | -41.3 | -31.5 | -61.2 | -58.8 | -- | -
| LLaVA-NeXT Llama-3-8B | -8.4B | -- | -- | -78.2 | -- | -1971.5 | -- | -- | -41.7 | -37.5 | -80.1 | -60.0 | -- | -
| Phi-3-vision-128k-instruct | -4.2B | -639* | -70.9 | -- | -- | -1537.5* | -- | -- | -40.4 | -44.5 | -64.2* | -58.8* | -- | -
| MiniCPM-V 1.0 | -2.8B | -366 | -60.6 | -38.2 | -47.5 | -1650.2 | -64.1 | -62.6 | -38.3 | -28.9 | -51.3 | -51.2 | -78.4 | -
| MiniCPM-V 2.0 | -2.8B | -605 | -74.1 | -71.9 | -54.5 | -1808.6 | -69.1 | -66.5 | -38.2 | -38.7 | -69.2 | -55.8 | -85.5 | -
| MiniCPM-Llama3-V 2.5 | -8.5B | -725 | -76.6 | -84.8 | -65.1 | -2024.6 | -77.2 | -74.2 | -45.8 | -54.3 | -86.7 | -63.5 | -89.7 | -
-
](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6) |
-| MiniCPM-V 2.6 gguf | CPU | 6 GB | The gguf version, lower memory usage and faster inference. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf) [
](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6-gguf) |
-| MiniCPM-V 2.6 int4 | GPU | 7 GB | The int4 quantized version, lower GPU memory usage. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-2_6-int4) [
](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6-int4) |
-| MiniCPM-Llama3-V 2.5 | GPU | 19 GB | Strong end-side multimodal performance. | [🤗](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5/) [
](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5) |
-| MiniCPM-Llama3-V 2.5 gguf | CPU | 6 GB | The gguf version, lower memory usage and faster inference. | [🤗](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf) [
](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5-gguf) |
-| MiniCPM-Llama3-V 2.5 int4 | GPU | 8 GB | The int4 quantized version, lower GPU memory usage. | [🤗](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-int4/) [
](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5-int4) |
-| MiniCPM-V 2.0 | GPU | 8 GB | Light version, balance the performance the computation cost. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-2) [
](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2) |
-| MiniCPM-V 1.0 | GPU | 7 GB | Lightest version, achieving the fastest inference. | [🤗](https://huggingface.co/openbmb/MiniCPM-V) [
](https://modelscope.cn/models/OpenBMB/MiniCPM-V) |
-
-### Multi-turn Conversation
-
-Please refer to the following codes to run.
-
-
-
[THUNLP](https://nlp.csai.tsinghua.edu.cn/)
--
[ModelBest](https://modelbest.cn/)
--
[Zhihu](https://www.zhihu.com/ )
-
-## 🌟 Star History
-
-
-
-
-
-**端侧可用的 GPT-4V 级单图、多图、视频多模态大模型**
+**端侧可用的 GPT-4o 级视觉、语音、多模态实时流式大模型**
中文 |
- [English](./README_en.md)
+ [English](./README.md)
+
+
+
+
+ 微信社区 |
+
+
+ MiniCPM-V 📖 最佳实践
+
- 加入我们的 💬 微信社区
-| 了解 MiniCPM-V 📖 最佳实践
-
-- MiniCPM-V 2.6 🤗 🤖 | MiniCPM-Llama3-V 2.5 🤗 🤖 | - MiniCPM-Llama3-V 2.5 技术报告 +
-
+
+| Model | +Size | +Token Density+ | +OpenCompass | +OCRBench | +MathVista mini | +ChartQA | +MMVet | +MMStar | +MME | +MMB1.1 test | +AI2D | +MMMU val | +HallusionBench | +TextVQA val | +DocVQA test | +MathVerse mini | +MathVision | +MMHal Score | +
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Proprietary | +||||||||||||||||||
| GPT-4o-20240513 | +- | +1088 | +69.9 | +736 | +61.3 | +85.7 | +69.1 | +63.9 | +2328.7 | +82.2 | +84.6 | +69.2 | +55.0 | +- | +92.8 | +50.2 | +30.4 | +3.6 | +
| Claude3.5-Sonnet | +- | +750 | +67.9 | +788 | +61.6 | +90.8 | +66.0 | +62.2 | +1920.0 | +78.5 | +80.2 | +65.9 | +49.9 | +- | +95.2 | +- | +- | +3.4 | +
| Gemini-1.5-Pro | +- | +- | +64.4 | +754 | +57.7 | +81.3 | +64.0 | +59.1 | +2110.6 | +73.9 | +79.1 | +60.6 | +45.6 | +73.5 | +86.5 | +- | +19.2 | +- | +
| GPT-4o-mini-20240718 | +- | +1088 | +64.1 | +785 | +52.4 | +- | +66.9 | +54.8 | +2003.4 | +76.0 | +77.8 | +60.0 | +46.1 | +- | +- | +- | +- | +3.3 | +
| Open Source | +||||||||||||||||||
| Cambrian-34B | +34B | +1820 | +58.3 | +591 | +50.3 | +75.6 | +53.2 | +54.2 | +2049.9 | +77.8 | +79.5 | +50.4 | +41.6 | +76.7 | +75.5 | +- | +- | +- | +
| GLM-4V-9B | +13B | +784 | +59.1 | +776 | +51.1 | +- | +58.0 | +54.8 | +2018.8 | +67.9 | +71.2 | +46.9 | +45.0 | +- | +- | +- | +- | +- | +
| Pixtral-12B | +12B | +256 | +61.0 | +685 | +56.9 | +81.8 | +58.5 | +54.5 | +- | +72.7 | +79.0 | +51.1 | +47.0 | +75.7 | +90.7 | +- | +- | +- | +
| DeepSeek-VL2-27B (4B) | +27B | +672 | +66.4 | +809 | +63.9 | +86.0 | +60.0 | +61.9 | +2253.0 | +81.2 | +83.8 | +54.0 | +45.3 | +84.2 | +93.3 | +- | +- | +3.0 | +
| Qwen2-VL-7B | +8B | +784 | +67.1 | +866 | +58.2 | +83.0 | +62.0 | +60.7 | +2326.0 | +81.8 | +83.0 | +54.1 | +50.6 | +84.3 | +94.5 | +31.9 | +16.3 | +3.2 | +
| LLaVA-OneVision-72B | +72B | +182 | +68.1 | +741 | +67.5 | +83.7 | +60.6 | +65.8 | +2261.0 | +85.0 | +85.6 | +56.8 | +49.0 | +80.5 | +91.3 | +39.1 | +- | +3.5 | +
| InternVL-2.5-8B | +8B | +706 | +68.3 | +822 | +64.4 | +84.8 | +62.8 | +62.8 | +2344.0 | +83.6 | +84.5 | +56.0 | +50.1 | +79.1 | +93.0 | +39.5 | +19.7 | +3.4 | +
| MiniCPM-V 2.6 | +8B | +2822 | +65.2 | +852* | +60.6 | +79.4 | +60.0 | +57.5 | +2348.4* | +78.0 | +82.1 | +49.8* | +48.1* | +80.1 | +90.8 | +25.7 | +18.3 | +3.6 | +
| MiniCPM-o 2.6 | +8B | +2822 | +70.2 | +897* | +71.9* | +86.9* | +67.5 | +64.0 | +2372.0* | +80.5 | +85.8 | +50.4* | +51.9 | +82.0 | +93.5 | +41.4* | +23.1* | +3.8 | +
| Model | +Size | +BLINK-val | +Mantis-Eval | +MIRB | +Video-MME (wo / w subs) | +
|---|---|---|---|---|---|
| Proprietary | +|||||
| GPT-4o-20240513 | +- | +68 | +- | +- | +71.9/77.2 | +
| GPT4V | +- | +54.6 | +62.7 | +53.1 | +59.9/63.3 | +
| Open-source | +|||||
| LLaVA-NeXT-Interleave 14B | +14B | +52.6 | +66.4 | +30.2 | +- | +
| LLaVA-One-Vision-72B | +72B | +55.4 | +77.6 | +- | +66.2/69.5 | +
| MANTIS 8B | +8B | +49.1 | +59.5 | +34.8 | +- | +
| Qwen2-VL-7B | +8B | +53.2 | +69.6* | +67.6* | +63.3/69.0 | +
| InternVL-2.5-8B | +8B | +54.8 | +67.7 | +52.5 | +64.2/66.9 | +
| MiniCPM-V 2.6 | +8B | +53 | +69.1 | +53.8 | +60.9/63.6 | +
| MiniCPM-o 2.6 | +8B | +56.7 | +71.9 | +58.6 | +63.9/67.9 | +
| Task | +Size | +ASR (zh) | +ASR (en) | +AST | +Emotion | +|||||
|---|---|---|---|---|---|---|---|---|---|---|
| Metric | ++ | CER↓ | +WER↓ | +BLEU↑ | +ACC↑ | +|||||
| Dataset | ++ | AISHELL-1 | +Fleurs zh | +WenetSpeech test-net | +LibriSpeech test-clean | +GigaSpeech | +TED-LIUM | +CoVoST en2zh | +CoVoST zh2en | +MELD emotion | +
| Proprietary | +||||||||||
| GPT-4o-Realtime | +- | +7.3* | +5.4* | +28.9* | +2.6* | +12.9* | +4.8* | +37.1* | +15.7* | +33.2* | +
| Gemini-1.5-Pro | +- | +4.5* | +5.9* | +14.3* | +2.9* | +10.6* | +3.0* | +47.3* | +22.6* | +48.4* | +
| Open-Source | +||||||||||
| Qwen2-Audio-Base | +8B | +- | +7.5 | +- | +1.6 | +- | +- | +45.2 | +24.4 | +55.3 | +
| Qwen2-Audio-Instruction | +8B | +2.6* | +6.9* | +10.3* | +3.1* | +9.7* | +5.9* | +39.5* | +22.9* | +17.4* | +
| GLM-4-Voice-Base | +9B | +2.5 | +- | +- | +2.8 | +- | +- | +- | +- | +|
| MiniCPM-o 2.6 | +8B | +1.6 | +4.4 | +6.9 | +1.7 | +8.7 | +3.0 | +48.2 | +27.2 | +52.4 | +
| Task | +Size | +SpeechQA | +||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Metric | ++ | ACC↑ | +G-Eval (10 point)↑ | +Semantic ELO score↑ | +Acoustic ELO score↑ | +Overall ELO score↑ | +UTMOS↑ | +ASR-WER↓ | +||
| Dataset | ++ | Speech Llama Q. | +Speech Web Q. | +Speech Trivia QA | +Speech AlpacaEval | +AudioArena | +||||
| Proprietary | +||||||||||
| GPT-4o-Realtime | ++ | 71.7 | +51.6 | +69.7 | +7.4 | +1157 | +1203 | +1200 | +4.2 | +2.3 | +
| Open-Source | +||||||||||
| GLM-4-Voice | +9B | +50.0 | +32.0 | +36.4 | +5.1 | +999 | +1147 | +1035 | +4.1 | +11.7 | +
| Llama-Omni | +8B | +45.3 | +22.9 | +10.7 | +3.9 | +960 | +878 | +897 | +3.2 | +24.3 | +
| Moshi | +7B | +43.7 | +23.8 | +16.7 | +2.4 | +871 | +808 | +875 | +2.8 | +8.2 | +
| Mini-Omni | +1B | +22.0 | +12.8 | +6.9 | +2.5 | +926 | +803 | +865 | +3.4 | +10.0 | +
| MiniCPM-o 2.6 | +8B | +61.0 | +40.0 | +40.2 | +5.1 | +1088 | +1163 | +1131 | +4.2 | +9.8 | +
| Task | +TTS | +|
|---|---|---|
| Metric | +SIMO↑ | +SIMO↑ | +
| Dataset | +Seed-TTS test-zh | +Seed-TTS test-en | +
| F5-TTS | +76 | +67 | +
| CosyVoice | +75 | +64 | +
| FireRedTTS | +63 | +46 | +
| MiniCPM-o 2.6 | +57 | +47 | +
| Model | +Size | +Real-Time Video Understanding | +Omni-Source Understanding | +Contextual Understanding | +Overall | +|||
|---|---|---|---|---|---|---|---|---|
| Proprietary | +||||||||
| Gemini 1.5 Pro | +- | +77.4 | +67.8 | +51.1 | +70.3 | +|||
| GPT-4o-202408 | +- | +74.5 | +51.0 | +48.0 | +64.1 | +|||
| Claude-3.5-Sonnet | +- | +74.0 | +41.4 | +37.8 | +59.7 | +|||
| Open-source | +||||||||
| VILA-1.5 | +8B | +61.5 | +37.5 | +26.7 | +49.5 | +|||
| LongVA | +7B | +63.1 | +35.9 | +30.2 | +50.7 | +|||
| LLaVA-Next-Video-34B | +34B | +69.8 | +41.7 | +34.3 | +56.7 | +|||
| Qwen2-VL-7B | +8B | +71.2 | +40.7 | +33.1 | +57.0 | +|||
| InternVL2-8B | +8B | +70.1 | +42.7 | +34.1 | +57.0 | +|||
| VITA-1.5 | +8B | +70.9 | +40.8 | +35.8 | +57.4 | +|||
| LLaVA-OneVision-7B | +8B | +74.3 | +40.8 | +31.0 | +58.4 | +|||
| InternLM-XC2.5-OL-7B | +8B | +75.4 | +46.2 | +33.6 | +60.8 | +|||
| MiniCPM-V 2.6 | +8B | +72.4 | +40.2 | +33.4 | +57.7 | +|||
| MiniCPM-o 2.6 | +8B | +79.9 | +53.4 | +38.5 | +66.0 | +|||
+
+
+
-| Model | -Size | -OCRBench | -TextVQA val | -DocVQA test | -Open-Compass | -MME | -MMB test (en) | -MMB test (cn) | -MMMU val | -Math-Vista | -LLaVA Bench | -RealWorld QA | -Object HalBench | -
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Proprietary | -|||||||||||||
| Gemini Pro | -- | -680 | -74.6 | -88.1 | -62.9 | -2148.9 | -73.6 | -74.3 | -48.9 | -45.8 | -79.9 | -60.4 | -- | -
| GPT-4V (2023.11.06) | -- | -645 | -78.0 | -88.4 | -63.5 | -1771.5 | -77.0 | -74.4 | -53.8 | -47.8 | -93.1 | -63.0 | -86.4 | -
| Open-source | -|||||||||||||
| Mini-Gemini | -2.2B | -- | -56.2 | -34.2* | -- | -1653.0 | -- | -- | -31.7 | -- | -- | -- | -- | -
| Qwen-VL-Chat | -9.6B | -488 | -61.5 | -62.6 | -51.6 | -1860.0 | -61.8 | -56.3 | -37.0 | -33.8 | -67.7 | -49.3 | -56.2 | -
| DeepSeek-VL-7B | -7.3B | -435 | -64.7* | -47.0* | -54.6 | -1765.4 | -73.8 | -71.4 | -38.3 | -36.8 | -77.8 | -54.2 | -- | -
| Yi-VL-34B | -34B | -290 | -43.4* | -16.9* | -52.2 | -2050.2 | -72.4 | -70.7 | -45.1 | -30.7 | -62.3 | -54.8 | -79.3 | -
| CogVLM-Chat | -17.4B | -590 | -70.4 | -33.3* | -54.2 | -1736.6 | -65.8 | -55.9 | -37.3 | -34.7 | -73.9 | -60.3 | -73.6 | -
| TextMonkey | -9.7B | -558 | -64.3 | -66.7 | -- | -- | -- | -- | -- | -- | -- | -- | -- | -
| Idefics2 | -8.0B | -- | -73.0 | -74.0 | -57.2 | -1847.6 | -75.7 | -68.6 | -45.2 | -52.2 | -49.1 | -60.7 | -- | -
| Bunny-LLama-3-8B | -8.4B | -- | -- | -- | -54.3 | -1920.3 | -77.0 | -73.9 | -41.3 | -31.5 | -61.2 | -58.8 | -- | -
| LLaVA-NeXT Llama-3-8B | -8.4B | -- | -- | -- | -- | -1971.5 | -- | -- | -41.7 | -- | -80.1 | -60.0 | -- | -
| Phi-3-vision-128k-instruct | -4.2B | -639* | -70.9 | -- | -- | -1537.5* | -- | -- | -40.4 | -44.5 | -64.2* | -58.8* | -- | -
| MiniCPM-V 1.0 | -2.8B | -366 | -60.6 | -38.2 | -47.5 | -1650.2 | -64.1 | -62.6 | -38.3 | -28.9 | -51.3 | -51.2 | -78.4 | -
| MiniCPM-V 2.0 | -2.8B | -605 | -74.1 | -71.9 | -54.5 | -1808.6 | -69.1 | -66.5 | -38.2 | -38.7 | -69.2 | -55.8 | -85.5 | -
| MiniCPM-Llama3-V 2.5 | -8.5B | -725 | -76.6 | -84.8 | -65.1 | -2024.6 | -77.2 | -74.2 | -45.8 | -54.3 | -86.7 | -63.5 | -89.7 | -
-
](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6) |
+| MiniCPM-o 2.6| GPU | 18 GB | 最新版本,提供端侧 GPT-4o 级的视觉、语音、多模态流式交互能力。 | [🤗](https://huggingface.co/openbmb/MiniCPM-o-2_6) [
](https://modelscope.cn/models/OpenBMB/MiniCPM-o-2_6) |
+| MiniCPM-o 2.6 gguf | CPU | 8 GB | gguf 版本,更低的内存占用和更高的推理效率。 | [🤗](https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf) [
](https://modelscope.cn/models/OpenBMB/MiniCPM-o-2_6-gguf) |
+| MiniCPM-o 2.6 int4 | GPU | 9 GB | int4量化版,更低显存占用。 | [🤗](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4) [
](https://modelscope.cn/models/OpenBMB/MiniCPM-o-2_6-int4) |
+| MiniCPM-V 2.6| GPU | 17 GB | 提供出色的端侧单图、多图、视频理解能力。 | [🤗](https://huggingface.co/openbmb/MiniCPM-V-2_6) [
](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6) |
| MiniCPM-V 2.6 gguf | CPU | 6 GB | gguf 版本,更低的内存占用和更高的推理效率。 | [🤗](https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf) [
](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6-gguf) |
| MiniCPM-V 2.6 int4 | GPU | 7 GB | int4量化版,更低显存占用。 | [🤗](https://huggingface.co/openbmb/MiniCPM-V-2_6-int4) [
](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6-int4) |
-| MiniCPM-Llama3-V 2.5| GPU | 19 GB | 提供出色的端侧多模态理解能力。 | [🤗](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5/) [
](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5) |
-| MiniCPM-Llama3-V 2.5 gguf | CPU | 6 GB | gguf 版本,更低的内存占用和更高的推理效率。 | [🤗](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf) [
](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5-gguf) |
-| MiniCPM-Llama3-V 2.5 int4 | GPU | 8 GB | int4量化版,更低显存占用。 | [🤗](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-int4/) [
](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5-int4) |
-| MiniCPM-V 2.0 | GPU | 8 GB | 轻量级版本,平衡计算开销和多模态理解能力。 | [🤗](https://huggingface.co/openbmb/MiniCPM-V-2) [
](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2) |
-| MiniCPM-V 1.0 | GPU | 7 GB | 最轻量版本, 提供最快的推理速度。 | [🤗](https://huggingface.co/openbmb/MiniCPM-V) [
](https://modelscope.cn/models/OpenBMB/MiniCPM-V) |
更多[历史版本模型](#legacy-models)
+
### 多轮对话
-请参考以下代码进行推理。
+
[清华大学自然语言处理实验室](https://nlp.csai.tsinghua.edu.cn/)
-
[面壁智能](https://modelbest.cn/)
--
[知乎](https://www.zhihu.com/ )
## 🌟 Star History
@@ -1700,7 +2555,7 @@ print(outputs[0].outputs[0].text)
## 支持技术和其他多模态项目
-👏 欢迎了解 MiniCPM-V 背后的支持技术和更多我们的多模态项目!
+👏 欢迎了解 MiniCPM-o/V 背后的支持技术和更多我们的多模态项目!
[VisCPM](https://github.com/OpenBMB/VisCPM/tree/main) | [RLHF-V](https://github.com/RLHF-V/RLHF-V) | [LLaVA-UHD](https://github.com/thunlp/LLaVA-UHD) | [RLAIF-V](https://github.com/RLHF-V/RLAIF-V)
diff --git a/assets/MiniCPM-o.png b/assets/MiniCPM-o.png
new file mode 100644
index 0000000..20fa91e
Binary files /dev/null and b/assets/MiniCPM-o.png differ
diff --git a/assets/discord.png b/assets/discord.png
new file mode 100644
index 0000000..c3067a4
Binary files /dev/null and b/assets/discord.png differ
diff --git a/assets/logo.html b/assets/logo.html
new file mode 100644
index 0000000..71257de
--- /dev/null
+++ b/assets/logo.html
@@ -0,0 +1,3 @@
+
+ MiniCPM-o
+
\ No newline at end of file
diff --git a/assets/minicpm-o-26-framework.png b/assets/minicpm-o-26-framework.png
new file mode 100644
index 0000000..459887e
Binary files /dev/null and b/assets/minicpm-o-26-framework.png differ
diff --git a/assets/minicpmo2_6/minicpmo2_6_diagram_train_NN.png b/assets/minicpmo2_6/minicpmo2_6_diagram_train_NN.png
new file mode 100644
index 0000000..eeef5f2
Binary files /dev/null and b/assets/minicpmo2_6/minicpmo2_6_diagram_train_NN.png differ
diff --git a/assets/minicpmo2_6/minicpmo2_6_math_intersect.png b/assets/minicpmo2_6/minicpmo2_6_math_intersect.png
new file mode 100644
index 0000000..f526b1c
Binary files /dev/null and b/assets/minicpmo2_6/minicpmo2_6_math_intersect.png differ
diff --git a/assets/minicpmo2_6/minicpmo2_6_multi-image_bike.png b/assets/minicpmo2_6/minicpmo2_6_multi-image_bike.png
new file mode 100644
index 0000000..090337b
Binary files /dev/null and b/assets/minicpmo2_6/minicpmo2_6_multi-image_bike.png differ
diff --git a/assets/minicpmo2_6/show_demo.jpg b/assets/minicpmo2_6/show_demo.jpg
new file mode 100644
index 0000000..40ec4fb
Binary files /dev/null and b/assets/minicpmo2_6/show_demo.jpg differ
diff --git a/assets/o-2dot6-demo-video-preview.png b/assets/o-2dot6-demo-video-preview.png
new file mode 100644
index 0000000..8e34ab4
Binary files /dev/null and b/assets/o-2dot6-demo-video-preview.png differ
diff --git a/assets/radar.jpg b/assets/radar.jpg
new file mode 100644
index 0000000..51f75bc
Binary files /dev/null and b/assets/radar.jpg differ
diff --git a/assets/ref_audios/default.wav b/assets/ref_audios/default.wav
new file mode 100644
index 0000000..8171eee
Binary files /dev/null and b/assets/ref_audios/default.wav differ
diff --git a/assets/ref_audios/female_example.wav b/assets/ref_audios/female_example.wav
new file mode 100644
index 0000000..4f795b2
Binary files /dev/null and b/assets/ref_audios/female_example.wav differ
diff --git a/assets/ref_audios/male_example.wav b/assets/ref_audios/male_example.wav
new file mode 100644
index 0000000..09e725b
Binary files /dev/null and b/assets/ref_audios/male_example.wav differ
diff --git a/assets/ref_audios/video_default.wav b/assets/ref_audios/video_default.wav
new file mode 100644
index 0000000..2e6061b
Binary files /dev/null and b/assets/ref_audios/video_default.wav differ
diff --git a/assets/wechat.png b/assets/wechat.png
new file mode 100644
index 0000000..8a109ef
Binary files /dev/null and b/assets/wechat.png differ
diff --git a/docs/minicpm_llama3_v2dot5.md b/docs/minicpm_llama3_v2dot5.md
new file mode 100644
index 0000000..7ab8700
--- /dev/null
+++ b/docs/minicpm_llama3_v2dot5.md
@@ -0,0 +1,333 @@
+## MiniCPM-Llama3-V 2.5
+
+> Archieve at: 2025-01-13
+
+
+**MiniCPM-Llama3-V 2.5** is the latest model in the MiniCPM-V series. The model is built on SigLip-400M and Llama3-8B-Instruct with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.0. Notable features of MiniCPM-Llama3-V 2.5 include:
+
+- 🔥 **Leading Performance.**
+ MiniCPM-Llama3-V 2.5 has achieved an average score of 65.1 on OpenCompass, a comprehensive evaluation over 11 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4V-1106, Gemini Pro, Claude 3 and Qwen-VL-Max** and greatly outperforms other Llama 3-based MLLMs.
+
+- 💪 **Strong OCR Capabilities.**
+ MiniCPM-Llama3-V 2.5 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344), achieving a **700+ score on OCRBench, surpassing proprietary models such as GPT-4o, GPT-4V-0409, Qwen-VL-Max and Gemini Pro**. Based on recent user feedback, MiniCPM-Llama3-V 2.5 has now enhanced full-text OCR extraction, table-to-markdown conversion, and other high-utility capabilities, and has further strengthened its instruction-following and complex reasoning abilities, enhancing multimodal interaction experiences.
+
+- 🏆 **Trustworthy Behavior.**
+ Leveraging the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) method (the newest technique in the [RLHF-V](https://github.com/RLHF-V) [CVPR'24] series), MiniCPM-Llama3-V 2.5 exhibits more trustworthy behavior. It achieves a **10.3%** hallucination rate on Object HalBench, lower than GPT-4V-1106 (13.6%), achieving the best-level performance within the open-source community. [Data released](https://huggingface.co/datasets/openbmb/RLAIF-V-Dataset).
+
+- 🌏 **Multilingual Support.**
+ Thanks to the strong multilingual capabilities of Llama 3 and the cross-lingual generalization technique from [VisCPM](https://github.com/OpenBMB/VisCPM), MiniCPM-Llama3-V 2.5 extends its bilingual (Chinese-English) multimodal capabilities to **over 30 languages including German, French, Spanish, Italian, Korean etc.** [All Supported Languages](./assets/minicpm-llama-v-2-5_languages.md).
+
+- 🚀 **Efficient Deployment.**
+ MiniCPM-Llama3-V 2.5 systematically employs **model quantization, CPU optimizations, NPU optimizations and compilation optimizations**, achieving high-efficiency deployment on end-side devices. For mobile phones with Qualcomm chips, we have integrated the NPU acceleration framework QNN into llama.cpp for the first time. After systematic optimization, MiniCPM-Llama3-V 2.5 has realized a **150x acceleration in end-side MLLM image encoding** and a **3x speedup in language decoding**.
+
+- 💫 **Easy Usage.**
+MiniCPM-Llama3-V 2.5 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-v2.5/examples/minicpmv/README.md) and [ollama](https://github.com/OpenBMB/ollama/tree/minicpm-v2.5/examples/minicpm-v2.5) support for efficient CPU inference on local devices, (2) [GGUF](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf) format quantized models in 16 sizes, (3) efficient [LoRA](https://github.com/OpenBMB/MiniCPM-V/tree/main/finetune#lora-finetuning) fine-tuning with only 2 V100 GPUs, (4) [streaming output](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5#usage), (5) quick local WebUI demo setup with [Gradio](https://github.com/OpenBMB/MiniCPM-V/blob/main/web_demo_2.5.py) and [Streamlit](https://github.com/OpenBMB/MiniCPM-V/blob/main/web_demo_streamlit-2_5.py), and (6) interactive demos on [HuggingFace Spaces](https://huggingface.co/spaces/openbmb/MiniCPM-Llama3-V-2_5).
+
+### Evaluation
+
+
+| Model | +Size | +OCRBench | +TextVQA val | +DocVQA test | +Open-Compass | +MME | +MMB test (en) | +MMB test (cn) | +MMMU val | +Math-Vista | +LLaVA Bench | +RealWorld QA | +Object HalBench | +
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Proprietary | +|||||||||||||
| Gemini Pro | +- | +680 | +74.6 | +88.1 | +62.9 | +2148.9 | +73.6 | +74.3 | +48.9 | +45.8 | +79.9 | +60.4 | +- | +
| GPT-4V (2023.11.06) | +- | +645 | +78.0 | +88.4 | +63.5 | +1771.5 | +77.0 | +74.4 | +53.8 | +47.8 | +93.1 | +63.0 | +86.4 | +
| Open-source | +|||||||||||||
| Mini-Gemini | +2.2B | +- | +56.2 | +34.2* | +- | +1653.0 | +- | +- | +31.7 | +- | +- | +- | +- | +
| Qwen-VL-Chat | +9.6B | +488 | +61.5 | +62.6 | +51.6 | +1860.0 | +61.8 | +56.3 | +37.0 | +33.8 | +67.7 | +49.3 | +56.2 | +
| DeepSeek-VL-7B | +7.3B | +435 | +64.7* | +47.0* | +54.6 | +1765.4 | +73.8 | +71.4 | +38.3 | +36.8 | +77.8 | +54.2 | +- | +
| Yi-VL-34B | +34B | +290 | +43.4* | +16.9* | +52.2 | +2050.2 | +72.4 | +70.7 | +45.1 | +30.7 | +62.3 | +54.8 | +79.3 | +
| CogVLM-Chat | +17.4B | +590 | +70.4 | +33.3* | +54.2 | +1736.6 | +65.8 | +55.9 | +37.3 | +34.7 | +73.9 | +60.3 | +73.6 | +
| TextMonkey | +9.7B | +558 | +64.3 | +66.7 | +- | +- | +- | +- | +- | +- | +- | +- | +- | +
| Idefics2 | +8.0B | +- | +73.0 | +74.0 | +57.2 | +1847.6 | +75.7 | +68.6 | +45.2 | +52.2 | +49.1 | +60.7 | +- | +
| Bunny-LLama-3-8B | +8.4B | +- | +- | +- | +54.3 | +1920.3 | +77.0 | +73.9 | +41.3 | +31.5 | +61.2 | +58.8 | +- | +
| LLaVA-NeXT Llama-3-8B | +8.4B | +- | +- | +78.2 | +- | +1971.5 | +- | +- | +41.7 | +37.5 | +80.1 | +60.0 | +- | +
| Phi-3-vision-128k-instruct | +4.2B | +639* | +70.9 | +- | +- | +1537.5* | +- | +- | +40.4 | +44.5 | +64.2* | +58.8* | +- | +
| MiniCPM-V 1.0 | +2.8B | +366 | +60.6 | +38.2 | +47.5 | +1650.2 | +64.1 | +62.6 | +38.3 | +28.9 | +51.3 | +51.2 | +78.4 | +
| MiniCPM-V 2.0 | +2.8B | +605 | +74.1 | +71.9 | +54.5 | +1808.6 | +69.1 | +66.5 | +38.2 | +38.7 | +69.2 | +55.8 | +85.5 | +
| MiniCPM-Llama3-V 2.5 | +8.5B | +725 | +76.6 | +84.8 | +65.1 | +2024.6 | +77.2 | +74.2 | +45.8 | +54.3 | +86.7 | +63.5 | +89.7 | +
+
](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5) |
+| MiniCPM-Llama3-V 2.5 gguf | CPU | 6 GB | The gguf version, lower memory usage and faster inference. | [🤗](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf) [
](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5-gguf) |
+| MiniCPM-Llama3-V 2.5 int4 | GPU | 8 GB | The int4 quantized version, lower GPU memory usage. | [🤗](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-int4/) [
](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5-int4) |
diff --git a/minicpm_v1.md b/docs/minicpm_v1.md
similarity index 100%
rename from minicpm_v1.md
rename to docs/minicpm_v1.md
diff --git a/docs/minicpm_v2.md b/docs/minicpm_v2.md
new file mode 100644
index 0000000..9dcb5a0
--- /dev/null
+++ b/docs/minicpm_v2.md
@@ -0,0 +1,294 @@
+## MiniCPM-V 2.0
+
+
+> Archive at:2025-01-13
+
+
+
+**MiniCPM-V 2.0** is an efficient version with promising performance for deployment. The model is built based on SigLip-400M and [MiniCPM-2.4B](https://github.com/OpenBMB/MiniCPM/), connected by a perceiver resampler. Our latest version, MiniCPM-V 2.0 has several notable features.
+
+- 🔥 **State-of-the-art Performance.**
+
+ MiniCPM-V 2.0 achieves **state-of-the-art performance** on multiple benchmarks (including OCRBench, TextVQA, MME, MMB, MathVista, etc) among models under 7B parameters. It even **outperforms strong Qwen-VL-Chat 9.6B, CogVLM-Chat 17.4B, and Yi-VL 34B on OpenCompass, a comprehensive evaluation over 11 popular benchmarks**. Notably, MiniCPM-V 2.0 shows **strong OCR capability**, achieving **comparable performance to Gemini Pro in scene-text understanding**, and **state-of-the-art performance on OCRBench** among open-source models.
+
+- 🏆 **Trustworthy Behavior.**
+
+ LMMs are known for suffering from hallucination, often generating text not factually grounded in images. MiniCPM-V 2.0 is **the first end-side LMM aligned via multimodal RLHF for trustworthy behavior** (using the recent [RLHF-V](https://rlhf-v.github.io/) [CVPR'24] series technique). This allows the model to **match GPT-4V in preventing hallucinations** on Object HalBench.
+
+- 🌟 **High-Resolution Images at Any Aspect Raito.**
+
+ MiniCPM-V 2.0 can accept **1.8 million pixels (e.g., 1344x1344) images at any aspect ratio**. This enables better perception of fine-grained visual information such as small objects and optical characters, which is achieved via a recent technique from [LLaVA-UHD](https://arxiv.org/pdf/2403.11703.pdf).
+
+- ⚡️ **High Efficiency.**
+
+ MiniCPM-V 2.0 can be **efficiently deployed on most GPU cards and personal computers**, and **even on end devices such as mobile phones**. For visual encoding, we compress the image representations into much fewer tokens via a perceiver resampler. This allows MiniCPM-V 2.0 to operate with **favorable memory cost and speed during inference even when dealing with high-resolution images**.
+
+- 🙌 **Bilingual Support.**
+
+ MiniCPM-V 2.0 **supports strong bilingual multimodal capabilities in both English and Chinese**. This is enabled by generalizing multimodal capabilities across languages, a technique from [VisCPM](https://arxiv.org/abs/2308.12038) [ICLR'24].
+
+
+### Evaluation
+
+
+| Model | +Size | +TextVQA val | +DocVQA test | +OCRBench | +OpenCompass | +MME | +MMB dev(en) | +MMB dev(zh) | +MMMU val | +MathVista | +LLaVA Bench | +Object HalBench | +
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Proprietary models | +||||||||||||
| Gemini Pro Vision | +- | +74.6 | +88.1 | +680 | +63.8 | +2148.9 | +75.2 | +74.0 | +48.9 | +45.8 | +79.9 | +- | +
| GPT-4V | +- | +78.0 | +88.4 | +645 | +63.2 | +1771.5 | +75.1 | +75.0 | +53.8 | +47.8 | +93.1 | +86.4 / 92.7 | +
| Open-source models 6B~34B | +||||||||||||
| Yi-VL-6B | +6.7B | +45.5* | +17.1* | +290 | +49.3 | +1915.1 | +68.6 | +68.3 | +40.3 | +28.8 | +51.9 | +- | +
| Qwen-VL-Chat | +9.6B | +61.5 | +62.6 | +488 | +52.1 | +1860.0 | +60.6 | +56.7 | +37.0 | +33.8 | +67.7 | +56.2 / 80.0 | +
| Yi-VL-34B | +34B | +43.4* | +16.9* | +290 | +52.6 | +2050.2 | +71.1 | +71.4 | +45.1 | +30.7 | +62.3 | +- | +
| DeepSeek-VL-7B | +7.3B | +64.7* | +47.0* | +435 | +55.6 | +1765.4 | +74.1 | +72.8 | +38.3 | +36.8 | +77.8 | +- | +
| TextMonkey | +9.7B | +64.3 | +66.7 | +558 | +- | +- | +- | +- | +- | +- | +- | +- | +
| CogVLM-Chat | +17.4B | +70.4 | +33.3* | +590 | +52.5 | +1736.6 | +63.7 | +53.8 | +37.3 | +34.7 | +73.9 | +73.6 / 87.4 | +
| Open-source models 1B~3B | +||||||||||||
| DeepSeek-VL-1.3B | +1.7B | +58.4* | +37.9* | +413 | +46.0 | +1531.6 | +64.0 | +61.2 | +33.8 | +29.4 | +51.1 | +- | +
| MobileVLM V2 | +3.1B | +57.5 | +19.4* | +- | +- | +1440.5(P) | +63.2 | +- | +- | +- | +- | +- | +
| Mini-Gemini | +2.2B | +56.2 | +34.2* | +- | +- | +1653.0 | +59.8 | +- | +31.7 | +- | +- | +- | +
| MiniCPM-V | +2.8B | +60.6 | +38.2 | +366 | +47.6 | +1650.2 | +67.9 | +65.3 | +38.3 | +28.9 | +51.3 | +78.4 / 88.5 | +
| MiniCPM-V 2.0 | +2.8B | +74.1 | +71.9 | +605 | +55.0 | +1808.6 | +69.6 | +68.1 | +38.2 | +38.7 | +69.2 | +85.5 / 92.2 | +
](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2) |
+| MiniCPM-V 1.0 | GPU | 7 GB | Lightest version, achieving the fastest inference. | [🤗](https://huggingface.co/openbmb/MiniCPM-V) [
](https://modelscope.cn/models/OpenBMB/MiniCPM-V) |
diff --git a/docs/minicpm_v2dot6.md b/docs/minicpm_v2dot6.md
new file mode 100644
index 0000000..9ef6dac
--- /dev/null
+++ b/docs/minicpm_v2dot6.md
@@ -0,0 +1,945 @@
+## MiniCPM-V 2.6
+
+> Archieve at: 2025-01-13
+
+**MiniCPM-V 2.6** is the latest and most capable model in the MiniCPM-V series. The model is built on SigLip-400M and Qwen2-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-Llama3-V 2.5, and introduces new features for multi-image and video understanding. Notable features of MiniCPM-V 2.6 include:
+
+- 🔥 **Leading Performance.**
+ MiniCPM-V 2.6 achieves an average score of 65.2 on the latest version of OpenCompass, a comprehensive evaluation over 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o mini, GPT-4V, Gemini 1.5 Pro, and Claude 3.5 Sonnet** for single image understanding.
+
+- 🖼️ **Multi Image Understanding and In-context Learning.** MiniCPM-V 2.6 can also perform **conversation and reasoning over multiple images**. It achieves **state-of-the-art performance** on popular multi-image benchmarks such as Mantis-Eval, BLINK, Mathverse mv and Sciverse mv, and also shows promising in-context learning capability.
+
+- 🎬 **Video Understanding.** MiniCPM-V 2.6 can also **accept video inputs**, performing conversation and providing dense captions for spatial-temporal information. It outperforms **GPT-4V, Claude 3.5 Sonnet and LLaVA-NeXT-Video-34B** on Video-MME with/without subtitles.
+
+- 💪 **Strong OCR Capability and Others.**
+ MiniCPM-V 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves **state-of-the-art performance on OCRBench, surpassing proprietary models such as GPT-4o, GPT-4V, and Gemini 1.5 Pro**.
+ Based on the the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) and [VisCPM](https://github.com/OpenBMB/VisCPM) techniques, it features **trustworthy behaviors**, with significantly lower hallucination rates than GPT-4o and GPT-4V on Object HalBench, and supports **multilingual capabilities** on English, Chinese, German, French, Italian, Korean, etc.
+
+
+- 🚀 **Superior Efficiency.**
+ In addition to its friendly size, MiniCPM-V 2.6 also shows **state-of-the-art token density** (i.e., number of pixels encoded into each visual token). **It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models**. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-V 2.6 can efficiently support **real-time video understanding** on end-side devices such as iPad.
+
+- 💫 **Easy Usage.**
+MiniCPM-V 2.6 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpmv-main/examples/llava/README-minicpmv2.6.md) and [ollama](https://github.com/OpenBMB/ollama/blob/minicpm-v2.6/examples/minicpm-v2.6/README.md) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-V-2_6-int4) and [GGUF](https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf) format quantized models in 16 sizes, (3) [vLLM](#inference-with-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks, (5) quick local WebUI demo setup with [Gradio](#chat-with-our-demo-on-gradio), and (6) online web [demo](http://120.92.209.146:8887/).
+
+### Evaluation
+
+| Model | +Size | +Token Density+ | +OpenCompass | +MME | +MMVet | +OCRBench | +MMMU val | +MathVista mini | +MMB1.1 test | +AI2D | +TextVQA val | +DocVQA test | +HallusionBench | +Object HalBench | +
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Proprietary | +||||||||||||||
| GPT-4o | +- | +1088 | +69.9 | +2328.7 | +69.1 | +736 | +69.2 | +61.3 | +82.2 | +84.6 | +- | +92.8 | +55.0 | +17.6 | +
| Claude 3.5 Sonnet | +- | +750 | +67.9 | +1920.0 | +66.0 | +788 | +65.9 | +61.6 | +78.5 | +80.2 | +- | +95.2 | +49.9 | +13.8 | +
| Gemini 1.5 Pro | +- | +- | +64.4 | +2110.6 | +64.0 | +754 | +60.6 | +57.7 | +73.9 | +79.1 | +73.5 | +86.5 | +45.6 | +- | +
| GPT-4o mini | +- | +1088 | +64.1 | +2003.4 | +66.9 | +785 | +60.0 | +52.4 | +76.0 | +77.8 | +- | +- | +46.1 | +12.4 | +
| GPT-4V | +- | +1088 | +63.5 | +2070.2 | +67.5 | +656 | +61.7 | +54.7 | +79.8 | +78.6 | +78.0 | +87.2 | +43.9 | +14.2 | +
| Step-1V | +- | +- | +59.5 | +2206.4 | +63.3 | +625 | +49.9 | +44.8 | +78.0 | +79.2 | +71.6 | +- | +48.4 | +- | +
| Qwen-VL-Max | +- | +784 | +58.3 | +2281.7 | +61.8 | +684 | +52.0 | +43.4 | +74.6 | +75.7 | +79.5 | +93.1 | +41.2 | +13.4 | +
| Open-source | +||||||||||||||
| LLaVA-NeXT-Yi-34B | +34B | +157 | +55.0 | +2006.5 | +50.7 | +574 | +48.8 | +40.4 | +77.8 | +78.9 | +69.3 | +- | +34.8 | +12.6 | +
| Mini-Gemini-HD-34B | +34B | +157 | +- | +2141.0 | +59.3 | +518 | +48.0 | +43.3 | +- | +80.5 | +74.1 | +78.9 | +- | +- | +
| Cambrian-34B | +34B | +1820 | +58.3 | +2049.9 | +53.2 | +591 | +50.4 | +50.3 | +77.8 | +79.5 | +76.7 | +75.5 | +41.6 | +14.7 | +
| GLM-4V-9B | +13B | +784 | +59.1 | +2018.8 | +58.0 | +776 | +46.9 | +51.1 | +67.9 | +71.2 | +- | +- | +45.0 | +- | +
| InternVL2-8B | +8B | +706 | +64.1 | +2215.1 | +54.3 | +794 | +51.2 | +58.3 | +79.4 | +83.6 | +77.4 | +91.6 | +45.0 | +21.3 | +
| MiniCPM-Llama-V 2.5 | +8B | +1882 | +58.8 | +2024.6 | +52.8 | +725 | +45.8 | +54.3 | +72.0 | +78.4 | +76.6 | +84.8 | +42.4 | +10.3 | +
| MiniCPM-V 2.6 | +8B | +2822 | +65.2 | +2348.4* | +60.0 | +852* | +49.8* | +60.6 | +78.0 | +82.1 | +80.1 | +90.8 | +48.1* | +8.2 | +
| Model | +Size | +Mantis Eval | +BLINK val | +Mathverse mv | +Sciverse mv | +MIRB | +
|---|---|---|---|---|---|---|
| Proprietary | +||||||
| GPT-4V | +- | +62.7 | +54.6 | +60.3 | +66.9 | +53.1 | +
| LLaVA-NeXT-Interleave-14B | +14B | +66.4 | +52.6 | +32.7 | +30.2 | +- | +
| Open-source | +||||||
| Emu2-Chat | +37B | +37.8 | +36.2 | +- | +27.2 | +- | +
| CogVLM | +17B | +45.2 | +41.1 | +- | +- | +- | +
| VPG-C | +7B | +52.4 | +43.1 | +24.3 | +23.1 | +- | +
| VILA 8B | +8B | +51.2 | +39.3 | +- | +36.5 | +- | +
| InternLM-XComposer-2.5 | +8B | +53.1* | +48.9 | +32.1* | +- | +42.5 | +
| InternVL2-8B | +8B | +59.0* | +50.9 | +30.5* | +34.4* | +56.9* | +
| MiniCPM-V 2.6 | +8B | +69.1 | +53.0 | +84.9 | +74.9 | +53.8 | +
| Model | +Size | +Video-MME | +Video-ChatGPT | +|||||
|---|---|---|---|---|---|---|---|---|
| + | + | w/o subs | +w subs | +Correctness | +Detail | +Context | +Temporal | +Consistency | +
| Proprietary | +||||||||
| Claude 3.5 Sonnet | +- | +60.0 | +62.9 | +- | +- | +- | +- | +- | +
| GPT-4V | +- | +59.9 | +63.3 | +- | +- | +- | +- | +- | +
| Open-source | +||||||||
| LLaVA-NeXT-7B | +7B | +- | +- | +3.39 | +3.29 | +3.92 | +2.60 | +3.12 | +
| LLaVA-NeXT-34B | +34B | +- | +- | +3.29 | +3.23 | +3.83 | +2.51 | +3.47 | +
| CogVLM2-Video | +12B | +- | +- | +3.49 | +3.46 | +3.23 | +2.98 | +3.64 | +
| LongVA | +7B | +52.4 | +54.3 | +3.05 | +3.09 | +3.77 | +2.44 | +3.64 | +
| InternVL2-8B | +8B | +54.0 | +56.9 | +- | +- | +- | +- | +- | +
| InternLM-XComposer-2.5 | +8B | +55.8 | +- | +- | +- | +- | +- | +- | +
| LLaVA-NeXT-Video | +32B | +60.2 | +63.0 | +3.48 | +3.37 | +3.95 | +2.64 | +3.28 | +
| MiniCPM-V 2.6 | +8B | +60.9 | +63.6 | +3.59 | +3.28 | +3.93 | +2.73 | +3.62 | +
| Model | +Size | +Shot | +TextVQA val | +VizWiz test-dev | +VQAv2 test-dev | +OK-VQA val | +
|---|---|---|---|---|---|---|
| Flamingo | +80B | +0* | +35.0 | +31.6 | +56.3 | +40.6 | +
| 4 | +36.5 | +39.6 | +63.1 | +57.4 | +||
| 8 | +37.3 | +44.8 | +65.6 | +57.5 | +||
| IDEFICS | +80B | +0* | +30.9 | +36.0 | +60.0 | +45.2 | +
| 4 | +34.3 | +40.4 | +63.6 | +52.4 | +||
| 8 | +35.7 | +46.1 | +64.8 | +55.1 | +||
| OmniCorpus | +7B | +0* | +43.0 | +49.8 | +63.2 | +45.5 | +
| 4 | +45.4 | +51.3 | +64.5 | +46.5 | +||
| 8 | +45.6 | +52.2 | +64.7 | +46.6 | +||
| Emu2 | +37B | +0 | +26.4 | +40.4 | +33.5 | +26.7 | +
| 4 | +48.2 | +54.6 | +67.0 | +53.2 | +||
| 8 | +49.3 | +54.7 | +67.8 | +54.1 | +||
| MM1 | +30B | +0 | +26.2 | +40.4 | +48.9 | +26.7 | +
| 8 | +49.3 | +54.7 | +70.9 | +54.1 | +||
| MiniCPM-V 2.6+ | +8B | +0 | +43.9 | +33.8 | +45.4 | +23.9 | +
| 4 | +63.6 | +60.5 | +65.5 | +50.1 | +||
| 8 | +64.6 | +63.4 | +68.2 | +51.4 | +
+
+
+
+
+
+
+
+