public MiniCPM-V 4.0

This commit is contained in:
tc-mb
2025-08-05 22:19:54 +08:00
parent 8185ac321d
commit e2559a5ca2
2 changed files with 1209 additions and 88 deletions

680
README.md
View File

@@ -22,15 +22,17 @@
<p align="center">
MiniCPM-o 2.6 <a href="https://huggingface.co/openbmb/MiniCPM-o-2_6">🤗</a> <a href="https://minicpm-omni-webdemo-us.modelbest.cn/"> 🤖</a> | MiniCPM-V 2.6 <a href="https://huggingface.co/openbmb/MiniCPM-V-2_6">🤗</a> <a href="http://120.92.209.146:8887/">🤖</a> |
📄 Technical Blog [<a href="https://openbmb.notion.site/MiniCPM-o-2-6-A-GPT-4o-Level-MLLM-for-Vision-Speech-and-Multimodal-Live-Streaming-on-Your-Phone-185ede1b7a558042b5d5e45e6b237da9">English</a>/<a href="https://openbmb.notion.site/MiniCPM-o-2-6-GPT-4o-188ede1b7a558084b3aedd669cb80730">中文</a>]
MiniCPM-V 4.0 <a href="https://huggingface.co/openbmb/MiniCPM-V-4">🤗</a> <a href="https://minicpm-v.openbmb.cn/"> 🤖</a> | MiniCPM-o 2.6 <a href="https://huggingface.co/openbmb/MiniCPM-o-2_6">🤗</a> <a href="https://minicpm-omni-webdemo-us.modelbest.cn/"> 🤖</a> | MiniCPM-V 2.6 <a href="https://huggingface.co/openbmb/MiniCPM-V-2_6">🤗</a> <a href="http://120.92.209.146:8887/">🤖</a> | <a href="https://github.com/OpenSQZ/MiniCPM-V-CookBook">🍳 Cookbook</a> |
📄 Technical Blog [<a href="https://openbmb.notion.site/MiniCPM-o-2-6-A-GPT-4o-Level-MLLM-for-Vision-Speech-and-Multimodal-Live-Streaming-on-Your-Phone-185ede1b7a558042b5d5e45e6b237da9">English</a>/<a href="https://openbmb.notion.site/MiniCPM-o-2-6-GPT-4o-188ede1b7a558084b3aedd669cb80730">中文</a>]
</p>
</div>
**MiniCPM-o** is the latest series of end-side multimodal LLMs (MLLMs) ungraded from MiniCPM-V. The models can now take images, video, text, and audio as inputs and provide high-quality text and speech outputs in an end-to-end fashion. Since February 2024, we have released 6 versions of the model, aiming to achieve **strong performance and efficient deployment**. The most notable models in the series currently include:
- **MiniCPM-o 2.6**: 🔥🔥🔥 The latest and most capable model in the MiniCPM-o series. With a total of 8B parameters, this end-to-end model **achieves comparable performance to GPT-4o-202405 in vision, speech, and multimodal live streaming**, making it one of the most versatile and performant models in the open-source community. For the new voice mode, MiniCPM-o 2.6 **supports bilingual real-time speech conversation with configurable voices**, and also allows for fun capabilities such as emotion/speed/style control, end-to-end voice cloning, role play, etc. It also advances MiniCPM-V 2.6's visual capabilities such **strong OCR capability, trustworthy behavior, multilingual support, and video understanding**. Due to its superior token density, MiniCPM-o 2.6 can for the first time **support multimodal live streaming on end-side devices** such as iPad.
- **MiniCPM-V 4.0**: 🚀🚀🚀 The latest efficient model in the MiniCPM-V series. With a total of 4B parameters, the model **surpasses GPT-4.1-mini-20250414, Qwen2.5-VL-3B-Instruct, and InternVL2.5-8B** in image understanding on the OpenCompass evaluation. With its small parameter-size and efficient architecure, MiniCPM-V 4.0 is an ideal choice for on-device deployment on the phone (e.g., **less than 2s first token delay and more than 17 token/s decoding** on iPhone 16 Pro Max using the open-sourced iOS App).
- **MiniCPM-o 2.6**: 🔥🔥🔥 The most capable model in the MiniCPM-o series. With a total of 8B parameters, this end-to-end model **achieves comparable performance to GPT-4o-202405 in vision, speech, and multimodal live streaming**, making it one of the most versatile and performant models in the open-source community. For the new voice mode, MiniCPM-o 2.6 **supports bilingual real-time speech conversation with configurable voices**, and also allows for fun capabilities such as emotion/speed/style control, end-to-end voice cloning, role play, etc. It also advances MiniCPM-V 2.6's visual capabilities such **strong OCR capability, trustworthy behavior, multilingual support, and video understanding**. Due to its superior token density, MiniCPM-o 2.6 can for the first time **support multimodal live streaming on end-side devices** such as iPad.
- **MiniCPM-V 2.6**: The most capable model in the MiniCPM-V series. With a total of 8B parameters, the model **surpasses GPT-4V in single-image, multi-image and video understanding**. It outperforms **GPT-4o mini, Gemini 1.5 Pro and Claude 3.5 Sonnet** in single image understanding, and can for the first time support real-time video understanding on iPad.
@@ -40,11 +42,12 @@
#### 📌 Pinned
* [2024.08.02] 🚀🚀🚀 We open-source MiniCPM-V 4.0, which outperforms GPT-4.1-mini-20250414 in image understanding. It advances popular features of MiniCPM-V 2.6, and largely improves the efficiency. We also open-source the iOS App on iPhone and iPad. Try it now!
* [2025.08.01] 🔥🔥🔥 We've open-sourced the [MiniCPM-V & o Cookbook](https://github.com/OpenSQZ/MiniCPM-V-CookBook)! It provides comprehensive guides for diverse user scenarios, paired with our new [Docs Site](https://minicpm-o.readthedocs.io/en/latest/index.html) for smoother onboarding.
* [2025.06.20] ⭐️⭐️⭐️ Our official [ollama repository](https://ollama.com/openbmb) is released. Try our latest models with [one click](https://ollama.com/openbmb/minicpm-o2.6)
* [2025.06.20] ⭐️⭐️⭐️ Our official [Ollama repository](https://ollama.com/openbmb) is released. Try our latest models with [one click](https://ollama.com/openbmb/minicpm-o2.6)
* [2025.03.01] 🚀🚀🚀 RLAIF-V, which is the alignment technique of MiniCPM-o, is accepted by CVPR 2025The [code](https://github.com/RLHF-V/RLAIF-V), [dataset](https://huggingface.co/datasets/openbmb/RLAIF-V-Dataset), [paper](https://arxiv.org/abs/2405.17220) are open-sourced!
@@ -52,7 +55,7 @@
* [2025.01.23] 💡💡💡 MiniCPM-o 2.6 is now supported by [Align-Anything](https://github.com/PKU-Alignment/align-anything), a framework by PKU-Alignment Team for aligning any-to-any modality large models with human intentions. It supports DPO and SFT fine-tuning on both vision and audio. Try it now!
* [2025.01.19] 📢 **ATTENTION!** We are currently working on merging MiniCPM-o 2.6 into the official repositories of llama.cpp, ollama, and vllm. Until the merge is complete, please USE OUR LOCAL FORKS of [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md), [ollama](https://github.com/OpenBMB/ollama/blob/minicpm-v2.6/examples/minicpm-v2.6/README.md), and [vllm](https://github.com/OpenBMB/MiniCPM-o?tab=readme-ov-file#efficient-inference-with-llamacpp-ollama-vllm). **Using the official repositories before the merge may lead to unexpected issues**.
* [2025.01.19] 📢 **ATTENTION!** We are currently working on merging MiniCPM-o 2.6 into the official repositories of llama.cpp, Ollama, and vllm. Until the merge is complete, please USE OUR LOCAL FORKS of [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md), [Ollama](https://github.com/OpenBMB/ollama/blob/minicpm-v2.6/examples/minicpm-v2.6/README.md), and [vllm](https://github.com/OpenBMB/MiniCPM-o?tab=readme-ov-file#efficient-inference-with-llamacpp-ollama-vllm). **Using the official repositories before the merge may lead to unexpected issues**.
* [2025.01.19] ⭐️⭐️⭐️ MiniCPM-o tops GitHub Trending and reaches top-2 on Hugging Face Trending!
@@ -80,7 +83,7 @@
* [2024.07.19] MiniCPM-Llama3-V 2.5 supports vLLM now! See [here](#inference-with-vllm).
* [2024.06.03] Now, you can run MiniCPM-Llama3-V 2.5 on multiple low VRAM GPUs(12 GB or 16 GB) by distributing the model's layers across multiple GPUs. For more details, Check this [link](https://github.com/OpenBMB/MiniCPM-V/blob/main/docs/inference_on_multiple_gpus.md).
* [2024.05.28] 🚀🚀🚀 MiniCPM-Llama3-V 2.5 now fully supports its feature in llama.cpp and ollama! Please pull the latest code **of our provided forks** ([llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-v2.5/examples/minicpmv/README.md), [ollama](https://github.com/OpenBMB/ollama/tree/minicpm-v2.5/examples/minicpm-v2.5)). GGUF models in various sizes are available [here](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf/tree/main). MiniCPM-Llama3-V 2.5 series is **not supported by the official repositories yet**, and we are working hard to merge PRs. Please stay tuned!
* [2024.05.28] 🚀🚀🚀 MiniCPM-Llama3-V 2.5 now fully supports its feature in llama.cpp and Ollama! Please pull the latest code **of our provided forks** ([llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-v2.5/examples/minicpmv/README.md), [Ollama](https://github.com/OpenBMB/ollama/tree/minicpm-v2.5/examples/minicpm-v2.5)). GGUF models in various sizes are available [here](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf/tree/main). MiniCPM-Llama3-V 2.5 series is **not supported by the official repositories yet**, and we are working hard to merge PRs. Please stay tuned!
* [2024.05.28] 💫 We now support LoRA fine-tuning for MiniCPM-Llama3-V 2.5, using only 2 V100 GPUs! See more statistics [here](https://github.com/OpenBMB/MiniCPM-V/tree/main/finetune#model-fine-tuning-memory-usage-statistics).
@@ -104,8 +107,11 @@
## Contents <!-- omit in toc -->
- [MiniCPM-V 4.0](#minicpm-v-40)
- [Examples](#examples)
- [MiniCPM-o 2.6](#minicpm-o-26)
- [MiniCPM-V 2.6](#minicpm-v-26)
- [MiniCPM-V \& o Cookbook](#minicpm-v--o-cookbook)
- [Chat with Our Demo on Gradio 🤗](#chat-with-our-demo-on-gradio-)
- [Inference](#inference)
- [Model Zoo](#model-zoo)
@@ -114,23 +120,570 @@
- [In-context Few-shot Learning](#in-context-few-shot-learning)
- [Chat with Video](#chat-with-video)
- [Speech and Audio Mode](#speech-and-audio-mode)
- [Mimick](#mimick)
- [General Speech Conversation with Configurable Voices](#general-speech-conversation-with-configurable-voices)
- [Speech Conversation as an AI Assistant](#speech-conversation-as-an-ai-assistant)
- [Instruction-to-Speech](#instruction-to-speech)
- [Voice Cloning](#voice-cloning)
- [Addressing Various Audio Understanding Tasks](#addressing-various-audio-understanding-tasks)
- [Multimodal Live Streaming](#multimodal-live-streaming)
- [Inference on Multiple GPUs](#inference-on-multiple-gpus)
- [Inference on Mac](#inference-on-mac)
- [Efficient Inference with llama.cpp, ollama, vLLM](#efficient-inference-with-llamacpp-ollama-vllm)
- [Efficient Inference with llama.cpp, Ollama, vLLM](#efficient-inference-with-llamacpp-ollama-vllm)
- [Fine-tuning](#fine-tuning)
- [MiniCPM-V \& o Cookbook](#minicpm-v--o-cookbook)
- [Awesome work using MiniCPM-V \& MiniCPM-o](#awesome-work-using-minicpm-v--minicpm-o)
- [FAQs](#faqs)
- [Limitations](#limitations)
## MiniCPM-V 4.0
**MiniCPM-V 4.0** is the latest efficient model in the MiniCPM-V series. The model is built based on SigLIP2-400M and MiniCPM4-3B with a total of 4.1B parameters. It inherits the strong single-image, multi-image and video understanding performance of MiniCPM-V 2.6 with largely improved efficiency. Notable features of MiniCPM-V 4.0 include:
- 🔥 **Leading Visual Capability.**
With only 4.1B parameters, MiniCPM-V 4.0 achieves an average score of 69.0 on OpenCompass, a comprehensive evaluation of 8 popular benchmarks, **outperforming GPT-4.1-mini-20250414, MiniCPM-V 2.6 (8.1B params, OpenCompass 65.2) and Qwen2.5-VL-3B-Instruct (3.8B params, OpenCompass 64.5)**. It also shows good performance in multi-image understanding and video understanding.
- 🚀 **Superior Efficiency.**
Designed for on-device deployment, MiniCPM-V 4.0 runs smoothly on end devices. For example, it devlivers **less than 2s first token delay and more than 17 token/s decoding on iPhone 16 Pro Max**, without heating problems. It also shows superior throughput under concurrent requests.
- 💫 **Easy Usage.**
MiniCPM-V 4.0 can be easily used in various ways including **llama.cpp, Ollama, vLLM, SGLang, LLaMA-Factory and local web demo** etc. We also open-source iOS App that can run on iPhone and iPad. Get started easily with our well-structured [Cookbook](https://github.com/OpenSQZ/MiniCPM-V-CookBook), featuring detailed instructions and practical examples.
### Evaluation <!-- omit in toc -->
<details>
<summary>Click to view single image results on OpenCompass. </summary>
<div align="center">
<table style="margin: 0px auto;">
<thead>
<tr>
<th nowrap="nowrap" align="left">model</th>
<th>Size</th>
<th>Opencompass</th>
<th>OCRBench</th>
<th>MathVista</th>
<th>HallusionBench</th>
<th>MMMU</th>
<th>MMVet</th>
<th>MMBench V1.1</th>
<th>MMStar</th>
<th>AI2D</th>
</tr>
</thead>
<tbody align="center">
<tr>
<td colspan="11" align="left"><strong>Proprietary</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4v-20240409</td>
<td>-</td>
<td>63.5</td>
<td>656</td>
<td>55.2</td>
<td>43.9</td>
<td>61.7</td>
<td>67.5</td>
<td>79.8</td>
<td>56.0</td>
<td>78.6</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Gemini-1.5-Pro</td>
<td>-</td>
<td>64.5</td>
<td>754</td>
<td>58.3</td>
<td>45.6</td>
<td>60.6</td>
<td>64.0</td>
<td>73.9</td>
<td>59.1</td>
<td>79.1</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4.1-mini-20250414</td>
<td>-</td>
<td>68.9</td>
<td>840</td>
<td>70.9</td>
<td>49.3</td>
<td>55.0</td>
<td>74.3</td>
<td>80.9</td>
<td>60.9</td>
<td>76.0</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Claude 3.5 Sonnet-20241022</td>
<td>-</td>
<td>70.6</td>
<td>798</td>
<td>65.3</td>
<td>55.5</td>
<td>66.4</td>
<td>70.1</td>
<td>81.7</td>
<td>65.1</td>
<td>81.2</td>
</tr>
<tr>
<td colspan="11" align="left"><strong>Open-source</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Qwen2.5-VL-3B-Instruct</td>
<td>3.8B</td>
<td>64.5</td>
<td>828</td>
<td>61.2</td>
<td>46.6</td>
<td>51.2</td>
<td>60.0</td>
<td>76.8</td>
<td>56.3</td>
<td>81.4</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">InternVL2.5-4B</td>
<td>3.7B</td>
<td>65.1</td>
<td>820</td>
<td>60.8</td>
<td>46.6</td>
<td>51.8</td>
<td>61.5</td>
<td>78.2</td>
<td>58.7</td>
<td>81.4</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Qwen2.5-VL-7B-Instruct</td>
<td>8.3B</td>
<td>70.9</td>
<td>888</td>
<td>68.1</td>
<td>51.9</td>
<td>58.0</td>
<td>69.7</td>
<td>82.2</td>
<td>64.1</td>
<td>84.3</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">InternVL2.5-8B</td>
<td>8.1B</td>
<td>68.1</td>
<td>821</td>
<td>64.5</td>
<td>49.0</td>
<td>56.2</td>
<td>62.8</td>
<td>82.5</td>
<td>63.2</td>
<td>84.6</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-V-2.6</td>
<td>8.1B</td>
<td>65.2</td>
<td>852</td>
<td>60.8</td>
<td>48.1</td>
<td>49.8</td>
<td>60.0</td>
<td>78.0</td>
<td>57.5</td>
<td>82.1</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-o-2.6</td>
<td>8.7B</td>
<td>70.2</td>
<td>889</td>
<td>73.3</td>
<td>51.1</td>
<td>50.9</td>
<td>67.2</td>
<td>80.6</td>
<td>63.3</td>
<td>86.1</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-V-4.0</td>
<td>4.1B</td>
<td>69.0</td>
<td>894</td>
<td>66.9</td>
<td>50.8</td>
<td>51.2</td>
<td>68.0</td>
<td>79.7</td>
<td>62.8</td>
<td>82.9</td>
</tr>
</tbody>
</table>
</div>
</details>
<details>
<summary>Click to view single image results on ChartQA, MME, RealWorldQA, TextVQA, DocVQA, MathVision, DynaMath, WeMath, Object HalBench and MM Halbench. </summary>
<div align="center">
<table style="margin: 0px auto;">
<thead>
<tr>
<th nowrap="nowrap" align="left">model</th>
<th>Size</th>
<th>ChartQA</th>
<th>MME</th>
<th>RealWorldQA</th>
<th>TextVQA</th>
<th>DocVQA</th>
<th>MathVision</th>
<th>DynaMath</th>
<th>WeMath</th>
<th colspan="2">Obj Hal</th>
<th colspan="2">MM Hal</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>CHAIRs↓</td>
<td>CHAIRi↓</td>
<td nowrap="nowrap">score avg@3</td>
<td nowrap="nowrap">hall rate avg@3</td>
</tr>
<tbody align="center">
<tr>
<td colspan="14" align="left"><strong>Proprietary</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4v-20240409</td>
<td>-</td>
<td>78.5</td>
<td>1927</td>
<td>61.4</td>
<td>78.0</td>
<td>88.4</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Gemini-1.5-Pro</td>
<td>-</td>
<td>87.2</td>
<td>-</td>
<td>67.5</td>
<td>78.8</td>
<td>93.1</td>
<td>41.0</td>
<td>31.5</td>
<td>50.5</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4.1-mini-20250414</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>45.3</td>
<td>47.7</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Claude 3.5 Sonnet-20241022</td>
<td>-</td>
<td>90.8</td>
<td>-</td>
<td>60.1</td>
<td>74.1</td>
<td>95.2</td>
<td>35.6</td>
<td>35.7</td>
<td>44.0</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td colspan="14" align="left"><strong>Open-source</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Qwen2.5-VL-3B-Instruct</td>
<td>3.8B</td>
<td>84.0</td>
<td>2157</td>
<td>65.4</td>
<td>79.3</td>
<td>93.9</td>
<td>21.9</td>
<td>13.2</td>
<td>22.9</td>
<td>18.3</td>
<td>10.8</td>
<td>3.9 </td>
<td>33.3 </td>
</tr>
<tr>
<td nowrap="nowrap" align="left">InternVL2.5-4B</td>
<td>3.7B</td>
<td>84.0</td>
<td>2338</td>
<td>64.3</td>
<td>76.8</td>
<td>91.6</td>
<td>18.4</td>
<td>15.2</td>
<td>21.2</td>
<td>13.7</td>
<td>8.7</td>
<td>3.2 </td>
<td>46.5 </td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Qwen2.5-VL-7B-Instruct</td>
<td>8.3B</td>
<td>87.3</td>
<td>2347</td>
<td>68.5</td>
<td>84.9</td>
<td>95.7</td>
<td>25.4</td>
<td>21.8</td>
<td>36.2</td>
<td>13.3</td>
<td>7.9</td>
<td>4.1 </td>
<td>31.6 </td>
</tr>
<tr>
<td nowrap="nowrap" align="left">InternVL2.5-8B</td>
<td>8.1B</td>
<td>84.8</td>
<td>2344</td>
<td>70.1</td>
<td>79.1</td>
<td>93.0</td>
<td>17.0</td>
<td>9.4</td>
<td>23.5</td>
<td>18.3</td>
<td>11.6</td>
<td>3.6 </td>
<td>37.2</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-V-2.6</td>
<td>8.1B</td>
<td>79.4</td>
<td>2348</td>
<td>65.0</td>
<td>80.1</td>
<td>90.8</td>
<td>17.5</td>
<td>9.0</td>
<td>20.4</td>
<td>7.3</td>
<td>4.7</td>
<td>4.0 </td>
<td>29.9 </td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-o-2.6</td>
<td>8.7B</td>
<td>86.9</td>
<td>2372</td>
<td>68.1</td>
<td>82.0</td>
<td>93.5</td>
<td>21.7</td>
<td>10.4</td>
<td>25.2</td>
<td>6.3</td>
<td>3.4</td>
<td>4.1 </td>
<td>31.3 </td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-V-4.0</td>
<td>4.1B</td>
<td>84.4</td>
<td>2298</td>
<td>68.5</td>
<td>80.8</td>
<td>92.9</td>
<td>20.7</td>
<td>14.2</td>
<td>32.7</td>
<td>6.3</td>
<td>3.5</td>
<td>4.1 </td>
<td>29.2 </td>
</tr>
</tbody>
</table>
</div>
</details>
<details>
<summary>Click to view multi-image and video understanding results on Mantis, Blink and Video-MME. </summary>
<div align="center">
<table style="margin: 0px auto;">
<thead>
<tr>
<th nowrap="nowrap" align="left">model</th>
<th>Size</th>
<th>Mantis</th>
<th>Blink</th>
<th nowrap="nowrap" colspan="2" >Video-MME</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>wo subs</td>
<td>w subs</td>
</tr>
<tbody align="center">
<tr>
<td colspan="6" align="left"><strong>Proprietary</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4v-20240409</td>
<td>-</td>
<td>62.7</td>
<td>54.6</td>
<td>59.9</td>
<td>63.3</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Gemini-1.5-Pro</td>
<td>-</td>
<td>-</td>
<td>59.1</td>
<td>75.0</td>
<td>81.3</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4o-20240513</td>
<td>-</td>
<td>-</td>
<td>68.0</td>
<td>71.9</td>
<td>77.2</td>
</tr>
<tr>
<td colspan="6" align="left"><strong>Open-source</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Qwen2.5-VL-3B-Instruct</td>
<td>3.8B</td>
<td>-</td>
<td>47.6</td>
<td>61.5</td>
<td>67.6</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">InternVL2.5-4B</td>
<td>3.7B</td>
<td>62.7</td>
<td>50.8</td>
<td>62.3</td>
<td>63.6</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Qwen2.5-VL-7B-Instruct</td>
<td>8.3B</td>
<td>-</td>
<td>56.4</td>
<td>65.1</td>
<td>71.6</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">InternVL2.5-8B</td>
<td>8.1B</td>
<td>67.7</td>
<td>54.8</td>
<td>64.2</td>
<td>66.9</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-V-2.6</td>
<td>8.1B</td>
<td>69.1</td>
<td>53.0</td>
<td>60.9</td>
<td>63.6</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-o-2.6</td>
<td>8.7B</td>
<td>71.9</td>
<td>56.7</td>
<td>63.9</td>
<td>69.6</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-V-4.0</td>
<td>4.1B</td>
<td>71.4</td>
<td>54.0</td>
<td>61.2</td>
<td>65.8</td>
</tr>
</tbody>
</table>
</div>
</details>
### Examples
<div style="display: flex; flex-direction: column; align-items: center;">
<img src="assets/minicpmv4/minicpm-v-4-case.png" alt="math" style="margin-bottom: 5px;">
</div>
We deploy MiniCPM-V 4.0 on iPhone 16 Pro Max with [iOS demo](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/demo/ios_demo/ios.md). The demo video is the raw screen recording without edition.
<table align="center">
<p align="center">
<img src="assets/minicpmv4/iphone_en.gif" width=45%/>
&nbsp;&nbsp;&nbsp;&nbsp;
<img src="assets/minicpmv4/iphone_en_information_extraction.gif" width=45%/>
</p>
<p align="center">
<img src="assets/minicpmv4/iphone_cn.gif" width=45%/>
&nbsp;&nbsp;&nbsp;&nbsp;
<img src="assets/minicpmv4/iphone_cn_funny_points.gif" width=45%/>
</p>
</table>
## MiniCPM-o 2.6
**MiniCPM-o 2.6** is the latest and most capable model in the MiniCPM-o series. The model is built in an end-to-end fashion based on SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.6, and introduces new features for real-time speech conversation and multimodal live streaming. Notable features of MiniCPM-o 2.6 include:
@@ -1117,7 +1670,7 @@ We deploy MiniCPM-o 2.6 on end devices. The demo video is the raw-speed recordin
In addition to its friendly size, MiniCPM-V 2.6 also shows **state-of-the-art token density** (i.e., number of pixels encoded into each visual token). **It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models**. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-V 2.6 can efficiently support **real-time video understanding** on end-side devices such as iPad.
- 💫 **Easy Usage.**
MiniCPM-V 2.6 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpmv-main/examples/llava/README-minicpmv2.6.md) and [ollama](https://github.com/OpenBMB/ollama/blob/minicpm-v2.6/examples/minicpm-v2.6/README.md) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-V-2_6-int4) and [GGUF](https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf) format quantized models in 16 sizes, (3) [vLLM](#inference-with-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks, (5) quick local WebUI demo setup with [Gradio](#chat-with-our-demo-on-gradio), and (6) online web [demo](http://120.92.209.146:8887/).
MiniCPM-V 2.6 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpmv-main/examples/llava/README-minicpmv2.6.md) and [Ollama](https://github.com/OpenBMB/ollama/blob/minicpm-v2.6/examples/minicpm-v2.6/README.md) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-V-2_6-int4) and [GGUF](https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf) format quantized models in 16 sizes, (3) [vLLM](#inference-with-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks, (5) quick local WebUI demo setup with [Gradio](#chat-with-our-demo-on-gradio), and (6) online web [demo](http://120.92.209.146:8887/).
### Evaluation <!-- omit in toc -->
<div align="center">
@@ -1874,6 +2427,32 @@ We deploy MiniCPM-V 2.6 on end devices. The demo video is the raw screen recordi
| OmniLMM-12B | [Document](././docs/omnilmm_en.md) |
## MiniCPM-V & o Cookbook
Discover comprehensive, ready-to-deploy solutions for the MiniCPM-V and MiniCPM-o model series in our structured [cookbook](https://github.com/OpenSQZ/MiniCPM-V-CookBook), which empowers developers to rapidly implement multimodal AI applications with integrated vision, speech, and live-streaming capabilities. Key features include:
**Easy Usage Documentation**
Our comprehensive [documentation website](https://minicpm-o.readthedocs.io/en/latest/index.html) presents every recipe in a clear, well-organized manner.
All features are displayed at a glance, making it easy for you to quickly find exactly what you need.
**Broad User Spectrum**
We support a wide range of users, from individuals to enterprises and researchers.
* **Individuals**: Enjoy effortless inference using [Ollama](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/ollama/minicpm-v4_ollama.md) and [Llama.cpp](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/llama.cpp/minicpm-v4_llamacpp.md) with minimal setup.
* **Enterprises**: Achieve high-throughput, scalable performance with [vLLM](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/vllm/minicpm-v4_vllm.md) and [SGLang](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/sglang/MiniCPM-v4_sglang.md).
* **Researchers**: Leverage advanced frameworks including [Transformers](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/finetune_full.md), [LLaMA-Factory](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/finetune_llamafactory.md), [SWIFT](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/swift.md), and [Align-anything](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/align_anything.md) to enable flexible model development and cutting-edge experimentation.
**Versatile Deployment Scenarios**
Our ecosystem delivers optimal solution for a variety of hardware environments and deployment demands.
* **Web demo**: Launch interactive multimodal AI web demo with [FastAPI](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/demo/README.md).
* **Quantized deployment**: Maximize efficiency and minimize resource consumption using [GGUF](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/quantization/gguf/minicpm-v4_gguf_quantize.md) and [BNB](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/quantization/bnb/minicpm-v4_bnb_quantize.md).
* **End devices**: Bring powerful AI experiences to [iPhone and iPad](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/demo/ios_demo/ios.md), supporting offline and privacy-sensitive applications.
## Chat with Our Demo on Gradio 🤗
We provide online and local demos powered by Hugging Face Gradio <a href='https://github.com/gradio-app/gradio'><img src='https://img.shields.io/github/stars/gradio-app/gradio'></a>, the most popular model deployment framework nowadays. It supports streaming outputs, progress bars, queuing, alerts, and other useful features.
@@ -1932,6 +2511,10 @@ Open `http://localhost:8000/` in browser and enjoy the vision mode chatbot.
| Model | Device | Memory | &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; Description | Download |
|:-----------|:--:|:-----------:|:-------------------|:---------------:|
| MiniCPM-V 4.0| GPU | 9 GB | The latest version, strong end-side multimodal performance for single image, multi-image and video understanding. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-4) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V-4) |
| MiniCPM-V 4.0 gguf | CPU | 4 GB | The gguf version, lower memory usage and faster inference. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-4-gguf) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V-4-gguf) |
| MiniCPM-V 4.0 int4 | GPU | 5 GB | The int4 quantized version, lower GPU memory usage. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-4-int4) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V-4-int4) |
| MiniCPM-V 4.0 AWQ | GPU | 5 GB | The int4 quantized version, lower GPU memory usage. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-4-AWQ) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V-4-AWQ) |
| MiniCPM-o 2.6| GPU | 18 GB | The latest version, achieving GPT-4o level performance for vision, speech and multimodal live streaming on end-side devices. | [🤗](https://huggingface.co/openbmb/MiniCPM-o-2_6) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-o-2_6) |
| MiniCPM-o 2.6 gguf | CPU | 8 GB | The gguf version, lower memory usage and faster inference. | [🤗](https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-o-2_6-gguf) |
| MiniCPM-o 2.6 int4 | GPU | 9 GB | The int4 quantized version, lower GPU memory usage. | [🤗](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-o-2_6-int4) |
@@ -1960,10 +2543,10 @@ from transformers import AutoModel, AutoTokenizer
torch.manual_seed(100)
model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True) # or openbmb/MiniCPM-o-2_6
image = Image.open('./assets/minicpmo2_6/show_demo.jpg').convert('RGB')
@@ -1991,24 +2574,24 @@ print(answer)
You will get the following output:
```
"The landform in the picture is a mountain range. The mountains appear to be karst formations, characterized by their steep, rugged peaks and smooth, rounded shapes. These types of mountains are often found in regions with limestone bedrock and are shaped by processes such as erosion and weathering. The reflection of the mountains in the water adds to the scenic beauty of the landscape."
"The landform in the picture is karst topography, characterized by its unique and striking limestone formations that rise dramatically from the surrounding landscape."
"When traveling to this scenic location, it's important to pay attention to the weather conditions, as the area appears to be prone to fog and mist, especially during sunrise or sunset. Additionally, ensure you have proper footwear for navigating the potentially slippery terrain around the water. Lastly, respect the natural environment by not disturbing the local flora and fauna."
"When traveling to this picturesque location, you should pay attention to the weather conditions as they can change rapidly in such areas. It's also important to respect local ecosystems and wildlife by staying on designated paths and not disturbing natural habitats. Additionally, bringing appropriate gear for photography is advisable due to the stunning reflections and lighting during sunrise or sunset."
```
#### Chat with Multiple Images
<details>
<summary> Click to view Python code running MiniCPM-o 2.6 with multiple images input. </summary>
<summary> Click to view Python code running MiniCPM-V-4 with multiple images input. </summary>
```python
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True) # or openbmb/MiniCPM-o-2_6
image1 = Image.open('image1.jpg').convert('RGB')
image2 = Image.open('image2.jpg').convert('RGB')
@@ -2026,17 +2609,17 @@ print(answer)
#### In-context Few-shot Learning
<details>
<summary> Click to view Python code running MiniCPM-o 2.6 with few-shot input. </summary>
<summary> Click to view Python code running MiniCPM-V-4 with few-shot input. </summary>
```python
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True) # or openbmb/MiniCPM-o-2_6
question = "production date"
image1 = Image.open('example1.jpg').convert('RGB')
@@ -2061,7 +2644,7 @@ print(answer)
#### Chat with Video
<details>
<summary> Click to view Python code running MiniCPM-o 2.6 with video input. </summary>
<summary> Click to view Python code running MiniCPM-V-4 with video input. </summary>
```python
import torch
@@ -2069,10 +2652,10 @@ from PIL import Image
from transformers import AutoModel, AutoTokenizer
from decord import VideoReader, cpu # pip install decord
model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True) # or openbmb/MiniCPM-o-2_6
MAX_NUM_FRAMES=64 # if cuda OOM set a smaller number
@@ -2135,7 +2718,7 @@ model.tts.float()
<hr/>
##### Mimick
##### Mimick <!-- omit in toc -->
`Mimick` task reflects a model's end-to-end speech modeling capability. The model takes audio input, and outputs an ASR transcription and subsequently reconstructs the original audio with high similarity. The higher the similarity between the reconstructed audio and the original audio, the stronger the model's foundational capability in end-to-end speech modeling.
@@ -2163,7 +2746,7 @@ res = model.chat(
<hr/>
##### General Speech Conversation with Configurable Voices
##### General Speech Conversation with Configurable Voices <!-- omit in toc -->
A general usage scenario of `MiniCPM-o-2.6` is role-playing a specific character based on the audio prompt. It will mimic the voice of the character to some extent and act like the character in text, including language style. In this mode, `MiniCPM-o-2.6` sounds **more natural and human-like**. Self-defined audio prompts can be used to customize the voice of the character in an end-to-end manner.
@@ -2205,7 +2788,7 @@ print(res)
<hr/>
##### Speech Conversation as an AI Assistant
##### Speech Conversation as an AI Assistant <!-- omit in toc -->
An enhanced feature of `MiniCPM-o-2.6` is to act as an AI assistant, but only with limited choice of voices. In this mode, `MiniCPM-o-2.6` is **less human-like and more like a voice assistant**. In this mode, the model is more instruction-following. For demo, you are suggested to use `assistant_female_voice`, `assistant_male_voice`, and `assistant_default_female_voice`. Other voices may work but not as stable as the default voices.
@@ -2248,7 +2831,7 @@ print(res)
<hr/>
##### Instruction-to-Speech
##### Instruction-to-Speech <!-- omit in toc -->
`MiniCPM-o-2.6` can also do Instruction-to-Speech, aka **Voice Creation**. You can describe a voice in detail, and the model will generate a voice that matches the description. For more Instruction-to-Speech sample instructions, you can refer to https://voxinstruct.github.io/VoxInstruct/.
@@ -2271,7 +2854,7 @@ res = model.chat(
<hr/>
##### Voice Cloning
##### Voice Cloning <!-- omit in toc -->
`MiniCPM-o-2.6` can also do zero-shot text-to-speech, aka **Voice Cloning**. With this mode, model will act like a TTS model.
@@ -2298,7 +2881,7 @@ res = model.chat(
<hr/>
##### Addressing Various Audio Understanding Tasks
##### Addressing Various Audio Understanding Tasks <!-- omit in toc -->
`MiniCPM-o-2.6` can also be used to address various audio understanding tasks, such as ASR, speaker analysis, general audio captioning, and sound scene tagging.
@@ -2515,11 +3098,11 @@ PYTORCH_ENABLE_MPS_FALLBACK=1 python test.py
</details>
### Efficient Inference with llama.cpp, ollama, vLLM
### Efficient Inference with llama.cpp, Ollama, vLLM
See [our fork of llama.cpp](https://github.com/OpenBMB/llama.cpp/tree/minicpmv-main/examples/llava/README-minicpmv2.6.md) for more detail. This implementation supports smooth inference of 16~18 token/s on iPad (test environmentiPad Pro + M4).
See [our fork of ollama](https://github.com/OpenBMB/ollama/blob/minicpm-v2.6/examples/minicpm-v2.6/README.md) for more detail. This implementation supports smooth inference of 16~18 token/s on iPad (test environmentiPad Pro + M4).
See [our fork of Ollama](https://github.com/OpenBMB/ollama/blob/minicpm-v2.6/examples/minicpm-v2.6/README.md) for more detail. This implementation supports smooth inference of 16~18 token/s on iPad (test environmentiPad Pro + M4).
<details>
@@ -2565,31 +3148,6 @@ We now support MiniCPM-V series fine-tuning with the SWIFT framework. SWIFT supp
Best Practices[MiniCPM-V 1.0](https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/minicpm-v最佳实践.md), [MiniCPM-V 2.0](https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/minicpm-v-2最佳实践.md), [MiniCPM-V 2.6](https://github.com/modelscope/ms-swift/issues/1613).
## MiniCPM-V & o Cookbook
Discover comprehensive, ready-to-deploy solutions for the MiniCPM-V and MiniCPM-o model series in our structured [cookbook](https://github.com/OpenSQZ/MiniCPM-V-CookBook), which empowers developers to rapidly implement multimodal AI applications with integrated vision, speech, and live-streaming capabilities. Key features include:
**Easy Usage Documentation**
Our comprehensive [documentation website](https://minicpm-o.readthedocs.io/en/latest/index.html) presents every recipe in a clear, well-organized manner.
All features are displayed at a glance, making it easy for you to quickly find exactly what you need.
**Broad User Spectrum**
We support a wide range of users, from individuals to enterprises and researchers.
* **Individuals**: Enjoy effortless inference using [Ollama](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/ollama/minicpm-v4_ollama.md) and [Llama.cpp](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/llama.cpp/minicpm-v4_llamacpp.md) with minimal setup.
* **Enterprises**: Achieve high-throughput, scalable performance with [vLLM](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/vllm/minicpm-v4_vllm.md) and [SGLang](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/sglang/MiniCPM-v4_sglang.md).
* **Researchers**: Leverage advanced frameworks including [Transformers](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/finetune_full.md), [LLaMA-Factory](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/finetune_llamafactory.md), [SWIFT](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/swift.md), and [Align-anything](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/align_anything.md) to enable flexible model development and cutting-edge experimentation.
**Versatile Deployment Scenarios**
Our ecosystem delivers optimal solution for a variety of hardware environments and deployment demands.
* **Web demo**: Launch interactive multimodal AI web demo with [FastAPI](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/demo/README.md).
* **Quantized deployment**: Maximize efficiency and minimize resource consumption using [GGUF](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/quantization/gguf/minicpm-v4_gguf_quantize.md) and [BNB](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/quantization/bnb/minicpm-v4_bnb_quantize.md).
* **Edge devices**: Bring powerful AI experiences to [iPhone and iPad](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/demo/ios_demo/ios.md), supporting offline and privacy-sensitive applications.
## Awesome work using MiniCPM-V & MiniCPM-o
- [text-extract-api](https://github.com/CatchTheTornado/text-extract-api): Document extraction API using OCRs and Ollama supported models ![GitHub Repo stars](https://img.shields.io/github/stars/CatchTheTornado/text-extract-api)

View File

@@ -13,12 +13,12 @@
<a href="docs/wechat.md" target="_blank"> 微信社区</a> &nbsp;|
</span>
<span style="display: inline-flex; align-items: center; margin-left: 2px;">
MiniCPM-V <a href="docs/best_practice_summary_zh.md" target="_blank">&nbsp; 📖 最佳实践</a >&nbsp;
<a href="https://github.com/OpenSQZ/MiniCPM-V-CookBook" target="_blank">&nbsp;🍳 使用指南</a >&nbsp;
</span>
<p align="center">
MiniCPM-o 2.6 <a href="https://huggingface.co/openbmb/MiniCPM-o-2_6">🤗</a> <a href="https://minicpm-omni-webdemo-us.modelbest.cn/"> 🤖</a> | MiniCPM-V 2.6 <a href="https://huggingface.co/openbmb/MiniCPM-V-2_6">🤗</a> <a href="http://120.92.209.146:8887/">🤖</a> |
MiniCPM-V 4.0 <a href="https://huggingface.co/openbmb/MiniCPM-V-4">🤗</a> <a href="https://minicpm-v.openbmb.cn/"> 🤖</a> | MiniCPM-o 2.6 <a href="https://huggingface.co/openbmb/MiniCPM-o-2_6">🤗</a> <a href="https://minicpm-omni-webdemo-us.modelbest.cn/"> 🤖</a> | MiniCPM-V 2.6 <a href="https://huggingface.co/openbmb/MiniCPM-V-2_6">🤗</a> <a href="http://120.92.209.146:8887/">🤖</a> |
📄 技术报告 [<a href="https://openbmb.notion.site/MiniCPM-o-2-6-GPT-4o-188ede1b7a558084b3aedd669cb80730">中文</a>/<a href="https://openbmb.notion.site/MiniCPM-o-2-6-A-GPT-4o-Level-MLLM-for-Vision-Speech-and-Multimodal-Live-Streaming-on-Your-Phone-185ede1b7a558042b5d5e45e6b237da9">English</a>]
</p>
@@ -27,6 +27,8 @@
**MiniCPM-o** 是从 MiniCPM-V 升级的最新端侧多模态大模型系列。该系列模型可以以端到端方式接受图像、视频、文本、音频作为输入并生成高质量文本和语音输出。自2024年2月以来我们以实现高性能和高效部署为目标发布了6个版本的模型。目前系列中最值得关注的模型包括
- **MiniCPM-V 4.0**:🚀🚀🚀 MiniCPM-V 系列中最新的高效模型,参数总量为 4B。该模型在 OpenCompass 评测中图像理解能力超越了 GPT-4.1-mini-20250414、Qwen2.5-VL-3B-Instruct 和 InternVL2.5-8B。凭借小巧的参数规模和高效的架构MiniCPM-V 4.0 是移动端部署的理想选择(例如,在 iPhone 16 Pro Max 上使用开源 iOS 应用时,首 token 延迟低于 2 秒,解码速度超过 17 token/s
- **MiniCPM-o 2.6**: 🔥🔥🔥 MiniCPM-o 系列的最新、性能最佳模型。总参数量 8B**视觉、语音和多模态流式能力达到了 GPT-4o-202405 级别**是开源社区中模态支持最丰富、性能最佳的模型之一。在新的语音模式中MiniCPM-o 2.6 **支持可配置声音的中英双语语音对话,还具备情感/语速/风格控制、端到端声音克隆、角色扮演等进阶能力**。模型也进一步提升了 MiniCPM-V 2.6 的 **OCR、可信行为、多语言支持和视频理解等视觉能力**。基于其领先的视觉 token 密度MiniCPM-V 2.6 成为了**首个支持在 iPad 等端侧设备上进行多模态实时流式交互**的多模态大模型。
@@ -37,10 +39,12 @@
#### 📌 置顶
* [2024.08.02] 🚀🚀🚀 我们开源了 MiniCPM-V 4.0,该模型在图像理解能力上超越了 GPT-4.1-mini-20250414。该模型不仅继承了 MiniCPM-V 2.6 的众多实用特性,还大幅提升了推理效率。我们还同步开源了适用于 iPhone 和 iPad 的 iOS 应用,欢迎试用!
* [2025.08.01] 🔥🔥🔥 我们开源了 [MiniCPM-V & o Cookbook](https://github.com/OpenSQZ/MiniCPM-V-CookBook),提供针对不同人群的全场景使用指南,配合最新的[文档网站](https://minicpm-o.readthedocs.io/en/latest/index.html)上手更轻松!
* [2025.06.20] ⭐️⭐️⭐️ MiniCPM-o 的 ollama [官方仓库](https://ollama.com/openbmb)正式支持 MiniCPM-o 2.6 等模型啦,欢迎[一键使用](https://ollama.com/openbmb/minicpm-o2.6)
* [2025.06.20] ⭐️⭐️⭐️ MiniCPM-o 的 Ollama [官方仓库](https://ollama.com/openbmb)正式支持 MiniCPM-o 2.6 等模型啦,欢迎[一键使用](https://ollama.com/openbmb/minicpm-o2.6)
* [2025.03.01] 🚀🚀🚀 MiniCPM-o 系列的对齐技术 RLAIF-V 被 CVPR 2025 接收了!其[代码](https://github.com/RLHF-V/RLAIF-V)、[数据](https://huggingface.co/datasets/openbmb/RLAIF-V-Dataset)、[论文](https://arxiv.org/abs/2405.17220)均已开源。
@@ -48,7 +52,7 @@
* [2025.01.23] 💡💡💡 MiniCPM-o 2.6 现在已被北大团队开发的 [Align-Anything](https://github.com/PKU-Alignment/align-anything),一个用于对齐全模态大模型的框架集成,支持 DPO 和 SFT 在视觉和音频模态上的微调。欢迎试用!
* [2025.01.19] 📢 **注意!** 我们正在努力将 MiniCPM-o 2.6 的支持合并到 llama.cpp、ollama、vLLM 的官方仓库,但还未完成。请大家暂时先使用我们提供的 fork 来进行部署:[llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md)、[ollama](https://github.com/OpenBMB/ollama/blob/minicpm-v2.6/examples/minicpm-v2.6/README.md)、[vllm](https://github.com/OpenBMB/MiniCPM-o?tab=readme-ov-file#efficient-inference-with-llamacpp-ollama-vllm)。 **合并完成前,使用官方仓库可能会导致不可预期的问题**
* [2025.01.19] 📢 **注意!** 我们正在努力将 MiniCPM-o 2.6 的支持合并到 llama.cpp、Ollama、vLLM 的官方仓库,但还未完成。请大家暂时先使用我们提供的 fork 来进行部署:[llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md)、[Ollama](https://github.com/OpenBMB/ollama/blob/minicpm-v2.6/examples/minicpm-v2.6/README.md)、[vllm](https://github.com/OpenBMB/MiniCPM-o?tab=readme-ov-file#efficient-inference-with-llamacpp-ollama-vllm)。 **合并完成前,使用官方仓库可能会导致不可预期的问题**
* [2025.01.19] ⭐️⭐️⭐️ MiniCPM-o 在 GitHub Trending 上登顶, Hugging Face Trending 上也达到了第二!
@@ -76,7 +80,7 @@
* [2024.07.19] MiniCPM-Llama3-V 2.5 现已支持[vLLM](#vllm-部署-)
* [2024.06.03] 现在你可以利用多张低显存显卡12G/16G进行GPU串行推理。详情请参见该[文档](https://github.com/OpenBMB/MiniCPM-V/blob/main/docs/inference_on_multiple_gpus.md)配置。
* [2024.05.28] 💫 我们现在支持 MiniCPM-Llama3-V 2.5 的 LoRA 微调,更多内存使用统计信息可以在[这里](https://github.com/OpenBMB/MiniCPM-V/tree/main/finetune#model-fine-tuning-memory-usage-statistics)找到。
* [2024.05.28] 💥 MiniCPM-Llama3-V 2.5 现在在 llama.cpp 和 ollama 中完全支持其功能!**请拉取我们最新的 fork 来使用**[llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-v2.5/examples/minicpmv/README.md) & [ollama](https://github.com/OpenBMB/ollama/tree/minicpm-v2.5/examples/minicpm-v2.5)。我们还发布了各种大小的 GGUF 版本,请点击[这里](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf/tree/main)查看。请注意,**目前官方仓库尚未支持 MiniCPM-Llama3-V 2.5**,我们也正积极推进将这些功能合并到 llama.cpp & ollama 官方仓库,敬请关注!
* [2024.05.28] 💥 MiniCPM-Llama3-V 2.5 现在在 llama.cpp 和 Ollama 中完全支持其功能!**请拉取我们最新的 fork 来使用**[llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-v2.5/examples/minicpmv/README.md) & [ollama](https://github.com/OpenBMB/ollama/tree/minicpm-v2.5/examples/minicpm-v2.5)。我们还发布了各种大小的 GGUF 版本,请点击[这里](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf/tree/main)查看。请注意,**目前官方仓库尚未支持 MiniCPM-Llama3-V 2.5**,我们也正积极推进将这些功能合并到 llama.cpp & ollama 官方仓库,敬请关注!
* [2024.05.25] MiniCPM-Llama3-V 2.5 [支持流式输出和自定义系统提示词](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5#usage)了,欢迎试用!
* [2024.05.24] 我们开源了 MiniCPM-Llama3-V 2.5 [gguf](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf),支持 [llama.cpp](#llamacpp-部署) 推理!实现端侧 6-8 tokens/s 的流畅解码,欢迎试用!
* [2024.05.23] 🔍 我们添加了Phi-3-vision-128k-instruct 与 MiniCPM-Llama3-V 2.5的全面对比,包括基准测试评估、多语言能力和推理效率 🌟📊🌍🚀。点击[这里](./docs/compare_with_phi-3_vision.md)查看详细信息。
@@ -94,6 +98,7 @@
## 目录 <!-- omit in toc -->
- [MiniCPM-V 4.0](#minicpm-v-40)
- [MiniCPM-o 2.6](#minicpm-o-26)
- [MiniCPM-V 2.6](#minicpm-v-26)
- [Chat with Our Demo on Gradio 🤗](#chat-with-our-demo-on-gradio-)
@@ -104,19 +109,573 @@
- [少样本上下文对话](#少样本上下文对话)
- [视频对话](#视频对话)
- [语音对话](#语音对话)
- [Mimick](#mimick)
- [可配置声音的语音对话](#可配置声音的语音对话)
- [更多语音任务](#更多语音任务)
- [多模态流式交互](#多模态流式交互)
- [多卡推理](#多卡推理)
- [Mac 推理](#mac-推理)
- [基于 llama.cpp、ollama、vLLM 的高效推理](#基于-llamacppollamavllm-的高效推理)
- [基于 llama.cpp、Ollama、vLLM 的高效推理](#基于-llamacppollamavllm-的高效推理)
- [微调](#微调)
- [MiniCPM-V \& o 使用手册](#minicpm-v--o-使用手册)
- [基于 MiniCPM-V \& MiniCPM-o 的更多项目](#基于-minicpm-v--minicpm-o-的更多项目)
- [FAQs](#faqs)
- [模型局限性](#模型局限性)
## MiniCPM-V 4.0
MiniCPM-V 4.0 是 MiniCPM-V 系列中的最新模型。该模型基于 SigLIP2-400M 和 MiniCPM4-3B 构建,参数总量为 4.1B。它延续了 MiniCPM-V 2.6 在单图、多图和视频理解方面的强大能力同时大幅提升了推理效率。MiniCPM-V 4.0 的主要特点包括:
- 🔥 **领先的视觉能力。**
MiniCPM-V 4.0 在 OpenCompass 上获得了平均 69.0 的高分,超越了 MiniCPM-V 2.68.1B,得分 65.2)、 Qwen2.5-VL-3B-Instruct3.8B,得分 64.5)和**广泛使用的闭源模型 GPT-4.1-mini-20250414**。在多图理解与视频理解任务上MiniCPM-V 4.0 也表现出色。
- 🚀 **卓越的效率。**
MiniCPM-V 4.0 专为端侧设备优化,**可在 iPhone 16 Pro Max 上流畅运行,首 token 延迟低至 2 秒,解码速度达 17.9 tokens/s**且无发热问题。MiniCPM-V 4.0 在并发请求场景下表现出领先的吞吐率指标。
- 💫 **易于使用。**
MiniCPM-V 4.0 支持多种推理方式,包括 **llama.cpp、Ollama、vLLM、SGLang、LLaMA-Factory 及本地 Web Demo 等**。我们还开源了可以在 iPhone 和 iPad 运行的 iOS App。欢迎参考我们开源的 **结构清晰的[使用手册](https://github.com/OpenSQZ/MiniCPM-V-CookBook)** 玩转 MiniCPM-V 4.0,其中涵盖了详细的部署指南和真实示例。
### 性能评估 <!-- omit in toc -->
<details>
<summary>点击查看在OpenCompass上的单图理解能力的评测结果。</summary>
<div align="center">
<table style="margin: 0px auto;">
<thead>
<tr>
<th nowrap="nowrap" align="left">model</th>
<th>Size</th>
<th>Opencompass</th>
<th>OCRBench</th>
<th>MathVista</th>
<th>HallusionBench</th>
<th>MMMU</th>
<th>MMVet</th>
<th>MMBench V1.1</th>
<th>MMStar</th>
<th>AI2D</th>
</tr>
</thead>
<tbody align="center">
<tr>
<td colspan="11" align="left"><strong>Proprietary</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4v-20240409</td>
<td>-</td>
<td>63.5</td>
<td>656</td>
<td>55.2</td>
<td>43.9</td>
<td>61.7</td>
<td>67.5</td>
<td>79.8</td>
<td>56.0</td>
<td>78.6</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Gemini-1.5-Pro</td>
<td>-</td>
<td>64.5</td>
<td>754</td>
<td>58.3</td>
<td>45.6</td>
<td>60.6</td>
<td>64.0</td>
<td>73.9</td>
<td>59.1</td>
<td>79.1</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4.1-mini-20250414</td>
<td>-</td>
<td>68.9</td>
<td>840</td>
<td>70.9</td>
<td>49.3</td>
<td>55.0</td>
<td>74.3</td>
<td>80.9</td>
<td>60.9</td>
<td>76.0</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Claude 3.5 Sonnet-20241022</td>
<td>-</td>
<td>70.6</td>
<td>798</td>
<td>65.3</td>
<td>55.5</td>
<td>66.4</td>
<td>70.1</td>
<td>81.7</td>
<td>65.1</td>
<td>81.2</td>
</tr>
<tr>
<td colspan="11" align="left"><strong>Open-source</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Qwen2.5-VL-3B-Instruct</td>
<td>3.8B</td>
<td>64.5</td>
<td>828</td>
<td>61.2</td>
<td>46.6</td>
<td>51.2</td>
<td>60.0</td>
<td>76.8</td>
<td>56.3</td>
<td>81.4</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">InternVL2.5-4B</td>
<td>3.7B</td>
<td>65.1</td>
<td>820</td>
<td>60.8</td>
<td>46.6</td>
<td>51.8</td>
<td>61.5</td>
<td>78.2</td>
<td>58.7</td>
<td>81.4</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Qwen2.5-VL-7B-Instruct</td>
<td>8.3B</td>
<td>70.9</td>
<td>888</td>
<td>68.1</td>
<td>51.9</td>
<td>58.0</td>
<td>69.7</td>
<td>82.2</td>
<td>64.1</td>
<td>84.3</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">InternVL2.5-8B</td>
<td>8.1B</td>
<td>68.1</td>
<td>821</td>
<td>64.5</td>
<td>49.0</td>
<td>56.2</td>
<td>62.8</td>
<td>82.5</td>
<td>63.2</td>
<td>84.6</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-V-2.6</td>
<td>8.1B</td>
<td>65.2</td>
<td>852</td>
<td>60.8</td>
<td>48.1</td>
<td>49.8</td>
<td>60.0</td>
<td>78.0</td>
<td>57.5</td>
<td>82.1</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-o-2.6</td>
<td>8.7B</td>
<td>70.2</td>
<td>889</td>
<td>73.3</td>
<td>51.1</td>
<td>50.9</td>
<td>67.2</td>
<td>80.6</td>
<td>63.3</td>
<td>86.1</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-V-4.0</td>
<td>4.1B</td>
<td>69.0</td>
<td>894</td>
<td>66.9</td>
<td>50.8</td>
<td>51.2</td>
<td>68.0</td>
<td>79.7</td>
<td>62.8</td>
<td>82.9</td>
</tr>
</tbody>
</table>
</div>
</details>
<details>
<summary>点击查看在图表理解、文档理解、数学推理、幻觉等领域的评测结果。 </summary>
<div align="center">
<table style="margin: 0px auto;">
<thead>
<tr>
<th nowrap="nowrap" align="left">model</th>
<th>Size</th>
<th>ChartQA</th>
<th>MME</th>
<th>RealWorldQA</th>
<th>TextVQA</th>
<th>DocVQA</th>
<th>MathVision</th>
<th>DynaMath</th>
<th>WeMath</th>
<th colspan="2">Obj Hal</th>
<th colspan="2">MM Hal</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>CHAIRs↓</td>
<td>CHAIRi↓</td>
<td nowrap="nowrap">score avg@3</td>
<td nowrap="nowrap">hall rate avg@3</td>
</tr>
<tbody align="center">
<tr>
<td colspan="14" align="left"><strong>Proprietary</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4v-20240409</td>
<td>-</td>
<td>78.5</td>
<td>1927</td>
<td>61.4</td>
<td>78.0</td>
<td>88.4</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Gemini-1.5-Pro</td>
<td>-</td>
<td>87.2</td>
<td>-</td>
<td>67.5</td>
<td>78.8</td>
<td>93.1</td>
<td>41.0</td>
<td>31.5</td>
<td>50.5</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4.1-mini-20250414</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>45.3</td>
<td>47.7</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Claude 3.5 Sonnet-20241022</td>
<td>-</td>
<td>90.8</td>
<td>-</td>
<td>60.1</td>
<td>74.1</td>
<td>95.2</td>
<td>35.6</td>
<td>35.7</td>
<td>44.0</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td colspan="14" align="left"><strong>Open-source</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Qwen2.5-VL-3B-Instruct</td>
<td>3.8B</td>
<td>84.0</td>
<td>2157</td>
<td>65.4</td>
<td>79.3</td>
<td>93.9</td>
<td>21.9</td>
<td>13.2</td>
<td>22.9</td>
<td>18.3</td>
<td>10.8</td>
<td>3.9 </td>
<td>33.3 </td>
</tr>
<tr>
<td nowrap="nowrap" align="left">InternVL2.5-4B</td>
<td>3.7B</td>
<td>84.0</td>
<td>2338</td>
<td>64.3</td>
<td>76.8</td>
<td>91.6</td>
<td>18.4</td>
<td>15.2</td>
<td>21.2</td>
<td>13.7</td>
<td>8.7</td>
<td>3.2 </td>
<td>46.5 </td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Qwen2.5-VL-7B-Instruct</td>
<td>8.3B</td>
<td>87.3</td>
<td>2347</td>
<td>68.5</td>
<td>84.9</td>
<td>95.7</td>
<td>25.4</td>
<td>21.8</td>
<td>36.2</td>
<td>13.3</td>
<td>7.9</td>
<td>4.1 </td>
<td>31.6 </td>
</tr>
<tr>
<td nowrap="nowrap" align="left">InternVL2.5-8B</td>
<td>8.1B</td>
<td>84.8</td>
<td>2344</td>
<td>70.1</td>
<td>79.1</td>
<td>93.0</td>
<td>17.0</td>
<td>9.4</td>
<td>23.5</td>
<td>18.3</td>
<td>11.6</td>
<td>3.6 </td>
<td>37.2</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-V-2.6</td>
<td>8.1B</td>
<td>79.4</td>
<td>2348</td>
<td>65.0</td>
<td>80.1</td>
<td>90.8</td>
<td>17.5</td>
<td>9.0</td>
<td>20.4</td>
<td>7.3</td>
<td>4.7</td>
<td>4.0 </td>
<td>29.9 </td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-o-2.6</td>
<td>8.7B</td>
<td>86.9</td>
<td>2372</td>
<td>68.1</td>
<td>82.0</td>
<td>93.5</td>
<td>21.7</td>
<td>10.4</td>
<td>25.2</td>
<td>6.3</td>
<td>3.4</td>
<td>4.1 </td>
<td>31.3 </td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-V-4.0</td>
<td>4.1B</td>
<td>84.4</td>
<td>2298</td>
<td>68.5</td>
<td>80.8</td>
<td>92.9</td>
<td>20.7</td>
<td>14.2</td>
<td>32.7</td>
<td>6.3</td>
<td>3.5</td>
<td>4.1 </td>
<td>29.2 </td>
</tr>
</tbody>
</table>
</div>
</details>
<details>
<summary>点击查看多图和视频理解能力的评测结果。 </summary>
<div align="center">
<table style="margin: 0px auto;">
<thead>
<tr>
<th nowrap="nowrap" align="left">model</th>
<th>Size</th>
<th>Mantis</th>
<th>Blink</th>
<th nowrap="nowrap" colspan="2" >Video-MME</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>wo subs</td>
<td>w subs</td>
</tr>
<tbody align="center">
<tr>
<td colspan="6" align="left"><strong>Proprietary</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4v-20240409</td>
<td>-</td>
<td>62.7</td>
<td>54.6</td>
<td>59.9</td>
<td>63.3</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Gemini-1.5-Pro</td>
<td>-</td>
<td>-</td>
<td>59.1</td>
<td>75.0</td>
<td>81.3</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4o-20240513</td>
<td>-</td>
<td>-</td>
<td>68.0</td>
<td>71.9</td>
<td>77.2</td>
</tr>
<tr>
<td colspan="6" align="left"><strong>Open-source</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Qwen2.5-VL-3B-Instruct</td>
<td>3.8B</td>
<td>-</td>
<td>47.6</td>
<td>61.5</td>
<td>67.6</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">InternVL2.5-4B</td>
<td>3.7B</td>
<td>62.7</td>
<td>50.8</td>
<td>62.3</td>
<td>63.6</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Qwen2.5-VL-7B-Instruct</td>
<td>8.3B</td>
<td>-</td>
<td>56.4</td>
<td>65.1</td>
<td>71.6</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">InternVL2.5-8B</td>
<td>8.1B</td>
<td>67.7</td>
<td>54.8</td>
<td>64.2</td>
<td>66.9</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-V-2.6</td>
<td>8.1B</td>
<td>69.1</td>
<td>53.0</td>
<td>60.9</td>
<td>63.6</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-o-2.6</td>
<td>8.7B</td>
<td>71.9</td>
<td>56.7</td>
<td>63.9</td>
<td>69.6</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-V-4.0</td>
<td>4.1B</td>
<td>71.4</td>
<td>54.0</td>
<td>61.2</td>
<td>65.8</td>
</tr>
</tbody>
</table>
</div>
</details>
### 典型示例 <!-- omit in toc -->
<div style="display: flex; flex-direction: column; align-items: center;">
<img src="assets/minicpmv4/minicpm-v-4-case.jpeg" alt="math" style="margin-bottom: 5px;">
</div>
我们在 iPhone 16 Pro Max 上部署了 MiniCPM-V 4.0 [iOS demo](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/demo/ios_demo/ios.md),并录制了以下演示录屏,视频未经加速等任何编辑:
<table align="center">
<p align="center">
<img src="assets/minicpmv4/iphone_en.gif" width=45%/>
&nbsp;&nbsp;&nbsp;&nbsp;
<img src="assets/minicpmv4/iphone_en_information_extraction.gif" width=45%/>
</p>
<p align="center">
<img src="assets/minicpmv4/iphone_cn.gif" width=45%/>
&nbsp;&nbsp;&nbsp;&nbsp;
<img src="assets/minicpmv4/iphone_cn_funny_points.gif" width=45%/>
</p>
</table>
## MiniCPM-o 2.6
@@ -1874,6 +2433,10 @@ python web_demos/minicpm-o_2.6/chatbot_web_demo_o2.6.py
| 模型 | 设备 | 资源 | &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; 简介 | 下载链接 |
|:--------------|:-:|:----------:|:-------------------|:---------------:|
| MiniCPM-V 4.0| GPU | 9 GB | 提供出色的端侧单图、多图、视频理解能力。 | [🤗](https://huggingface.co/openbmb/MiniCPM-V-4) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V-4) |
| MiniCPM-V 4.0 gguf | CPU | 4 GB | gguf 版本,更低的内存占用和更高的推理效率。 | [🤗](https://huggingface.co/openbmb/MiniCPM-V-4-gguf) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V-4-gguf) |
| MiniCPM-V 4.0 int4 | GPU | 5 GB | int4量化版更低显存占用 | [🤗](https://huggingface.co/openbmb/MiniCPM-V-4-int4) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V-4-int4) |
| MiniCPM-V 4.0 AWQ | GPU | 5 GB | int4量化版更低显存占用 | [🤗](https://huggingface.co/openbmb/MiniCPM-V-4-AWQ) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V-4-AWQ) |
| MiniCPM-o 2.6| GPU | 18 GB | 最新版本,提供端侧 GPT-4o 级的视觉、语音、多模态流式交互能力。 | [🤗](https://huggingface.co/openbmb/MiniCPM-o-2_6) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-o-2_6) |
| MiniCPM-o 2.6 gguf | CPU | 8 GB | gguf 版本,更低的内存占用和更高的推理效率。 | [🤗](https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-o-2_6-gguf) |
| MiniCPM-o 2.6 int4 | GPU | 9 GB | int4量化版更低显存占用。 | [🤗](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-o-2_6-int4) |
@@ -1903,10 +2466,10 @@ from transformers import AutoModel, AutoTokenizer
torch.manual_seed(100)
model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True)
image = Image.open('./assets/minicpmo2_6/show_demo.jpg').convert('RGB')
@@ -1934,24 +2497,24 @@ print(answer)
你可以得到如下推理结果:
```
"The landform in the picture is a mountain range. The mountains appear to be karst formations, characterized by their steep, rugged peaks and smooth, rounded shapes. These types of mountains are often found in regions with limestone bedrock and are shaped by processes such as erosion and weathering. The reflection of the mountains in the water adds to the scenic beauty of the landscape."
"The landform in the picture is karst topography, characterized by its unique and striking limestone formations that rise dramatically from the surrounding landscape."
"When traveling to this scenic location, it's important to pay attention to the weather conditions, as the area appears to be prone to fog and mist, especially during sunrise or sunset. Additionally, ensure you have proper footwear for navigating the potentially slippery terrain around the water. Lastly, respect the natural environment by not disturbing the local flora and fauna."
"When traveling to this picturesque location, you should pay attention to the weather conditions as they can change rapidly in such areas. It's also important to respect local ecosystems and wildlife by staying on designated paths and not disturbing natural habitats. Additionally, bringing appropriate gear for photography is advisable due to the stunning reflections and lighting during sunrise or sunset."
```
#### 多图对话
<details>
<summary> 点击查看 MiniCPM-o 2.6 多图输入的 Python 代码。 </summary>
<summary> 点击查看 MiniCPM-V-4 多图输入的 Python 代码。 </summary>
```python
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True)
image1 = Image.open('image1.jpg').convert('RGB')
image2 = Image.open('image2.jpg').convert('RGB')
@@ -1969,17 +2532,17 @@ print(answer)
#### 少样本上下文对话
<details>
<summary> 点击查看 MiniCPM-o 2.6 少样本上下文对话的 Python 代码。 </summary>
<summary> 点击查看 MiniCPM-V-4 少样本上下文对话的 Python 代码。 </summary>
```python
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True)
question = "production date"
image1 = Image.open('example1.jpg').convert('RGB')
@@ -2004,7 +2567,7 @@ print(answer)
#### 视频对话
<details>
<summary> 点击查看 MiniCPM-o 2.6 视频输入的 Python 代码。 </summary>
<summary> 点击查看 MiniCPM-V-4 视频输入的 Python 代码。 </summary>
```python
import torch
@@ -2012,10 +2575,10 @@ from PIL import Image
from transformers import AutoModel, AutoTokenizer
from decord import VideoReader, cpu # pip install decord
model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True)
MAX_NUM_FRAMES=64 # if cuda OOM set a smaller number
@@ -2076,7 +2639,7 @@ model.tts.float()
</details>
##### Mimick
##### Mimick <!-- omit in toc -->
<details> <summary> 点击查看 MiniCPM-o 2.6 端到端语音理解生成的 Python 代码。 </summary>
@@ -2099,7 +2662,7 @@ res = model.chat(
</details>
##### 可配置声音的语音对话
##### 可配置声音的语音对话 <!-- omit in toc -->
<details> <summary> 点击查看个性化配置 MiniCPM-o 2.6 对话声音的 Python 代码。</summary>
```python
@@ -2145,7 +2708,7 @@ print(res)
</details>
##### 更多语音任务
##### 更多语音任务 <!-- omit in toc -->
<details>
<summary> 点击查看 MiniCPM-o 2.6 完成更多语音任务的 Python 代码。 </summary>
@@ -2398,11 +2961,11 @@ PYTORCH_ENABLE_MPS_FALLBACK=1 python test.py
</details>
### 基于 llama.cpp、ollama、vLLM 的高效推理
### 基于 llama.cpp、Ollama、vLLM 的高效推理
llama.cpp 用法请参考[我们的fork llama.cpp](https://github.com/OpenBMB/llama.cpp/tree/minicpmv-main/examples/llava/README-minicpmv2.6.md) 在iPad上可以支持 16~18 token/s 的流畅推理测试环境iPad Pro + M4
ollama 用法请参考[我们的fork ollama](https://github.com/OpenBMB/ollama/blob/minicpm-v2.6/examples/minicpm-v2.6/README.md) 在iPad上可以支持 16~18 token/s 的流畅推理测试环境iPad Pro + M4
Ollama 用法请参考[我们的fork Ollama](https://github.com/OpenBMB/ollama/blob/minicpm-v2.6/examples/minicpm-v2.6/README.md) 在iPad上可以支持 16~18 token/s 的流畅推理测试环境iPad Pro + M4
<details>
<summary>点击查看, vLLM 现已官方支持MiniCPM-o 2.6、MiniCPM-V 2.6、MiniCPM-Llama3-V 2.5 和 MiniCPM-V 2.0。 </summary>