add minicpm-v-4.5 (#963)

Co-authored-by: wangchongyi <>
This commit is contained in:
YuzaChongyi
2025-08-26 05:20:58 +08:00
committed by GitHub
parent 2ef22c138e
commit 06e220c8f4
27 changed files with 3154 additions and 2689 deletions

945
docs/minicpm_v2dot6_en.md Normal file
View File

@@ -0,0 +1,945 @@
## MiniCPM-V 2.6
> Archieve at: 2025-01-13
**MiniCPM-V 2.6** is the latest and most capable model in the MiniCPM-V series. The model is built on SigLip-400M and Qwen2-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-Llama3-V 2.5, and introduces new features for multi-image and video understanding. Notable features of MiniCPM-V 2.6 include:
- 🔥 **Leading Performance.**
MiniCPM-V 2.6 achieves an average score of 65.2 on the latest version of OpenCompass, a comprehensive evaluation over 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o mini, GPT-4V, Gemini 1.5 Pro, and Claude 3.5 Sonnet** for single image understanding.
- 🖼️ **Multi Image Understanding and In-context Learning.** MiniCPM-V 2.6 can also perform **conversation and reasoning over multiple images**. It achieves **state-of-the-art performance** on popular multi-image benchmarks such as Mantis-Eval, BLINK, Mathverse mv and Sciverse mv, and also shows promising in-context learning capability.
- 🎬 **Video Understanding.** MiniCPM-V 2.6 can also **accept video inputs**, performing conversation and providing dense captions for spatial-temporal information. It outperforms **GPT-4V, Claude 3.5 Sonnet and LLaVA-NeXT-Video-34B** on Video-MME with/without subtitles.
- 💪 **Strong OCR Capability and Others.**
MiniCPM-V 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves **state-of-the-art performance on OCRBench, surpassing proprietary models such as GPT-4o, GPT-4V, and Gemini 1.5 Pro**.
Based on the the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) and [VisCPM](https://github.com/OpenBMB/VisCPM) techniques, it features **trustworthy behaviors**, with significantly lower hallucination rates than GPT-4o and GPT-4V on Object HalBench, and supports **multilingual capabilities** on English, Chinese, German, French, Italian, Korean, etc.
- 🚀 **Superior Efficiency.**
In addition to its friendly size, MiniCPM-V 2.6 also shows **state-of-the-art token density** (i.e., number of pixels encoded into each visual token). **It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models**. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-V 2.6 can efficiently support **real-time video understanding** on end-side devices such as iPad.
- 💫 **Easy Usage.**
MiniCPM-V 2.6 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpmv-main/examples/llava/README-minicpmv2.6.md) and [ollama](https://github.com/OpenBMB/ollama/blob/minicpm-v2.6/examples/minicpm-v2.6/README.md) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-V-2_6-int4) and [GGUF](https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf) format quantized models in 16 sizes, (3) [vLLM](#inference-with-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks, (5) quick local WebUI demo setup with [Gradio](#chat-with-our-demo-on-gradio), and (6) online web [demo](http://120.92.209.146:8887/).
### Evaluation <!-- omit in toc -->
<div align="center">
<img src=../assets/radar_final.png width=66% />
</div>
<details>
<summary>Click to view single image results on OpenCompass, MME, MMVet, OCRBench, MMMU, MathVista, MMB, AI2D, TextVQA, DocVQA, HallusionBench, Object HalBench. </summary>
<div align="center">
<table style="margin: 0px auto;">
<thead>
<tr>
<th align="left">Model</th>
<th>Size</th>
<th>Token Density<sup>+</sup></th>
<th>OpenCompass</th>
<th>MME</th>
<th>MMVet</th>
<th>OCRBench</th>
<th>MMMU val</th>
<th>MathVista mini</th>
<th>MMB1.1 test</th>
<th>AI2D</th>
<th>TextVQA val</th>
<th>DocVQA test</th>
<th>HallusionBench</th>
<th>Object HalBench</th>
</tr>
</thead>
<tbody align="center">
<tr>
<td colspan="15" align="left"><strong>Proprietary</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4o</td>
<td>-</td>
<td>1088</td>
<td>69.9</td>
<td>2328.7</td>
<td>69.1</td>
<td>736</td>
<td>69.2</td>
<td>61.3</td>
<td>82.2</td>
<td>84.6</td>
<td>-</td>
<td>92.8</td>
<td>55.0</td>
<td>17.6</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Claude 3.5 Sonnet</td>
<td>-</td>
<td>750</td>
<td>67.9</td>
<td>1920.0</td>
<td>66.0</td>
<td>788</td>
<td>65.9</td>
<td>61.6</td>
<td>78.5</td>
<td>80.2</td>
<td>-</td>
<td>95.2</td>
<td>49.9</td>
<td>13.8</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
<td>-</td>
<td>-</td>
<td>64.4</td>
<td>2110.6</td>
<td>64.0</td>
<td>754</td>
<td>60.6</td>
<td>57.7</td>
<td>73.9</td>
<td>79.1</td>
<td>73.5</td>
<td>86.5</td>
<td>45.6</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4o mini</td>
<td>-</td>
<td>1088</td>
<td>64.1</td>
<td>2003.4</td>
<td>66.9</td>
<td>785</td>
<td>60.0</td>
<td>52.4</td>
<td>76.0</td>
<td>77.8</td>
<td>-</td>
<td>-</td>
<td>46.1</td>
<td>12.4</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4V</td>
<td>-</td>
<td>1088</td>
<td>63.5</td>
<td>2070.2</td>
<td>67.5</td>
<td>656</td>
<td>61.7</td>
<td>54.7</td>
<td>79.8</td>
<td>78.6</td>
<td>78.0</td>
<td>87.2</td>
<td>43.9</td>
<td>14.2</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Step-1V</td>
<td>-</td>
<td>-</td>
<td>59.5</td>
<td>2206.4</td>
<td>63.3</td>
<td>625</td>
<td>49.9</td>
<td>44.8</td>
<td>78.0</td>
<td>79.2</td>
<td>71.6</td>
<td>-</td>
<td>48.4</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Qwen-VL-Max</td>
<td>-</td>
<td>784</td>
<td>58.3</td>
<td>2281.7</td>
<td>61.8</td>
<td>684</td>
<td>52.0</td>
<td>43.4</td>
<td>74.6</td>
<td>75.7</td>
<td>79.5</td>
<td>93.1</td>
<td>41.2</td>
<td>13.4</td>
</tr>
<tr>
<td colspan="15" align="left"><strong>Open-source</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">LLaVA-NeXT-Yi-34B</td>
<td>34B</td>
<td>157</td>
<td>55.0</td>
<td>2006.5</td>
<td>50.7</td>
<td>574</td>
<td>48.8</td>
<td>40.4</td>
<td>77.8</td>
<td>78.9</td>
<td>69.3</td>
<td>-</td>
<td>34.8</td>
<td>12.6</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Mini-Gemini-HD-34B</td>
<td>34B</td>
<td>157</td>
<td>-</td>
<td>2141.0</td>
<td>59.3</td>
<td>518</td>
<td>48.0</td>
<td>43.3</td>
<td>-</td>
<td>80.5</td>
<td>74.1</td>
<td>78.9</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Cambrian-34B</td>
<td>34B</td>
<td>1820</td>
<td>58.3</td>
<td>2049.9</td>
<td>53.2</td>
<td>591</td>
<td>50.4</td>
<td>50.3</td>
<td>77.8</td>
<td>79.5</td>
<td>76.7</td>
<td>75.5</td>
<td>41.6</td>
<td>14.7</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GLM-4V-9B</td>
<td>13B</td>
<td>784</td>
<td>59.1</td>
<td>2018.8</td>
<td>58.0</td>
<td>776</td>
<td>46.9</td>
<td>51.1</td>
<td>67.9</td>
<td>71.2</td>
<td>-</td>
<td>-</td>
<td>45.0</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">InternVL2-8B</td>
<td>8B</td>
<td>706</td>
<td>64.1</td>
<td>2215.1</td>
<td>54.3</td>
<td>794</td>
<td><strong>51.2</strong></td>
<td>58.3</td>
<td><strong>79.4</strong></td>
<td><strong>83.6</strong></td>
<td>77.4</td>
<td><strong>91.6</strong></td>
<td>45.0</td>
<td>21.3</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-Llama-V 2.5</td>
<td>8B</td>
<td>1882</td>
<td>58.8</td>
<td>2024.6</td>
<td>52.8</td>
<td>725</td>
<td>45.8</td>
<td>54.3</td>
<td>72.0</td>
<td>78.4</td>
<td>76.6</td>
<td>84.8</td>
<td>42.4</td>
<td>10.3</td>
</tr>
<tr style="background-color: #e6f2ff;">
<td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
<td>8B</td>
<td><strong>2822</strong></td>
<td><strong>65.2</strong></td>
<td><strong>2348.4</strong>*</td>
<td><strong>60.0</strong></td>
<td><strong>852</strong>*</td>
<td>49.8*</td>
<td><strong>60.6</strong></td>
<td>78.0</td>
<td>82.1</td>
<td><strong>80.1<strong></td>
<td>90.8</td>
<td><strong>48.1</strong>*</td>
<td><strong>8.2</strong></td>
</tr>
</tbody>
</table>
</div>
* We evaluate this benchmark using chain-of-thought prompting. Specifically, for MME, we used this technique only for the Cognition set.
<sup>+</sup> Token Density: number of pixels encoded into each visual token at maximum resolution, i.e., # pixels at maximum resolution / # visual tokens.
Note: For proprietary models, we calculate token density based on the image encoding charging strategy defined in the official API documentation, which provides an upper-bound estimation.
</details>
<details>
<summary>Click to view multi-image results on Mantis Eval, BLINK, Mathverse mv, Sciverse mv, MIRB.</summary>
<div align="center">
<table style="margin: 0px auto;">
<thead>
<tr>
<th align="left">Model</th>
<th>Size</th>
<th>Mantis Eval</th>
<th>BLINK val</th>
<th>Mathverse mv</th>
<th>Sciverse mv</th>
<th>MIRB</th>
</tr>
</thead>
<tbody align="center">
<tr>
<td colspan="7" align="left"><strong>Proprietary</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4V</td>
<td>-</td>
<td>62.7</td>
<td>54.6</td>
<td>60.3</td>
<td>66.9</td>
<td>53.1</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">LLaVA-NeXT-Interleave-14B</td>
<td>14B</td>
<td>66.4</td>
<td>52.6</td>
<td>32.7</td>
<td>30.2</td>
<td>-</td>
</tr>
<tr>
<td colspan="7" align="left"><strong>Open-source</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Emu2-Chat</td>
<td>37B</td>
<td>37.8</td>
<td>36.2</td>
<td>-</td>
<td>27.2</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">CogVLM</td>
<td>17B</td>
<td>45.2</td>
<td>41.1</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">VPG-C</td>
<td>7B</td>
<td>52.4</td>
<td>43.1</td>
<td>24.3</td>
<td>23.1</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">VILA 8B</td>
<td>8B</td>
<td>51.2</td>
<td>39.3</td>
<td>-</td>
<td>36.5</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">InternLM-XComposer-2.5</td>
<td>8B</td>
<td>53.1*</td>
<td>48.9</td>
<td>32.1*</td>
<td>-</td>
<td>42.5</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">InternVL2-8B</td>
<td>8B</td>
<td>59.0*</td>
<td>50.9</td>
<td>30.5*</td>
<td>34.4*</td>
<td><strong>56.9*</strong></td>
</tr>
<tr style="background-color: #e6f2ff;">
<td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
<td>8B</td>
<td><strong>69.1</strong></td>
<td><strong>53.0</strong></td>
<td><strong>84.9</strong></td>
<td><strong>74.9</strong></td>
<td>53.8</td>
</tr>
</tbody>
</table>
</div>
* We evaluate the officially released checkpoint by ourselves.
</details>
<details>
<summary>Click to view video results on Video-MME and Video-ChatGPT.</summary>
<div align="center">
<table style="margin: 0px auto;">
<thead>
<tr>
<th align="left">Model</th>
<th>Size</th>
<th colspan="2">Video-MME</th>
<th colspan="5">Video-ChatGPT</th>
</tr>
<tr>
<th align="left"></th>
<th></th>
<th>w/o subs</th>
<th>w subs</th>
<th>Correctness</th>
<th>Detail</th>
<th>Context</th>
<th>Temporal</th>
<th>Consistency</th>
</tr>
</thead>
<tbody align="center">
<tr>
<td colspan="9" align="left"><strong>Proprietary</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Claude 3.5 Sonnet</td>
<td>-</td>
<td>60.0</td>
<td>62.9</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4V</td>
<td>-</td>
<td>59.9</td>
<td>63.3</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td colspan="9" align="left"><strong>Open-source</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">LLaVA-NeXT-7B</td>
<td>7B</td>
<td>-</td>
<td>-</td>
<td>3.39</td>
<td>3.29</td>
<td>3.92</td>
<td>2.60</td>
<td>3.12</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">LLaVA-NeXT-34B</td>
<td>34B</td>
<td>-</td>
<td>-</td>
<td>3.29</td>
<td>3.23</td>
<td>3.83</td>
<td>2.51</td>
<td>3.47</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">CogVLM2-Video</td>
<td>12B</td>
<td>-</td>
<td>-</td>
<td>3.49</td>
<td><strong>3.46</strong></td>
<td>3.23</td>
<td><strong>2.98</strong></td>
<td><strong>3.64</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">LongVA</td>
<td>7B</td>
<td>52.4</td>
<td>54.3</td>
<td>3.05</td>
<td>3.09</td>
<td>3.77</td>
<td>2.44</td>
<td><strong>3.64</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">InternVL2-8B</td>
<td>8B</td>
<td>54.0</td>
<td>56.9</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">InternLM-XComposer-2.5</td>
<td>8B</td>
<td>55.8</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">LLaVA-NeXT-Video</td>
<td>32B</td>
<td>60.2</td>
<td>63.0</td>
<td>3.48</td>
<td>3.37</td>
<td><strong>3.95</strong></td>
<td>2.64</td>
<td>3.28</td>
</tr>
<tr style="background-color: #e6f2ff;">
<td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
<td>8B</td>
<td><strong>60.9</strong></td>
<td><strong>63.6</strong></td>
<td><strong>3.59</strong></td>
<td>3.28</td>
<td>3.93</td>
<td>2.73</td>
<td>3.62</td>
</tr>
</tbody>
</table>
</div>
</details>
<details>
<summary>Click to view few-shot results on TextVQA, VizWiz, VQAv2, OK-VQA.</summary>
<div align="center">
<table style="margin: 0px auto;">
<thead>
<tr>
<th align="left">Model</th>
<th>Size</th>
<th>Shot</th>
<th>TextVQA val</th>
<th>VizWiz test-dev</th>
<th>VQAv2 test-dev</th>
<th>OK-VQA val</th>
</tr>
</thead>
<tbody align="center">
<tr>
<td align="left" nowrap="nowrap" rowspan="3">Flamingo</td>
<td rowspan="3">80B</td>
<td>0*</td>
<td>35.0</td>
<td>31.6</td>
<td>56.3</td>
<td>40.6</td>
</tr>
<tr>
<td>4</td>
<td>36.5</td>
<td>39.6</td>
<td>63.1</td>
<td><strong>57.4</strong></td>
</tr>
<tr>
<td>8</td>
<td>37.3</td>
<td>44.8</td>
<td>65.6</td>
<td>57.5</td>
</tr>
<tr>
<td align="left" nowrap="nowrap" rowspan="3">IDEFICS</td>
<td rowspan="3">80B</td>
<td>0*</td>
<td>30.9</td>
<td>36.0</td>
<td>60.0</td>
<td>45.2</td>
</tr>
<tr>
<td>4</td>
<td>34.3</td>
<td>40.4</td>
<td>63.6</td>
<td>52.4</td>
</tr>
<tr>
<td>8</td>
<td>35.7</td>
<td>46.1</td>
<td>64.8</td>
<td>55.1</td>
</tr>
<tr>
<td align="left" nowrap="nowrap" rowspan="3">OmniCorpus</td>
<td rowspan="3">7B</td>
<td>0*</td>
<td>43.0</td>
<td>49.8</td>
<td>63.2</td>
<td>45.5</td>
</tr>
<tr>
<td>4</td>
<td>45.4</td>
<td>51.3</td>
<td>64.5</td>
<td>46.5</td>
</tr>
<tr>
<td>8</td>
<td>45.6</td>
<td>52.2</td>
<td>64.7</td>
<td>46.6</td>
</tr>
<tr>
<td align="left" nowrap="nowrap" rowspan="3">Emu2</td>
<td rowspan="3">37B</td>
<td>0</td>
<td>26.4</td>
<td>40.4</td>
<td>33.5</td>
<td>26.7</td>
</tr>
<tr>
<td>4</td>
<td>48.2</td>
<td>54.6</td>
<td>67.0</td>
<td>53.2</td>
</tr>
<tr>
<td>8</td>
<td>49.3</td>
<td>54.7</td>
<td>67.8</td>
<td>54.1</td>
</tr>
<tr>
<td align="left" nowrap="nowrap" rowspan="2">MM1</td>
<td rowspan="2">30B</td>
<td>0</td>
<td>26.2</td>
<td>40.4</td>
<td>48.9</td>
<td>26.7</td>
</tr>
<tr>
<td>8</td>
<td>49.3</td>
<td>54.7</td>
<td><strong>70.9</strong></td>
<td>54.1</td>
</tr>
<tr style="background-color: #e6f2ff;">
<td align="left" nowrap="nowrap" rowspan="3">MiniCPM-V 2.6<sup>+</sup></td>
<td rowspan="3">8B</td>
<td>0</td>
<td>43.9</td>
<td>33.8</td>
<td>45.4</td>
<td>23.9</td>
</tr>
<tr style="background-color: #e6f2ff;">
<td>4</td>
<td>63.6</td>
<td>60.5</td>
<td>65.5</td>
<td>50.1</td>
</tr>
<tr style="background-color: #e6f2ff;">
<td>8</td>
<td><strong>64.6</strong></td>
<td><strong>63.4</strong></td>
<td>68.2</td>
<td>51.4</td>
</tr>
</tbody>
</table>
</div>
* denotes zero image shot and two additional text shots following Flamingo.
<sup>+</sup> We evaluate the pretraining ckpt without SFT.
</details>
### Examples <!-- omit in toc -->
<div style="display: flex; flex-direction: column; align-items: center;">
<img src="../assets/minicpmv2_6/multi_img-bike.png" alt="Bike" style="margin-bottom: 5px;">
<img src="../assets/minicpmv2_6/multi_img-menu.png" alt="Menu" style="margin-bottom: 5px;">
<img src="../assets/minicpmv2_6/multi_img-code.png" alt="Code" style="margin-bottom: 5px;">
<img src="../assets/minicpmv2_6/ICL-Mem.png" alt="Mem" style="margin-bottom: 5px;">
<img src="../assets/minicpmv2_6/multiling-medal.png" alt="medal" style="margin-bottom: 10px;">
</div>
<details>
<summary>Click to view more cases.</summary>
<div style="display: flex; flex-direction: column; align-items: center;">
<img src="../assets/minicpmv2_6/ICL-elec.png" alt="elec" style="margin-bottom: 5px;">
<img src="../assets/minicpmv2_6/multiling-olympic.png" alt="Menu" style="margin-bottom: 10px;">
</div>
</details>
We deploy MiniCPM-V 2.6 on end devices. The demo video is the raw screen recording on a iPad Pro without edition.
<table align="center">
<p align="center">
<img src="../assets/gif_cases/ai.gif" width=32%/>
&nbsp;&nbsp;&nbsp;&nbsp;
<img src="../assets/gif_cases/beer.gif" width=32%/>
</p>
</table>
<table align="center">
<p align="center">
<img src="../assets/gif_cases/ticket.gif" width=32%/>
&nbsp;&nbsp;&nbsp;&nbsp;
<img src="../assets/gif_cases/wfh.gif" width=32%/>
</p>
</table>
<table align="center">
<p align="center">
<video src="https://github.com/user-attachments/assets/21f4b818-ede1-4822-920e-91281725c830" width="360" /> </video>
<!-- <video src="https://github.com/user-attachments/assets/c835f757-206b-4d9c-8e36-70d67b453628" width="360" /> </video> -->
</p>
</table>
</details>
### Multi-turn Conversation
<div align="center">
<img src="../assets/airplane.jpeg" width="500px">
</div>
```python
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
torch.manual_seed(0)
model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
image = Image.open('./assets/airplane.jpeg').convert('RGB')
# First round chat
question = "Tell me the model of this aircraft."
msgs = [{'role': 'user', 'content': [image, question]}]
answer = model.chat(
image=None,
msgs=msgs,
tokenizer=tokenizer
)
print(answer)
# Second round chat
# pass history context of multi-turn conversation
msgs.append({"role": "assistant", "content": [answer]})
msgs.append({"role": "user", "content": ["Introduce something about Airbus A380."]})
answer = model.chat(
image=None,
msgs=msgs,
tokenizer=tokenizer
)
print(answer)
```
You could get the following output:
```
"The aircraft in the image is an Airbus A380, which can be identified by its large size, double-deck structure, and the distinctive shape of its wings and engines. The A380 is a wide-body aircraft known for being the world's largest passenger airliner, designed for long-haul flights. It has four engines, which are characteristic of large commercial aircraft. The registration number on the aircraft can also provide specific information about the model if looked up in an aviation database."
"The Airbus A380 is a double-deck, wide-body, four-engine jet airliner made by Airbus. It is the world's largest passenger airliner and is known for its long-haul capabilities. The aircraft was developed to improve efficiency and comfort for passengers traveling over long distances. It has two full-length passenger decks, which can accommodate more passengers than a typical single-aisle airplane. The A380 has been operated by airlines such as Lufthansa, Singapore Airlines, and Emirates, among others. It is widely recognized for its unique design and significant impact on the aviation industry."
```
#### Multi-image Understanding
<details>
<summary> Click to view Python example of MiniCPM-V 2.6 multi-image understanding </summary>
```python
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
image1 = Image.open('image1.jpg').convert('RGB')
image2 = Image.open('image2.jpg').convert('RGB')
question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.'
msgs = [{'role': 'user', 'content': [image1, image2, question]}]
answer = model.chat(
image=None,
msgs=msgs,
tokenizer=tokenizer
)
print(answer)
```
</details>
#### Few-shot In-Context-Learning
<details>
<summary> Click to view Python example of MiniCPM-V 2.6 few-shot in-context-learning example </summary>
```python
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
question = "production date"
image1 = Image.open('example1.jpg').convert('RGB')
answer1 = "2023.08.04"
image2 = Image.open('example2.jpg').convert('RGB')
answer2 = "2007.04.24"
image_test = Image.open('test.jpg').convert('RGB')
msgs = [
{'role': 'user', 'content': [image1, question]}, {'role': 'assistant', 'content': [answer1]},
{'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]},
{'role': 'user', 'content': [image_test, question]}
]
answer = model.chat(
image=None,
msgs=msgs,
tokenizer=tokenizer
)
print(answer)
```
</details>
#### Video understanding
<details>
<summary> Click to view Python example of MiniCPM-V 2.6 video understanding </summary>
```python
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
from decord import VideoReader, cpu # pip install decord
model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
MAX_NUM_FRAMES=64 # if cuda OOM set a smaller number
def encode_video(video_path):
def uniform_sample(l, n):
gap = len(l) / n
idxs = [int(i * gap + gap / 2) for i in range(n)]
return [l[i] for i in idxs]
vr = VideoReader(video_path, ctx=cpu(0))
sample_fps = round(vr.get_avg_fps() / 1) # FPS
frame_idx = [i for i in range(0, len(vr), sample_fps)]
if len(frame_idx) > MAX_NUM_FRAMES:
frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
frames = vr.get_batch(frame_idx).asnumpy()
frames = [Image.fromarray(v.astype('uint8')) for v in frames]
print('num frames:', len(frames))
return frames
video_path="video_test.mp4"
frames = encode_video(video_path)
question = "Describe the video"
msgs = [
{'role': 'user', 'content': frames + [question]},
]
# Set decode params for video
params = {}
params["use_image_id"] = False
params["max_slice_nums"] = 2 # 如果cuda OOM且视频分辨率大于448*448可设为1
answer = model.chat(
image=None,
msgs=msgs,
tokenizer=tokenizer,
**params
)
print(answer)
```
</details>

763
docs/minicpm_v2dot6_zh.md Normal file
View File

@@ -0,0 +1,763 @@
## MiniCPM-V 2.6
> Archieve at: 2025-08-25
**MiniCPM-V 2.6** 是 MiniCPM-V 系列中最新、性能最佳的模型。该模型基于 SigLip-400M 和 Qwen2-7B 构建,共 8B 参数。与 MiniCPM-Llama3-V 2.5 相比MiniCPM-V 2.6 性能提升显著并引入了多图和视频理解的新功能。MiniCPM-V 2.6 的主要特点包括:
- 🔥 **领先的性能。**
MiniCPM-V 2.6 在最新版本 OpenCompass 榜单上(综合 8 个主流多模态评测基准)平均得分 65.2**以8B量级的大小在单图理解方面超越了 GPT-4o mini、GPT-4V、Gemini 1.5 Pro 和 Claude 3.5 Sonnet 等主流商用闭源多模态大模型**。
- 🖼️ **多图理解和上下文学习。**
MiniCPM-V 2.6 还支持**多图对话和推理**。它在 Mantis-Eval、BLINK、Mathverse mv 和 Sciverse mv 等主流多图评测基准中取得了**最佳水平**,并展现出了优秀的上下文学习能力。
- 🎬 **视频理解。**
MiniCPM-V 2.6 还可以**接受视频输入**,进行对话和提供涵盖时序和空间信息的详细视频描述。模型在 有/无字幕 评测场景下的 Video-MME 表现均超过了 **GPT-4V、Claude 3.5 Sonnet 和 LLaVA-NeXT-Video-34B**等商用闭源模型。
- 💪 **强大的 OCR 能力及其他功能。**
MiniCPM-V 2.6 可以处理任意长宽比的图像,像素数可达 180 万(如 1344x1344。在 OCRBench 上取得**最佳水平,超过 GPT-4o、GPT-4V 和 Gemini 1.5 Pro 等商用闭源模型**。基于最新的 [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) 和 [VisCPM](https://github.com/OpenBMB/VisCPM) 技术,其具备了**可信的多模态行为**,在 Object HalBench 上的幻觉率显著低于 GPT-4o 和 GPT-4V并支持英语、中文、德语、法语、意大利语、韩语等**多种语言**。
- 🚀 **卓越的效率。**
除了对个人用户友好的模型大小MiniCPM-V 2.6 还表现出**最先进的视觉 token 密度**(即每个视觉 token 编码的像素数量)。它**仅需 640 个 token 即可处理 180 万像素图像,比大多数模型少 75%**。这一特性优化了模型的推理速度、首 token 延迟、内存占用和功耗。因此MiniCPM-V 2.6 可以支持 iPad 等终端设备上的高效**实时视频理解**。
- 💫 **易于使用。**
MiniCPM-V 2.6 可以通过多种方式轻松使用:(1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpmv-main/examples/llava/README-minicpmv2.6.md) 和 [ollama](https://github.com/OpenBMB/ollama/blob/minicpm-v2.6/examples/minicpm-v2.6/README.md) 支持在本地设备上进行高效的 CPU 推理,(2) [int4](https://huggingface.co/openbmb/MiniCPM-V-2_6-int4) 和 [GGUF](https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf) 格式的量化模型,有 16 种尺寸,(3) [vLLM](#vllm-部署-) 支持高吞吐量和内存高效的推理,(4) 针对新领域和任务进行微调,(5) 使用 [Gradio](#本地-webui-demo-) 快速设置本地 WebUI 演示,(6) 在线[demo](http://120.92.209.146:8887/)即可体验。
### 性能评估 <!-- omit in toc -->
<div align="center">
<img src=assets/radar_final.png width=90% />
</div>
<details>
<summary>点击查看 OpenCompass, MME, MMVet, OCRBench, MMMU, MathVista, MMB, AI2D, TextVQA, DocVQA, HallusionBench, Object HalBench 上的单图评测结果详情。 </summary>
<div align="center">
<table style="margin: 0px auto;">
<thead>
<tr>
<th align="left">Model</th>
<th>Size</th>
<th>Token Density<sup>+</sup></th>
<th>OpenCompass</th>
<th>MME</th>
<th>MMVet</th>
<th>OCRBench</th>
<th>MMMU val</th>
<th>MathVista mini</th>
<th>MMB1.1 test</th>
<th>AI2D</th>
<th>TextVQA val</th>
<th>DocVQA test</th>
<th>HallusionBench</th>
<th>Object HalBench</th>
</tr>
</thead>
<tbody align="center">
<tr>
<td colspan="15" align="left"><strong>Proprietary</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4o</td>
<td>-</td>
<td>1088</td>
<td>69.9</td>
<td>2328.7</td>
<td>69.1</td>
<td>736</td>
<td>69.2</td>
<td>61.3</td>
<td>82.2</td>
<td>84.6</td>
<td>-</td>
<td>92.8</td>
<td>55.0</td>
<td>17.6</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Claude 3.5 Sonnet</td>
<td>-</td>
<td>750</td>
<td>67.9</td>
<td>1920.0</td>
<td>66.0</td>
<td>788</td>
<td>65.9</td>
<td>61.6</td>
<td>78.5</td>
<td>80.2</td>
<td>-</td>
<td>95.2</td>
<td>49.9</td>
<td>13.8</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
<td>-</td>
<td>-</td>
<td>64.4</td>
<td>2110.6</td>
<td>64.0</td>
<td>754</td>
<td>60.6</td>
<td>57.7</td>
<td>73.9</td>
<td>79.1</td>
<td>73.5</td>
<td>86.5</td>
<td>45.6</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4o mini</td>
<td>-</td>
<td>1088</td>
<td>64.1</td>
<td>2003.4</td>
<td>66.9</td>
<td>785</td>
<td>60.0</td>
<td>52.4</td>
<td>76.0</td>
<td>77.8</td>
<td>-</td>
<td>-</td>
<td>46.1</td>
<td>12.4</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4V</td>
<td>-</td>
<td>1088</td>
<td>63.5</td>
<td>2070.2</td>
<td>67.5</td>
<td>656</td>
<td>61.7</td>
<td>54.7</td>
<td>79.8</td>
<td>78.6</td>
<td>78.0</td>
<td>87.2</td>
<td>43.9</td>
<td>14.2</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Step-1V</td>
<td>-</td>
<td>-</td>
<td>59.5</td>
<td>2206.4</td>
<td>63.3</td>
<td>625</td>
<td>49.9</td>
<td>44.8</td>
<td>78.0</td>
<td>79.2</td>
<td>71.6</td>
<td>-</td>
<td>48.4</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Qwen-VL-Max</td>
<td>-</td>
<td>784</td>
<td>58.3</td>
<td>2281.7</td>
<td>61.8</td>
<td>684</td>
<td>52.0</td>
<td>43.4</td>
<td>74.6</td>
<td>75.7</td>
<td>79.5</td>
<td>93.1</td>
<td>41.2</td>
<td>13.4</td>
</tr>
<tr>
<td colspan="15" align="left"><strong>Open-source</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">LLaVA-NeXT-Yi-34B</td>
<td>34B</td>
<td>157</td>
<td>55.0</td>
<td>2006.5</td>
<td>50.7</td>
<td>574</td>
<td>48.8</td>
<td>40.4</td>
<td>77.8</td>
<td>78.9</td>
<td>69.3</td>
<td>-</td>
<td>34.8</td>
<td>12.6</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Mini-Gemini-HD-34B</td>
<td>34B</td>
<td>157</td>
<td>-</td>
<td>2141</td>
<td>59.3</td>
<td>518</td>
<td>48.0</td>
<td>43.3</td>
<td>-</td>
<td>80.5</td>
<td>74.1</td>
<td>78.9</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Cambrian-34B</td>
<td>34B</td>
<td>1820</td>
<td>58.3</td>
<td>2049.9</td>
<td>53.2</td>
<td>591</td>
<td>50.4</td>
<td>50.3</td>
<td>77.8</td>
<td>79.5</td>
<td>76.7</td>
<td>75.5</td>
<td>41.6</td>
<td>14.7</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GLM-4V-9B</td>
<td>13B</td>
<td>784</td>
<td>59.1</td>
<td>2018.8</td>
<td>58.0</td>
<td>776</td>
<td>46.9</td>
<td>51.1</td>
<td>67.9</td>
<td>71.2</td>
<td>-</td>
<td>-</td>
<td>45.0</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">InternVL2-8B</td>
<td>8B</td>
<td>706</td>
<td>64.1</td>
<td>2215.1</td>
<td>54.3</td>
<td>794</td>
<td><strong>51.2</strong></td>
<td>58.3</td>
<td><strong>79.4</strong></td>
<td><strong>83.6</strong></td>
<td>77.4</td>
<td><strong>91.6</strong></td>
<td>45.0</td>
<td>21.3</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-Llama-V 2.5</td>
<td>8B</td>
<td>1882</td>
<td>58.8</td>
<td>2024.6</td>
<td>52.8</td>
<td>725</td>
<td>45.8</td>
<td>54.3</td>
<td>72.0</td>
<td>78.4</td>
<td>76.6</td>
<td>84.8</td>
<td>42.4</td>
<td>10.3</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
<td>8B</td>
<td><strong>2822</strong></td>
<td><strong>65.2</strong></td>
<td><strong>2348.4</strong>*</td>
<td><strong>60.0</strong></td>
<td><strong>852</strong>*</td>
<td>49.8*</td>
<td><strong>60.6</strong></td>
<td>78.0</td>
<td>82.1</td>
<td><strong>80.1<strong></td>
<td>90.8</td>
<td><strong>48.1</strong>*</td>
<td><strong>8.2</strong></td>
</tr>
</tbody>
</table>
</div>
* 我们使用思维链提示词来评估这些基准。
<sup>+</sup> Token Density每个视觉 token 在最大分辨率下编码的像素数,即最大分辨率下的像素数 / 视觉 token 数。
注意:闭源模型的 Token Density 由 API 收费方式估算得到。
</details>
<details>
<summary>点击查看 Mantis Eval, BLINK, Mathverse mv, Sciverse mv, MIRB 上的多图评测结果详情。</summary>
<div align="center">
<table style="margin: 0px auto;">
<thead>
<tr>
<th align="left">Model</th>
<th>Size</th>
<th>Mantis Eval</th>
<th>BLINK val</th>
<th>Mathverse mv</th>
<th>Sciverse mv</th>
<th>MIRB</th>
</tr>
</thead>
<tbody align="center">
<tr>
<td colspan="7" align="left"><strong>Proprietary</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4V</td>
<td>-</td>
<td>62.7</td>
<td>54.6</td>
<td>60.3</td>
<td>66.9</td>
<td>53.1</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">LLaVA-NeXT-Interleave-14B</td>
<td>14B</td>
<td>66.4</td>
<td>52.6</td>
<td>32.7</td>
<td>30.2</td>
<td>-</td>
</tr>
<tr>
<td colspan="7" align="left"><strong>Open-source</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Emu2-Chat</td>
<td>37B</td>
<td>37.8</td>
<td>36.2</td>
<td>-</td>
<td>27.2</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">CogVLM</td>
<td>17B</td>
<td>45.2</td>
<td>41.1</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">VPG-C</td>
<td>7B</td>
<td>52.4</td>
<td>43.1</td>
<td>24.3</td>
<td>23.1</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">VILA 8B</td>
<td>8B</td>
<td>51.2</td>
<td>39.3</td>
<td>-</td>
<td>36.5</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">InternLM-XComposer-2.5</td>
<td>8B</td>
<td>53.1*</td>
<td>48.9</td>
<td>32.1*</td>
<td>-</td>
<td>42.5</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">InternVL2-8B</td>
<td>8B</td>
<td>59.0*</td>
<td>50.9</td>
<td>30.5*</td>
<td>34.4*</td>
<td><strong>56.9*</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
<td>8B</td>
<td><strong>69.1</strong></td>
<td><strong>53.0</strong></td>
<td><strong>84.9</strong></td>
<td><strong>74.9</strong></td>
<td>53.8</td>
</tr>
</tbody>
</table>
</div>
* 正式开源模型权重的评测结果。
</details>
<details>
<summary>点击查看 Video-MME 和 Video-ChatGPT 上的视频评测结果详情。</summary>
<div align="center">
<table style="margin: 0px auto;">
<thead>
<tr>
<th align="left">Model</th>
<th>Size</th>
<th colspan="2">Video-MME</th>
<th colspan="5">Video-ChatGPT</th>
</tr>
<tr>
<th align="left"></th>
<th></th>
<th>w/o subs</th>
<th>w subs</th>
<th>Correctness</th>
<th>Detail</th>
<th>Context</th>
<th>Temporal</th>
<th>Consistency</th>
</tr>
</thead>
<tbody align="center">
<tr>
<td colspan="9" align="left"><strong>Proprietary</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Claude 3.5 Sonnet</td>
<td>-</td>
<td>60.0</td>
<td>62.9</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4V</td>
<td>-</td>
<td>59.9</td>
<td>63.3</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td colspan="9" align="left"><strong>Open-source</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">LLaVA-NeXT-7B</td>
<td>7B</td>
<td>-</td>
<td>-</td>
<td>3.39</td>
<td>3.29</td>
<td>3.92</td>
<td>2.60</td>
<td>3.12</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">LLaVA-NeXT-34B</td>
<td>34B</td>
<td>-</td>
<td>-</td>
<td>3.29</td>
<td>3.23</td>
<td>3.83</td>
<td>2.51</td>
<td>3.47</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">CogVLM2-Video</td>
<td>12B</td>
<td>-</td>
<td>-</td>
<td>3.49</td>
<td><strong>3.46</strong></td>
<td>3.23</td>
<td><strong>2.98</strong></td>
<td><strong>3.64</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">LongVA</td>
<td>7B</td>
<td>52.4</td>
<td>54.3</td>
<td>3.05</td>
<td>3.09</td>
<td>3.77</td>
<td>2.44</td>
<td><strong>3.64</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">InternVL2-8B</td>
<td>8B</td>
<td>54.0</td>
<td>56.9</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">InternLM-XComposer-2.5</td>
<td>8B</td>
<td>55.8</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">LLaVA-NeXT-Video</td>
<td>32B</td>
<td>60.2</td>
<td>63.0</td>
<td>3.48</td>
<td>3.37</td>
<td><strong>3.95</strong></td>
<td>2.64</td>
<td>3.28</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
<td>8B</td>
<td><strong>60.9</strong></td>
<td><strong>63.6</strong></td>
<td><strong>3.59</strong></td>
<td>3.28</td>
<td>3.93</td>
<td>2.73</td>
<td>3.62</td>
</tr>
</tbody>
</table>
</div>
</details>
<details>
<summary>点击查看 TextVQA, VizWiz, VQAv2, OK-VQA上的少样本评测结果详情。</summary>
<div align="center">
<table style="margin: 0px auto;">
<thead>
<tr>
<th align="left">Model</th>
<th>Size</th>
<th>Shot</th>
<th>TextVQA val</th>
<th>VizWiz test-dev</th>
<th>VQAv2 test-dev</th>
<th>OK-VQA val</th>
</tr>
</thead>
<tbody align="center">
<tr>
<td align="left" nowrap="nowrap" rowspan="3">Flamingo</td>
<td rowspan="3">80B</td>
<td>0*</td>
<td>35.0</td>
<td>31.6</td>
<td>56.3</td>
<td>40.6</td>
</tr>
<tr>
<td>4</td>
<td>36.5</td>
<td>39.6</td>
<td>63.1</td>
<td><strong>57.4</strong></td>
</tr>
<tr>
<td>8</td>
<td>37.3</td>
<td>44.8</td>
<td>65.6</td>
<td>57.5</td>
</tr>
<tr>
<td align="left" nowrap="nowrap" rowspan="3">IDEFICS</td>
<td rowspan="3">80B</td>
<td>0*</td>
<td>30.9</td>
<td>36.0</td>
<td>60.0</td>
<td>45.2</td>
</tr>
<tr>
<td>4</td>
<td>34.3</td>
<td>40.4</td>
<td>63.6</td>
<td>52.4</td>
</tr>
<tr>
<td>8</td>
<td>35.7</td>
<td>46.1</td>
<td>64.8</td>
<td>55.1</td>
</tr>
<tr>
<td align="left" nowrap="nowrap" rowspan="3">OmniCorpus</td>
<td rowspan="3">7B</td>
<td>0*</td>
<td>43.0</td>
<td>49.8</td>
<td>63.2</td>
<td>45.5</td>
</tr>
<tr>
<td>4</td>
<td>45.4</td>
<td>51.3</td>
<td>64.5</td>
<td>46.5</td>
</tr>
<tr>
<td>8</td>
<td>45.6</td>
<td>52.2</td>
<td>64.7</td>
<td>46.6</td>
</tr>
<tr>
<td align="left" nowrap="nowrap" rowspan="3">Emu2</td>
<td rowspan="3">37B</td>
<td>0</td>
<td>26.4</td>
<td>40.4</td>
<td>33.5</td>
<td>26.7</td>
</tr>
<tr>
<td>4</td>
<td>48.2</td>
<td>54.6</td>
<td>67.0</td>
<td>53.2</td>
</tr>
<tr>
<td>8</td>
<td>49.3</td>
<td>54.7</td>
<td>67.8</td>
<td>54.1</td>
</tr>
<tr>
<td align="left" nowrap="nowrap" rowspan="2">MM1</td>
<td rowspan="2">30B</td>
<td>0</td>
<td>26.2</td>
<td>40.4</td>
<td>48.9</td>
<td>26.7</td>
</tr>
<tr>
<td>8</td>
<td>49.3</td>
<td>54.7</td>
<td><strong>70.9</strong></td>
<td>54.1</td>
</tr>
<tr>
<td align="left" nowrap="nowrap" rowspan="3">MiniCPM-V 2.6<sup>+</sup></td>
<td rowspan="3">8B</td>
<td>0</td>
<td>43.9</td>
<td>33.8</td>
<td>45.4</td>
<td>23.9</td>
</tr>
<tr>
<td>4</td>
<td>63.6</td>
<td>60.5</td>
<td>65.5</td>
<td>50.1</td>
</tr>
<tr>
<td>8</td>
<td><strong>64.6</strong></td>
<td><strong>63.4</strong></td>
<td>68.2</td>
<td>51.4</td>
</tr>
</tbody>
</table>
</div>
* 使用 Flamingo 方式 zero image shot 和 two additional text shots 评估零样本性能。
<sup>+</sup> 我们在没有进行监督微调 (SFT) 的情况下评估预训练的模型权重 (ckpt)。
</details>
### 典型示例 <!-- omit in toc -->
<div style="display: flex; flex-direction: column; align-items: center;">
<img src="../assets/minicpmv2_6/multi_img-bike.png" alt="Bike" style="margin-bottom: 5px;">
<img src="../assets/minicpmv2_6/multi_img-menu.png" alt="Menu" style="margin-bottom: 5px;">
<img src="../assets/minicpmv2_6/multi_img-code.png" alt="Code" style="margin-bottom: 5px;">
<img src="../assets/minicpmv2_6/ICL-Mem.png" alt="Mem" style="margin-bottom: 5px;">
<img src="../assets/minicpmv2_6/multiling-medal.png" alt="medal" style="margin-bottom: 10px;">
</div>
<details>
<summary>点击查看更多示例。</summary>
<div style="display: flex; flex-direction: column; align-items: center;">
<img src="../assets/minicpmv2_6/ICL-elec.png" alt="elec" style="margin-bottom: 5px;">
<img src="../assets/minicpmv2_6/multiling-olympic.png" alt="Menu" style="margin-bottom: 10px;">
</div>
</details>
我们将 MiniCPM-V 2.6 部署在iPad Pro上并录制了以下演示视频。
<table align="center">
<p align="center">
<img src="../assets/gif_cases/ai.gif" width=32%/>
&nbsp;&nbsp;&nbsp;&nbsp;
<img src="../assets/gif_cases/beer.gif" width=32%/>
</p>
</table>
<table align="center">
<p align="center">
<video src="https://github.com/user-attachments/assets/21f4b818-ede1-4822-920e-91281725c830" width="360" /> </video>
<!-- <video src="https://github.com/user-attachments/assets/c835f757-206b-4d9c-8e36-70d67b453628" width="360" /> </video> -->
</p>
</table>
</details>

556
docs/minicpm_v4_en.md Normal file
View File

@@ -0,0 +1,556 @@
## MiniCPM-V 4.0
> Archieve at: 2025-08-25
**MiniCPM-V 4.0** is the latest efficient model in the MiniCPM-V series. The model is built based on SigLIP2-400M and MiniCPM4-3B with a total of 4.1B parameters. It inherits the strong single-image, multi-image and video understanding performance of MiniCPM-V 2.6 with largely improved efficiency. Notable features of MiniCPM-V 4.0 include:
- 🔥 **Leading Visual Capability.**
With only 4.1B parameters, MiniCPM-V 4.0 achieves an average score of 69.0 on OpenCompass, a comprehensive evaluation of 8 popular benchmarks, **outperforming GPT-4.1-mini-20250414, MiniCPM-V 2.6 (8.1B params, OpenCompass 65.2) and Qwen2.5-VL-3B-Instruct (3.8B params, OpenCompass 64.5)**. It also shows good performance in multi-image understanding and video understanding.
- 🚀 **Superior Efficiency.**
Designed for on-device deployment, MiniCPM-V 4.0 runs smoothly on end devices. For example, it devlivers **less than 2s first token delay and more than 17 token/s decoding on iPhone 16 Pro Max**, without heating problems. It also shows superior throughput under concurrent requests.
- 💫 **Easy Usage.**
MiniCPM-V 4.0 can be easily used in various ways including **llama.cpp, Ollama, vLLM, SGLang, LLaMA-Factory and local web demo** etc. We also open-source iOS App that can run on iPhone and iPad. Get started easily with our well-structured [Cookbook](https://github.com/OpenSQZ/MiniCPM-V-CookBook), featuring detailed instructions and practical examples.
### Evaluation <!-- omit in toc -->
<details>
<summary>Click to view single image results on OpenCompass. </summary>
<div align="center">
<table style="margin: 0px auto;">
<thead>
<tr>
<th nowrap="nowrap" align="left">model</th>
<th>Size</th>
<th>Opencompass</th>
<th>OCRBench</th>
<th>MathVista</th>
<th>HallusionBench</th>
<th>MMMU</th>
<th>MMVet</th>
<th>MMBench V1.1</th>
<th>MMStar</th>
<th>AI2D</th>
</tr>
</thead>
<tbody align="center">
<tr>
<td colspan="11" align="left"><strong>Proprietary</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4v-20240409</td>
<td>-</td>
<td>63.5</td>
<td>656</td>
<td>55.2</td>
<td>43.9</td>
<td>61.7</td>
<td>67.5</td>
<td>79.8</td>
<td>56.0</td>
<td>78.6</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Gemini-1.5-Pro</td>
<td>-</td>
<td>64.5</td>
<td>754</td>
<td>58.3</td>
<td>45.6</td>
<td>60.6</td>
<td>64.0</td>
<td>73.9</td>
<td>59.1</td>
<td>79.1</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4.1-mini-20250414</td>
<td>-</td>
<td>68.9</td>
<td>840</td>
<td>70.9</td>
<td>49.3</td>
<td>55.0</td>
<td>74.3</td>
<td>80.9</td>
<td>60.9</td>
<td>76.0</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Claude 3.5 Sonnet-20241022</td>
<td>-</td>
<td>70.6</td>
<td>798</td>
<td>65.3</td>
<td>55.5</td>
<td>66.4</td>
<td>70.1</td>
<td>81.7</td>
<td>65.1</td>
<td>81.2</td>
</tr>
<tr>
<td colspan="11" align="left"><strong>Open-source</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Qwen2.5-VL-3B-Instruct</td>
<td>3.8B</td>
<td>64.5</td>
<td>828</td>
<td>61.2</td>
<td>46.6</td>
<td>51.2</td>
<td>60.0</td>
<td>76.8</td>
<td>56.3</td>
<td>81.4</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">InternVL2.5-4B</td>
<td>3.7B</td>
<td>65.1</td>
<td>820</td>
<td>60.8</td>
<td>46.6</td>
<td>51.8</td>
<td>61.5</td>
<td>78.2</td>
<td>58.7</td>
<td>81.4</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Qwen2.5-VL-7B-Instruct</td>
<td>8.3B</td>
<td>70.9</td>
<td>888</td>
<td>68.1</td>
<td>51.9</td>
<td>58.0</td>
<td>69.7</td>
<td>82.2</td>
<td>64.1</td>
<td>84.3</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">InternVL2.5-8B</td>
<td>8.1B</td>
<td>68.1</td>
<td>821</td>
<td>64.5</td>
<td>49.0</td>
<td>56.2</td>
<td>62.8</td>
<td>82.5</td>
<td>63.2</td>
<td>84.6</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-V-2.6</td>
<td>8.1B</td>
<td>65.2</td>
<td>852</td>
<td>60.8</td>
<td>48.1</td>
<td>49.8</td>
<td>60.0</td>
<td>78.0</td>
<td>57.5</td>
<td>82.1</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-o-2.6</td>
<td>8.7B</td>
<td>70.2</td>
<td>889</td>
<td>73.3</td>
<td>51.1</td>
<td>50.9</td>
<td>67.2</td>
<td>80.6</td>
<td>63.3</td>
<td>86.1</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-V-4.0</td>
<td>4.1B</td>
<td>69.0</td>
<td>894</td>
<td>66.9</td>
<td>50.8</td>
<td>51.2</td>
<td>68.0</td>
<td>79.7</td>
<td>62.8</td>
<td>82.9</td>
</tr>
</tbody>
</table>
</div>
</details>
<details>
<summary>Click to view single image results on ChartQA, MME, RealWorldQA, TextVQA, DocVQA, MathVision, DynaMath, WeMath, Object HalBench and MM Halbench. </summary>
<div align="center">
<table style="margin: 0px auto;">
<thead>
<tr>
<th nowrap="nowrap" align="left">model</th>
<th>Size</th>
<th>ChartQA</th>
<th>MME</th>
<th>RealWorldQA</th>
<th>TextVQA</th>
<th>DocVQA</th>
<th>MathVision</th>
<th>DynaMath</th>
<th>WeMath</th>
<th colspan="2">Obj Hal</th>
<th colspan="2">MM Hal</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>CHAIRs↓</td>
<td>CHAIRi↓</td>
<td nowrap="nowrap">score avg@3</td>
<td nowrap="nowrap">hall rate avg@3</td>
</tr>
<tbody align="center">
<tr>
<td colspan="14" align="left"><strong>Proprietary</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4v-20240409</td>
<td>-</td>
<td>78.5</td>
<td>1927</td>
<td>61.4</td>
<td>78.0</td>
<td>88.4</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Gemini-1.5-Pro</td>
<td>-</td>
<td>87.2</td>
<td>-</td>
<td>67.5</td>
<td>78.8</td>
<td>93.1</td>
<td>41.0</td>
<td>31.5</td>
<td>50.5</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4.1-mini-20250414</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>45.3</td>
<td>47.7</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Claude 3.5 Sonnet-20241022</td>
<td>-</td>
<td>90.8</td>
<td>-</td>
<td>60.1</td>
<td>74.1</td>
<td>95.2</td>
<td>35.6</td>
<td>35.7</td>
<td>44.0</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td colspan="14" align="left"><strong>Open-source</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Qwen2.5-VL-3B-Instruct</td>
<td>3.8B</td>
<td>84.0</td>
<td>2157</td>
<td>65.4</td>
<td>79.3</td>
<td>93.9</td>
<td>21.9</td>
<td>13.2</td>
<td>22.9</td>
<td>18.3</td>
<td>10.8</td>
<td>3.9 </td>
<td>33.3 </td>
</tr>
<tr>
<td nowrap="nowrap" align="left">InternVL2.5-4B</td>
<td>3.7B</td>
<td>84.0</td>
<td>2338</td>
<td>64.3</td>
<td>76.8</td>
<td>91.6</td>
<td>18.4</td>
<td>15.2</td>
<td>21.2</td>
<td>13.7</td>
<td>8.7</td>
<td>3.2 </td>
<td>46.5 </td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Qwen2.5-VL-7B-Instruct</td>
<td>8.3B</td>
<td>87.3</td>
<td>2347</td>
<td>68.5</td>
<td>84.9</td>
<td>95.7</td>
<td>25.4</td>
<td>21.8</td>
<td>36.2</td>
<td>13.3</td>
<td>7.9</td>
<td>4.1 </td>
<td>31.6 </td>
</tr>
<tr>
<td nowrap="nowrap" align="left">InternVL2.5-8B</td>
<td>8.1B</td>
<td>84.8</td>
<td>2344</td>
<td>70.1</td>
<td>79.1</td>
<td>93.0</td>
<td>17.0</td>
<td>9.4</td>
<td>23.5</td>
<td>18.3</td>
<td>11.6</td>
<td>3.6 </td>
<td>37.2</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-V-2.6</td>
<td>8.1B</td>
<td>79.4</td>
<td>2348</td>
<td>65.0</td>
<td>80.1</td>
<td>90.8</td>
<td>17.5</td>
<td>9.0</td>
<td>20.4</td>
<td>7.3</td>
<td>4.7</td>
<td>4.0 </td>
<td>29.9 </td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-o-2.6</td>
<td>8.7B</td>
<td>86.9</td>
<td>2372</td>
<td>68.1</td>
<td>82.0</td>
<td>93.5</td>
<td>21.7</td>
<td>10.4</td>
<td>25.2</td>
<td>6.3</td>
<td>3.4</td>
<td>4.1 </td>
<td>31.3 </td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-V-4.0</td>
<td>4.1B</td>
<td>84.4</td>
<td>2298</td>
<td>68.5</td>
<td>80.8</td>
<td>92.9</td>
<td>20.7</td>
<td>14.2</td>
<td>32.7</td>
<td>6.3</td>
<td>3.5</td>
<td>4.1 </td>
<td>29.2 </td>
</tr>
</tbody>
</table>
</div>
</details>
<details>
<summary>Click to view multi-image and video understanding results on Mantis, Blink and Video-MME. </summary>
<div align="center">
<table style="margin: 0px auto;">
<thead>
<tr>
<th nowrap="nowrap" align="left">model</th>
<th>Size</th>
<th>Mantis</th>
<th>Blink</th>
<th nowrap="nowrap" colspan="2" >Video-MME</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>wo subs</td>
<td>w subs</td>
</tr>
<tbody align="center">
<tr>
<td colspan="6" align="left"><strong>Proprietary</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4v-20240409</td>
<td>-</td>
<td>62.7</td>
<td>54.6</td>
<td>59.9</td>
<td>63.3</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Gemini-1.5-Pro</td>
<td>-</td>
<td>-</td>
<td>59.1</td>
<td>75.0</td>
<td>81.3</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4o-20240513</td>
<td>-</td>
<td>-</td>
<td>68.0</td>
<td>71.9</td>
<td>77.2</td>
</tr>
<tr>
<td colspan="6" align="left"><strong>Open-source</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Qwen2.5-VL-3B-Instruct</td>
<td>3.8B</td>
<td>-</td>
<td>47.6</td>
<td>61.5</td>
<td>67.6</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">InternVL2.5-4B</td>
<td>3.7B</td>
<td>62.7</td>
<td>50.8</td>
<td>62.3</td>
<td>63.6</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Qwen2.5-VL-7B-Instruct</td>
<td>8.3B</td>
<td>-</td>
<td>56.4</td>
<td>65.1</td>
<td>71.6</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">InternVL2.5-8B</td>
<td>8.1B</td>
<td>67.7</td>
<td>54.8</td>
<td>64.2</td>
<td>66.9</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-V-2.6</td>
<td>8.1B</td>
<td>69.1</td>
<td>53.0</td>
<td>60.9</td>
<td>63.6</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-o-2.6</td>
<td>8.7B</td>
<td>71.9</td>
<td>56.7</td>
<td>63.9</td>
<td>69.6</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-V-4.0</td>
<td>4.1B</td>
<td>71.4</td>
<td>54.0</td>
<td>61.2</td>
<td>65.8</td>
</tr>
</tbody>
</table>
</div>
</details>
### Examples
<div style="display: flex; flex-direction: column; align-items: center;">
<img src="../assets/minicpmv4/minicpm-v-4-case.png" alt="math" style="margin-bottom: 5px;">
</div>
We deploy MiniCPM-V 4.0 on iPhone 16 Pro Max with [iOS demo](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/demo/ios_demo/ios.md). The demo video is the raw screen recording without edition.
<table align="center">
<p align="center">
<img src="../assets/minicpmv4/iphone_en.gif" width=45%/>
&nbsp;&nbsp;&nbsp;&nbsp;
<img src="../assets/minicpmv4/iphone_en_information_extraction.gif" width=45%/>
</p>
<p align="center">
<img src="../assets/minicpmv4/iphone_cn.gif" width=45%/>
&nbsp;&nbsp;&nbsp;&nbsp;
<img src="../assets/minicpmv4/iphone_cn_funny_points.gif" width=45%/>
</p>
</table>

557
docs/minicpm_v4_zh.md Normal file
View File

@@ -0,0 +1,557 @@
## MiniCPM-V 4.0
> Archieve at: 2025-08-25
MiniCPM-V 4.0 是 MiniCPM-V 系列中的最新模型。该模型基于 SigLIP2-400M 和 MiniCPM4-3B 构建,参数总量为 4.1B。它延续了 MiniCPM-V 2.6 在单图、多图和视频理解方面的强大能力同时大幅提升了推理效率。MiniCPM-V 4.0 的主要特点包括:
- 🔥 **领先的视觉能力。**
MiniCPM-V 4.0 在 OpenCompass 上获得了平均 69.0 的高分,超越了 MiniCPM-V 2.68.1B,得分 65.2)、 Qwen2.5-VL-3B-Instruct3.8B,得分 64.5)和**广泛使用的闭源模型 GPT-4.1-mini-20250414**。在多图理解与视频理解任务上MiniCPM-V 4.0 也表现出色。
- 🚀 **卓越的效率。**
MiniCPM-V 4.0 专为端侧设备优化,**可在 iPhone 16 Pro Max 上流畅运行,首 token 延迟低至 2 秒,解码速度达 17.9 tokens/s**且无发热问题。MiniCPM-V 4.0 在并发请求场景下表现出领先的吞吐率指标。
- 💫 **易于使用。**
MiniCPM-V 4.0 支持多种推理方式,包括 **llama.cpp、Ollama、vLLM、SGLang、LLaMA-Factory 及本地 Web Demo 等**。我们还开源了可以在 iPhone 和 iPad 运行的 iOS App。欢迎参考我们开源的 **结构清晰的[使用手册](https://github.com/OpenSQZ/MiniCPM-V-CookBook)** 玩转 MiniCPM-V 4.0,其中涵盖了详细的部署指南和真实示例。
### 性能评估 <!-- omit in toc -->
<details>
<summary>点击查看在OpenCompass上的单图理解能力的评测结果。</summary>
<div align="center">
<table style="margin: 0px auto;">
<thead>
<tr>
<th nowrap="nowrap" align="left">model</th>
<th>Size</th>
<th>Opencompass</th>
<th>OCRBench</th>
<th>MathVista</th>
<th>HallusionBench</th>
<th>MMMU</th>
<th>MMVet</th>
<th>MMBench V1.1</th>
<th>MMStar</th>
<th>AI2D</th>
</tr>
</thead>
<tbody align="center">
<tr>
<td colspan="11" align="left"><strong>Proprietary</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4v-20240409</td>
<td>-</td>
<td>63.5</td>
<td>656</td>
<td>55.2</td>
<td>43.9</td>
<td>61.7</td>
<td>67.5</td>
<td>79.8</td>
<td>56.0</td>
<td>78.6</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Gemini-1.5-Pro</td>
<td>-</td>
<td>64.5</td>
<td>754</td>
<td>58.3</td>
<td>45.6</td>
<td>60.6</td>
<td>64.0</td>
<td>73.9</td>
<td>59.1</td>
<td>79.1</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4.1-mini-20250414</td>
<td>-</td>
<td>68.9</td>
<td>840</td>
<td>70.9</td>
<td>49.3</td>
<td>55.0</td>
<td>74.3</td>
<td>80.9</td>
<td>60.9</td>
<td>76.0</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Claude 3.5 Sonnet-20241022</td>
<td>-</td>
<td>70.6</td>
<td>798</td>
<td>65.3</td>
<td>55.5</td>
<td>66.4</td>
<td>70.1</td>
<td>81.7</td>
<td>65.1</td>
<td>81.2</td>
</tr>
<tr>
<td colspan="11" align="left"><strong>Open-source</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Qwen2.5-VL-3B-Instruct</td>
<td>3.8B</td>
<td>64.5</td>
<td>828</td>
<td>61.2</td>
<td>46.6</td>
<td>51.2</td>
<td>60.0</td>
<td>76.8</td>
<td>56.3</td>
<td>81.4</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">InternVL2.5-4B</td>
<td>3.7B</td>
<td>65.1</td>
<td>820</td>
<td>60.8</td>
<td>46.6</td>
<td>51.8</td>
<td>61.5</td>
<td>78.2</td>
<td>58.7</td>
<td>81.4</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Qwen2.5-VL-7B-Instruct</td>
<td>8.3B</td>
<td>70.9</td>
<td>888</td>
<td>68.1</td>
<td>51.9</td>
<td>58.0</td>
<td>69.7</td>
<td>82.2</td>
<td>64.1</td>
<td>84.3</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">InternVL2.5-8B</td>
<td>8.1B</td>
<td>68.1</td>
<td>821</td>
<td>64.5</td>
<td>49.0</td>
<td>56.2</td>
<td>62.8</td>
<td>82.5</td>
<td>63.2</td>
<td>84.6</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-V-2.6</td>
<td>8.1B</td>
<td>65.2</td>
<td>852</td>
<td>60.8</td>
<td>48.1</td>
<td>49.8</td>
<td>60.0</td>
<td>78.0</td>
<td>57.5</td>
<td>82.1</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-o-2.6</td>
<td>8.7B</td>
<td>70.2</td>
<td>889</td>
<td>73.3</td>
<td>51.1</td>
<td>50.9</td>
<td>67.2</td>
<td>80.6</td>
<td>63.3</td>
<td>86.1</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-V-4.0</td>
<td>4.1B</td>
<td>69.0</td>
<td>894</td>
<td>66.9</td>
<td>50.8</td>
<td>51.2</td>
<td>68.0</td>
<td>79.7</td>
<td>62.8</td>
<td>82.9</td>
</tr>
</tbody>
</table>
</div>
</details>
<details>
<summary>点击查看在图表理解、文档理解、数学推理、幻觉等领域的评测结果。 </summary>
<div align="center">
<table style="margin: 0px auto;">
<thead>
<tr>
<th nowrap="nowrap" align="left">model</th>
<th>Size</th>
<th>ChartQA</th>
<th>MME</th>
<th>RealWorldQA</th>
<th>TextVQA</th>
<th>DocVQA</th>
<th>MathVision</th>
<th>DynaMath</th>
<th>WeMath</th>
<th colspan="2">Obj Hal</th>
<th colspan="2">MM Hal</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>CHAIRs↓</td>
<td>CHAIRi↓</td>
<td nowrap="nowrap">score avg@3</td>
<td nowrap="nowrap">hall rate avg@3</td>
</tr>
<tbody align="center">
<tr>
<td colspan="14" align="left"><strong>Proprietary</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4v-20240409</td>
<td>-</td>
<td>78.5</td>
<td>1927</td>
<td>61.4</td>
<td>78.0</td>
<td>88.4</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Gemini-1.5-Pro</td>
<td>-</td>
<td>87.2</td>
<td>-</td>
<td>67.5</td>
<td>78.8</td>
<td>93.1</td>
<td>41.0</td>
<td>31.5</td>
<td>50.5</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4.1-mini-20250414</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>45.3</td>
<td>47.7</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Claude 3.5 Sonnet-20241022</td>
<td>-</td>
<td>90.8</td>
<td>-</td>
<td>60.1</td>
<td>74.1</td>
<td>95.2</td>
<td>35.6</td>
<td>35.7</td>
<td>44.0</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td colspan="14" align="left"><strong>Open-source</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Qwen2.5-VL-3B-Instruct</td>
<td>3.8B</td>
<td>84.0</td>
<td>2157</td>
<td>65.4</td>
<td>79.3</td>
<td>93.9</td>
<td>21.9</td>
<td>13.2</td>
<td>22.9</td>
<td>18.3</td>
<td>10.8</td>
<td>3.9 </td>
<td>33.3 </td>
</tr>
<tr>
<td nowrap="nowrap" align="left">InternVL2.5-4B</td>
<td>3.7B</td>
<td>84.0</td>
<td>2338</td>
<td>64.3</td>
<td>76.8</td>
<td>91.6</td>
<td>18.4</td>
<td>15.2</td>
<td>21.2</td>
<td>13.7</td>
<td>8.7</td>
<td>3.2 </td>
<td>46.5 </td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Qwen2.5-VL-7B-Instruct</td>
<td>8.3B</td>
<td>87.3</td>
<td>2347</td>
<td>68.5</td>
<td>84.9</td>
<td>95.7</td>
<td>25.4</td>
<td>21.8</td>
<td>36.2</td>
<td>13.3</td>
<td>7.9</td>
<td>4.1 </td>
<td>31.6 </td>
</tr>
<tr>
<td nowrap="nowrap" align="left">InternVL2.5-8B</td>
<td>8.1B</td>
<td>84.8</td>
<td>2344</td>
<td>70.1</td>
<td>79.1</td>
<td>93.0</td>
<td>17.0</td>
<td>9.4</td>
<td>23.5</td>
<td>18.3</td>
<td>11.6</td>
<td>3.6 </td>
<td>37.2</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-V-2.6</td>
<td>8.1B</td>
<td>79.4</td>
<td>2348</td>
<td>65.0</td>
<td>80.1</td>
<td>90.8</td>
<td>17.5</td>
<td>9.0</td>
<td>20.4</td>
<td>7.3</td>
<td>4.7</td>
<td>4.0 </td>
<td>29.9 </td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-o-2.6</td>
<td>8.7B</td>
<td>86.9</td>
<td>2372</td>
<td>68.1</td>
<td>82.0</td>
<td>93.5</td>
<td>21.7</td>
<td>10.4</td>
<td>25.2</td>
<td>6.3</td>
<td>3.4</td>
<td>4.1 </td>
<td>31.3 </td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-V-4.0</td>
<td>4.1B</td>
<td>84.4</td>
<td>2298</td>
<td>68.5</td>
<td>80.8</td>
<td>92.9</td>
<td>20.7</td>
<td>14.2</td>
<td>32.7</td>
<td>6.3</td>
<td>3.5</td>
<td>4.1 </td>
<td>29.2 </td>
</tr>
</tbody>
</table>
</div>
</details>
<details>
<summary>点击查看多图和视频理解能力的评测结果。 </summary>
<div align="center">
<table style="margin: 0px auto;">
<thead>
<tr>
<th nowrap="nowrap" align="left">model</th>
<th>Size</th>
<th>Mantis</th>
<th>Blink</th>
<th nowrap="nowrap" colspan="2" >Video-MME</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>wo subs</td>
<td>w subs</td>
</tr>
<tbody align="center">
<tr>
<td colspan="6" align="left"><strong>Proprietary</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4v-20240409</td>
<td>-</td>
<td>62.7</td>
<td>54.6</td>
<td>59.9</td>
<td>63.3</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Gemini-1.5-Pro</td>
<td>-</td>
<td>-</td>
<td>59.1</td>
<td>75.0</td>
<td>81.3</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">GPT-4o-20240513</td>
<td>-</td>
<td>-</td>
<td>68.0</td>
<td>71.9</td>
<td>77.2</td>
</tr>
<tr>
<td colspan="6" align="left"><strong>Open-source</strong></td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Qwen2.5-VL-3B-Instruct</td>
<td>3.8B</td>
<td>-</td>
<td>47.6</td>
<td>61.5</td>
<td>67.6</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">InternVL2.5-4B</td>
<td>3.7B</td>
<td>62.7</td>
<td>50.8</td>
<td>62.3</td>
<td>63.6</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">Qwen2.5-VL-7B-Instruct</td>
<td>8.3B</td>
<td>-</td>
<td>56.4</td>
<td>65.1</td>
<td>71.6</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">InternVL2.5-8B</td>
<td>8.1B</td>
<td>67.7</td>
<td>54.8</td>
<td>64.2</td>
<td>66.9</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-V-2.6</td>
<td>8.1B</td>
<td>69.1</td>
<td>53.0</td>
<td>60.9</td>
<td>63.6</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-o-2.6</td>
<td>8.7B</td>
<td>71.9</td>
<td>56.7</td>
<td>63.9</td>
<td>69.6</td>
</tr>
<tr>
<td nowrap="nowrap" align="left">MiniCPM-V-4.0</td>
<td>4.1B</td>
<td>71.4</td>
<td>54.0</td>
<td>61.2</td>
<td>65.8</td>
</tr>
</tbody>
</table>
</div>
</details>
### 典型示例
<div style="display: flex; flex-direction: column; align-items: center;">
<img src="../assets/minicpmv4/minicpm-v-4-case.png" alt="math" style="margin-bottom: 5px;">
</div>
我们在 iPhone 16 Pro Max 上部署了 MiniCPM-V 4.0 [iOS demo](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/demo/ios_demo/ios.md),并录制了以下演示录屏,视频未经加速等任何编辑:
<table align="center">
<p align="center">
<img src="../assets/minicpmv4/iphone_en.gif" width=45%/>
&nbsp;&nbsp;&nbsp;&nbsp;
<img src="../assets/minicpmv4/iphone_en_information_extraction.gif" width=45%/>
</p>
<p align="center">
<img src="../assets/minicpmv4/iphone_cn.gif" width=45%/>
&nbsp;&nbsp;&nbsp;&nbsp;
<img src="../assets/minicpmv4/iphone_cn_funny_points.gif" width=45%/>
</p>
</table>