mirror of
https://github.com/OpenBMB/MiniCPM-V.git
synced 2026-02-04 17:59:18 +08:00
295 lines
8.7 KiB
Markdown
295 lines
8.7 KiB
Markdown
## MiniCPM-V 2.0
|
||
|
||
|
||
> Archive at:2025-01-13
|
||
|
||
|
||
|
||
**MiniCPM-V 2.0** is an efficient version with promising performance for deployment. The model is built based on SigLip-400M and [MiniCPM-2.4B](https://github.com/OpenBMB/MiniCPM/), connected by a perceiver resampler. Our latest version, MiniCPM-V 2.0 has several notable features.
|
||
|
||
- 🔥 **State-of-the-art Performance.**
|
||
|
||
MiniCPM-V 2.0 achieves **state-of-the-art performance** on multiple benchmarks (including OCRBench, TextVQA, MME, MMB, MathVista, etc) among models under 7B parameters. It even **outperforms strong Qwen-VL-Chat 9.6B, CogVLM-Chat 17.4B, and Yi-VL 34B on OpenCompass, a comprehensive evaluation over 11 popular benchmarks**. Notably, MiniCPM-V 2.0 shows **strong OCR capability**, achieving **comparable performance to Gemini Pro in scene-text understanding**, and **state-of-the-art performance on OCRBench** among open-source models.
|
||
|
||
- 🏆 **Trustworthy Behavior.**
|
||
|
||
LMMs are known for suffering from hallucination, often generating text not factually grounded in images. MiniCPM-V 2.0 is **the first end-side LMM aligned via multimodal RLHF for trustworthy behavior** (using the recent [RLHF-V](https://rlhf-v.github.io/) [CVPR'24] series technique). This allows the model to **match GPT-4V in preventing hallucinations** on Object HalBench.
|
||
|
||
- 🌟 **High-Resolution Images at Any Aspect Raito.**
|
||
|
||
MiniCPM-V 2.0 can accept **1.8 million pixels (e.g., 1344x1344) images at any aspect ratio**. This enables better perception of fine-grained visual information such as small objects and optical characters, which is achieved via a recent technique from [LLaVA-UHD](https://arxiv.org/pdf/2403.11703.pdf).
|
||
|
||
- ⚡️ **High Efficiency.**
|
||
|
||
MiniCPM-V 2.0 can be **efficiently deployed on most GPU cards and personal computers**, and **even on end devices such as mobile phones**. For visual encoding, we compress the image representations into much fewer tokens via a perceiver resampler. This allows MiniCPM-V 2.0 to operate with **favorable memory cost and speed during inference even when dealing with high-resolution images**.
|
||
|
||
- 🙌 **Bilingual Support.**
|
||
|
||
MiniCPM-V 2.0 **supports strong bilingual multimodal capabilities in both English and Chinese**. This is enabled by generalizing multimodal capabilities across languages, a technique from [VisCPM](https://arxiv.org/abs/2308.12038) [ICLR'24].
|
||
|
||
|
||
### Evaluation <!-- omit in toc -->
|
||
|
||
<div align="center">
|
||
<img src=../assets/minicpmv-2-peformance.png width=66% />
|
||
</div>
|
||
<details>
|
||
<summary>Click to view results on TextVQA, DocVQA, OCRBench, OpenCompass, MME, MMBench, MMMU, MathVista, LLaVA Bench, Object HalBench. </summary>
|
||
<div align="center">
|
||
|
||
<table style="margin: 0px auto;">
|
||
<thead>
|
||
<tr>
|
||
<th align="left">Model</th>
|
||
<th>Size</th>
|
||
<th>TextVQA val</th>
|
||
<th>DocVQA test</th>
|
||
<th>OCRBench</th>
|
||
<th>OpenCompass</th>
|
||
<th nowrap="nowrap" >MME</th>
|
||
<th>MMB dev(en)</th>
|
||
<th>MMB dev(zh)</th>
|
||
<th>MMMU val</th>
|
||
<th>MathVista</th>
|
||
<th>LLaVA Bench</th>
|
||
<th nowrap="nowrap">Object HalBench</th>
|
||
</tr>
|
||
</thead>
|
||
<tbody align="center">
|
||
<tr>
|
||
<td colspan="12" align="left"><strong>Proprietary models</strong></td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">Gemini Pro Vision</td>
|
||
<td>- </td>
|
||
<td>74.6</td>
|
||
<td>88.1</td>
|
||
<td>680</td>
|
||
<td>63.8</td>
|
||
<td>2148.9</td>
|
||
<td>75.2</td>
|
||
<td>74.0</td>
|
||
<td>48.9</td>
|
||
<td>45.8</td>
|
||
<td>79.9</td>
|
||
<td>- </td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left">GPT-4V</td>
|
||
<td>- </td>
|
||
<td>78.0</td>
|
||
<td>88.4</td>
|
||
<td>645</td>
|
||
<td>63.2</td>
|
||
<td>1771.5</td>
|
||
<td>75.1</td>
|
||
<td>75.0</td>
|
||
<td>53.8</td>
|
||
<td>47.8</td>
|
||
<td>93.1</td>
|
||
<td>86.4 / 92.7</td>
|
||
</tr>
|
||
<tr>
|
||
<td colspan="12" align="left"><strong>Open-source models 6B~34B</strong></td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left" >Yi-VL-6B</td>
|
||
<td align="right" >6.7B</td>
|
||
<td>45.5*</td>
|
||
<td>17.1*</td>
|
||
<td>290</td>
|
||
<td>49.3</td>
|
||
<td>1915.1 </td>
|
||
<td>68.6 </td>
|
||
<td>68.3 </td>
|
||
<td>40.3 </td>
|
||
<td>28.8 </td>
|
||
<td>51.9 </td>
|
||
<td>- </td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left" >Qwen-VL-Chat</td>
|
||
<td align="right" >9.6B</td>
|
||
<td>61.5</td>
|
||
<td>62.6</td>
|
||
<td>488 </td>
|
||
<td>52.1 </td>
|
||
<td>1860.0 </td>
|
||
<td>60.6 </td>
|
||
<td>56.7 </td>
|
||
<td>37.0 </td>
|
||
<td>33.8 </td>
|
||
<td>67.7 </td>
|
||
<td>56.2 / 80.0</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left" >Yi-VL-34B</td>
|
||
<td align="right" >34B</td>
|
||
<td>43.4*</td>
|
||
<td>16.9*</td>
|
||
<td>290</td>
|
||
<td>52.6 </td>
|
||
<td>2050.2</td>
|
||
<td>71.1</td>
|
||
<td>71.4</td>
|
||
<td>45.1</td>
|
||
<td>30.7</td>
|
||
<td>62.3</td>
|
||
<td>- </td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left" >DeepSeek-VL-7B</td>
|
||
<td align="right" >7.3B</td>
|
||
<td>64.7*</td>
|
||
<td>47.0* </td>
|
||
<td>435</td>
|
||
<td>55.6 </td>
|
||
<td>1765.4 </td>
|
||
<td>74.1 </td>
|
||
<td>72.8 </td>
|
||
<td>38.3 </td>
|
||
<td>36.8</td>
|
||
<td>77.8 </td>
|
||
<td>- </td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left" >TextMonkey</td>
|
||
<td align="right" >9.7B</td>
|
||
<td>64.3</td>
|
||
<td>66.7 </td>
|
||
<td>558</td>
|
||
<td>- </td>
|
||
<td>- </td>
|
||
<td>- </td>
|
||
<td>- </td>
|
||
<td>- </td>
|
||
<td>-</td>
|
||
<td>- </td>
|
||
<td>- </td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left" >CogVLM-Chat</td>
|
||
<td align="right" >17.4B</td>
|
||
<td>70.4</td>
|
||
<td>33.3*</td>
|
||
<td>590 </td>
|
||
<td>52.5 </td>
|
||
<td>1736.6 </td>
|
||
<td>63.7 </td>
|
||
<td>53.8 </td>
|
||
<td>37.3 </td>
|
||
<td>34.7 </td>
|
||
<td>73.9 </td>
|
||
<td>73.6 / 87.4 </td>
|
||
</tr>
|
||
<tr>
|
||
<td colspan="12" align="left"><strong>Open-source models 1B~3B </strong></td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left" >DeepSeek-VL-1.3B</td>
|
||
<td align="right" >1.7B</td>
|
||
<td>58.4*</td>
|
||
<td>37.9*</td>
|
||
<td>413</td>
|
||
<td>46.0 </td>
|
||
<td>1531.6 </td>
|
||
<td>64.0 </td>
|
||
<td>61.2 </td>
|
||
<td>33.8 </td>
|
||
<td>29.4 </td>
|
||
<td>51.1 </td>
|
||
<td>- </td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left" >MobileVLM V2</td>
|
||
<td align="right" >3.1B</td>
|
||
<td>57.5</td>
|
||
<td>19.4*</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
<td>1440.5(P) </td>
|
||
<td>63.2 </td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left" >Mini-Gemini</td>
|
||
<td align="right" >2.2B</td>
|
||
<td>56.2</td>
|
||
<td>34.2*</td>
|
||
<td>-</td>
|
||
<td>-</td>
|
||
<td>1653.0 </td>
|
||
<td>59.8 </td>
|
||
<td>- </td>
|
||
<td>31.7 </td>
|
||
<td>-</td>
|
||
<td>- </td>
|
||
<td>- </td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left" >MiniCPM-V</td>
|
||
<td align="right" >2.8B </td>
|
||
<td>60.6</td>
|
||
<td>38.2 </td>
|
||
<td>366</td>
|
||
<td>47.6</td>
|
||
<td>1650.2 </td>
|
||
<td>67.9 </td>
|
||
<td>65.3 </td>
|
||
<td><strong>38.3</strong></td>
|
||
<td>28.9</td>
|
||
<td>51.3 </td>
|
||
<td>78.4 / 88.5 </td>
|
||
</tr>
|
||
<tr>
|
||
<td nowrap="nowrap" align="left" ><strong>MiniCPM-V 2.0</strong></td>
|
||
<td align="right" >2.8B </td>
|
||
<td><strong>74.1</strong></td>
|
||
<td><strong>71.9</strong> </td>
|
||
<td><strong>605</strong></td>
|
||
<td><strong>55.0</strong></td>
|
||
<td><strong>1808.6</strong> </td>
|
||
<td><strong>69.6</strong> </td>
|
||
<td><strong>68.1</strong> </td>
|
||
<td>38.2 </td>
|
||
<td><strong>38.7</strong></td>
|
||
<td><strong>69.2</strong> </td>
|
||
<td><strong>85.5 / 92.2 </strong></td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
|
||
</div>
|
||
* We evaluate the officially released checkpoint by ourselves.
|
||
</details>
|
||
|
||
### Examples <!-- omit in toc -->
|
||
|
||
<table align="center">
|
||
<p align="center">
|
||
<img src="../assets/minicpmv2-cases_2.png" width=95%/>
|
||
</p>
|
||
</table>
|
||
|
||
We deploy MiniCPM-V 2.0 on end devices. The demo video is the raw screen recording on a Xiaomi 14 Pro without edition.
|
||
|
||
<table align="center">
|
||
<p align="center">
|
||
<img src="../assets/gif_cases/station.gif" width=36%/>
|
||
<img src="../assets/gif_cases/london_car.gif" width=36%/>
|
||
</p>
|
||
</table>
|
||
|
||
|
||
|
||
### Model Zoo
|
||
|
||
| Model | Device | Memory |          Description | Download |
|
||
|:-----------|:--:|:-----------:|:-------------------|:---------------:|
|
||
| MiniCPM-V 2.0 | GPU | 8 GB | Light version, balance the performance the computation cost. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-2) [<img src="../assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2) |
|
||
| MiniCPM-V 1.0 | GPU | 7 GB | Lightest version, achieving the fastest inference. | [🤗](https://huggingface.co/openbmb/MiniCPM-V) [<img src="../assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V) |
|