mirror of
https://github.com/OpenBMB/MiniCPM-V.git
synced 2026-02-05 02:09:20 +08:00
334 lines
13 KiB
Markdown
334 lines
13 KiB
Markdown
## MiniCPM-Llama3-V 2.5
|
|
|
|
> Archieve at: 2025-01-13
|
|
|
|
|
|
**MiniCPM-Llama3-V 2.5** is the latest model in the MiniCPM-V series. The model is built on SigLip-400M and Llama3-8B-Instruct with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.0. Notable features of MiniCPM-Llama3-V 2.5 include:
|
|
|
|
- 🔥 **Leading Performance.**
|
|
MiniCPM-Llama3-V 2.5 has achieved an average score of 65.1 on OpenCompass, a comprehensive evaluation over 11 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4V-1106, Gemini Pro, Claude 3 and Qwen-VL-Max** and greatly outperforms other Llama 3-based MLLMs.
|
|
|
|
- 💪 **Strong OCR Capabilities.**
|
|
MiniCPM-Llama3-V 2.5 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344), achieving a **700+ score on OCRBench, surpassing proprietary models such as GPT-4o, GPT-4V-0409, Qwen-VL-Max and Gemini Pro**. Based on recent user feedback, MiniCPM-Llama3-V 2.5 has now enhanced full-text OCR extraction, table-to-markdown conversion, and other high-utility capabilities, and has further strengthened its instruction-following and complex reasoning abilities, enhancing multimodal interaction experiences.
|
|
|
|
- 🏆 **Trustworthy Behavior.**
|
|
Leveraging the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) method (the newest technique in the [RLHF-V](https://github.com/RLHF-V) [CVPR'24] series), MiniCPM-Llama3-V 2.5 exhibits more trustworthy behavior. It achieves a **10.3%** hallucination rate on Object HalBench, lower than GPT-4V-1106 (13.6%), achieving the best-level performance within the open-source community. [Data released](https://huggingface.co/datasets/openbmb/RLAIF-V-Dataset).
|
|
|
|
- 🌏 **Multilingual Support.**
|
|
Thanks to the strong multilingual capabilities of Llama 3 and the cross-lingual generalization technique from [VisCPM](https://github.com/OpenBMB/VisCPM), MiniCPM-Llama3-V 2.5 extends its bilingual (Chinese-English) multimodal capabilities to **over 30 languages including German, French, Spanish, Italian, Korean etc.** [All Supported Languages](../docs/minicpm-llama-v-2-5_languages.md).
|
|
|
|
- 🚀 **Efficient Deployment.**
|
|
MiniCPM-Llama3-V 2.5 systematically employs **model quantization, CPU optimizations, NPU optimizations and compilation optimizations**, achieving high-efficiency deployment on end-side devices. For mobile phones with Qualcomm chips, we have integrated the NPU acceleration framework QNN into llama.cpp for the first time. After systematic optimization, MiniCPM-Llama3-V 2.5 has realized a **150x acceleration in end-side MLLM image encoding** and a **3x speedup in language decoding**.
|
|
|
|
- 💫 **Easy Usage.**
|
|
MiniCPM-Llama3-V 2.5 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-v2.5/examples/minicpmv/README.md) and [ollama](https://github.com/OpenBMB/ollama/tree/minicpm-v2.5/examples/minicpm-v2.5) support for efficient CPU inference on local devices, (2) [GGUF](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf) format quantized models in 16 sizes, (3) efficient [LoRA](https://github.com/OpenBMB/MiniCPM-V/tree/main/finetune#lora-finetuning) fine-tuning with only 2 V100 GPUs, (4) [streaming output](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5#usage), (5) quick local WebUI demo setup with [Gradio](https://github.com/OpenBMB/MiniCPM-V/blob/main/web_demo_2.5.py) and [Streamlit](https://github.com/OpenBMB/MiniCPM-V/blob/main/web_demo_streamlit-2_5.py), and (6) interactive demos on [HuggingFace Spaces](https://huggingface.co/spaces/openbmb/MiniCPM-Llama3-V-2_5).
|
|
|
|
### Evaluation <!-- omit in toc -->
|
|
|
|
<div align="center">
|
|
<img src=../assets/MiniCPM-Llama3-V-2.5-peformance.png width=66% />
|
|
</div>
|
|
<details>
|
|
<summary>Click to view results on TextVQA, DocVQA, OCRBench, OpenCompass, MME, MMBench, MMMU, MathVista, LLaVA Bench, RealWorld QA, Object HalBench. </summary>
|
|
<div align="center">
|
|
|
|
<table style="margin: 0px auto;">
|
|
<thead>
|
|
<tr>
|
|
<th align="left">Model</th>
|
|
<th>Size</th>
|
|
<th>OCRBench</th>
|
|
<th>TextVQA val</th>
|
|
<th>DocVQA test</th>
|
|
<th>Open-Compass</th>
|
|
<th>MME</th>
|
|
<th>MMB test (en)</th>
|
|
<th>MMB test (cn)</th>
|
|
<th>MMMU val</th>
|
|
<th>Math-Vista</th>
|
|
<th>LLaVA Bench</th>
|
|
<th>RealWorld QA</th>
|
|
<th>Object HalBench</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody align="center">
|
|
<tr>
|
|
<td colspan="14" align="left"><strong>Proprietary</strong></td>
|
|
</tr>
|
|
<tr>
|
|
<td nowrap="nowrap" align="left">Gemini Pro</td>
|
|
<td>-</td>
|
|
<td>680</td>
|
|
<td>74.6</td>
|
|
<td>88.1</td>
|
|
<td>62.9</td>
|
|
<td>2148.9</td>
|
|
<td>73.6</td>
|
|
<td>74.3</td>
|
|
<td>48.9</td>
|
|
<td>45.8</td>
|
|
<td>79.9</td>
|
|
<td>60.4</td>
|
|
<td>-</td>
|
|
</tr>
|
|
<tr>
|
|
<td nowrap="nowrap" align="left">GPT-4V (2023.11.06)</td>
|
|
<td>-</td>
|
|
<td>645</td>
|
|
<td>78.0</td>
|
|
<td>88.4</td>
|
|
<td>63.5</td>
|
|
<td>1771.5</td>
|
|
<td>77.0</td>
|
|
<td>74.4</td>
|
|
<td>53.8</td>
|
|
<td>47.8</td>
|
|
<td>93.1</td>
|
|
<td>63.0</td>
|
|
<td>86.4</td>
|
|
</tr>
|
|
<tr>
|
|
<td colspan="14" align="left"><strong>Open-source</strong></td>
|
|
</tr>
|
|
<tr>
|
|
<td nowrap="nowrap" align="left">Mini-Gemini</td>
|
|
<td>2.2B</td>
|
|
<td>-</td>
|
|
<td>56.2</td>
|
|
<td>34.2*</td>
|
|
<td>-</td>
|
|
<td>1653.0</td>
|
|
<td>-</td>
|
|
<td>-</td>
|
|
<td>31.7</td>
|
|
<td>-</td>
|
|
<td>-</td>
|
|
<td>-</td>
|
|
<td>-</td>
|
|
</tr>
|
|
<tr>
|
|
<td nowrap="nowrap" align="left">Qwen-VL-Chat</td>
|
|
<td>9.6B</td>
|
|
<td>488</td>
|
|
<td>61.5</td>
|
|
<td>62.6</td>
|
|
<td>51.6</td>
|
|
<td>1860.0</td>
|
|
<td>61.8</td>
|
|
<td>56.3</td>
|
|
<td>37.0</td>
|
|
<td>33.8</td>
|
|
<td>67.7</td>
|
|
<td>49.3</td>
|
|
<td>56.2</td>
|
|
</tr>
|
|
<tr>
|
|
<td nowrap="nowrap" align="left">DeepSeek-VL-7B</td>
|
|
<td>7.3B</td>
|
|
<td>435</td>
|
|
<td>64.7*</td>
|
|
<td>47.0*</td>
|
|
<td>54.6</td>
|
|
<td>1765.4</td>
|
|
<td>73.8</td>
|
|
<td>71.4</td>
|
|
<td>38.3</td>
|
|
<td>36.8</td>
|
|
<td>77.8</td>
|
|
<td>54.2</td>
|
|
<td>-</td>
|
|
</tr>
|
|
<tr>
|
|
<td nowrap="nowrap" align="left">Yi-VL-34B</td>
|
|
<td>34B</td>
|
|
<td>290</td>
|
|
<td>43.4*</td>
|
|
<td>16.9*</td>
|
|
<td>52.2</td>
|
|
<td><strong>2050.2</strong></td>
|
|
<td>72.4</td>
|
|
<td>70.7</td>
|
|
<td>45.1</td>
|
|
<td>30.7</td>
|
|
<td>62.3</td>
|
|
<td>54.8</td>
|
|
<td>79.3</td>
|
|
</tr>
|
|
<tr>
|
|
<td nowrap="nowrap" align="left">CogVLM-Chat</td>
|
|
<td>17.4B</td>
|
|
<td>590</td>
|
|
<td>70.4</td>
|
|
<td>33.3*</td>
|
|
<td>54.2</td>
|
|
<td>1736.6</td>
|
|
<td>65.8</td>
|
|
<td>55.9</td>
|
|
<td>37.3</td>
|
|
<td>34.7</td>
|
|
<td>73.9</td>
|
|
<td>60.3</td>
|
|
<td>73.6</td>
|
|
</tr>
|
|
<tr>
|
|
<td nowrap="nowrap" align="left">TextMonkey</td>
|
|
<td>9.7B</td>
|
|
<td>558</td>
|
|
<td>64.3</td>
|
|
<td>66.7</td>
|
|
<td>-</td>
|
|
<td>-</td>
|
|
<td>-</td>
|
|
<td>-</td>
|
|
<td>-</td>
|
|
<td>-</td>
|
|
<td>-</td>
|
|
<td>-</td>
|
|
<td>-</td>
|
|
</tr>
|
|
<tr>
|
|
<td nowrap="nowrap" align="left">Idefics2</td>
|
|
<td>8.0B</td>
|
|
<td>-</td>
|
|
<td>73.0</td>
|
|
<td>74.0</td>
|
|
<td>57.2</td>
|
|
<td>1847.6</td>
|
|
<td>75.7</td>
|
|
<td>68.6</td>
|
|
<td>45.2</td>
|
|
<td>52.2</td>
|
|
<td>49.1</td>
|
|
<td>60.7</td>
|
|
<td>-</td>
|
|
</tr>
|
|
<tr>
|
|
<td nowrap="nowrap" align="left">Bunny-LLama-3-8B</td>
|
|
<td>8.4B</td>
|
|
<td>-</td>
|
|
<td>-</td>
|
|
<td>-</td>
|
|
<td>54.3</td>
|
|
<td>1920.3</td>
|
|
<td>77.0</td>
|
|
<td>73.9</td>
|
|
<td>41.3</td>
|
|
<td>31.5</td>
|
|
<td>61.2</td>
|
|
<td>58.8</td>
|
|
<td>-</td>
|
|
</tr>
|
|
<tr>
|
|
<td nowrap="nowrap" align="left">LLaVA-NeXT Llama-3-8B</td>
|
|
<td>8.4B</td>
|
|
<td>-</td>
|
|
<td>-</td>
|
|
<td>78.2</td>
|
|
<td>-</td>
|
|
<td>1971.5</td>
|
|
<td>-</td>
|
|
<td>-</td>
|
|
<td>41.7</td>
|
|
<td>37.5</td>
|
|
<td>80.1</td>
|
|
<td>60.0</td>
|
|
<td>-</td>
|
|
</tr>
|
|
<tr>
|
|
<td nowrap="nowrap" align="left">Phi-3-vision-128k-instruct</td>
|
|
<td>4.2B</td>
|
|
<td>639*</td>
|
|
<td>70.9</td>
|
|
<td>-</td>
|
|
<td>-</td>
|
|
<td>1537.5*</td>
|
|
<td>-</td>
|
|
<td>-</td>
|
|
<td>40.4</td>
|
|
<td>44.5</td>
|
|
<td>64.2*</td>
|
|
<td>58.8*</td>
|
|
<td>-</td>
|
|
</tr>
|
|
<tr style="background-color: #e6f2ff;">
|
|
<td nowrap="nowrap" align="left">MiniCPM-V 1.0</td>
|
|
<td>2.8B</td>
|
|
<td>366</td>
|
|
<td>60.6</td>
|
|
<td>38.2</td>
|
|
<td>47.5</td>
|
|
<td>1650.2</td>
|
|
<td>64.1</td>
|
|
<td>62.6</td>
|
|
<td>38.3</td>
|
|
<td>28.9</td>
|
|
<td>51.3</td>
|
|
<td>51.2</td>
|
|
<td>78.4</td>
|
|
</tr>
|
|
<tr style="background-color: #e6f2ff;">
|
|
<td nowrap="nowrap" align="left">MiniCPM-V 2.0</td>
|
|
<td>2.8B</td>
|
|
<td>605</td>
|
|
<td>74.1</td>
|
|
<td>71.9</td>
|
|
<td>54.5</td>
|
|
<td>1808.6</td>
|
|
<td>69.1</td>
|
|
<td>66.5</td>
|
|
<td>38.2</td>
|
|
<td>38.7</td>
|
|
<td>69.2</td>
|
|
<td>55.8</td>
|
|
<td>85.5</td>
|
|
</tr>
|
|
<tr style="background-color: #e6f2ff;">
|
|
<td nowrap="nowrap" align="left">MiniCPM-Llama3-V 2.5</td>
|
|
<td>8.5B</td>
|
|
<td><strong>725</strong></td>
|
|
<td><strong>76.6</strong></td>
|
|
<td><strong>84.8</strong></td>
|
|
<td><strong>65.1</strong></td>
|
|
<td>2024.6</td>
|
|
<td><strong>77.2</strong></td>
|
|
<td><strong>74.2</strong></td>
|
|
<td><strong>45.8</strong></td>
|
|
<td><strong>54.3</strong></td>
|
|
<td><strong>86.7</strong></td>
|
|
<td><strong>63.5</strong></td>
|
|
<td><strong>89.7</strong></td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
|
|
|
|
</div>
|
|
* We evaluate the officially released checkpoint by ourselves.
|
|
|
|
</details>
|
|
|
|
<div align="center">
|
|
<img src="../assets/llavabench_compare_3.png" width="100%" />
|
|
<br>
|
|
Evaluation results of multilingual LLaVA Bench
|
|
</div>
|
|
|
|
### Examples <!-- omit in toc -->
|
|
|
|
<table align="center" >
|
|
<p align="center" >
|
|
<img src="../assets/minicpmv-llama3-v2.5/cases_all.png" />
|
|
</p>
|
|
</table>
|
|
|
|
</details>
|
|
|
|
|
|
### Model Zoo
|
|
|
|
| Model | Device | Memory |          Description | Download |
|
|
|:-----------|:--:|:-----------:|:-------------------|:---------------:|
|
|
| MiniCPM-Llama3-V 2.5 | GPU | 19 GB | Strong end-side multimodal performance. | [🤗](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5/) [<img src="../assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5) |
|
|
| MiniCPM-Llama3-V 2.5 gguf | CPU | 6 GB | The gguf version, lower memory usage and faster inference. | [🤗](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf) [<img src="../assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5-gguf) |
|
|
| MiniCPM-Llama3-V 2.5 int4 | GPU | 8 GB | The int4 quantized version, lower GPU memory usage. | [🤗](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-int4/) [<img src="../assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5-int4) |
|