Update to MiniCPM-Llama3-V 2.5

2026-02-05 18:29:18 +08:00 · 2024-05-20 12:44:33 +08:00
parent c0e39dbfe2
commit 2c75097411
27 changed files with 1944 additions and 985 deletions
--- a/README_en.md
+++ b/README_en.md
@@ -1,30 +1,31 @@
 <div align="center">

-<img src="./assets/minicpmv-omnilmm.png" width="400em" ></img> 
+<img src="./assets/minicpmv.png" width="300em" ></img> 

-**Large multi-modal models for strong performance and efficient deployment**
+**A GPT-4V Level Multimodal LLM on Your Phone**

  <strong>[中文](./README.md) |
  English</strong>

 <p align="center">
+  MiniCPM-Llama3-V  2.5  <a href="https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5/">🤗</a> <a href="http://120.92.209.146:8889/">🤖</a> |
  MiniCPM-V 2.0  <a href="https://huggingface.co/openbmb/MiniCPM-V-2/">🤗</a> <a href="https://huggingface.co/spaces/openbmb/MiniCPM-V-2">🤖</a> |
-  OmniLMM-12B <a href="https://huggingface.co/openbmb/OmniLMM-12B/">🤗</a> <a href="http://120.92.209.146:8081">🤖</a> | <a href="https://openbmb.vercel.app/minicpm-v-2-en"> Technical Blog </a>
+  <a href="https://openbmb.vercel.app/minicpm-v-2-en"> Technical Blog </a>
 </p>

 </div>


-**MiniCPM-V** and **OmniLMM** are a family of open-source large multimodal models (LMMs) adept at vision & language modeling. The models process images and text inputs and deliver high-quality text outputs. We release two featured versions that are targeted at **strong performance and efficient deployment**:
+**MiniCPM-V** is a series of end-side multimodal LLMs designed for image-text understanding. These models accept image and text inputs and provide high-quality text outputs. Since February 2024, we have released four versions of the model, aiming to achieve **strong performance and efficient deployment**. The most noteworthy models in this series currently include:

- **MiniCPM-V 2.8B**: State-of-the-art end-side large multimodal models. Our latest MiniCPM-V 2.0 can accept 1.8 million pixels (e.g., 1344x1344) images at any aspect ratio, and is adept at OCR capability. It achieves comparable performance with Gemini Pro in understanding scene-text and matches GPT-4V in preventing hallucinations.
-
- **OmniLMM 12B**: The most capable version with leading performance among comparable-sized models on multiple benchmarks. The model also achieves state-of-the-art performance in trustworthy behaviors, with even less hallucination than GPT-4V.
+- **MiniCPM-Llama3-V 2.5**: 🔥🔥🔥 The latest and most capable model in the MiniCPM-V series. With a total of 8B parameters, the model surpasses proprietary models such as GPT-4V-1106, Gemini Pro, Qwen-VL-Max and Claude 3 in overall performance. Its OCR capability and instruction-following capability have been further enhanced. The model supports multimodal interaction in over 30 languages including English, Chinese, French, Spanish, German etc. Equipped with model quantization and efficient inference technologies on CPUs, NPUs and compilation optimizations, MiniCPM-Llama3-V 2.5 can be efficiently deployed on edge devices.

+- **MiniCPM-V 2.0**: The lightest model in the MiniCPM-V series. With 2B parameters, it surpasses larger-scale models such as Yi-VL 34B, CogVLM-Chat 17B, and Qwen-VL-Chat 10B in overall performance. It accepts image inputs of any aspect ratio up to 1.8 million pixels (e.g., 1344x1344), achieving comparable performance with Gemini Pro in understanding scene-text and matches GPT-4V in preventing hallucinations.


 ## News <!-- omit in toc -->

+* [2024.05.20] We open-soure MiniCPM-Llama3-V 2.5, it has improved OCR capability and supports 30+ languages, representing the first edge-side multimodal LLM achieving GPT-4V level performance! We provide [efficient inference](#deployment-on-mobile-phone) and [simple fine-tuning](./finetune/readme.md), try it now!
 * [2024.04.23] MiniCPM-V-2.0 supports vLLM now! Click [here](#vllm) to view more details.
 * [2024.04.18] We create a HuggingFace Space to host the demo of MiniCPM-V 2.0 at [here](https://huggingface.co/spaces/openbmb/MiniCPM-V-2)!
 * [2024.04.17] MiniCPM-V-2.0 supports deploying [WebUI Demo](#webui-demo) now!
@@ -38,8 +39,8 @@
 ## Contents <!-- omit in toc -->


- [MiniCPM-V 2.8B](#minicpm-v-28b)
- [OmniLMM-12B](#omnilmm-12b)
+- [MiniCPM-Llama3-V 2.5](#minicpm-llama3-v-25)
+- [MiniCPM-V 2.0](#minicpm-v-20)
 - [Online Demo](#online-demo)
 - [Install](#install)
 - [Inference](#inference)
@@ -48,13 +49,334 @@
  - [Inference on Mac](#inference-on-mac)
  - [Deployment on Mobile Phone](#deployment-on-mobile-phone)
  - [WebUI Demo](#webui-demo)
- [Finetune](#finetune)
+  - [Inference with vLLM](#inference-with-vllm)
+- [Fine-tuning](#fine-tuning)
 - [TODO](#todo)
 - [Citation](#citation)

+## MiniCPM-Llama3-V 2.5

-## MiniCPM-V 2.8B
-**MiniCPM-V 2.8B** is an efficient version with promising performance for deployment. The model is built based on SigLip-400M and [MiniCPM-2.4B](https://github.com/OpenBMB/MiniCPM/), connected by a perceiver resampler. Our latest version, MiniCPM-V 2.0 has several notable features. 
+**MiniCPM-Llama3-V 2.5** is the latest model in the MiniCPM-V series. The model is built on SigLip-400M and Llama3-8B-Instruct with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.0. Notable features of MiniCPM-Llama3-V 2.5 include:
+
+- 🔥 **Leading Performance.**
+  MiniCPM-Llama3-V 2.5 has achieved an average score of 65.1 on OpenCompass, a comprehensive evaluation over 11 popular benchmarks. **It surpasses widely used proprietary models like GPT-4V-1106, Gemini Pro, Claude 3 and Qwen-VL-Max with 8B parameters**, greatly outperforming other multimodal LLMs built on Llama 3.
+
+- 💪 **Strong OCR Capabilities.**
+  MiniCPM-Llama3-V 2.5 can process images with any aspect ratio up to 1.8 million pixels, achieving an **700+ score on OCRBench, surpassing proprietary models such as GPT-4o, GPT-4V-0409, Qwen-VL-Max and Gemini Pro**. Based on recent user feedback, MiniCPM-Llama3-V 2.5 has now enhanced full-text OCR extraction, table-to-markdown conversion, and other high-utility capabilities, and has further strengthened its instruction-following and complex reasoning abilities, enhancing multimodal interaction experiences.
+
+- 🏆 **Trustworthy Behavior.**
+  Leveraging the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) method (the newest technology in the [RLHF-V](https://github.com/RLHF-V) [CVPR'24] series), MiniCPM-Llama3-V 2.5 exhibits trustworthy multimodal behavior. It achieves **10.3%** hallucination rate on Object HalBench, lower than GPT-4V-1106 (13.6%), achieving the best level within the open-source community.
+
+- 🌏 **Multilingual Support.**
+  Thanks to the strong multilingual capabilities of Llama 3 and the cross-lingual generalization technique from [VisCPM](https://github.com/OpenBMB/VisCPM), MiniCPM-Llama3-V 2.5 extends its foundational bilingual (Chinese-English) multimodal capabilities to support **30+ languages including German, French, Spanish, Italian, Russian etc.** We achieve this extension through only minimal instruction-tuning with translated multimodal data. [All Supported Languages](./assets/minicpm-llama-v-2-5_languages.md).
+
+- 🚀 **Efficient Deployment.**
+  MiniCPM-Llama3-V 2.5 systematically employs **model quantization, CPU optimizations, NPU optimizations and compilation optimizations** as acceleration techniques, achieving high-efficiency deployment on edge devices. For mobile phones with Qualcomm chips, we have integrated the NPU acceleration framework QNN into llama.cpp for the first time. After systematic optimization, MiniCPM-Llama3-V 2.5 has realized a **150-fold acceleration in multimodal large model edge-side image encoding** and a **3-fold increase in language decoding speed**.
+
+### Evaluation <!-- omit in toc -->
+
+<div align="center">
+    <img src=assets/MiniCPM-Llama3-V-2.5-peformance.png width=66% />
+</div>
+<details>
+<summary>Click to view results on TextVQA, DocVQA, OCRBench, OpenCompass, MME, MMBench, MMMU, MathVista, LLaVA Bench, RealWorld QA, Object HalBench. </summary>
+<div align="center">
+
+<table style="margin: 0px auto;">
+    <thead>
+        <tr>
+            <th align="left">Model</th>
+            <th>Size</th>
+            <th>OCRBench</th>
+            <th>TextVQA val</th>
+            <th>DocVQA test</th>
+            <th>Open-Compass</th>
+            <th>MME</th>
+            <th>MMB test (en)</th>
+            <th>MMB test (cn)</th>
+            <th>MMMU val</th>
+            <th>Math-Vista</th>
+            <th>LLaVA Bench</th>
+            <th>RealWorld QA</th>
+            <th>Object HalBench</th>
+        </tr>
+    </thead>
+    <tbody align="center">
+        <tr>
+            <td colspan="14" align="left"><strong>Proprietary</strong></td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">Gemini Pro</td>
+            <td>-</td>
+            <td>680</td>
+            <td>74.6</td>
+            <td>88.1</td>
+            <td>62.9</td>
+            <td>2148.9</td>
+            <td>73.6</td>
+            <td>74.3</td>
+            <td>48.9</td>
+            <td>45.8</td>
+            <td>79.9</td>
+            <td>60.4</td>
+            <td>-</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">GPT-4V (2023.11.06)</td>
+            <td>-</td>
+            <td>645</td>
+            <td>78.0</td>
+            <td>88.4</td>
+            <td>63.5</td>
+            <td>1771.5</td>
+            <td>77.0</td>
+            <td>74.4</td>
+            <td>53.8</td>
+            <td>47.8</td>
+            <td>93.1</td>
+            <td>63.0</td>
+            <td>86.4</td>
+        </tr>
+        <tr>
+            <td colspan="14" align="left"><strong>Open-source</strong></td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">Mini-Gemini</td>
+            <td>2.2B</td>
+            <td>-</td>
+            <td>56.2</td>
+            <td>34.2*</td>
+            <td>-</td>
+            <td>1653.0</td>
+            <td>-</td>
+            <td>-</td>
+            <td>31.7</td>
+            <td>-</td>
+            <td>-</td>
+            <td>-</td>
+            <td>-</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">Qwen-VL-Chat</td>
+            <td>9.6B</td>
+            <td>488</td>
+            <td>61.5</td>
+            <td>62.6</td>
+            <td>51.6</td>
+            <td>1860.0</td>
+            <td>61.8</td>
+            <td>56.3</td>
+            <td>37.0</td>
+            <td>33.8</td>
+            <td>67.7</td>
+            <td>49.3</td>
+            <td>56.2</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">DeepSeek-VL-7B</td>
+            <td>7.3B</td>
+            <td>435</td>
+            <td>64.7*</td>
+            <td>47.0*</td>
+            <td>54.6</td>
+            <td>1765.4</td>
+            <td>73.8</td>
+            <td>71.4</td>
+            <td>38.3</td>
+            <td>36.8</td>
+            <td>77.8</td>
+            <td>54.2</td>
+            <td></td>
+        </tr>        
+        <tr>
+            <td nowrap="nowrap" align="left">Yi-VL-34B</td>
+            <td>34B</td>
+            <td>290</td>
+            <td>43.4*</td>
+            <td>16.9*</td>
+            <td>52.2</td>
+            <td>2050.2</td>
+            <td>72.4</td>
+            <td>70.7</td>
+            <td>45.1</td>
+            <td>30.7</td>
+            <td>62.3</td>
+            <td>54.8</td>
+            <td>79.3</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">CogVLM-Chat</td>
+            <td>17.4B</td>
+            <td>590</td>
+            <td>70.4</td>
+            <td>33.3*</td>
+            <td>54.2</td>
+            <td>1736.6</td>
+            <td>65.8</td>
+            <td>55.9</td>
+            <td>37.3</td>
+            <td>34.7</td>
+            <td>73.9</td>
+            <td>60.3</td>
+            <td>73.6</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">TextMonkey</td>
+            <td>9.7B</td>
+            <td>558</td>
+            <td>64.3</td>
+            <td>66.7</td>
+            <td>-</td>
+            <td>-</td>
+            <td>-</td>
+            <td>-</td>
+            <td>-</td>
+            <td>-</td>
+            <td>-</td>
+            <td>-</td>
+            <td>-</td>
+        </tr>
+        <tr>
+          <td nowrap="nowrap" align="left">IDEFICS2-8B</td>
+          <td>8.0B</td>
+          <td>-</td>
+          <td>73.0</td>
+          <td>74.0</td>
+          <td>57.2</td>
+          <td>1847.6</td>
+          <td>75.7</td>
+          <td>68.6</td>
+          <td>45.2</td>
+          <td>52.2</td>
+          <td>49.1</td>
+          <td>60.7</td>
+          <td>-</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">Bunny-LLama-3-8B</td>
+            <td>8.4B</td>
+            <td>-</td>
+            <td>-</td>
+            <td>-</td>
+            <td>54.3</td>
+            <td>1920.3</td>
+            <td>77.0</td>
+            <td>73.9</td>
+            <td>41.3</td>
+            <td>31.5</td>
+            <td>61.2</td>
+            <td>58.8</td>
+            <td>-</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">LLaVA-NeXT Llama-3-8B</td>
+            <td>8.4B</td>
+            <td>-</td>
+            <td>-</td>
+            <td>78.2</td>
+            <td>-</td>
+            <td>1971.5</td>
+            <td>-</td>
+            <td>-</td>
+            <td>41.7</td>
+            <td>37.5</td>
+            <td>80.1</td>
+            <td>60.0</td>
+            <td>-</td>
+        </tr>
+        <tr style="background-color: #e6f2ff;">
+            <td nowrap="nowrap" align="left">MiniCPM-V 1.0</td>
+            <td>2.8B</td>
+            <td>366</td>
+            <td>60.6</td>
+            <td>38.2</td>
+            <td>47.5</td>
+            <td>1650.2</td>
+            <td>64.1</td>
+            <td>62.6</td>
+            <td>38.3</td>
+            <td>28.9</td>
+            <td>51.3</td>
+            <td>51.2</td>
+            <td>78.4</td>
+        </tr>
+        <tr style="background-color: #e6f2ff;">
+            <td nowrap="nowrap" align="left">MiniCPM-V 2.0</td>
+            <td>2.8B</td>
+            <td>605</td>
+            <td>74.1</td>
+            <td>71.9</td>
+            <td>54.5</td>
+            <td>1808.6</td>
+            <td>69.1</td>
+            <td>66.5</td>
+            <td>38.2</td>
+            <td>38.7</td>
+            <td>69.2</td>
+            <td>55.8</td>
+            <td>85.5</td>
+        </tr>
+        <tr style="background-color: #e6f2ff;">
+            <td nowrap="nowrap" align="left">MiniCPM-Llama3-V 2.5</td>
+            <td>8.5B</td>
+            <td><strong>725</strong></td>
+            <td><strong>76.6</strong></td>
+            <td><strong>84.8</strong></td>
+            <td><strong>65.1</strong></td>
+            <td>2024.6</td>
+            <td><strong>77.2</strong></td>
+            <td><strong>74.2</strong></td>
+            <td><strong>45.8</strong></td>
+            <td><strong>54.3</strong></td>
+            <td><strong>86.7</strong></td>
+            <td><strong>63.5</strong></td>
+            <td><strong>89.7</strong></td>
+        </tr>
+    </tbody>
+</table>
+
+
+</div>
+* We evaluate the officially released checkpoint by ourselves.
+
+</details>
+
+<div align="center">
+    <img src="assets/llavabench_compare.png" width="66%" />
+    <br>
+    Evaluation results of LLaVABench in multiple languages
+</div>
+
+### Examples <!-- omit in toc -->
+
+<table align="center" >
+  <p align="center" > 
+  <img src="assets/minicpmv-llama3-v2.5/cases_all.png" />
+  </p>
+</table>
+
+We deploy MiniCPM-Llama3-V 2.5 on end devices. The demo video is the raw screen recording on a Xiaomi 14 Pro at double speed.
+
+<table align="center">
+    <p align="center">
+      <img src="assets/gif_cases/ticket.gif" width=32%/>
+      <img src="assets/gif_cases/meal_plan.gif" width=32%/>
+    </p>
+</table>
+
+<table align="center">
+    <p align="center">
+      <img src="assets/gif_cases/1-4.gif" width=64%/>
+    </p>
+</table>
+
+## MiniCPM-V 2.0
+
+<details>
+<summary>Click to view more details of MiniCPM-V 2.0</summary>
+
+
+**MiniCPM-V 2.0** is an efficient version with promising performance for deployment. The model is built based on SigLip-400M and [MiniCPM-2.4B](https://github.com/OpenBMB/MiniCPM/), connected by a perceiver resampler. Our latest version, MiniCPM-V 2.0 has several notable features. 

 - 🔥 **State-of-the-art Performance.** 

@@ -72,252 +394,10 @@

  MiniCPM-V 2.0 can be **efficiently deployed on most GPU cards and personal computers**, and **even on end devices such as mobile phones**. For visual encoding, we compress the image representations into much fewer tokens via a perceiver resampler. This allows MiniCPM-V 2.0 to operate with **favorable memory cost and speed during inference even when dealing with high-resolution images**.

-
-
 - 🙌 **Bilingual Support.** 

  MiniCPM-V 2.0 **supports strong bilingual multimodal capabilities in both English and Chinese**. This is enabled by generalizing multimodal capabilities across languages, a technique from [VisCPM](https://arxiv.org/abs/2308.12038) [ICLR'24].

-### Evaluation <!-- omit in toc -->
-
-<div align="center">
-    <img src=assets/minicpmv-2-peformance.png width=66% />
-</div>
-<details>
-<summary>Click to view results on TextVQA, DocVQA, OCRBench, OpenCompass, MME, MMBench, MMMU, MathVista, LLaVA Bench, Object HalBench. </summary>
-<div align="center">
-
-<table style="margin: 0px auto;">
-<thead>
-  <tr>
-    <th align="left">Model</th>
-    <th>Size</th>
-    <th>TextVQA val</th>
-    <th>DocVQA test</th>
-    <th>OCRBench</th>
-    <th>OpenCompass</th>
-    <th nowrap="nowrap" >MME</th>
-    <th>MMB dev(en)</th>
-    <th>MMB dev(zh)</th>
-    <th>MMMU val</th>
-    <th>MathVista</th>
-    <th>LLaVA Bench</th>
-    <th nowrap="nowrap">Object HalBench</th>
-  </tr>
-</thead>
-<tbody align="center">
-  <tr>
-    <td colspan="12" align="left"><strong>Proprietary models</strong></td>
-  </tr>
-  <tr>
-    <td nowrap="nowrap" align="left">Gemini Pro Vision</td>
-    <td>- </td>
-    <td>74.6</td>
-    <td>88.1</td>
-    <td>680</td>
-    <td>63.8</td>
-    <td>2148.9</td>
-    <td>75.2</td>
-    <td>74.0</td>
-    <td>48.9</td>
-    <td>45.8</td>
-    <td>79.9</td>
-    <td>- </td>
-  </tr>
-  <tr>
-    <td nowrap="nowrap" align="left">GPT-4V</td>
-    <td>- </td>
-    <td>78.0</td>
-    <td>88.4</td>
-    <td>645</td>
-    <td>63.2</td>
-    <td>1771.5</td>
-    <td>75.1</td>
-    <td>75.0</td>
-    <td>53.8</td>
-    <td>47.8</td>
-    <td>93.1</td>
-    <td>86.4 / 92.7</td>
-  </tr>
-  <tr>
-    <td colspan="12" align="left"><strong>Open-source models 6B~34B</strong></td>
-  </tr>
-  <tr>
-    <td  nowrap="nowrap" align="left" >Yi-VL-6B</td>
-    <td align="right" >6.7B</td>
-    <td>45.5*</td>
-    <td>17.1*</td>
-    <td>290</td>
-    <td>49.3</td>
-    <td>1915.1 </td>
-    <td>68.6 </td>
-    <td>68.3 </td>
-    <td>40.3 </td>
-    <td>28.8 </td>
-    <td>51.9 </td>
-    <td>- </td>
-  </tr>
-  <tr>
-    <td  nowrap="nowrap" align="left" >Qwen-VL-Chat</td>
-    <td align="right" >9.6B</td>
-    <td>61.5</td>
-    <td>62.6</td>
-    <td>488 </td>
-    <td>52.1 </td>
-    <td>1860.0 </td>
-    <td>60.6 </td>
-    <td>56.7 </td>
-    <td>37.0 </td>
-    <td>33.8 </td>
-    <td>67.7 </td>
-    <td>56.2 / 80.0</td>
-  </tr>
-  <tr>
-    <td nowrap="nowrap" align="left" >Yi-VL-34B</td>
-    <td align="right" >34B</td>
-    <td>43.4*</td>
-    <td>16.9*</td>
-    <td>290</td>
-    <td>52.6 </td>
-    <td>2050.2</td>
-    <td>71.1</td>
-    <td>71.4</td>
-    <td>45.1</td>
-    <td>30.7</td>
-    <td>62.3</td>
-    <td>- </td>
-  </tr>
-  <tr>
-    <td  nowrap="nowrap" align="left" >DeepSeek-VL-7B</td>
-    <td align="right" >7.3B</td>
-    <td>64.7*</td>
-    <td>47.0* </td>
-    <td>435</td>
-    <td>55.6 </td>
-    <td>1765.4 </td>
-    <td>74.1 </td>
-    <td>72.8 </td>
-    <td>38.3 </td>
-    <td>36.8</td>
-    <td>77.8 </td>
-    <td>- </td>
-  </tr>
-  <tr>
-    <td  nowrap="nowrap" align="left" >TextMonkey</td>
-    <td align="right" >9.7B</td>
-    <td>64.3</td>
-    <td>66.7 </td>
-    <td>558</td>
-    <td>- </td>
-    <td>- </td>
-    <td>- </td>
-    <td>- </td>
-    <td>- </td>
-    <td>-</td>
-    <td>- </td>
-    <td>- </td>
-  </tr>
-    <tr>
-    <td  nowrap="nowrap" align="left" >CogVLM-Chat</td>
-    <td align="right" >17.4B</td>
-    <td>70.4</td>
-    <td>33.3*</td>
-    <td>590 </td>
-    <td>52.5 </td>
-    <td>1736.6 </td>
-    <td>63.7 </td>
-    <td>53.8 </td>
-    <td>37.3 </td>
-    <td>34.7 </td>
-    <td>73.9 </td>
-    <td>73.6 / 87.4 </td>
-  </tr>
-  <tr>
-    <td colspan="12" align="left"><strong>Open-source models 1B~3B </strong></td>
-  </tr>
-  <tr>
-    <td  nowrap="nowrap" align="left" >DeepSeek-VL-1.3B</td>
-    <td align="right" >1.7B</td>
-    <td>58.4*</td>
-    <td>37.9*</td>
-    <td>413</td>
-    <td>46.0 </td>
-    <td>1531.6 </td>
-    <td>64.0 </td>
-    <td>61.2 </td>
-    <td>33.8 </td>
-    <td>29.4 </td>
-    <td>51.1 </td>
-    <td>- </td>
-  </tr>
-  <tr>
-    <td  nowrap="nowrap" align="left" >MobileVLM V2</td>
-    <td align="right" >3.1B</td>
-    <td>57.5</td>
-    <td>19.4*</td>
-    <td>-</td>
-    <td>-</td>
-    <td>1440.5(P) </td>
-    <td>63.2 </td>
-    <td>-</td>
-    <td>-</td>
-    <td>-</td>
-    <td>-</td>
-    <td>-</td>
-  </tr>
-  <tr>
-    <td  nowrap="nowrap" align="left" >Mini-Gemini</td>
-    <td align="right" >2.2B</td>
-    <td>56.2</td>
-    <td>34.2*</td>
-    <td>-</td>
-    <td>-</td>
-    <td>1653.0 </td>
-    <td>59.8 </td>
-    <td>- </td>
-    <td>31.7 </td>
-    <td>-</td>
-    <td>- </td>
-    <td>- </td>
-  </tr>
-  <tr>
-    <td  nowrap="nowrap" align="left" >MiniCPM-V</td>
-    <td align="right" >2.8B </td>
-    <td>60.6</td>
-    <td>38.2 </td>
-    <td>366</td>
-    <td>47.6</td>
-    <td>1650.2 </td>
-    <td>67.9 </td>
-    <td>65.3 </td>
-    <td><strong>38.3</strong></td>
-    <td>28.9</td>
-    <td>51.3 </td>
-    <td>78.4 / 88.5 </td>
-  </tr>
-  <tr>
-    <td  nowrap="nowrap" align="left" ><strong>MiniCPM-V 2.0</strong></td>
-    <td align="right" >2.8B </td>
-    <td><strong>74.1</strong></td>
-    <td><strong>71.9</strong> </td>
-    <td><strong>605</strong></td>
-    <td><strong>55.0</strong></td>
-    <td><strong>1808.6</strong> </td>
-    <td><strong>69.6</strong> </td>
-    <td><strong>68.1</strong> </td>
-    <td>38.2 </td>
-    <td><strong>38.7</strong></td>
-    <td><strong>69.2</strong> </td>
-    <td><strong>85.5 / 92.2 </strong></td>
-  </tr>
-</tbody>
-</table>
-
-</div>
-* We evaluate the officially released checkpoint by ourselves.
-
-</details>
-
 ### Examples <!-- omit in toc -->

 <table align="center">
@@ -335,157 +415,19 @@ We deploy MiniCPM-V 2.0 on end devices. The demo video is the raw screen recordi
    </p>
 </table>

-### MiniCPM-V 1.0 <!-- omit in toc -->
-Please see the info about MiniCPM-V 1.0 [here](./minicpm_v1.md).
-
-
-## OmniLMM-12B
-**OmniLMM-12B** is the most capable version. The model is built based on EVA02-5B and Zephyr-7B-β, connected with a perceiver resampler layer, and trained on multimodal data in a curriculum fashion. The model has three notable features:
-
- 🔥 **Strong Performance.** 
-
-  OmniLMM-12B achieves **leading performance** among models with comparable sizes, surpassing established LMMs on multiple benchmarks (including MME, MMBench, SEED-Bench, etc). The model also endows rich multi-modal world knowledge.
-
- 🏆 **Trustworthy Behavior.** 
-
-  LMMs are known for suffering from hallucination, often generating text that is not factually grounded in images (e.g., faithfully describing non-existing objects in images). OmniLMM-12B is **the first state-of-the-art open-source LMM aligned via multimodal RLHF for trustworthy behavior** (using the recent [RLHF-V](https://rlhf-v.github.io/) technique). It **ranks #1** among open-source models on [MMHal-Bench](https://huggingface.co/datasets/Shengcao1006/MMHal-Bench), and **outperforms GPT-4V** on [Object HalBench](https://arxiv.org/abs/2312.00849).
-
- 🕹 **Real-time Multimodal Interaction.** 
-
-  We combine the OmniLMM-12B and GPT-3.5 (text-only) into a **real-time multimodal interactive assistant**. The assistant accepts video streams from the camera and speech streams from the microphone and emits speech output. While still primary, we find the model can **replicate some of the fun cases shown in the Gemini Demo video, without any video edition**.
-
-
-### Evaluation <!-- omit in toc -->
-<div align="center">
-    <img src=assets/radar_omnilmm12b.png width=66% />
-</div>
-<details>
-<summary>Click to view results on MME, MMBench, MMMU, MMBench, MMHal-Bench, Object HalBench, SeedBench, LLaVA Bench, MathVista. </summary>
-
-<table>
-<thead>
-  <tr>
-    <th align="left">Model</th>
-    <th>Size</th>
-    <th>MME</th>
-    <th nowrap="nowrap">MMB dev (en)</th>
-    <th nowrap="nowrap" >MMMU val</th>
-    <th nowrap="nowrap" >MMHal-Bench</th>
-    <th nowrap="nowrap" >Object HalBench</th>
-    <th nowrap="nowrap" >SeedBench-I</th>
-    <th>MathVista</th>
-    <th nowrap="nowrap" >LLaVA Bench</th>
-  </tr>
-</thead>
-<tbody align="center">
-  <tr>
-    <td align="left">GPT-4V†</td>
-    <td>-</td>
-    <td>1771.5</td>
-    <td>75.1 </td>
-    <td>56.8</td>
-    <td>3.53 / 70.8</td>
-    <td>86.4 / 92.7</td>
-    <td>71.6 </td>
-    <td>47.8 </td>
-    <td>93.1 </td>
-  </tr>
-  <tr>
-    <td nowrap="nowrap" align="left">Qwen-VL-Plus†</td>
-    <td>-</td>
-    <td>2183.4</td>
-    <td>66.2 </td>
-    <td>45.2</td>
-    <td>- </td>
-    <td>- </td>
-    <td>65.7 </td>
-    <td>36.0 </td>
-    <td>73.7 </td>
-  </tr>
-  <tr>
-    <td align="left">Yi-VL 6B</td>
-    <td align="right">6.7B </td>
-    <td>1915.1 </td>
-    <td>68.6 </td>
-    <td>40.3 </td>
-    <td>- </td>
-    <td>- </td>
-    <td>67.5 </td>
-    <td>28.8 </td>
-    <td>51.9 </td>
-  </tr>
-  <tr>
-    <td nowrap="nowrap" align="left" >Qwen-VL-Chat</td>
-    <td align="right">9.6B</td>
-    <td>1860.0</td>
-    <td>60.6 </td>
-    <td>35.9</td>
-    <td>2.93 / 59.4</td>
-    <td>56.2 / 80.0</td>
-    <td>64.8 </td>
-    <td>33.8 </td>
-    <td>67.7 </td>
-  </tr>
-  <tr>
-    <td align="left" >CogVLM-Chat</td>
-    <td align="right">17.4B</td>
-    <td>1736.6</td>
-    <td>63.7 </td>
-    <td>32.1 </td>
-    <td>2.68 / 52.1 </td>
-    <td>73.6 / 87.4 </td>
-    <td>68.8 </td>
-    <td>34.7 </td>
-    <td>73.9 </td>
-  </tr>
-  <tr>
-    <td align="left" >LLaVA 1.5</td>
-    <td align="right">13.6B </td>
-    <td>1808.4 </td>
-    <td>68.2 </td>
-    <td>36.4 </td>
-    <td>2.71 / 51.0 </td>
-    <td>53.7 / 77.4 </td>
-    <td>68.1 </td>
-    <td>26.4 </td>
-    <td>64.6 </td>
-  </tr>
-  <tr>
-    <td nowrap="nowrap" align="left" ><b>OmniLMM-12B</b></td>
-    <td align="right">11.6B </td>
-    <td>1935.8 </td>
-    <td>71.6 </td>
-    <td>40.7 </td>
-    <td>3.45 / 68.8 </td>
-    <td>90.3 / 95.5 </td>
-    <td>71.1 </td>
-    <td>34.9 </td>
-    <td>72.0 </td>
-  </tr>
-</tbody>
-</table>
-<small>†: Proprietary models</small>
-<br>
 </details>

-### Examples <!-- omit in toc -->
+## Legacy Models <!-- omit in toc --> 

-<table align="center" >
-  <p align="center" > 
-    <img src="assets/omnilmm-12b-examples_2.png" />
-  </p>
-</table>
+| Model                | Introduction and Guidance       |
+|:----------------------|:-------------------:|
+| MiniCPM-V 1.0  | [Document](./minicpm_v1.md)   | 
+| OmniLMM-12B  | [Document](./omnilmm_en.md)   |  


-We combine the OmniLMM-12B and GPT-3.5 (text-only) into a **real-time multimodal interactive assistant**. Video frames are described in text using OmniLMM-12B, and ChatGPT 3.5 (text-only) is employed to generate response according to the descriptions and user prompts. The demo video is a raw recording without edition. 
-
-<div align="center" >
-  <video controls src="https://github.com/OpenBMB/OmniLMM/assets/157115220/485a8f52-fb4d-4eca-8fee-506347efcfc6" type="video/mp4" width=80%/>
-</div>
-

 ## Online Demo
-Click here to try out the Demo of [MiniCPM-V 2.0](http://120.92.209.146:80/) and [OmniLMM-12B](http://120.92.209.146:8081).
+Click here to try out the Demo of [MiniCPM-Llama3-V 2.5](http://120.92.209.146:8889/) ｜ [MiniCPM-V 2.0](http://120.92.209.146:80).

 ## Install

@@ -514,9 +456,10 @@ pip install -r requirements.txt
 ### Model Zoo
 | Model                | Description       | Download Link |
 |:----------------------|:-------------------|:---------------:|
-| MiniCPM-V 2.0  | The latest version for state-of-the-art end-side capabilities with high efficiency.    |  [🤗](https://huggingface.co/openbmb/MiniCPM-V-2) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2/files) |
-| MiniCPM-V  | The first version of MiniCPM-V.    |  [🤗](https://huggingface.co/openbmb/MiniCPM-V) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V/files) |
-| OmniLMM-12B | The most capable version with leading performance.   |  [🤗](https://huggingface.co/openbmb/OmniLMM-12B) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/OmniLMM-12B/files) |
+| MiniCPM-Llama3-V 2.5  | The lastest version, achieving state-of-the edge-side multimodal performance.   |  [🤗](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5/) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5/files) |
+| MiniCPM-Llama3-V 2.5 int4  | int4 quantized version，lower GPU memory usage. |  [🤗](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-int4/) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5-int4/files) |
+| MiniCPM-V 2.0  | Light version, balance the performance the computation cost.   |  [🤗](https://huggingface.co/openbmb/MiniCPM-V-2) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2/files) |
+| MiniCPM-V 1.0 | Lightest version, achieving the fastest inference. |   [🤗](https://huggingface.co/openbmb/MiniCPM-V) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V/files) |

 ### Multi-turn Conversation
 Please refer to the following codes to run `MiniCPM-V` and `OmniLMM`.
@@ -529,9 +472,9 @@ Please refer to the following codes to run `MiniCPM-V` and `OmniLMM`.
 ```python
 import torch
 from chat import OmniLMMChat, img2base64
-torch.manual_seed(0)
+torch.manual_seed(20)

-chat_model = OmniLMMChat('openbmb/MiniCPM-V-2') # or 'openbmb/OmniLMM-12B'
+chat_model = OmniLMMChat('openbmb/MiniCPM-Llama3-V-2_5')

 im_64 = img2base64('./assets/hk_OCR.jpg')

@@ -545,7 +488,7 @@ print(answer)
 # Second round chat 
 # pass history context of multi-turn conversation
 msgs.append({"role": "assistant", "content": answer})
-msgs.append({"role": "user", "content": "Where is this store in the image?"})
+msgs.append({"role": "user", "content": "请用中文回答"})

 inputs = {"image": im_64, "question": json.dumps(msgs)}
 answer = chat_model.chat(inputs)
@@ -555,27 +498,27 @@ print(answer)
 We can obtain the following results:

 ```
-"You should go to the Canon store for a camera."
+"You should go to the Nikon store, as indicated by the neon sign on the right side of the image."

-"The Canon store is located on the right side of the image."
+"你应该去到尼康店，正如指示在图片的右侧。"
 ```



 ### Inference on Mac
 <details>
-<summary>Click to view an example, to run MiniCPM-V 2.0 on 💻 Mac with MPS (Apple silicon or AMD GPUs). </summary>
+<summary>Click to view an example, to run MiniCPM-Llama3-V 2.5 on 💻 Mac with MPS (Apple silicon or AMD GPUs). </summary>

 ```python
-# test.py
+# test.py  Need more than 16GB memory.
 import torch
 from PIL import Image
 from transformers import AutoModel, AutoTokenizer

-model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2', trust_remote_code=True, torch_dtype=torch.bfloat16)
-model = model.to(device='mps', dtype=torch.float16)
+model = AutoModel.from_pretrained('openbmb/MiniCPM-Llama3-V-2_5', trust_remote_code=True, low_cpu_mem_usage=True)
+model = model.to(device='mps')

-tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2', trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-Llama3-V-2_5', trust_remote_code=True)
 model.eval()

 image = Image.open('./assets/hk_OCR.jpg').convert('RGB')
@@ -598,7 +541,7 @@ PYTORCH_ENABLE_MPS_FALLBACK=1 python test.py
 </details>

 ### Deployment on Mobile Phone
-Currently MiniCPM-V 2.0 can be deployed on mobile phones with Android and Harmony operating systems. 🚀 Try it out [here](https://github.com/OpenBMB/mlc-MiniCPM).
+MiniCPM-V 2.0 can be deployed on mobile phones with Android operating systems. 🚀 Click [here](https://github.com/OpenBMB/mlc-MiniCPM) to install apk. MiniCPM-Llama3-V 2.5 coming soon.

 ### WebUI Demo

@@ -610,14 +553,11 @@ pip install -r requirements.txt
 ```
  
 ```shell
-# For Nvidia GPUs support BF16 (like A100, H100, RTX3090), run:
-python web_demo.py --device cuda --dtype bf16
-
-# For Nvidia GPUs do NOT support BF16 (like V100, T4, RTX2080), run:
-python web_demo.py --device cuda --dtype fp16
+# For NVIDIA GPUs, run:
+python web_demo_2.5.py --device cuda

 # For Mac with MPS (Apple silicon or AMD GPUs), run:
-PYTORCH_ENABLE_MPS_FALLBACK=1 python web_demo.py --device mps --dtype fp16
+PYTORCH_ENABLE_MPS_FALLBACK=1 python web_demo_2.5.py --device mps
 ```
 </details>

@@ -646,20 +586,25 @@ python examples/minicpmv_example.py
 ```
 </details>

-## Finetune
+## Fine-tuning

-### MiniCPM-V <!-- omit in toc -->
+### Simple Fine-tuning <!-- omit in toc -->

-We now support finetune MiniCPM-V series with the SWIFT framework. SWIFT supports training, inference, evaluation and deployment of nearly 200 LLMs and MLLMs (multimodal large models). It supports the lightweight training solutions provided by PEFT and a complete Adapters Library including techniques such as NEFTune, LoRA+ and LLaMA-PRO.
+We supports simple fine-tuning with Hugging Face for MiniCPM-V 2.0 and MiniCPM-Llama3-V 2.5.

-Best Practices：[MiniCPM-V](https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/minicpm-v最佳实践.md), [MiniCPM-V-2](https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/minicpm-v-2最佳实践.md)
+[Reference Document](./finetune/readme.md)
+
+### With the SWIFT Framework <!-- omit in toc -->
+
+We now support finetune MiniCPM-V series with the SWIFT framework. SWIFT supports training, inference, evaluation and deployment of nearly 200 LLMs and MLLMs . It supports the lightweight training solutions provided by PEFT and a complete Adapters Library including techniques such as NEFTune, LoRA+ and LLaMA-PRO.
+
+Best Practices：[MiniCPM-V 1.0](https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/minicpm-v最佳实践.md), [MiniCPM-V 2.0](https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/minicpm-v-2最佳实践.md)



 ## TODO

 - [x] MiniCPM-V fine-tuning support
- [ ] OmniLMM fine-tuning support
 - [ ] Code release for real-time interactive assistant

 ## Model License <!-- omit in toc -->