Update to MiniCPM-Llama3-V 2.5

2026-02-04 09:49:20 +08:00 · 2024-05-20 12:44:33 +08:00
parent c0e39dbfe2
commit 2c75097411
27 changed files with 1944 additions and 985 deletions
--- a/README.md
+++ b/README.md
@@ -2,32 +2,34 @@

 <!-- <!-- <h1 style="color: #33A6B8; font-family: Helvetica"> OmniLMM </h1> -->

-<img src="./assets/minicpmv-omnilmm.png" width="400em" ></img> 
+<img src="./assets/minicpmv.png" width="300em" ></img> 

-**性能领先且部署高效的多模态大模型**
+**端侧可用的 GPT-4V 级多模态大模型**

  <strong>中文 |
  [English](./README_en.md)</strong>

 <p align="center">
+  MiniCPM-Llama3-V  2.5  <a href="https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5/">🤗</a> <a href="http://120.92.209.146:8889/">🤖</a> |
  MiniCPM-V 2.0  <a href="https://huggingface.co/openbmb/MiniCPM-V-2/">🤗</a> <a href="https://huggingface.co/spaces/openbmb/MiniCPM-V-2">🤖</a> |
-  OmniLMM-12B <a href="https://huggingface.co/openbmb/OmniLMM-12B/">🤗</a> <a href="http://120.92.209.146:8081">🤖</a> |
  <a href="https://openbmb.vercel.app/minicpm-v-2">MiniCPM-V 2.0 技术博客</a>
 </p>

 </div>


-**MiniCPM-V**和**OmniLMM** 是面向图文理解的开源多模态大模型系列。该系列模型接受图像和文本输入，并提供高质量的文本输出。我们发布了两个版本的模型，旨在实现**领先的性能和高效的部署**：
+**MiniCPM-V**是面向图文理解的端侧多模态大模型系列。该系列模型接受图像和文本输入，并提供高质量的文本输出。自2024年2月以来，我们共发布了4个版本模型，旨在实现**领先的性能和高效的部署**，目前该系列最值得关注的模型包括：

- **MiniCPM-V 2.8B**：可在终端设备上部署的先进多模态大模型。最新发布的 MiniCPM-V 2.0 可以接受 180 万像素的任意长宽比图像输入，实现了和 Gemini Pro 相近的场景文字识别能力以及和 GPT-4V 相匹的低幻觉率。
+- **MiniCPM-Llama3-V 2.5**：🔥🔥🔥 MiniCPM-V系列的最新、性能最佳模型。总参数量8B，多模态综合性能超越 GPT-4V-1106、Gemini Pro、Claude 3、Qwen-VL-Max 等商用闭源模型，OCR 能力及指令跟随能力进一步提升，并支持超过30种语言的多模态交互。通过系统使用模型量化、CPU、NPU、编译优化等高效推理技术，MiniCPM-Llama3-V 2.5 可以实现高效的终端设备部署。
+
+- **MiniCPM-V 2.0**：MiniCPM-V系列的最轻量级模型。总参数量2B，多模态综合性能超越 Yi-VL 34B、CogVLM-Chat 17B、Qwen-VL-Chat 10B 等更大参数规模的模型，可接受 180 万像素的任意长宽比图像输入，实现了和 Gemini Pro 相近的场景文字识别能力以及和 GPT-4V 相匹的低幻觉率。

- **OmniLMM-12B**：相比同规模其他模型在多个基准测试中具有领先性能，实现了相比 GPT-4V 更低的幻觉率。


 ## 更新日志 <!-- omit in toc -->

-* [2024.04.23] 我们增加了对 [vllm](#vllm) 的支持，欢迎体验！
+* [2024.05.20] 我们开源了 MiniCPM-Llama3-V 2.5，增强了 OCR 能力，支持 30 多种语言，并首次在端侧实现了 GPT-4V 级的多模态能力！我们提供了[高效推理](#手机端部署)和[简易微调](./finetune/readme.md)的支持，欢迎试用！
+* [2024.04.23] 我们增加了对 [vLLM](#vllm) 的支持，欢迎体验！
 * [2024.04.18] 我们在 HuggingFace Space 新增了 MiniCPM-V 2.0 的 [demo](https://huggingface.co/spaces/openbmb/MiniCPM-V-2)，欢迎体验！
 * [2024.04.17] MiniCPM-V 2.0 现在支持用户部署本地 [WebUI Demo](#本地webui-demo部署) 了，欢迎试用!
 * [2024.04.15] MiniCPM-V 2.0 现在可以通过 SWIFT 框架 [微调](https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/minicpm-v-2最佳实践.md) 了，支持流式输出!
@@ -39,9 +41,10 @@

 ## 目录 <!-- omit in toc -->

-
- [MiniCPM-V 2.8B](#minicpm-v-28b)
- [OmniLMM-12B](#omnilmm-12b)
+- [MiniCPM-Llama3-V 2.5](#minicpm-llama3-v-25)
+  - [性能评估](#性能评估)
+  - [典型示例](#典型示例)
+- [MiniCPM-V 2.0](#minicpm-v-20)
 - [Online Demo](#online-demo)
 - [安装](#安装)
 - [推理](#推理)
@@ -50,14 +53,334 @@
  - [Mac 推理](#mac-推理)
  - [手机端部署](#手机端部署)
  - [本地WebUI Demo部署](#本地webui-demo部署)
+  - [vLLM 部署 ](#vllm-部署-)
 - [微调](#微调)
 - [未来计划](#未来计划)
 - [引用](#引用)


-## MiniCPM-V 2.8B
+## MiniCPM-Llama3-V 2.5
+**MiniCPM-Llama3-V 2.5** 是 MiniCPM-V 系列的最新版本模型，基于 SigLip-400M 和 Llama3-8B-Instruct 构建，共 8B 参数量，相较于 MiniCPM-V 2.0 性能取得较大幅度提升。MiniCPM-Llama3-V 2.5 值得关注的特点包括：

-**MiniCPM-V 2.8B**可以高效部署到终端设备。该模型基于 SigLip-400M 和 [MiniCPM-2.4B](https://github.com/OpenBMB/MiniCPM/)构建，通过perceiver resampler连接。最新发布的 MiniCPM-V 2.0 的特点包括：
+- 🔥 **领先的性能。**
+  MiniCPM-Llama3-V 2.5 在综合了 11 个主流多模态大模型评测基准的 OpenCompass 榜单上平均得分 65.1，**以 8B 量级的大小超过了 GPT-4V-1106、Gemini Pro、Claude 3、Qwen-VL-Max 等主流商用闭源多模态大模型**，大幅超越基于Llama 3构建的其他多模态大模型。
+
+- 💪 **优秀的 OCR 能力。**
+  MiniCPM-Llama3-V 2.5 可接受 180 万像素的任意宽高比图像输入，**OCRBench 得分达到 725，超越 GPT-4o、GPT-4V、Gemini Pro、Qwen-VL-Max 等商用闭源模型**，达到最佳水平。基于近期用户反馈建议，MiniCPM-Llama3-V 2.5 增强了全文 OCR 信息提取、表格图像转 markdown 等高频实用能力，并且进一步加强了指令跟随、复杂推理能力，带来更好的多模态交互体感。
+
+  
+- 🏆 **可信行为。** 
+  借助最新的 [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) 对齐技术（[RLHF-V](https://github.com/RLHF-V/) [CVPR'24]系列的最新技术），MiniCPM-Llama3-V 2.5 具有更加可信的多模态行为，在 Object HalBench 的幻觉率降低到了 **10.3%**，显著低于 GPT-4V-1106 (13.6%)，达到开源社区最佳水平。
+
+- 🌏 **多语言支持。**
+  得益于 Llama 3 强大的多语言能力和 VisCPM 的跨语言泛化技术，MiniCPM-Llama3-V 2.5 在中英双语多模态能力的基础上，仅通过少量翻译的多模态数据的指令微调，高效泛化支持了**德语、法语、西班牙语、意大利语、俄语等 30+ 种语言**的多模态能力，并表现出了良好的多语言多模态对话性能。[查看所有支持语言](./assets/minicpm-llama-v-2-5_languages.md)
+
+- 🚀 **高效部署。**
+  MiniCPM-Llama3-V 2.5 较为系统地通过**模型量化、CPU、NPU、编译优化**等高效加速技术，实现高效的终端设备部署。对于高通芯片的移动手机，我们首次将 NPU 加速框架 QNN 整合进了 llama.cpp。经过系统优化后，MiniCPM-Llama3-V 2.5 实现了多模态大模型端侧**语言解码速度 3 倍加速**、**图像编码 150 倍加速**的巨大提升。
+
+
+
+### 性能评估
+
+<div align="center">
+    <img src="assets/MiniCPM-Llama3-V-2.5-peformance.png" width="66%" />
+</div>
+<details>
+<summary>TextVQA, DocVQA, OCRBench, OpenCompass, MME, MMBench, MMMU, MathVista, LLaVA Bench, RealWorld QA, Object HalBench上的详细评测结果。 </summary>
+<div align="center">
+
+<table style="margin: 0px auto;">
+    <thead>
+        <tr>
+            <th align="left">Model</th>
+            <th>Size</th>
+            <th>OCRBench</th>
+            <th>TextVQA val</th>
+            <th>DocVQA test</th>
+            <th>Open-Compass</th>
+            <th>MME</th>
+            <th>MMB test (en)</th>
+            <th>MMB test (cn)</th>
+            <th>MMMU val</th>
+            <th>Math-Vista</th>
+            <th>LLaVA Bench</th>
+            <th>RealWorld QA</th>
+            <th>Object HalBench</th>
+        </tr>
+    </thead>
+            <tbody align="center">
+        <tr>
+            <td colspan="14" align="left"><strong>Proprietary</strong></td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">Gemini Pro</td>
+            <td>-</td>
+            <td>680</td>
+            <td>74.6</td>
+            <td>88.1</td>
+            <td>62.9</td>
+            <td>2148.9</td>
+            <td>73.6</td>
+            <td>74.3</td>
+            <td>48.9</td>
+            <td>45.8</td>
+            <td>79.9</td>
+            <td>60.4</td>
+            <td>-</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">GPT-4V (2023.11.06)</td>
+            <td>-</td>
+            <td>645</td>
+            <td>78.0</td>
+            <td>88.4</td>
+            <td>63.5</td>
+            <td>1771.5</td>
+            <td>77.0</td>
+            <td>74.4</td>
+            <td>53.8</td>
+            <td>47.8</td>
+            <td>93.1</td>
+            <td>63.0</td>
+            <td>86.4</td>
+        </tr>
+        <tr>
+            <td colspan="14" align="left"><strong>Open-source</strong></td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">Mini-Gemini</td>
+            <td>2.2B</td>
+            <td>-</td>
+            <td>56.2</td>
+            <td>34.2*</td>
+            <td>-</td>
+            <td>1653.0</td>
+            <td>-</td>
+            <td>-</td>
+            <td>31.7</td>
+            <td>-</td>
+            <td>-</td>
+            <td>-</td>
+            <td>-</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">Qwen-VL-Chat</td>
+            <td>9.6B</td>
+            <td>488</td>
+            <td>61.5</td>
+            <td>62.6</td>
+            <td>51.6</td>
+            <td>1860.0</td>
+            <td>61.8</td>
+            <td>56.3</td>
+            <td>37.0</td>
+            <td>33.8</td>
+            <td>67.7</td>
+            <td>49.3</td>
+            <td>56.2</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">DeepSeek-VL-7B</td>
+            <td>7.3B</td>
+            <td>435</td>
+            <td>64.7*</td>
+            <td>47.0*</td>
+            <td>54.6</td>
+            <td>1765.4</td>
+            <td>73.8</td>
+            <td>71.4</td>
+            <td>38.3</td>
+            <td>36.8</td>
+            <td>77.8</td>
+            <td>54.2</td>
+            <td></td>
+        </tr>        
+        <tr>
+            <td nowrap="nowrap" align="left">Yi-VL-34B</td>
+            <td>34B</td>
+            <td>290</td>
+            <td>43.4*</td>
+            <td>16.9*</td>
+            <td>52.2</td>
+            <td>2050.2</td>
+            <td>72.4</td>
+            <td>70.7</td>
+            <td>45.1</td>
+            <td>30.7</td>
+            <td>62.3</td>
+            <td>54.8</td>
+            <td>79.3</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">CogVLM-Chat</td>
+            <td>17.4B</td>
+            <td>590</td>
+            <td>70.4</td>
+            <td>33.3*</td>
+            <td>54.2</td>
+            <td>1736.6</td>
+            <td>65.8</td>
+            <td>55.9</td>
+            <td>37.3</td>
+            <td>34.7</td>
+            <td>73.9</td>
+            <td>60.3</td>
+            <td>73.6</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">TextMonkey</td>
+            <td>9.7B</td>
+            <td>558</td>
+            <td>64.3</td>
+            <td>66.7</td>
+            <td>-</td>
+            <td>-</td>
+            <td>-</td>
+            <td>-</td>
+            <td>-</td>
+            <td>-</td>
+            <td>-</td>
+            <td>-</td>
+            <td>-</td>
+        </tr>
+        <tr>
+          <td nowrap="nowrap" align="left">IDEFICS2-8B</td>
+          <td>8.0B</td>
+          <td>-</td>
+          <td>73.0</td>
+          <td>74.0</td>
+          <td>57.2</td>
+          <td>1847.6</td>
+          <td>75.7</td>
+          <td>68.6</td>
+          <td>45.2</td>
+          <td>52.2</td>
+          <td>49.1</td>
+          <td>60.7</td>
+          <td>-</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">Bunny-LLama-3-8B</td>
+            <td>8.4B</td>
+            <td>-</td>
+            <td>-</td>
+            <td>-</td>
+            <td>54.3</td>
+            <td>1920.3</td>
+            <td>77.0</td>
+            <td>73.9</td>
+            <td>41.3</td>
+            <td>31.5</td>
+            <td>61.2</td>
+            <td>58.8</td>
+            <td>-</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">LLaVA-NeXT Llama-3-8B</td>
+            <td>8.4B</td>
+            <td>-</td>
+            <td>-</td>
+            <td>-</td>
+            <td>-</td>
+            <td>1971.5</td>
+            <td>-</td>
+            <td>-</td>
+            <td>41.7</td>
+            <td>-</td>
+            <td>80.1</td>
+            <td>60.0</td>
+            <td>-</td>
+        </tr>
+        <tr style="background-color: #e6f2ff;">
+            <td nowrap="nowrap" align="left">MiniCPM-V 1.0</td>
+            <td>2.8B</td>
+            <td>366</td>
+            <td>60.6</td>
+            <td>38.2</td>
+            <td>47.5</td>
+            <td>1650.2</td>
+            <td>64.1</td>
+            <td>62.6</td>
+            <td>38.3</td>
+            <td>28.9</td>
+            <td>51.3</td>
+            <td>51.2</td>
+            <td>78.4</td>
+        </tr>
+        <tr style="background-color: #e6f2ff;">
+            <td nowrap="nowrap" align="left">MiniCPM-V 2.0</td>
+            <td>2.8B</td>
+            <td>605</td>
+            <td>74.1</td>
+            <td>71.9</td>
+            <td>54.5</td>
+            <td>1808.6</td>
+            <td>69.1</td>
+            <td>66.5</td>
+            <td>38.2</td>
+            <td>38.7</td>
+            <td>69.2</td>
+            <td>55.8</td>
+            <td>85.5</td>
+        </tr>
+        <tr style="background-color: #e6f2ff;">
+            <td nowrap="nowrap" align="left">MiniCPM-Llama3-V 2.5</td>
+            <td>8.5B</td>
+            <td><strong>725</strong></td>
+            <td><strong>76.6</strong></td>
+            <td><strong>84.8</strong></td>
+            <td><strong>65.1</strong></td>
+            <td>2024.6</td>
+            <td><strong>77.2</strong></td>
+            <td><strong>74.2</strong></td>
+            <td><strong>45.8</strong></td>
+            <td><strong>54.3</strong></td>
+            <td><strong>86.7</strong></td>
+            <td><strong>63.5</strong></td>
+            <td><strong>89.7</strong></td>
+        </tr>
+    </tbody>
+</table>
+
+</div>
+* 正式开源模型权重的评测结果。
+</details>
+
+<div align="center">
+    <img src="assets/llavabench_compare.png" width="66%" />
+    <br>
+    多语言LLaVABench评测结果
+</div>
+
+
+### 典型示例
+<table align="center">
+    <p align="center">
+      <img src="assets/minicpmv-llama3-v2.5/cases_all.png" width=95%/>
+    </p>
+</table>
+
+我们将 MiniCPM-Llama3-V 2.5 部署在小米 14 Pro 上，并录制了以下演示视频，我们以2倍速播放视频。
+
+<table align="center">
+    <p align="center">
+      <img src="assets/gif_cases/ticket.gif" width=32%/>
+      <img src="assets/gif_cases/meal_plan.gif" width=32%/>
+    </p>
+</table>
+
+<table align="center">
+    <p align="center" width=80%>
+      <img src="assets/gif_cases/1-4.gif" width=72%/>
+    </p>
+</table>
+
+## MiniCPM-V 2.0
+
+<details>
+<summary>查看 MiniCPM-V 2.0 的详细信息</summary>
+
+**MiniCPM-V 2.0**可以高效部署到终端设备。该模型基于 SigLip-400M 和 [MiniCPM-2.4B](https://github.com/OpenBMB/MiniCPM/)构建，通过perceiver resampler连接。其特点包括：

 - 🔥 **优秀的性能。**

@@ -83,246 +406,6 @@
  MiniCPM-V 2.0 **提供领先的中英双语多模态能力支持**。
  该能力通过 [VisCPM](https://arxiv.org/abs/2308.12038) [ICLR'24] 论文中提出的多模态能力的跨语言泛化技术实现。

-### 性能评估 <!-- omit in toc -->
-
-<div align="center">
-    <img src=assets/minicpmv-2-peformance.png width=66% />
-</div>
-<details>
-<summary>TextVQA, DocVQA, OCRBench, OpenCompass, MME, MMBench, MMMU, MathVista, LLaVA Bench, Object HalBench 上的详细评测结果。 </summary>
-<div align="center">
-
-<table style="margin: 0px auto;">
-<thead>
-  <tr>
-    <th align="left">Model</th>
-    <th>Size</th>
-    <th>TextVQA val</th>
-    <th>DocVQA test</th>
-    <th>OCRBench</th>
-    <th>OpenCompass</th>
-    <th nowrap="nowrap" >MME</th>
-    <th>MMB dev(en)</th>
-    <th>MMB dev(zh)</th>
-    <th>MMMU val</th>
-    <th>MathVista</th>
-    <th>LLaVA Bench</th>
-    <th nowrap="nowrap">Object HalBench</th>
-  </tr>
-</thead>
-<tbody align="center">
-  <tr>
-    <td colspan="12" align="left"><strong>Proprietary models</strong></td>
-  </tr>
-  <tr>
-    <td nowrap="nowrap" align="left">Gemini Pro Vision</td>
-    <td>- </td>
-    <td>74.6</td>
-    <td>88.1</td>
-    <td>680</td>
-    <td>63.8</td>
-    <td>2148.9</td>
-    <td>75.2</td>
-    <td>74.0</td>
-    <td>48.9</td>
-    <td>45.8</td>
-    <td>79.9</td>
-    <td>- </td>
-  </tr>
-  <tr>
-    <td nowrap="nowrap" align="left">GPT-4V</td>
-    <td>- </td>
-    <td>78.0</td>
-    <td>88.4</td>
-    <td>645</td>
-    <td>63.2</td>
-    <td>1771.5</td>
-    <td>75.1</td>
-    <td>75.0</td>
-    <td>53.8</td>
-    <td>47.8</td>
-    <td>93.1</td>
-    <td>86.4 / 92.7</td>
-  </tr>
-  <tr>
-    <td colspan="12" align="left"><strong>Open-source models 6B~34B</strong></td>
-  </tr>
-  <tr>
-    <td  nowrap="nowrap" align="left" >Yi-VL-6B</td>
-    <td align="right" >6.7B</td>
-    <td>45.5*</td>
-    <td>17.1*</td>
-    <td>290</td>
-    <td>49.3</td>
-    <td>1915.1 </td>
-    <td>68.6 </td>
-    <td>68.3 </td>
-    <td>40.3 </td>
-    <td>28.8 </td>
-    <td>51.9 </td>
-    <td>- </td>
-  </tr>
-  <tr>
-    <td  nowrap="nowrap" align="left" >Qwen-VL-Chat</td>
-    <td align="right" >9.6B</td>
-    <td>61.5</td>
-    <td>62.6</td>
-    <td>488 </td>
-    <td>52.1 </td>
-    <td>1860.0 </td>
-    <td>60.6 </td>
-    <td>56.7 </td>
-    <td>37.0 </td>
-    <td>33.8 </td>
-    <td>67.7 </td>
-    <td>56.2 / 80.0</td>
-  </tr>
-  <tr>
-    <td nowrap="nowrap" align="left" >Yi-VL-34B</td>
-    <td align="right" >34B</td>
-    <td>43.4*</td>
-    <td>16.9*</td>
-    <td>290</td>
-    <td>52.6 </td>
-    <td>2050.2</td>
-    <td>71.1</td>
-    <td>71.4</td>
-    <td>45.1</td>
-    <td>30.7</td>
-    <td>62.3</td>
-    <td>- </td>
-  </tr>
-  <tr>
-    <td  nowrap="nowrap" align="left" >DeepSeek-VL-7B</td>
-    <td align="right" >7.3B</td>
-    <td>64.7*</td>
-    <td>47.0* </td>
-    <td>435</td>
-    <td>55.6 </td>
-    <td>1765.4 </td>
-    <td>74.1 </td>
-    <td>72.8 </td>
-    <td>38.3 </td>
-    <td>36.8</td>
-    <td>77.8 </td>
-    <td>- </td>
-  </tr>
-  <tr>
-    <td  nowrap="nowrap" align="left" >TextMonkey</td>
-    <td align="right" >9.7B</td>
-    <td>64.3</td>
-    <td>66.7 </td>
-    <td>558</td>
-    <td>- </td>
-    <td>- </td>
-    <td>- </td>
-    <td>- </td>
-    <td>- </td>
-    <td>-</td>
-    <td>- </td>
-    <td>- </td>
-  </tr>
-    <tr>
-    <td  nowrap="nowrap" align="left" >CogVLM-Chat</td>
-    <td align="right" >17.4B</td>
-    <td>70.4</td>
-    <td>33.3*</td>
-    <td>590 </td>
-    <td>52.5 </td>
-    <td>1736.6 </td>
-    <td>63.7 </td>
-    <td>53.8 </td>
-    <td>37.3 </td>
-    <td>34.7 </td>
-    <td>73.9 </td>
-    <td>73.6 / 87.4 </td>
-  </tr>
-  <tr>
-    <td colspan="12" align="left"><strong>Open-source models 1B~3B </strong></td>
-  </tr>
-  <tr>
-    <td  nowrap="nowrap" align="left" >DeepSeek-VL-1.3B</td>
-    <td align="right" >1.7B</td>
-    <td>58.4*</td>
-    <td>37.9*</td>
-    <td>413</td>
-    <td>46.0 </td>
-    <td>1531.6 </td>
-    <td>64.0 </td>
-    <td>61.2 </td>
-    <td>33.8 </td>
-    <td>29.4 </td>
-    <td>51.1 </td>
-    <td>- </td>
-  </tr>
-  <tr>
-    <td  nowrap="nowrap" align="left" >MobileVLM V2</td>
-    <td align="right" >3.1B</td>
-    <td>57.5</td>
-    <td>19.4*</td>
-    <td>-</td>
-    <td>-</td>
-    <td>1440.5(P) </td>
-    <td>63.2 </td>
-    <td>-</td>
-    <td>-</td>
-    <td>-</td>
-    <td>-</td>
-    <td>-</td>
-  </tr>
-  <tr>
-    <td  nowrap="nowrap" align="left" >Mini-Gemini</td>
-    <td align="right" >2.2B</td>
-    <td>56.2</td>
-    <td>34.2*</td>
-    <td>-</td>
-    <td>-</td>
-    <td>1653.0 </td>
-    <td>59.8 </td>
-    <td>- </td>
-    <td>31.7 </td>
-    <td>-</td>
-    <td>- </td>
-    <td>- </td>
-  </tr>
-  <tr>
-    <td  nowrap="nowrap" align="left" >MiniCPM-V</td>
-    <td align="right" >2.8B </td>
-    <td>60.6</td>
-    <td>38.2 </td>
-    <td>366</td>
-    <td>47.6</td>
-    <td>1650.2 </td>
-    <td>67.9 </td>
-    <td>65.3 </td>
-    <td><strong>38.3</strong></td>
-    <td>28.9</td>
-    <td>51.3 </td>
-    <td>78.4 / 88.5 </td>
-  </tr>
-  <tr>
-    <td  nowrap="nowrap" align="left" ><strong>MiniCPM-V 2.0</strong></td>
-    <td align="right" >2.8B </td>
-    <td><strong>74.1</strong></td>
-    <td><strong>71.9</strong> </td>
-    <td><strong>605</strong></td>
-    <td><strong>55.0</strong></td>
-    <td><strong>1808.6</strong> </td>
-    <td><strong>69.6</strong> </td>
-    <td><strong>68.1</strong> </td>
-    <td>38.2 </td>
-    <td><strong>38.7</strong></td>
-    <td><strong>69.2</strong> </td>
-    <td><strong>85.5 / 92.2 </strong></td>
-  </tr>
-</tbody>
-</table>
-
-</div>
-* 我们自己评测了正式开源的模型权重。
-
-</details>
-
 ### 典型示例 <!-- omit in toc -->


@@ -341,158 +424,23 @@
    </p>
 </table>

-### MiniCPM-V 1.0 <!-- omit in toc -->
-
-请参考[这里](./minicpm_v1.md)了解 MiniCPM-V 1.0 的信息和使用教程。
-
-
-## OmniLMM-12B
-**OmniLMM-12B** 是当前系列中性能最佳的版本。该模型基于EVA02-5B和Zephyr-7B-β初始化构建，并使用perceiver resampler连接，采用了课程学习的方法在多模态数据上进行训练。该模型具有三个特点：
-
- 🔥 **性能领先。**
-
-  OmniLMM-12B 相比其他同规模模型在多个基准测试中取得**领先的性能**（包括 MME、MMBench、SEED-Bench 等），模型掌握了较为丰富的多模态世界知识。
-
- 🏆 **行为可信。**
-
-  多模态大模型的幻觉问题备受关注，模型经常生成和图像中的事实不符的文本（例如，确信地描述图片中并不存在的物体）。OmniLMM-12B是 **第一个通过多模态 RLHF 对齐的综合能力优秀的开源多模态大模型**（借助 [RLHF-V](https://rlhf-v.github.io/) [CVPR'24] 系列技术）。该模型在 [MMHal-Bench](https://huggingface.co/datasets/Shengcao1006/MMHal-Bench) 幻觉评测基准上达到**开源模型最佳水平**，并在 [Object HalBench](https://arxiv.org/abs/2312.00849) 中**优于GPT-4V**。
-
- 🕹 **实时多模态交互。**
-
-  我们尝试结合OmniLMM-12B和GPT-3.5 (纯文本模型) ，实现**实时多模态交互助手**。该模型接受来自摄像头的视频流，并借助工具处理语音输入输出。虽然还很初步，我们发现该模型无需视频编辑可以**复现Gemini演示视频中的一些有趣例子**。
-
-### 评测结果 <!-- omit in toc -->
-
-<div align="center">
-    <img src=assets/radar_omnilmm12b.png width=66% />
-</div>
-<details>
-<summary> MME, MMBench, MMMU, MMBench, MMHal-Bench, Object HalBench, SeedBench, LLaVA Bench W, MathVista 上的详细评测结果。 </summary>
-
-<table>
-<thead>
-  <tr>
-    <th align="left">Model</th>
-    <th>Size</th>
-    <th>MME</th>
-    <th nowrap="nowrap">MMB dev (en)</th>
-    <th nowrap="nowrap" >MMMU val</th>
-    <th nowrap="nowrap" >MMHal-Bench</th>
-    <th nowrap="nowrap" >Object HalBench</th>
-    <th nowrap="nowrap" >SeedBench-I</th>
-    <th>MathVista</th>
-    <th nowrap="nowrap" >LLaVA Bench</th>
-  </tr>
-</thead>
-<tbody align="center">
-  <tr>
-    <td align="left">GPT-4V†</td>
-    <td>-</td>
-    <td>1771.5</td>
-    <td>75.1 </td>
-    <td>56.8</td>
-    <td>3.53 / 70.8</td>
-    <td>86.4 / 92.7</td>
-    <td>71.6 </td>
-    <td>47.8 </td>
-    <td>93.1 </td>
-  </tr>
-  <tr>
-    <td nowrap="nowrap" align="left">Qwen-VL-Plus†</td>
-    <td>-</td>
-    <td>2183.4</td>
-    <td>66.2 </td>
-    <td>45.2</td>
-    <td>- </td>
-    <td>- </td>
-    <td>65.7 </td>
-    <td>36.0 </td>
-    <td>73.7 </td>
-  </tr>
-  <tr>
-    <td align="left">Yi-VL 6B</td>
-    <td align="right">6.7B </td>
-    <td>1915.1 </td>
-    <td>68.6 </td>
-    <td>40.3 </td>
-    <td>- </td>
-    <td>- </td>
-    <td>67.5 </td>
-    <td>28.8 </td>
-    <td>51.9 </td>
-  </tr>
-  <tr>
-    <td nowrap="nowrap" align="left" >Qwen-VL-Chat</td>
-    <td align="right">9.6B</td>
-    <td>1860.0</td>
-    <td>60.6 </td>
-    <td>35.9</td>
-    <td>2.93 / 59.4</td>
-    <td>56.2 / 80.0</td>
-    <td>64.8 </td>
-    <td>33.8 </td>
-    <td>67.7 </td>
-  </tr>
-  <tr>
-    <td align="left" >CogVLM-Chat</td>
-    <td align="right">17.4B</td>
-    <td>1736.6</td>
-    <td>63.7 </td>
-    <td>32.1 </td>
-    <td>2.68 / 52.1 </td>
-    <td>73.6 / 87.4 </td>
-    <td>68.8 </td>
-    <td>34.7 </td>
-    <td>73.9 </td>
-  </tr>
-  <tr>
-    <td align="left" >LLaVA 1.5</td>
-    <td align="right">13.6B </td>
-    <td>1808.4 </td>
-    <td>68.2 </td>
-    <td>36.4 </td>
-    <td>2.71 / 51.0 </td>
-    <td>53.7 / 77.4 </td>
-    <td>68.1 </td>
-    <td>26.4 </td>
-    <td>64.6 </td>
-  </tr>
-  <tr>
-    <td nowrap="nowrap" align="left" ><b>OmniLMM-12B</b></td>
-    <td align="right">11.6B </td>
-    <td>1935.8 </td>
-    <td>71.6 </td>
-    <td>40.7 </td>
-    <td>3.45 / 68.8 </td>
-    <td>90.3 / 95.5 </td>
-    <td>71.1 </td>
-    <td>34.9 </td>
-    <td>72.0 </td>
-  </tr>
-</tbody>
-</table>
-<small>†: 闭源模型</small>
-<br>
 </details>

-### 典型示例 <!-- omit in toc -->

-<table align="center" >
-  <p align="center" > 
-    <img src="assets/omnilmm-12b-examples_2.png" />
-  </p>
-</table>
+<a id='legacy-models'></a>
+
+## 历史版本模型  <!-- omit in toc -->


-我们结合 OmniLMM-12B 和 ChatGPT-3.5 (纯文本模型) 尝试构建 **实时多模态交互助手**. OmniLMM-12B 将视频帧转为对应的图像描述并输入给ChatGPT-3.5来生成对用户指令的响应。演示视频未经编辑。
+| 模型                | 介绍信息和使用教程       |
+|:----------------------|:-------------------:|
+| MiniCPM-V 1.0  | [文档](./minicpm_v1.md)   | 
+| OmniLMM-12B  | [文档](./omnilmm.md)   |  

-<div align="center" >
-  <video controls src="https://github.com/OpenBMB/OmniLMM/assets/157115220/8fec13bf-bb47-4bf8-8f8c-d0b716a964ec" type="video/mp4" width=80%/>
-</div>

 ## Online Demo

-欢迎通过以下链接使用我们的网页端推理服务： [OmniLMM-12B](http://120.92.209.146:8081) ｜ [MiniCPM-V 2.0](http://120.92.209.146:80).
+欢迎通过以下链接使用我们的网页端推理服务： [MiniCPM-Llama3-V 2.5](http://120.92.209.146:8889/) ｜ [MiniCPM-V 2.0](http://120.92.209.146:80).

 ## 安装

@@ -522,14 +470,16 @@ pip install -r requirements.txt

 | 模型                | 简介       | 下载链接 |
 |:----------------------|:-------------------|:---------------:|
-| MiniCPM-V 2.0  | 最新版本，提供高效而领先的端侧双语多模态理解能力。   |  [🤗](https://huggingface.co/openbmb/MiniCPM-V-2) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2/files) |
-| MiniCPM-V  | 第一版 MiniCPM-V    |   [🤗](https://huggingface.co/openbmb/MiniCPM-V) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V/files) |
-| OmniLMM-12B | 性能最强的版本                   |  [🤗](https://huggingface.co/openbmb/OmniLMM-12B) &nbsp;&nbsp;  [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/OmniLMM-12B/files) |
+| MiniCPM-Llama3-V 2.5  | 最新版本，提供最佳的端侧多模态理解能力。   |  [🤗](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5/) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5/files) |
+| MiniCPM-Llama3-V 2.5 int4  | int4量化版，更低显存占用。   |  [🤗](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-int4/) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5-int4/files) |
+| MiniCPM-V 2.0  | 轻量级版本，平衡计算开销和多模态理解能力。   |  [🤗](https://huggingface.co/openbmb/MiniCPM-V-2) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2/files) |
+| MiniCPM-V 1.0 | 最轻量版本， 提供最快的推理速度。    |   [🤗](https://huggingface.co/openbmb/MiniCPM-V) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V/files) |

+更多[历史版本模型](#legacy-models)

 ### 多轮对话

-请参考以下代码使用 `MiniCPM-V` 和 `OmniLMM` 进行推理。
+请参考以下代码进行推理。

 <div align="center">
 <img src="assets/hk_OCR.jpg" width="500px">
@@ -538,8 +488,10 @@ pip install -r requirements.txt

 ```python
 from chat import OmniLMMChat, img2base64
+import torch
+torch.manual_seed(20)

-chat_model = OmniLMMChat('openbmb/MiniCPM-V-2') # or 'openbmb/OmniLMM-12B'
+chat_model = OmniLMMChat('openbmb/MiniCPM-Llama3-V-2_5')

 im_64 = img2base64('./assets/hk_OCR.jpg')

@@ -553,7 +505,7 @@ print(answer)
 # Second round chat 
 # pass history context of multi-turn conversation
 msgs.append({"role": "assistant", "content": answer})
-msgs.append({"role": "user", "content": "Where is this store in the image?"})
+msgs.append({"role": "user", "content": "请用中文回答"})

 inputs = {"image": im_64, "question": json.dumps(msgs)}
 answer = chat_model.chat(inputs)
@@ -563,27 +515,27 @@ print(answer)
 可以得到以下输出:

 ```
-"You should go to the Canon store for a camera."
+"You should go to the Nikon store, as indicated by the neon sign on the right side of the image."

-"The Canon store is located on the right side of the image."
+"你应该去到尼康店，正如指示在图片的右侧。"
 ```



 ### Mac 推理
 <details>
-<summary>点击查看 MiniCPM-V 2.0 基于Mac MPS运行 (Apple silicon or AMD GPUs)的示例。 </summary>
+<summary>点击查看 MiniCPM-Llama3-V 2.5 / MiniCPM-V 2.0 基于Mac MPS运行 (Apple silicon 或 AMD GPUs)的示例。 </summary>

 ```python
-# test.py
+# test.py    Need more than 16GB memory to run.
 import torch
 from PIL import Image
 from transformers import AutoModel, AutoTokenizer

-model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2', trust_remote_code=True, torch_dtype=torch.bfloat16)
-model = model.to(device='mps', dtype=torch.float16)
+model = AutoModel.from_pretrained('openbmb/MiniCPM-Llama3-V-2_5', trust_remote_code=True, low_cpu_mem_usage=True)
+model = model.to(device='mps')

-tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2', trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-Llama3-V-2_5', trust_remote_code=True)
 model.eval()

 image = Image.open('./assets/hk_OCR.jpg').convert('RGB')
@@ -607,25 +559,22 @@ PYTORCH_ENABLE_MPS_FALLBACK=1 python test.py


 ### 手机端部署
-MiniCPM-V 2.0 目前可以部署在Android和Harmony操作系统的手机上。 🚀 点击[这里](https://github.com/OpenBMB/mlc-MiniCPM)开始手机端部署。
+MiniCPM-V 2.0 可运行在Android手机上, 点击[2.0](https://github.com/OpenBMB/mlc-MiniCPM)安装apk使用; MiniCPM-Llama3-V 2.5 将很快推出，敬请期待。

 ### 本地WebUI Demo部署
 <details>
-<summary>点击查看本地WebUI demo在Nvidia GPU, Mac等不同设备部署方法 </summary>
+<summary>点击查看本地WebUI demo 在 NVIDIA GPU, Mac等不同设备部署方法 </summary>
  
 ```shell
 pip install -r requirements.txt
 ```
  
 ```shell
-# For Nvidia GPUs support BF16 (like A100, H100, RTX3090), run:
-python web_demo.py --device cuda --dtype bf16
-
-# For Nvidia GPUs do NOT support BF16 (like V100, T4, RTX2080), run:
-python web_demo.py --device cuda --dtype fp16
+# For NVIDIA GPUs, run:
+python web_demo_2.5.py --device cuda

 # For Mac with MPS (Apple silicon or AMD GPUs), run:
-PYTORCH_ENABLE_MPS_FALLBACK=1 python web_demo.py --device mps --dtype fp16
+PYTORCH_ENABLE_MPS_FALLBACK=1 python web_demo_2.5.py --device mps
 ```
 </details>

@@ -658,16 +607,21 @@ python examples/minicpmv_example.py

 ## 微调

-### MiniCPM-V <!-- omit in toc -->
+### 简易微调 <!-- omit in toc -->

-我们支持使用 SWIFT 框架微调 MiniCPM-V 系列模型。SWIFT 支持近 200 种 LLM 和 MLLM（多模态大模型）的训练、推理、评测和部署。支持 PEFT 提供的轻量训练方案和完整的 Adapters 库支持的最新训练技术如 NEFTune、LoRA+、LLaMA-PRO 等。 
+我们支持使用 Huggingface Transformers 库简易地微调 MiniCPM-V 2.0 和 MiniCPM-Llama3-V 2.5 模型。

-参考文档：[MiniCPM-V](https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/minicpm-v最佳实践.md), [MiniCPM-V-2](https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/minicpm-v-2最佳实践.md)
+[参考文档](./finetune/readme.md)
+
+### 使用 SWIFT 框架 <!-- omit in toc -->
+
+我们支持使用 SWIFT 框架微调 MiniCPM-V 系列模型。SWIFT 支持近 200 种大语言模型和多模态大模型的训练、推理、评测和部署。支持 PEFT 提供的轻量训练方案和完整的 Adapters 库支持的最新训练技术如 NEFTune、LoRA+、LLaMA-PRO 等。 
+
+参考文档：[MiniCPM-V 1.0](https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/minicpm-v最佳实践.md), [MiniCPM-V 2.0](https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/minicpm-v-2最佳实践.md)

 ## 未来计划

 - [x] 支持 MiniCPM-V 系列模型微调
- [ ] 支持 OmniLMM 系列模型微调
 - [ ] 实时多模态交互代码开源


@@ -676,18 +630,18 @@ python examples/minicpmv_example.py

 本仓库中代码依照 Apache-2.0 协议开源

-OmniLMM 模型权重的使用遵循 “[通用模型许可协议-来源说明-宣传限制-商业授权](https://github.com/OpenBMB/General-Model-License/blob/main/通用模型许可协议-来源说明-宣传限制-商业授权.md)”。
+本项目中模型权重的使用遵循 “[通用模型许可协议-来源说明-宣传限制-商业授权](https://github.com/OpenBMB/General-Model-License/blob/main/通用模型许可协议-来源说明-宣传限制-商业授权.md)”。

-OmniLMM 模型权重对学术研究完全开放。
+本项目中模型权重对学术研究完全开放。

 如需将模型用于商业用途，请联系 cpm@modelbest.cn 来获取书面授权，登记后可以免费商业使用。


 ## 声明 <!-- omit in toc -->

-作为多模态大模型，MiniCPM-V 和 OmniLMM 通过学习大量的多模态数据来生成内容，但它无法理解、表达个人观点或价值判断，它所输出的任何内容都不代表模型开发者的观点和立场。
+作为多模态大模型，MiniCPM-V 系列模型（包括 OmniLMM）通过学习大量的多模态数据来生成内容，但它无法理解、表达个人观点或价值判断，它所输出的任何内容都不代表模型开发者的观点和立场。

-因此用户在使用 MiniCPM-V 和 OmniLMM 生成的内容时，应自行负责对其进行评估和验证。如果由于使用 OmniLMM 开源模型而导致的任何问题，包括但不限于数据安全问题、公共舆论风险，或模型被误导、滥用、传播或不当利用所带来的任何风险和问题，我们将不承担任何责任。
+因此用户在使用本项目的系列模型生成的内容时，应自行负责对其进行评估和验证。如果由于使用本项目的系列开源模型而导致的任何问题，包括但不限于数据安全问题、公共舆论风险，或模型被误导、滥用、传播或不当利用所带来的任何风险和问题，我们将不承担任何责任。


 ## 机构 <!-- omit in toc -->
--- a/README_en.md
+++ b/README_en.md
@@ -1,30 +1,31 @@
 <div align="center">

-<img src="./assets/minicpmv-omnilmm.png" width="400em" ></img> 
+<img src="./assets/minicpmv.png" width="300em" ></img> 

-**Large multi-modal models for strong performance and efficient deployment**
+**A GPT-4V Level Multimodal LLM on Your Phone**

  <strong>[中文](./README.md) |
  English</strong>

 <p align="center">
+  MiniCPM-Llama3-V  2.5  <a href="https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5/">🤗</a> <a href="http://120.92.209.146:8889/">🤖</a> |
  MiniCPM-V 2.0  <a href="https://huggingface.co/openbmb/MiniCPM-V-2/">🤗</a> <a href="https://huggingface.co/spaces/openbmb/MiniCPM-V-2">🤖</a> |
-  OmniLMM-12B <a href="https://huggingface.co/openbmb/OmniLMM-12B/">🤗</a> <a href="http://120.92.209.146:8081">🤖</a> | <a href="https://openbmb.vercel.app/minicpm-v-2-en"> Technical Blog </a>
+  <a href="https://openbmb.vercel.app/minicpm-v-2-en"> Technical Blog </a>
 </p>

 </div>


-**MiniCPM-V** and **OmniLMM** are a family of open-source large multimodal models (LMMs) adept at vision & language modeling. The models process images and text inputs and deliver high-quality text outputs. We release two featured versions that are targeted at **strong performance and efficient deployment**:
+**MiniCPM-V** is a series of end-side multimodal LLMs designed for image-text understanding. These models accept image and text inputs and provide high-quality text outputs. Since February 2024, we have released four versions of the model, aiming to achieve **strong performance and efficient deployment**. The most noteworthy models in this series currently include:

- **MiniCPM-V 2.8B**: State-of-the-art end-side large multimodal models. Our latest MiniCPM-V 2.0 can accept 1.8 million pixels (e.g., 1344x1344) images at any aspect ratio, and is adept at OCR capability. It achieves comparable performance with Gemini Pro in understanding scene-text and matches GPT-4V in preventing hallucinations.
-
- **OmniLMM 12B**: The most capable version with leading performance among comparable-sized models on multiple benchmarks. The model also achieves state-of-the-art performance in trustworthy behaviors, with even less hallucination than GPT-4V.
+- **MiniCPM-Llama3-V 2.5**: 🔥🔥🔥 The latest and most capable model in the MiniCPM-V series. With a total of 8B parameters, the model surpasses proprietary models such as GPT-4V-1106, Gemini Pro, Qwen-VL-Max and Claude 3 in overall performance. Its OCR capability and instruction-following capability have been further enhanced. The model supports multimodal interaction in over 30 languages including English, Chinese, French, Spanish, German etc. Equipped with model quantization and efficient inference technologies on CPUs, NPUs and compilation optimizations, MiniCPM-Llama3-V 2.5 can be efficiently deployed on edge devices.

+- **MiniCPM-V 2.0**: The lightest model in the MiniCPM-V series. With 2B parameters, it surpasses larger-scale models such as Yi-VL 34B, CogVLM-Chat 17B, and Qwen-VL-Chat 10B in overall performance. It accepts image inputs of any aspect ratio up to 1.8 million pixels (e.g., 1344x1344), achieving comparable performance with Gemini Pro in understanding scene-text and matches GPT-4V in preventing hallucinations.


 ## News <!-- omit in toc -->

+* [2024.05.20] We open-soure MiniCPM-Llama3-V 2.5, it has improved OCR capability and supports 30+ languages, representing the first edge-side multimodal LLM achieving GPT-4V level performance! We provide [efficient inference](#deployment-on-mobile-phone) and [simple fine-tuning](./finetune/readme.md), try it now!
 * [2024.04.23] MiniCPM-V-2.0 supports vLLM now! Click [here](#vllm) to view more details.
 * [2024.04.18] We create a HuggingFace Space to host the demo of MiniCPM-V 2.0 at [here](https://huggingface.co/spaces/openbmb/MiniCPM-V-2)!
 * [2024.04.17] MiniCPM-V-2.0 supports deploying [WebUI Demo](#webui-demo) now!
@@ -38,8 +39,8 @@
 ## Contents <!-- omit in toc -->


- [MiniCPM-V 2.8B](#minicpm-v-28b)
- [OmniLMM-12B](#omnilmm-12b)
+- [MiniCPM-Llama3-V 2.5](#minicpm-llama3-v-25)
+- [MiniCPM-V 2.0](#minicpm-v-20)
 - [Online Demo](#online-demo)
 - [Install](#install)
 - [Inference](#inference)
@@ -48,13 +49,334 @@
  - [Inference on Mac](#inference-on-mac)
  - [Deployment on Mobile Phone](#deployment-on-mobile-phone)
  - [WebUI Demo](#webui-demo)
- [Finetune](#finetune)
+  - [Inference with vLLM](#inference-with-vllm)
+- [Fine-tuning](#fine-tuning)
 - [TODO](#todo)
 - [Citation](#citation)

+## MiniCPM-Llama3-V 2.5

-## MiniCPM-V 2.8B
-**MiniCPM-V 2.8B** is an efficient version with promising performance for deployment. The model is built based on SigLip-400M and [MiniCPM-2.4B](https://github.com/OpenBMB/MiniCPM/), connected by a perceiver resampler. Our latest version, MiniCPM-V 2.0 has several notable features. 
+**MiniCPM-Llama3-V 2.5** is the latest model in the MiniCPM-V series. The model is built on SigLip-400M and Llama3-8B-Instruct with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.0. Notable features of MiniCPM-Llama3-V 2.5 include:
+
+- 🔥 **Leading Performance.**
+  MiniCPM-Llama3-V 2.5 has achieved an average score of 65.1 on OpenCompass, a comprehensive evaluation over 11 popular benchmarks. **It surpasses widely used proprietary models like GPT-4V-1106, Gemini Pro, Claude 3 and Qwen-VL-Max with 8B parameters**, greatly outperforming other multimodal LLMs built on Llama 3.
+
+- 💪 **Strong OCR Capabilities.**
+  MiniCPM-Llama3-V 2.5 can process images with any aspect ratio up to 1.8 million pixels, achieving an **700+ score on OCRBench, surpassing proprietary models such as GPT-4o, GPT-4V-0409, Qwen-VL-Max and Gemini Pro**. Based on recent user feedback, MiniCPM-Llama3-V 2.5 has now enhanced full-text OCR extraction, table-to-markdown conversion, and other high-utility capabilities, and has further strengthened its instruction-following and complex reasoning abilities, enhancing multimodal interaction experiences.
+
+- 🏆 **Trustworthy Behavior.**
+  Leveraging the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) method (the newest technology in the [RLHF-V](https://github.com/RLHF-V) [CVPR'24] series), MiniCPM-Llama3-V 2.5 exhibits trustworthy multimodal behavior. It achieves **10.3%** hallucination rate on Object HalBench, lower than GPT-4V-1106 (13.6%), achieving the best level within the open-source community.
+
+- 🌏 **Multilingual Support.**
+  Thanks to the strong multilingual capabilities of Llama 3 and the cross-lingual generalization technique from [VisCPM](https://github.com/OpenBMB/VisCPM), MiniCPM-Llama3-V 2.5 extends its foundational bilingual (Chinese-English) multimodal capabilities to support **30+ languages including German, French, Spanish, Italian, Russian etc.** We achieve this extension through only minimal instruction-tuning with translated multimodal data. [All Supported Languages](./assets/minicpm-llama-v-2-5_languages.md).
+
+- 🚀 **Efficient Deployment.**
+  MiniCPM-Llama3-V 2.5 systematically employs **model quantization, CPU optimizations, NPU optimizations and compilation optimizations** as acceleration techniques, achieving high-efficiency deployment on edge devices. For mobile phones with Qualcomm chips, we have integrated the NPU acceleration framework QNN into llama.cpp for the first time. After systematic optimization, MiniCPM-Llama3-V 2.5 has realized a **150-fold acceleration in multimodal large model edge-side image encoding** and a **3-fold increase in language decoding speed**.
+
+### Evaluation <!-- omit in toc -->
+
+<div align="center">
+    <img src=assets/MiniCPM-Llama3-V-2.5-peformance.png width=66% />
+</div>
+<details>
+<summary>Click to view results on TextVQA, DocVQA, OCRBench, OpenCompass, MME, MMBench, MMMU, MathVista, LLaVA Bench, RealWorld QA, Object HalBench. </summary>
+<div align="center">
+
+<table style="margin: 0px auto;">
+    <thead>
+        <tr>
+            <th align="left">Model</th>
+            <th>Size</th>
+            <th>OCRBench</th>
+            <th>TextVQA val</th>
+            <th>DocVQA test</th>
+            <th>Open-Compass</th>
+            <th>MME</th>
+            <th>MMB test (en)</th>
+            <th>MMB test (cn)</th>
+            <th>MMMU val</th>
+            <th>Math-Vista</th>
+            <th>LLaVA Bench</th>
+            <th>RealWorld QA</th>
+            <th>Object HalBench</th>
+        </tr>
+    </thead>
+    <tbody align="center">
+        <tr>
+            <td colspan="14" align="left"><strong>Proprietary</strong></td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">Gemini Pro</td>
+            <td>-</td>
+            <td>680</td>
+            <td>74.6</td>
+            <td>88.1</td>
+            <td>62.9</td>
+            <td>2148.9</td>
+            <td>73.6</td>
+            <td>74.3</td>
+            <td>48.9</td>
+            <td>45.8</td>
+            <td>79.9</td>
+            <td>60.4</td>
+            <td>-</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">GPT-4V (2023.11.06)</td>
+            <td>-</td>
+            <td>645</td>
+            <td>78.0</td>
+            <td>88.4</td>
+            <td>63.5</td>
+            <td>1771.5</td>
+            <td>77.0</td>
+            <td>74.4</td>
+            <td>53.8</td>
+            <td>47.8</td>
+            <td>93.1</td>
+            <td>63.0</td>
+            <td>86.4</td>
+        </tr>
+        <tr>
+            <td colspan="14" align="left"><strong>Open-source</strong></td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">Mini-Gemini</td>
+            <td>2.2B</td>
+            <td>-</td>
+            <td>56.2</td>
+            <td>34.2*</td>
+            <td>-</td>
+            <td>1653.0</td>
+            <td>-</td>
+            <td>-</td>
+            <td>31.7</td>
+            <td>-</td>
+            <td>-</td>
+            <td>-</td>
+            <td>-</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">Qwen-VL-Chat</td>
+            <td>9.6B</td>
+            <td>488</td>
+            <td>61.5</td>
+            <td>62.6</td>
+            <td>51.6</td>
+            <td>1860.0</td>
+            <td>61.8</td>
+            <td>56.3</td>
+            <td>37.0</td>
+            <td>33.8</td>
+            <td>67.7</td>
+            <td>49.3</td>
+            <td>56.2</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">DeepSeek-VL-7B</td>
+            <td>7.3B</td>
+            <td>435</td>
+            <td>64.7*</td>
+            <td>47.0*</td>
+            <td>54.6</td>
+            <td>1765.4</td>
+            <td>73.8</td>
+            <td>71.4</td>
+            <td>38.3</td>
+            <td>36.8</td>
+            <td>77.8</td>
+            <td>54.2</td>
+            <td></td>
+        </tr>        
+        <tr>
+            <td nowrap="nowrap" align="left">Yi-VL-34B</td>
+            <td>34B</td>
+            <td>290</td>
+            <td>43.4*</td>
+            <td>16.9*</td>
+            <td>52.2</td>
+            <td>2050.2</td>
+            <td>72.4</td>
+            <td>70.7</td>
+            <td>45.1</td>
+            <td>30.7</td>
+            <td>62.3</td>
+            <td>54.8</td>
+            <td>79.3</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">CogVLM-Chat</td>
+            <td>17.4B</td>
+            <td>590</td>
+            <td>70.4</td>
+            <td>33.3*</td>
+            <td>54.2</td>
+            <td>1736.6</td>
+            <td>65.8</td>
+            <td>55.9</td>
+            <td>37.3</td>
+            <td>34.7</td>
+            <td>73.9</td>
+            <td>60.3</td>
+            <td>73.6</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">TextMonkey</td>
+            <td>9.7B</td>
+            <td>558</td>
+            <td>64.3</td>
+            <td>66.7</td>
+            <td>-</td>
+            <td>-</td>
+            <td>-</td>
+            <td>-</td>
+            <td>-</td>
+            <td>-</td>
+            <td>-</td>
+            <td>-</td>
+            <td>-</td>
+        </tr>
+        <tr>
+          <td nowrap="nowrap" align="left">IDEFICS2-8B</td>
+          <td>8.0B</td>
+          <td>-</td>
+          <td>73.0</td>
+          <td>74.0</td>
+          <td>57.2</td>
+          <td>1847.6</td>
+          <td>75.7</td>
+          <td>68.6</td>
+          <td>45.2</td>
+          <td>52.2</td>
+          <td>49.1</td>
+          <td>60.7</td>
+          <td>-</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">Bunny-LLama-3-8B</td>
+            <td>8.4B</td>
+            <td>-</td>
+            <td>-</td>
+            <td>-</td>
+            <td>54.3</td>
+            <td>1920.3</td>
+            <td>77.0</td>
+            <td>73.9</td>
+            <td>41.3</td>
+            <td>31.5</td>
+            <td>61.2</td>
+            <td>58.8</td>
+            <td>-</td>
+        </tr>
+        <tr>
+            <td nowrap="nowrap" align="left">LLaVA-NeXT Llama-3-8B</td>
+            <td>8.4B</td>
+            <td>-</td>
+            <td>-</td>
+            <td>78.2</td>
+            <td>-</td>
+            <td>1971.5</td>
+            <td>-</td>
+            <td>-</td>
+            <td>41.7</td>
+            <td>37.5</td>
+            <td>80.1</td>
+            <td>60.0</td>
+            <td>-</td>
+        </tr>
+        <tr style="background-color: #e6f2ff;">
+            <td nowrap="nowrap" align="left">MiniCPM-V 1.0</td>
+            <td>2.8B</td>
+            <td>366</td>
+            <td>60.6</td>
+            <td>38.2</td>
+            <td>47.5</td>
+            <td>1650.2</td>
+            <td>64.1</td>
+            <td>62.6</td>
+            <td>38.3</td>
+            <td>28.9</td>
+            <td>51.3</td>
+            <td>51.2</td>
+            <td>78.4</td>
+        </tr>
+        <tr style="background-color: #e6f2ff;">
+            <td nowrap="nowrap" align="left">MiniCPM-V 2.0</td>
+            <td>2.8B</td>
+            <td>605</td>
+            <td>74.1</td>
+            <td>71.9</td>
+            <td>54.5</td>
+            <td>1808.6</td>
+            <td>69.1</td>
+            <td>66.5</td>
+            <td>38.2</td>
+            <td>38.7</td>
+            <td>69.2</td>
+            <td>55.8</td>
+            <td>85.5</td>
+        </tr>
+        <tr style="background-color: #e6f2ff;">
+            <td nowrap="nowrap" align="left">MiniCPM-Llama3-V 2.5</td>
+            <td>8.5B</td>
+            <td><strong>725</strong></td>
+            <td><strong>76.6</strong></td>
+            <td><strong>84.8</strong></td>
+            <td><strong>65.1</strong></td>
+            <td>2024.6</td>
+            <td><strong>77.2</strong></td>
+            <td><strong>74.2</strong></td>
+            <td><strong>45.8</strong></td>
+            <td><strong>54.3</strong></td>
+            <td><strong>86.7</strong></td>
+            <td><strong>63.5</strong></td>
+            <td><strong>89.7</strong></td>
+        </tr>
+    </tbody>
+</table>
+
+
+</div>
+* We evaluate the officially released checkpoint by ourselves.
+
+</details>
+
+<div align="center">
+    <img src="assets/llavabench_compare.png" width="66%" />
+    <br>
+    Evaluation results of LLaVABench in multiple languages
+</div>
+
+### Examples <!-- omit in toc -->
+
+<table align="center" >
+  <p align="center" > 
+  <img src="assets/minicpmv-llama3-v2.5/cases_all.png" />
+  </p>
+</table>
+
+We deploy MiniCPM-Llama3-V 2.5 on end devices. The demo video is the raw screen recording on a Xiaomi 14 Pro at double speed.
+
+<table align="center">
+    <p align="center">
+      <img src="assets/gif_cases/ticket.gif" width=32%/>
+      <img src="assets/gif_cases/meal_plan.gif" width=32%/>
+    </p>
+</table>
+
+<table align="center">
+    <p align="center">
+      <img src="assets/gif_cases/1-4.gif" width=64%/>
+    </p>
+</table>
+
+## MiniCPM-V 2.0
+
+<details>
+<summary>Click to view more details of MiniCPM-V 2.0</summary>
+
+
+**MiniCPM-V 2.0** is an efficient version with promising performance for deployment. The model is built based on SigLip-400M and [MiniCPM-2.4B](https://github.com/OpenBMB/MiniCPM/), connected by a perceiver resampler. Our latest version, MiniCPM-V 2.0 has several notable features. 

 - 🔥 **State-of-the-art Performance.** 

@@ -72,252 +394,10 @@

  MiniCPM-V 2.0 can be **efficiently deployed on most GPU cards and personal computers**, and **even on end devices such as mobile phones**. For visual encoding, we compress the image representations into much fewer tokens via a perceiver resampler. This allows MiniCPM-V 2.0 to operate with **favorable memory cost and speed during inference even when dealing with high-resolution images**.

-
-
 - 🙌 **Bilingual Support.** 

  MiniCPM-V 2.0 **supports strong bilingual multimodal capabilities in both English and Chinese**. This is enabled by generalizing multimodal capabilities across languages, a technique from [VisCPM](https://arxiv.org/abs/2308.12038) [ICLR'24].

-### Evaluation <!-- omit in toc -->
-
-<div align="center">
-    <img src=assets/minicpmv-2-peformance.png width=66% />
-</div>
-<details>
-<summary>Click to view results on TextVQA, DocVQA, OCRBench, OpenCompass, MME, MMBench, MMMU, MathVista, LLaVA Bench, Object HalBench. </summary>
-<div align="center">
-
-<table style="margin: 0px auto;">
-<thead>
-  <tr>
-    <th align="left">Model</th>
-    <th>Size</th>
-    <th>TextVQA val</th>
-    <th>DocVQA test</th>
-    <th>OCRBench</th>
-    <th>OpenCompass</th>
-    <th nowrap="nowrap" >MME</th>
-    <th>MMB dev(en)</th>
-    <th>MMB dev(zh)</th>
-    <th>MMMU val</th>
-    <th>MathVista</th>
-    <th>LLaVA Bench</th>
-    <th nowrap="nowrap">Object HalBench</th>
-  </tr>
-</thead>
-<tbody align="center">
-  <tr>
-    <td colspan="12" align="left"><strong>Proprietary models</strong></td>
-  </tr>
-  <tr>
-    <td nowrap="nowrap" align="left">Gemini Pro Vision</td>
-    <td>- </td>
-    <td>74.6</td>
-    <td>88.1</td>
-    <td>680</td>
-    <td>63.8</td>
-    <td>2148.9</td>
-    <td>75.2</td>
-    <td>74.0</td>
-    <td>48.9</td>
-    <td>45.8</td>
-    <td>79.9</td>
-    <td>- </td>
-  </tr>
-  <tr>
-    <td nowrap="nowrap" align="left">GPT-4V</td>
-    <td>- </td>
-    <td>78.0</td>
-    <td>88.4</td>
-    <td>645</td>
-    <td>63.2</td>
-    <td>1771.5</td>
-    <td>75.1</td>
-    <td>75.0</td>
-    <td>53.8</td>
-    <td>47.8</td>
-    <td>93.1</td>
-    <td>86.4 / 92.7</td>
-  </tr>
-  <tr>
-    <td colspan="12" align="left"><strong>Open-source models 6B~34B</strong></td>
-  </tr>
-  <tr>
-    <td  nowrap="nowrap" align="left" >Yi-VL-6B</td>
-    <td align="right" >6.7B</td>
-    <td>45.5*</td>
-    <td>17.1*</td>
-    <td>290</td>
-    <td>49.3</td>
-    <td>1915.1 </td>
-    <td>68.6 </td>
-    <td>68.3 </td>
-    <td>40.3 </td>
-    <td>28.8 </td>
-    <td>51.9 </td>
-    <td>- </td>
-  </tr>
-  <tr>
-    <td  nowrap="nowrap" align="left" >Qwen-VL-Chat</td>
-    <td align="right" >9.6B</td>
-    <td>61.5</td>
-    <td>62.6</td>
-    <td>488 </td>
-    <td>52.1 </td>
-    <td>1860.0 </td>
-    <td>60.6 </td>
-    <td>56.7 </td>
-    <td>37.0 </td>
-    <td>33.8 </td>
-    <td>67.7 </td>
-    <td>56.2 / 80.0</td>
-  </tr>
-  <tr>
-    <td nowrap="nowrap" align="left" >Yi-VL-34B</td>
-    <td align="right" >34B</td>
-    <td>43.4*</td>
-    <td>16.9*</td>
-    <td>290</td>
-    <td>52.6 </td>
-    <td>2050.2</td>
-    <td>71.1</td>
-    <td>71.4</td>
-    <td>45.1</td>
-    <td>30.7</td>
-    <td>62.3</td>
-    <td>- </td>
-  </tr>
-  <tr>
-    <td  nowrap="nowrap" align="left" >DeepSeek-VL-7B</td>
-    <td align="right" >7.3B</td>
-    <td>64.7*</td>
-    <td>47.0* </td>
-    <td>435</td>
-    <td>55.6 </td>
-    <td>1765.4 </td>
-    <td>74.1 </td>
-    <td>72.8 </td>
-    <td>38.3 </td>
-    <td>36.8</td>
-    <td>77.8 </td>
-    <td>- </td>
-  </tr>
-  <tr>
-    <td  nowrap="nowrap" align="left" >TextMonkey</td>
-    <td align="right" >9.7B</td>
-    <td>64.3</td>
-    <td>66.7 </td>
-    <td>558</td>
-    <td>- </td>
-    <td>- </td>
-    <td>- </td>
-    <td>- </td>
-    <td>- </td>
-    <td>-</td>
-    <td>- </td>
-    <td>- </td>
-  </tr>
-    <tr>
-    <td  nowrap="nowrap" align="left" >CogVLM-Chat</td>
-    <td align="right" >17.4B</td>
-    <td>70.4</td>
-    <td>33.3*</td>
-    <td>590 </td>
-    <td>52.5 </td>
-    <td>1736.6 </td>
-    <td>63.7 </td>
-    <td>53.8 </td>
-    <td>37.3 </td>
-    <td>34.7 </td>
-    <td>73.9 </td>
-    <td>73.6 / 87.4 </td>
-  </tr>
-  <tr>
-    <td colspan="12" align="left"><strong>Open-source models 1B~3B </strong></td>
-  </tr>
-  <tr>
-    <td  nowrap="nowrap" align="left" >DeepSeek-VL-1.3B</td>
-    <td align="right" >1.7B</td>
-    <td>58.4*</td>
-    <td>37.9*</td>
-    <td>413</td>
-    <td>46.0 </td>
-    <td>1531.6 </td>
-    <td>64.0 </td>
-    <td>61.2 </td>
-    <td>33.8 </td>
-    <td>29.4 </td>
-    <td>51.1 </td>
-    <td>- </td>
-  </tr>
-  <tr>
-    <td  nowrap="nowrap" align="left" >MobileVLM V2</td>
-    <td align="right" >3.1B</td>
-    <td>57.5</td>
-    <td>19.4*</td>
-    <td>-</td>
-    <td>-</td>
-    <td>1440.5(P) </td>
-    <td>63.2 </td>
-    <td>-</td>
-    <td>-</td>
-    <td>-</td>
-    <td>-</td>
-    <td>-</td>
-  </tr>
-  <tr>
-    <td  nowrap="nowrap" align="left" >Mini-Gemini</td>
-    <td align="right" >2.2B</td>
-    <td>56.2</td>
-    <td>34.2*</td>
-    <td>-</td>
-    <td>-</td>
-    <td>1653.0 </td>
-    <td>59.8 </td>
-    <td>- </td>
-    <td>31.7 </td>
-    <td>-</td>
-    <td>- </td>
-    <td>- </td>
-  </tr>
-  <tr>
-    <td  nowrap="nowrap" align="left" >MiniCPM-V</td>
-    <td align="right" >2.8B </td>
-    <td>60.6</td>
-    <td>38.2 </td>
-    <td>366</td>
-    <td>47.6</td>
-    <td>1650.2 </td>
-    <td>67.9 </td>
-    <td>65.3 </td>
-    <td><strong>38.3</strong></td>
-    <td>28.9</td>
-    <td>51.3 </td>
-    <td>78.4 / 88.5 </td>
-  </tr>
-  <tr>
-    <td  nowrap="nowrap" align="left" ><strong>MiniCPM-V 2.0</strong></td>
-    <td align="right" >2.8B </td>
-    <td><strong>74.1</strong></td>
-    <td><strong>71.9</strong> </td>
-    <td><strong>605</strong></td>
-    <td><strong>55.0</strong></td>
-    <td><strong>1808.6</strong> </td>
-    <td><strong>69.6</strong> </td>
-    <td><strong>68.1</strong> </td>
-    <td>38.2 </td>
-    <td><strong>38.7</strong></td>
-    <td><strong>69.2</strong> </td>
-    <td><strong>85.5 / 92.2 </strong></td>
-  </tr>
-</tbody>
-</table>
-
-</div>
-* We evaluate the officially released checkpoint by ourselves.
-
-</details>
-
 ### Examples <!-- omit in toc -->

 <table align="center">
@@ -335,157 +415,19 @@ We deploy MiniCPM-V 2.0 on end devices. The demo video is the raw screen recordi
    </p>
 </table>

-### MiniCPM-V 1.0 <!-- omit in toc -->
-Please see the info about MiniCPM-V 1.0 [here](./minicpm_v1.md).
-
-
-## OmniLMM-12B
-**OmniLMM-12B** is the most capable version. The model is built based on EVA02-5B and Zephyr-7B-β, connected with a perceiver resampler layer, and trained on multimodal data in a curriculum fashion. The model has three notable features:
-
- 🔥 **Strong Performance.** 
-
-  OmniLMM-12B achieves **leading performance** among models with comparable sizes, surpassing established LMMs on multiple benchmarks (including MME, MMBench, SEED-Bench, etc). The model also endows rich multi-modal world knowledge.
-
- 🏆 **Trustworthy Behavior.** 
-
-  LMMs are known for suffering from hallucination, often generating text that is not factually grounded in images (e.g., faithfully describing non-existing objects in images). OmniLMM-12B is **the first state-of-the-art open-source LMM aligned via multimodal RLHF for trustworthy behavior** (using the recent [RLHF-V](https://rlhf-v.github.io/) technique). It **ranks #1** among open-source models on [MMHal-Bench](https://huggingface.co/datasets/Shengcao1006/MMHal-Bench), and **outperforms GPT-4V** on [Object HalBench](https://arxiv.org/abs/2312.00849).
-
- 🕹 **Real-time Multimodal Interaction.** 
-
-  We combine the OmniLMM-12B and GPT-3.5 (text-only) into a **real-time multimodal interactive assistant**. The assistant accepts video streams from the camera and speech streams from the microphone and emits speech output. While still primary, we find the model can **replicate some of the fun cases shown in the Gemini Demo video, without any video edition**.
-
-
-### Evaluation <!-- omit in toc -->
-<div align="center">
-    <img src=assets/radar_omnilmm12b.png width=66% />
-</div>
-<details>
-<summary>Click to view results on MME, MMBench, MMMU, MMBench, MMHal-Bench, Object HalBench, SeedBench, LLaVA Bench, MathVista. </summary>
-
-<table>
-<thead>
-  <tr>
-    <th align="left">Model</th>
-    <th>Size</th>
-    <th>MME</th>
-    <th nowrap="nowrap">MMB dev (en)</th>
-    <th nowrap="nowrap" >MMMU val</th>
-    <th nowrap="nowrap" >MMHal-Bench</th>
-    <th nowrap="nowrap" >Object HalBench</th>
-    <th nowrap="nowrap" >SeedBench-I</th>
-    <th>MathVista</th>
-    <th nowrap="nowrap" >LLaVA Bench</th>
-  </tr>
-</thead>
-<tbody align="center">
-  <tr>
-    <td align="left">GPT-4V†</td>
-    <td>-</td>
-    <td>1771.5</td>
-    <td>75.1 </td>
-    <td>56.8</td>
-    <td>3.53 / 70.8</td>
-    <td>86.4 / 92.7</td>
-    <td>71.6 </td>
-    <td>47.8 </td>
-    <td>93.1 </td>
-  </tr>
-  <tr>
-    <td nowrap="nowrap" align="left">Qwen-VL-Plus†</td>
-    <td>-</td>
-    <td>2183.4</td>
-    <td>66.2 </td>
-    <td>45.2</td>
-    <td>- </td>
-    <td>- </td>
-    <td>65.7 </td>
-    <td>36.0 </td>
-    <td>73.7 </td>
-  </tr>
-  <tr>
-    <td align="left">Yi-VL 6B</td>
-    <td align="right">6.7B </td>
-    <td>1915.1 </td>
-    <td>68.6 </td>
-    <td>40.3 </td>
-    <td>- </td>
-    <td>- </td>
-    <td>67.5 </td>
-    <td>28.8 </td>
-    <td>51.9 </td>
-  </tr>
-  <tr>
-    <td nowrap="nowrap" align="left" >Qwen-VL-Chat</td>
-    <td align="right">9.6B</td>
-    <td>1860.0</td>
-    <td>60.6 </td>
-    <td>35.9</td>
-    <td>2.93 / 59.4</td>
-    <td>56.2 / 80.0</td>
-    <td>64.8 </td>
-    <td>33.8 </td>
-    <td>67.7 </td>
-  </tr>
-  <tr>
-    <td align="left" >CogVLM-Chat</td>
-    <td align="right">17.4B</td>
-    <td>1736.6</td>
-    <td>63.7 </td>
-    <td>32.1 </td>
-    <td>2.68 / 52.1 </td>
-    <td>73.6 / 87.4 </td>
-    <td>68.8 </td>
-    <td>34.7 </td>
-    <td>73.9 </td>
-  </tr>
-  <tr>
-    <td align="left" >LLaVA 1.5</td>
-    <td align="right">13.6B </td>
-    <td>1808.4 </td>
-    <td>68.2 </td>
-    <td>36.4 </td>
-    <td>2.71 / 51.0 </td>
-    <td>53.7 / 77.4 </td>
-    <td>68.1 </td>
-    <td>26.4 </td>
-    <td>64.6 </td>
-  </tr>
-  <tr>
-    <td nowrap="nowrap" align="left" ><b>OmniLMM-12B</b></td>
-    <td align="right">11.6B </td>
-    <td>1935.8 </td>
-    <td>71.6 </td>
-    <td>40.7 </td>
-    <td>3.45 / 68.8 </td>
-    <td>90.3 / 95.5 </td>
-    <td>71.1 </td>
-    <td>34.9 </td>
-    <td>72.0 </td>
-  </tr>
-</tbody>
-</table>
-<small>†: Proprietary models</small>
-<br>
 </details>

-### Examples <!-- omit in toc -->
+## Legacy Models <!-- omit in toc --> 

-<table align="center" >
-  <p align="center" > 
-    <img src="assets/omnilmm-12b-examples_2.png" />
-  </p>
-</table>
+| Model                | Introduction and Guidance       |
+|:----------------------|:-------------------:|
+| MiniCPM-V 1.0  | [Document](./minicpm_v1.md)   | 
+| OmniLMM-12B  | [Document](./omnilmm_en.md)   |  


-We combine the OmniLMM-12B and GPT-3.5 (text-only) into a **real-time multimodal interactive assistant**. Video frames are described in text using OmniLMM-12B, and ChatGPT 3.5 (text-only) is employed to generate response according to the descriptions and user prompts. The demo video is a raw recording without edition. 
-
-<div align="center" >
-  <video controls src="https://github.com/OpenBMB/OmniLMM/assets/157115220/485a8f52-fb4d-4eca-8fee-506347efcfc6" type="video/mp4" width=80%/>
-</div>
-

 ## Online Demo
-Click here to try out the Demo of [MiniCPM-V 2.0](http://120.92.209.146:80/) and [OmniLMM-12B](http://120.92.209.146:8081).
+Click here to try out the Demo of [MiniCPM-Llama3-V 2.5](http://120.92.209.146:8889/) ｜ [MiniCPM-V 2.0](http://120.92.209.146:80).

 ## Install

@@ -514,9 +456,10 @@ pip install -r requirements.txt
 ### Model Zoo
 | Model                | Description       | Download Link |
 |:----------------------|:-------------------|:---------------:|
-| MiniCPM-V 2.0  | The latest version for state-of-the-art end-side capabilities with high efficiency.    |  [🤗](https://huggingface.co/openbmb/MiniCPM-V-2) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2/files) |
-| MiniCPM-V  | The first version of MiniCPM-V.    |  [🤗](https://huggingface.co/openbmb/MiniCPM-V) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V/files) |
-| OmniLMM-12B | The most capable version with leading performance.   |  [🤗](https://huggingface.co/openbmb/OmniLMM-12B) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/OmniLMM-12B/files) |
+| MiniCPM-Llama3-V 2.5  | The lastest version, achieving state-of-the edge-side multimodal performance.   |  [🤗](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5/) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5/files) |
+| MiniCPM-Llama3-V 2.5 int4  | int4 quantized version，lower GPU memory usage. |  [🤗](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-int4/) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5-int4/files) |
+| MiniCPM-V 2.0  | Light version, balance the performance the computation cost.   |  [🤗](https://huggingface.co/openbmb/MiniCPM-V-2) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2/files) |
+| MiniCPM-V 1.0 | Lightest version, achieving the fastest inference. |   [🤗](https://huggingface.co/openbmb/MiniCPM-V) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V/files) |

 ### Multi-turn Conversation
 Please refer to the following codes to run `MiniCPM-V` and `OmniLMM`.
@@ -529,9 +472,9 @@ Please refer to the following codes to run `MiniCPM-V` and `OmniLMM`.
 ```python
 import torch
 from chat import OmniLMMChat, img2base64
-torch.manual_seed(0)
+torch.manual_seed(20)

-chat_model = OmniLMMChat('openbmb/MiniCPM-V-2') # or 'openbmb/OmniLMM-12B'
+chat_model = OmniLMMChat('openbmb/MiniCPM-Llama3-V-2_5')

 im_64 = img2base64('./assets/hk_OCR.jpg')

@@ -545,7 +488,7 @@ print(answer)
 # Second round chat 
 # pass history context of multi-turn conversation
 msgs.append({"role": "assistant", "content": answer})
-msgs.append({"role": "user", "content": "Where is this store in the image?"})
+msgs.append({"role": "user", "content": "请用中文回答"})

 inputs = {"image": im_64, "question": json.dumps(msgs)}
 answer = chat_model.chat(inputs)
@@ -555,27 +498,27 @@ print(answer)
 We can obtain the following results:

 ```
-"You should go to the Canon store for a camera."
+"You should go to the Nikon store, as indicated by the neon sign on the right side of the image."

-"The Canon store is located on the right side of the image."
+"你应该去到尼康店，正如指示在图片的右侧。"
 ```



 ### Inference on Mac
 <details>
-<summary>Click to view an example, to run MiniCPM-V 2.0 on 💻 Mac with MPS (Apple silicon or AMD GPUs). </summary>
+<summary>Click to view an example, to run MiniCPM-Llama3-V 2.5 on 💻 Mac with MPS (Apple silicon or AMD GPUs). </summary>

 ```python
-# test.py
+# test.py  Need more than 16GB memory.
 import torch
 from PIL import Image
 from transformers import AutoModel, AutoTokenizer

-model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2', trust_remote_code=True, torch_dtype=torch.bfloat16)
-model = model.to(device='mps', dtype=torch.float16)
+model = AutoModel.from_pretrained('openbmb/MiniCPM-Llama3-V-2_5', trust_remote_code=True, low_cpu_mem_usage=True)
+model = model.to(device='mps')

-tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2', trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-Llama3-V-2_5', trust_remote_code=True)
 model.eval()

 image = Image.open('./assets/hk_OCR.jpg').convert('RGB')
@@ -598,7 +541,7 @@ PYTORCH_ENABLE_MPS_FALLBACK=1 python test.py
 </details>

 ### Deployment on Mobile Phone
-Currently MiniCPM-V 2.0 can be deployed on mobile phones with Android and Harmony operating systems. 🚀 Try it out [here](https://github.com/OpenBMB/mlc-MiniCPM).
+MiniCPM-V 2.0 can be deployed on mobile phones with Android operating systems. 🚀 Click [here](https://github.com/OpenBMB/mlc-MiniCPM) to install apk. MiniCPM-Llama3-V 2.5 coming soon.

 ### WebUI Demo

@@ -610,14 +553,11 @@ pip install -r requirements.txt
 ```
  
 ```shell
-# For Nvidia GPUs support BF16 (like A100, H100, RTX3090), run:
-python web_demo.py --device cuda --dtype bf16
-
-# For Nvidia GPUs do NOT support BF16 (like V100, T4, RTX2080), run:
-python web_demo.py --device cuda --dtype fp16
+# For NVIDIA GPUs, run:
+python web_demo_2.5.py --device cuda

 # For Mac with MPS (Apple silicon or AMD GPUs), run:
-PYTORCH_ENABLE_MPS_FALLBACK=1 python web_demo.py --device mps --dtype fp16
+PYTORCH_ENABLE_MPS_FALLBACK=1 python web_demo_2.5.py --device mps
 ```
 </details>

@@ -646,20 +586,25 @@ python examples/minicpmv_example.py
 ```
 </details>

-## Finetune
+## Fine-tuning

-### MiniCPM-V <!-- omit in toc -->
+### Simple Fine-tuning <!-- omit in toc -->

-We now support finetune MiniCPM-V series with the SWIFT framework. SWIFT supports training, inference, evaluation and deployment of nearly 200 LLMs and MLLMs (multimodal large models). It supports the lightweight training solutions provided by PEFT and a complete Adapters Library including techniques such as NEFTune, LoRA+ and LLaMA-PRO.
+We supports simple fine-tuning with Hugging Face for MiniCPM-V 2.0 and MiniCPM-Llama3-V 2.5.

-Best Practices：[MiniCPM-V](https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/minicpm-v最佳实践.md), [MiniCPM-V-2](https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/minicpm-v-2最佳实践.md)
+[Reference Document](./finetune/readme.md)
+
+### With the SWIFT Framework <!-- omit in toc -->
+
+We now support finetune MiniCPM-V series with the SWIFT framework. SWIFT supports training, inference, evaluation and deployment of nearly 200 LLMs and MLLMs . It supports the lightweight training solutions provided by PEFT and a complete Adapters Library including techniques such as NEFTune, LoRA+ and LLaMA-PRO.
+
+Best Practices：[MiniCPM-V 1.0](https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/minicpm-v最佳实践.md), [MiniCPM-V 2.0](https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/minicpm-v-2最佳实践.md)



 ## TODO

 - [x] MiniCPM-V fine-tuning support
- [ ] OmniLMM fine-tuning support
 - [ ] Code release for real-time interactive assistant

 ## Model License <!-- omit in toc -->
--- a/assets/MiniCPM-Llama3-V-2.5-peformance.png
+++ b/assets/MiniCPM-Llama3-V-2.5-peformance.png
--- a/assets/gif_cases/1-4.gif
+++ b/assets/gif_cases/1-4.gif
--- a/assets/gif_cases/meal_plan.gif
+++ b/assets/gif_cases/meal_plan.gif
--- a/assets/gif_cases/ticket.gif
+++ b/assets/gif_cases/ticket.gif
--- a/assets/llavabench_compare.png
+++ b/assets/llavabench_compare.png
--- a/assets/minicpm-llama-v-2-5_languages.md
+++ b/assets/minicpm-llama-v-2-5_languages.md
@@ -0,0 +1,176 @@
+- English
+- 中文
+- 한국어
+- 日本語
+- Deutsch
+- Français
+- Português
+- Español
+- မြန်မာဘာသာ
+- ไทย
+- Tiếng Việt
+- Türkçe
+- ܣܘܪܝܝܐ
+- العربية
+- हिन्दी
+- বাংলা
+- नेपाली
+- Türkmençe
+- Тоҷикӣ
+- Кыргызча
+- Русский
+- Українська
+- Беларуская
+- ქართული
+- Azərbaycanca
+- Հայերեն
+- Polski
+- Lietuvių
+- Eesti
+- Latviešu
+- Čeština
+- Slovenčina
+- Magyar
+- Slovenščina
+- Hrvatski
+- Bosanski
+- Crnogorski
+- Српски
+- Shqip
+- Română
+- Български
+- Македонски
+
+
+## 支持语言
+
+英语
+
+中文
+
+韩语
+
+日语
+
+德语
+
+法语
+
+葡萄牙语
+
+西班牙语
+
+缅甸语
+
+泰语
+
+越南语
+
+土耳其语
+
+叙利亚语
+
+阿拉伯语
+
+印地语
+
+孟加拉语
+
+尼泊尔语
+
+土库曼语
+
+塔吉克语
+
+吉尔吉斯语
+
+俄语
+
+乌克兰语
+
+白俄罗斯语
+
+格鲁吉亚语
+
+阿塞拜疆语
+
+亚美尼亚语
+
+波兰语
+
+立陶宛语
+
+爱沙尼亚语
+
+拉脱维亚语
+
+捷克语
+
+斯洛伐克语
+
+匈牙利语
+
+斯洛文尼亚语
+
+克罗地亚语
+
+波斯尼亚语
+
+黑山语
+
+塞尔维亚语
+
+阿尔巴尼亚语
+
+罗马尼亚语
+
+保加利亚
+
+马其顿语
+
+
+
+## Supported Languages
+
+English  
+Chinese  
+Korean  
+Japanese  
+German  
+French  
+Portuguese  
+Spanish  
+Burmese  
+Thai  
+Vietnamese  
+Turkish  
+Syriac  
+Arabic  
+Hindi  
+Bengali  
+Nepali  
+Turkmen  
+Tajik  
+Kyrgyz  
+Russian  
+Ukrainian  
+Belarusian  
+Georgian  
+Azerbaijani  
+Armenian  
+Polish  
+Lithuanian  
+Estonian  
+Latvian  
+Czech  
+Slovak  
+Hungarian  
+Slovenian  
+Croatian  
+Bosnian  
+Montenegrin  
+Serbian  
+Albanian  
+Romanian  
+Bulgarian  
+Macedonian
--- a/assets/minicpmv-2-peformance.png
+++ b/assets/minicpmv-2-peformance.png
--- a/assets/minicpmv-llama3-v2.5/case_OCR_en.png
+++ b/assets/minicpmv-llama3-v2.5/case_OCR_en.png
--- a/assets/minicpmv-llama3-v2.5/case_complex_reasoning.png
+++ b/assets/minicpmv-llama3-v2.5/case_complex_reasoning.png
--- a/assets/minicpmv-llama3-v2.5/case_information_extraction.png
+++ b/assets/minicpmv-llama3-v2.5/case_information_extraction.png
--- a/assets/minicpmv-llama3-v2.5/case_long_img.png
+++ b/assets/minicpmv-llama3-v2.5/case_long_img.png
--- a/assets/minicpmv-llama3-v2.5/case_markdown.png
+++ b/assets/minicpmv-llama3-v2.5/case_markdown.png
--- a/assets/minicpmv-llama3-v2.5/cases_all.png
+++ b/assets/minicpmv-llama3-v2.5/cases_all.png
--- a/assets/minicpmv-llama3-v2.5/temp
+++ b/assets/minicpmv-llama3-v2.5/temp
@@ -0,0 +1 @@
+
--- a/assets/minicpmv.png
+++ b/assets/minicpmv.png
--- a/chat.py
+++ b/chat.py
@@ -160,11 +160,36 @@ class OmniLMM3B:
    	)
        return answer

+class MiniCPMV2_5:
+    def __init__(self, model_path) -> None:
+        self.model = AutoModel.from_pretrained(model_path, trust_remote_code=True).to(dtype=torch.float16)
+        self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
+        self.model.eval().cuda()
+
+    def chat(self, input):
+        try:
+            image = Image.open(io.BytesIO(base64.b64decode(input['image']))).convert('RGB')
+        except Exception as e:
+            return "Image decode error"
+
+        msgs = json.loads(input['question'])
+        
+        answer = self.model.chat(
+            image=image,
+            msgs=msgs,
+            tokenizer=self.tokenizer,
+            sampling=True,
+            temperature=0.7
+    	)
+        return answer
+

 class OmniLMMChat:
    def __init__(self, model_path) -> None:
        if '12B' in model_path:
            self.model = OmniLMM12B(model_path)
+        elif 'MiniCPM-Llama3-V' in model_path:
+            self.model = MiniCPMV2_5(model_path)
        else:
            self.model = OmniLMM3B(model_path)

--- a/finetune/dataset.py
+++ b/finetune/dataset.py
@@ -1,90 +1,115 @@
-import os
-import math
-import json
 import copy
+import json
 import logging
+import math
+import os
+from dataclasses import dataclass, field
+from typing import Dict, List, Optional

 import numpy as np
 import torch
-from torch.nn.utils.rnn import pad_sequence
-from typing import Dict, Optional, List
 from PIL import Image
-
-
-from dataclasses import dataclass, field
-from transformers import AutoTokenizer, AutoProcessor
+from torch.nn.utils.rnn import pad_sequence
 from torch.utils.data import Dataset
+from transformers import AutoProcessor, AutoTokenizer


 class SupervisedDataset(Dataset):
    """Dataset for supervised fine-tuning."""
-    def __init__(self, raw_data, transform, tokenizer, slice_config):
+
+    def __init__(
+        self,
+        raw_data,
+        transform,
+        tokenizer,
+        slice_config,
+        llm_type="minicpm",
+        patch_size=14,
+        query_nums=64,
+        batch_vision=False,
+    ):
        super(SupervisedDataset, self).__init__()
        self.raw_data = raw_data
        self.tokenizer = tokenizer
        self.transform = transform
        self.slice_config = slice_config
+        self.llm_type = llm_type
+        self.patch_size = patch_size
+        self.query_nums=query_nums
+        self.batch_vision = batch_vision

    def __len__(self):
        return len(self.raw_data)

    def __getitem__(self, i) -> Dict[str, torch.Tensor]:
        image = Image.open(self.raw_data[i]["image"]).convert("RGB")
-        ret = preprocess(image, self.raw_data[i]["conversations"], self.tokenizer, self.transform, slice_config=self.slice_config)
+        ret = preprocess(
+            image,
+            self.raw_data[i]["conversations"],
+            self.tokenizer,
+            self.transform,
+            query_nums=self.query_nums,
+            slice_config=self.slice_config,
+            llm_type=self.llm_type,
+            patch_size=self.patch_size,
+            batch_vision=self.batch_vision,
+        )
        ret = dict(
            input_ids=ret["input_ids"],
            labels=ret["target"],
-            attention_mask=ret["input_ids"].ne(self.tokenizer.pad_token_id),
+            attention_mask=torch.ones_like(ret["input_ids"], dtype=torch.bool),
            pixel_values=ret["pixel_values"],
+            tgt_sizes=ret["tgt_sizes"],
            image_bound=ret["image_bound"],
        )
-        
+
        return ret


 def data_collator(examples, padding_value=0):
-    input_ids = pad_sequence([example["input_ids"] for example in examples], batch_first=True, padding_value=padding_value)
-    targets = pad_sequence([example["labels"] for example in examples], batch_first=True, padding_value=padding_value)
-    attention_mask = pad_sequence([example["attention_mask"] for example in examples], batch_first=True, padding_value=padding_value)
+    input_ids = pad_sequence(
+        [example["input_ids"] for example in examples],
+        batch_first=True,
+        padding_value=padding_value,
+    )
+    targets = pad_sequence(
+        [example["labels"] for example in examples],
+        batch_first=True,
+        padding_value=padding_value,
+    )
+    attention_mask = pad_sequence(
+        [example["attention_mask"] for example in examples],
+        batch_first=True,
+        padding_value=padding_value,
+    )
    pixel_values = [example["pixel_values"] for example in examples]
    image_bound = [example["image_bound"] for example in examples]
-    return {"input_ids": input_ids, "labels":targets, "attention_mask": attention_mask, "image_bound": image_bound, "pixel_values": pixel_values}
+    tgt_sizes = [example["tgt_sizes"] for example in examples]
+    return {
+        "input_ids": input_ids,
+        "labels": targets,
+        "attention_mask": attention_mask,
+        "image_bound": image_bound,
+        "tgt_sizes": tgt_sizes,
+        "pixel_values": pixel_values,
+    }


-def conversation_to_ids(conversation, tokenizer):
+def conversation_to_ids(conversation, tokenizer, llm_type=None):
    """
    for single image multi-turn conversation
    conversation: [{'role': 'user', 'content': 'Describe this image'},
                   {'role': 'assistant', 'content': 'This is a cat.'}]
    """
-    raw_msg = ''
-    input_ids = []
-    context = []
-    for idx, msg in enumerate(conversation):
-        role = msg['role']
-        message = msg['content']
-        assert role in ['user', 'assistant']
-        if role == 'user':
-            prefix = '<用户>'
-        else:
-            prefix = '<AI>'
-        # append eos
-        if idx == len(conversation) - 1:
-            message = message + tokenizer.eos_token
-        prefix_ids = tokenizer.encode(prefix)[1:] # remove bos
-        message_ids = tokenizer.encode(message)[1:]
+    if llm_type == "llama3":
+        input_ids, context, raw_msg = conversation_to_ids_llama3(
+            conversation, tokenizer
+        )
+    else:
+        input_ids, context, raw_msg = conversation_to_ids_minicpm(
+            conversation, tokenizer
+        )

-        input_ids.append(prefix_ids)
-        input_ids.append(message_ids)
-
-        context.append(np.ones((len(prefix_ids),), dtype=np.int8))
-        if role == 'assistant':
-            context.append(np.zeros((len(message_ids),), dtype=np.int8))
-        else:
-            context.append(np.ones((len(message_ids),), dtype=np.int8))
-
-        raw_msg += (prefix + message)
-    
    ids = torch.from_numpy(np.hstack(input_ids, dtype=np.int32))
    context = torch.from_numpy(np.hstack(context, dtype=np.int8))

@@ -94,45 +119,137 @@ def conversation_to_ids(conversation, tokenizer):
        if context[i] == 0:
            target[i - 1] = ids[i]
        if context[i] == 1 and context[i - 1] == 0:
-            target[i - 1] = tokenizer.eos_id
+            if hasattr(tokenizer, "eot_id"):
+                target[i - 1] = tokenizer.eot_id
+            else:
+                target[i - 1] = tokenizer.eos_id

    # build image bound
    image_start_tokens = torch.where(ids == tokenizer.im_start_id)[0]
    image_start_tokens += 1
    image_end_tokens = torch.where(ids == tokenizer.im_end_id)[0]
    if len(image_start_tokens) != len(image_end_tokens):
-        print('image start token != image end tokens')
-    if len(image_start_tokens)>0:
-        image_bound = torch.hstack([image_start_tokens.unsqueeze(-1), image_end_tokens.unsqueeze(-1)])
+        print("image start token != image end tokens")
+    if len(image_start_tokens) > 0:
+        image_bound = torch.hstack(
+            [image_start_tokens.unsqueeze(-1), image_end_tokens.unsqueeze(-1)]
+        )
    else:
        image_bound = []

    return {
-        'input_ids': ids,
-        'target': target,
-        'image_bound': image_bound,
-        'raw_msg': raw_msg,
+        "input_ids": ids,
+        "target": target,
+        "image_bound": image_bound,
+        "raw_msg": raw_msg,
    }


-def preprocess(image, conversation, tokenizer, transform, query_nums=64, slice_config=None):
+def conversation_to_ids_minicpm(conversation, tokenizer):
+    raw_msg = ""
+    input_ids = []
+    context = []
+    for idx, msg in enumerate(conversation):
+        role = msg["role"]
+        message = msg["content"]
+        assert role in ["user", "assistant"]
+        if role == "user":
+            prefix = "<用户>"
+        else:
+            prefix = "<AI>"
+        # append eos
+        if idx == len(conversation) - 1:
+            message = message + tokenizer.eos_token
+        prefix_ids = tokenizer.encode(prefix)[1:]  # remove bos
+        message_ids = tokenizer.encode(message)[1:]
+
+        input_ids.append(prefix_ids)
+        input_ids.append(message_ids)
+
+        context.append(np.ones((len(prefix_ids),), dtype=np.int8))
+        if role == "assistant":
+            context.append(np.zeros((len(message_ids),), dtype=np.int8))
+        else:
+            context.append(np.ones((len(message_ids),), dtype=np.int8))
+
+        raw_msg += prefix + message
+
+    return input_ids, context, raw_msg
+
+
+def conversation_to_ids_llama3(conversation, tokenizer):
+    raw_msg = ""
+    input_ids = []
+    context = []
+    raw_msg = tokenizer.apply_chat_template(
+        conversation, tokenize=False, add_generation_prompt=False
+    )
+    input_ids = tokenizer.apply_chat_template(
+        conversation, tokenize=True, add_generation_prompt=False
+    )
+    input_ids = np.array(input_ids)
+
+    start_header_idxs = np.where(
+        input_ids == tokenizer.convert_tokens_to_ids("<|start_header_id|>")
+    )[0]
+    assistant_idxs = np.where(
+        input_ids == tokenizer.convert_tokens_to_ids("assistant")
+    )[0]
+    end_header_idxs = np.where(
+        input_ids == tokenizer.convert_tokens_to_ids("<|end_header_id|>")
+    )[0]
+    eot_idxs = np.where(
+        input_ids == tokenizer.convert_tokens_to_ids("<|eot_id|>"))[0]
+
+    context = np.ones_like(input_ids, dtype=np.int8)
+
+    for assistant_idx in assistant_idxs:
+        if assistant_idx in set((start_header_idxs + end_header_idxs) / 2):
+            st = assistant_idx + 3  # assistant<|end_header_id|>\n\n
+            for eot_idx in eot_idxs:
+                if eot_idx > st:
+                    context[st: eot_idx + 1] = 0
+                    break
+
+    input_ids = np.hstack(input_ids)
+    context = np.hstack(context)
+
+    return input_ids, context, raw_msg
+
+
+def preprocess(
+    image,
+    conversation,
+    tokenizer,
+    transform,
+    query_nums=64,
+    slice_config=None,
+    llm_type=None,
+    patch_size=14,
+    batch_vision=False,
+):
    """
    single image preprocess, the image will be placed at the top of the conversation
    """
    conversation = copy.deepcopy(conversation)
    assert len(conversation) > 1, "conversation length must large than 2"
-    assert conversation[0]['role'] == 'user', "the first role must be user"
+    assert conversation[0]["role"] == "user", "the first role must be user"

    if slice_config is not None:
        assert isinstance(slice_config, Dict)
-        assert 'patch_size' in slice_config
-        assert 'max_slice_nums' in slice_config
-        assert 'scale_resolution' in slice_config
-    default_image_placeholder = tokenizer.im_start + tokenizer.unk_token * query_nums + tokenizer.im_end
+        assert "patch_size" in slice_config
+        assert "max_slice_nums" in slice_config
+        assert "scale_resolution" in slice_config
+    default_image_placeholder = (
+        tokenizer.im_start + tokenizer.unk_token * query_nums + tokenizer.im_end
+    )
    if slice_config:
        images = []
        source_image, patches, best_grid = slice_image(
-            image, slice_config['max_slice_nums'], slice_config['scale_resolution'], slice_config['patch_size']
+            image,
+            slice_config["max_slice_nums"],
+            slice_config["scale_resolution"],
+            slice_config["patch_size"],
        )
        images.append(source_image)
        image_placeholder = default_image_placeholder
@@ -142,30 +259,51 @@ def preprocess(image, conversation, tokenizer, transform, query_nums=64, slice_c
                    images.append(patches[i][j])

            image_placeholder += get_grid_placeholder(
-                tokenizer, best_grid, query_nums
-            )
+                tokenizer, best_grid, query_nums)
        images = [transform(i) for i in images]
    else:
        images = [transform(image)]
        image_placeholder = default_image_placeholder
-    if '<image>' in conversation[0]['content']:
-        conversation[0]['content'] = conversation[0]['content'].replace('<image>', image_placeholder)
+    if "<image>" in conversation[0]["content"]:
+        conversation[0]["content"] = conversation[0]["content"].replace(
+            "<image>", image_placeholder
+        )
    else:
-        conversation[0]['content'] = image_placeholder + '\n' + conversation[0]['content']
+        conversation[0]["content"] = (
+            image_placeholder + "\n" + conversation[0]["content"]
+        )
+
+    input_dict = conversation_to_ids(conversation, tokenizer, llm_type)
+
+    if batch_vision:
+        tgt_sizes = []
+        reshape_images = []
+        for image in images:
+            H, W = image.shape[1:]
+            reshape_image = reshape_by_patch(image, patch_size)
+            reshape_images.append(reshape_image)
+            tgt_sizes.append([H // patch_size, W // patch_size])
+        if tgt_sizes:
+            tgt_sizes = torch.Tensor(tgt_sizes).type(torch.int32)
+
+        input_dict["pixel_values"] = reshape_images
+        input_dict["tgt_sizes"] = tgt_sizes
+
+    else:
+        input_dict["pixel_values"] = images
+        input_dict["tgt_sizes"] = []

-    input_dict = conversation_to_ids(conversation, tokenizer)
-    input_dict['pixel_values'] = images
    return input_dict


-
 def slice_image(
    image, max_slice_nums=9, scale_resolution=448, patch_size=14, never_split=False
 ):
    original_size = image.size
    original_width, original_height = original_size
    log_ratio = math.log(original_width / original_height)
-    ratio = original_width * original_height / (scale_resolution * scale_resolution)
+    ratio = original_width * original_height / \
+        (scale_resolution * scale_resolution)
    multiple = min(math.ceil(ratio), max_slice_nums)

    source_image = None
@@ -186,7 +324,8 @@ def slice_image(
            candidate_split_grids_nums.append(i)

        # source image, down-sampling and ensure divided by patch_size
-        best_resize = find_best_resize(original_size, scale_resolution, patch_size)
+        best_resize = find_best_resize(
+            original_size, scale_resolution, patch_size)
        source_image = image.copy().resize(best_resize, Image.Resampling.BICUBIC)
        candidate_grids = []

@@ -285,6 +424,22 @@ def get_grid_placeholder(tokenizer, grid, query_num):
        for j in range(cols):
            lines.append(image_placeholder)
        slices.append("".join(lines))
-    slice_placeholder = tokenizer.slice_start + "\n".join(slices) + tokenizer.slice_end
+    slice_placeholder = tokenizer.slice_start + \
+        "\n".join(slices) + tokenizer.slice_end
    return slice_placeholder

+
+def reshape_by_patch(image_tensor, patch_size):
+    """
+    :param image_tensor: shape [3, H, W]
+    :param patch_size:
+    :return: [3, patch_size, HW/patch_size]
+    """
+    patches = torch.nn.functional.unfold(
+        image_tensor, (patch_size, patch_size), stride=(patch_size, patch_size)
+    )
+
+    patches = patches.reshape(image_tensor.size(0), patch_size, patch_size, -1)
+    patches = patches.permute(0, 1, 3, 2).reshape(
+        image_tensor.size(0), patch_size, -1)
+    return patches
--- a/finetune/finetune.py
+++ b/finetune/finetune.py
@@ -1,22 +1,22 @@
-import os
 import glob
 import json
 import logging
+import os
 from dataclasses import dataclass, field
-from typing import Dict, Optional, List
+from typing import Dict, List, Optional
+
 import torch
-from torch.utils.data import Dataset
 import transformers
-from trainer import CPMTrainer
-from deepspeed.runtime.zero.partition_parameters import ZeroParamStatus
-from deepspeed import zero
-
-from dataset import data_collator, SupervisedDataset
-
-
-from PIL import Image
-from transformers import AutoModel, AutoTokenizer
 from accelerate.utils import DistributedType
+from deepspeed import zero
+from deepspeed.runtime.zero.partition_parameters import ZeroParamStatus
+from PIL import Image
+from torch.utils.data import Dataset
+from transformers import AutoModel, AutoTokenizer
+
+from dataset import SupervisedDataset, data_collator
+from trainer import CPMTrainer
+

@dataclass
 class ModelArguments:
@@ -44,6 +44,8 @@ class TrainingArguments(transformers.TrainingArguments):
            "help": "Maximum sequence length. Sequences will be right padded (and possibly truncated)."
        },
    )
+    tune_vision: Optional[bool] = field(default=True)
+    tune_llm: Optional[bool] = field(default=True)


 def rank0_print(*args):
@@ -52,7 +54,15 @@ def rank0_print(*args):


 def make_supervised_data_module(
-    tokenizer: transformers.PreTrainedTokenizer, data_args, transform, data_collator=None, slice_config=None, 
+    tokenizer: transformers.PreTrainedTokenizer,
+    data_args,
+    transform,
+    data_collator=None,
+    llm_type="minicpm",
+    slice_config=None,
+    patch_size=14,
+    query_nums=64,
+    batch_vision=False,
 ) -> Dict:
    """Make dataset and collator for supervised fine-tuning."""
    dataset_cls = SupervisedDataset
@@ -60,19 +70,57 @@ def make_supervised_data_module(
    rank0_print("Loading data...")

    train_json = json.load(open(data_args.data_path, "r"))
-    train_dataset = dataset_cls(train_json, transform, tokenizer, slice_config=slice_config)
+    train_dataset = dataset_cls(
+        train_json,
+        transform,
+        tokenizer,
+        slice_config=slice_config,
+        llm_type=llm_type,
+        patch_size=patch_size,
+        query_nums=query_nums,
+        batch_vision=batch_vision,
+    )

    if data_args.eval_data_path:
        eval_json = json.load(open(data_args.eval_data_path, "r"))
-        eval_dataset = dataset_cls(eval_json, transform, tokenizer, slice_config=slice_config)
+        eval_dataset = dataset_cls(
+            eval_json,
+            transform,
+            tokenizer,
+            slice_config=slice_config,
+            llm_type=llm_type,
+            patch_size=patch_size,
+            query_nums=query_nums,
+            batch_vision=batch_vision,
+        )
    else:
        eval_dataset = None

-    return dict(train_dataset=train_dataset, eval_dataset=eval_dataset, data_collator=data_collator)
+    return dict(
+        train_dataset=train_dataset,
+        eval_dataset=eval_dataset,
+        data_collator=data_collator,
+    )
+
+
+def get_parameter_number(model):
+    trainable_params, all_param = 0, 0
+    for param in model.parameters():
+        num_params = param.numel()
+        # if using DS Zero 3 and the weights are initialized empty
+        if num_params == 0 and hasattr(param, "ds_numel"):
+            num_params = param.ds_numel
+
+        all_param += num_params
+        if param.requires_grad:
+            trainable_params += num_params
+        
+    return {'Total': all_param, 'Trainable': trainable_params}


 local_rank = 0

+
 def train():
    global local_rank

@@ -85,8 +133,8 @@ def train():
        data_args,
        training_args,
    ) = parser.parse_args_into_dataclasses()
-    
-    if getattr(training_args, 'deepspeed', None):
+
+    if getattr(training_args, "deepspeed", None):
        training_args.distributed_state.distributed_type = DistributedType.DEEPSPEED

    compute_dtype = (
@@ -99,14 +147,50 @@ def train():

    world_size = int(os.environ.get("WORLD_SIZE", 1))
    ddp = world_size != 1
-    device_map = {"": int(os.environ.get("LOCAL_RANK") or 0)} if ddp else None
-    
-    model = AutoModel.from_pretrained(model_args.model_name_or_path, trust_remote_code=True, torch_dtype=compute_dtype, device_map=device_map)
-    tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, trust_remote_code=True)

-    #Load data
+    device_map = {"": int(os.environ.get("LOCAL_RANK") or 0)} if ddp else None
+
+    model = AutoModel.from_pretrained(
+        model_args.model_name_or_path,
+        trust_remote_code=True,
+        torch_dtype=compute_dtype,
+        device_map=device_map,
+    )
+    tokenizer = AutoTokenizer.from_pretrained(
+        model_args.model_name_or_path, trust_remote_code=True
+    )
+
+    if not training_args.tune_vision:
+        model.vpm.requires_grad_(False)
+    if not training_args.tune_llm:
+        model.llm.requires_grad_(False)
+    rank0_print(get_parameter_number(model))
+
+    llm_type = "minicpm"
+    if "llama3" in model.name_or_path.lower():
+        tokenizer.chat_template = "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}"
+        llm_type = "llama3"
+
+    # Load data
+    if hasattr(model.config, "slice_config"):
+        slice_config = model.config.slice_config.to_dict()
+    else:
+        slice_config = model.config.to_dict()
+    if hasattr(model.config, "batch_vision_input"):
+        batch_vision = model.config.batch_vision_input
+    else:
+        batch_vision = False
+
    data_module = make_supervised_data_module(
-        tokenizer=tokenizer, data_args=data_args, transform=model.transform,  data_collator=data_collator, slice_config=model.config.__dict__,
+        tokenizer=tokenizer,
+        data_args=data_args,
+        transform=model.transform,
+        data_collator=data_collator,
+        slice_config=slice_config,
+        llm_type=llm_type,
+        patch_size=model.config.patch_size,
+        query_nums=model.config.query_num,
+        batch_vision=batch_vision,
    )

    trainer = CPMTrainer(
@@ -115,11 +199,10 @@ def train():
        args=training_args,
        **data_module,
    )
-    
+
    trainer.train()
    trainer.save_state()


 if __name__ == "__main__":
    train()
-
--- a/finetune/readme.md
+++ b/finetune/readme.md
@@ -1,18 +1,13 @@
-# Minicpm-V2 Finetuning
+# MiniCPM-V Finetuning

-<div align="center">

-[English](README.md) 
-
-</div>
-
-We offer the official scripts for easy finetuning of the pretrained minicpm-v2 model on downstream tasks. Our finetune scripts use DeepSpeed by default.
+We offer the official scripts for easy finetuning of the pretrained **MiniCPM-Llama3-V 2.5** and **MiniCPM-V 2.0** on downstream tasks. Our finetune scripts use transformers Trainer and DeepSpeed by default.

 ### Data preparation

-To prepare your finetuning data, you should (1) formulate each sample as a dictionary consisting of an id, an image path list with an image (optional, not required for pure-text example), and a list of conversations, and (2) save data samples in JSON files.
+To prepare your finetuning data, you should formulate each sample as a dictionary consisting of an id, an image path list with an image, and a list of conversations. Then save data samples in JSON files.

-For the vision-language example with image, you are required to define placeholder(s) <ImageHere> to define the position to insert the image embeddings.
+For the vision-language example with image, you are required to provide **\<image\>** to define the position to insert the image embeddings. If you don't provide \<image\>, the image will be placed at the front of the conversation.

 <details>
  <summary>
@@ -57,10 +52,19 @@ For the vision-language example with image, you are required to define placehold

 ### Full-parameter finetuning

-Full-parameter parameter finetuning requires updating all parameters of LLM in the whole training process. To launch your training, run the following script:
+Full-parameter parameter finetuning requires updating all parameters of LLM in the whole training process. Please specify the correct MODEL path and DATA path in the shell scripts.
+
+```shell
+MODEL="openbmb/MiniCPM-Llama3-V-2_5" # or openbmb/MiniCPM-V-2
+DATA="path/to/trainging_data" # json file
+EVAL_DATA="path/to/test_data" # json file
+```
+
+To launch your training, run the following script:

 ```
 sh finetune_ds.sh
 ```
+
 #### Customizing Hyperparameters
 To tailor the training process according to your specific requirements, you can adjust various hyperparameters. For comprehensive documentation on available hyperparameters and their functionalities, you can refer to the [official Transformers documentation](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments). Experimentation and fine-tuning of these parameters are essential for achieving optimal model performance tailored to your specific task and dataset.
--- a/finetune/trainer.py
+++ b/finetune/trainer.py
@@ -1,23 +1,22 @@
+from typing import Any, Dict, List, Optional, Tuple, Union
+
 import torch
 import torch.nn as nn
-from typing import Tuple, Union, Optional, List, Dict, Any
 from transformers import Trainer
 from transformers.trainer_pt_utils import nested_detach
 from transformers.utils import is_sagemaker_mp_enabled
+
+
 class CPMTrainer(Trainer):
-    def compute_loss(
-            self, 
-            model, 
-            inputs, 
-            return_outputs=False
-        ):
+    def compute_loss(self, model, inputs, return_outputs=False):
        if "labels" in inputs:
            labels = inputs.pop("labels")
        else:
            labels = None
-        
-        vllm_embedding, vision_hidden_states = self.model.get_vllm_embedding(inputs)
-        
+
+        vllm_embedding, vision_hidden_states = self.model.get_vllm_embedding(
+            inputs)
+
        outputs = self.model.llm(
            inputs_embeds=vllm_embedding,
            use_cache=False,
@@ -26,7 +25,8 @@ class CPMTrainer(Trainer):
        if labels is not None:
            # Flatten the tokens
            loss_fct = nn.CrossEntropyLoss()
-            logits = outputs.logits.view(-1, self.model.config.vocab_size).contiguous()
+            logits = outputs.logits.view(-1,
+                                         self.model.config.vocab_size).contiguous()
            labels = labels.view(-1).long().contiguous()
            # Enable model parallelism
            labels = labels.to(logits.device)
@@ -35,19 +35,20 @@ class CPMTrainer(Trainer):
            if isinstance(outputs, dict) and "loss" not in outputs:
                raise ValueError(
                    "The model did not return a loss from the inputs, only the following keys: "
-                    f"{','.join(outputs.keys())}. For reference, the inputs it received are {','.join(inputs.keys())}."
+                    f"{','.join(outputs.keys())}. For reference, the inputs it received are {
+                        ','.join(inputs.keys())}."
                )
            # We don't use .loss here since the model may return tuples instead of ModelOutput.
            loss = outputs["loss"] if isinstance(outputs, dict) else outputs[0]

        return (loss, outputs) if return_outputs else loss
-    
+
    def prediction_step(
-            self, 
-            model: nn.Module, 
-            inputs:Dict[str, Union[torch.Tensor, Any]],
-            prediction_loss_only: bool,
-            ignore_keys: Optional[List[str]] = None,
+        self,
+        model: nn.Module,
+        inputs: Dict[str, Union[torch.Tensor, Any]],
+        prediction_loss_only: bool,
+        ignore_keys: Optional[List[str]] = None,
    ) -> Tuple[Optional[torch.Tensor], Optional[torch.Tensor], Optional[torch.Tensor]]:
        """
        Perform an evaluation step on `model` using `inputs`.
@@ -72,25 +73,34 @@ class CPMTrainer(Trainer):
            Tuple[Optional[torch.Tensor], Optional[torch.Tensor], Optional[torch.Tensor]]: A tuple with the loss,
            logits and labels (each being optional).
        """
-        has_labels = False if len(self.label_names) == 0 else all(inputs.get(k) is not None for k in self.label_names)
+        has_labels = (
+            False
+            if len(self.label_names) == 0
+            else all(inputs.get(k) is not None for k in self.label_names)
+        )
        # For CLIP-like models capable of returning loss values.
        # If `return_loss` is not specified or being `None` in `inputs`, we check if the default value of `return_loss`
        # is `True` in `model.forward`.
        return_loss = inputs.get("return_loss", None)
        if return_loss is None:
            return_loss = self.can_return_loss
-        loss_without_labels = True if len(self.label_names) == 0 and return_loss else False
+        loss_without_labels = (
+            True if len(self.label_names) == 0 and return_loss else False
+        )

        inputs = self._prepare_inputs(inputs)
        if ignore_keys is None:
            if hasattr(self.model, "config"):
-                ignore_keys = getattr(self.model.config, "keys_to_ignore_at_inference", [])
+                ignore_keys = getattr(
+                    self.model.config, "keys_to_ignore_at_inference", []
+                )
            else:
                ignore_keys = []

        # labels may be popped when computing the loss (label smoothing for instance) so we grab them first.
        if has_labels or loss_without_labels:
-            labels = nested_detach(tuple(inputs.get(name) for name in self.label_names))
+            labels = nested_detach(tuple(inputs.get(name)
+                                   for name in self.label_names))
            if len(labels) == 1:
                labels = labels[0]
        else:
@@ -102,7 +112,11 @@ class CPMTrainer(Trainer):
                if has_labels or loss_without_labels:
                    if isinstance(raw_outputs, dict):
                        loss_mb = raw_outputs["loss"]
-                        logits_mb = tuple(v for k, v in raw_outputs.items() if k not in ignore_keys + ["loss"])
+                        logits_mb = tuple(
+                            v
+                            for k, v in raw_outputs.items()
+                            if k not in ignore_keys + ["loss"]
+                        )
                    else:
                        loss_mb = raw_outputs[0]
                        logits_mb = raw_outputs[1:]
@@ -112,18 +126,26 @@ class CPMTrainer(Trainer):
                else:
                    loss = None
                    if isinstance(raw_outputs, dict):
-                        logits_mb = tuple(v for k, v in raw_outputs.items() if k not in ignore_keys)
+                        logits_mb = tuple(
+                            v for k, v in raw_outputs.items() if k not in ignore_keys
+                        )
                    else:
                        logits_mb = raw_outputs
                    logits = smp_nested_concat(logits_mb)
            else:
                if has_labels or loss_without_labels:
                    with self.compute_loss_context_manager():
-                        loss, outputs = self.compute_loss(model, inputs, return_outputs=True)
+                        loss, outputs = self.compute_loss(
+                            model, inputs, return_outputs=True
+                        )
                    loss = loss.mean().detach()

                    if isinstance(outputs, dict):
-                        logits = tuple(v for k, v in outputs.items() if k not in ignore_keys + ["loss"])
+                        logits = tuple(
+                            v
+                            for k, v in outputs.items()
+                            if k not in ignore_keys + ["loss"]
+                        )
                    else:
                        logits = outputs[1:]
                else:
@@ -131,7 +153,9 @@ class CPMTrainer(Trainer):
                    with self.compute_loss_context_manager():
                        outputs = model(**inputs)
                    if isinstance(outputs, dict):
-                        logits = tuple(v for k, v in outputs.items() if k not in ignore_keys)
+                        logits = tuple(
+                            v for k, v in outputs.items() if k not in ignore_keys
+                        )
                    else:
                        logits = outputs
                    # TODO: this needs to be fixed and made cleaner later.
@@ -146,5 +170,3 @@ class CPMTrainer(Trainer):
            logits = logits[0]

        return (loss, logits, labels)
-
-    
--- a/minicpm_v1.md
+++ b/minicpm_v1.md
@@ -1,4 +1,8 @@
 ## MiniCPM-V 1.0
+
+
+> Archive at：2024-05-19
+
 MiniCPM-V 1.0 is an efficient version with promising performance for deployment. The model is built based on SigLip-400M and [MiniCPM-2.4B](https://github.com/OpenBMB/MiniCPM/), connected by a perceiver resampler. Notable features of MiniCPM-V 1.0 include:

 - ⚡️ **High Efficiency.** 
--- a/omnilmm.md
+++ b/omnilmm.md
@@ -0,0 +1,183 @@
+## OmniLMM-12B
+
+> OmniLMM-12B 发布于本项目早期。推荐您使用我们[最新发布的模型](./README.md)，以获得更高效的推理和更强大的性能体验。
+
+> 归档时间：2024-05-19
+
+**OmniLMM-12B** 是当前系列中性能最佳的版本。该模型基于EVA02-5B和Zephyr-7B-β初始化构建，并使用perceiver resampler连接，采用了课程学习的方法在多模态数据上进行训练。该模型具有三个特点：
+
+- 🔥 **性能领先。**
+
+  OmniLMM-12B 相比其他同规模模型在多个基准测试中取得**领先的性能**（包括 MME、MMBench、SEED-Bench 等），模型掌握了较为丰富的多模态世界知识。
+
+- 🏆 **行为可信。**
+
+  多模态大模型的幻觉问题备受关注，模型经常生成和图像中的事实不符的文本（例如，确信地描述图片中并不存在的物体）。OmniLMM-12B是 **第一个通过多模态 RLHF 对齐的综合能力优秀的开源多模态大模型**（借助 [RLHF-V](https://rlhf-v.github.io/) [CVPR'24] 系列技术）。该模型在 [MMHal-Bench](https://huggingface.co/datasets/Shengcao1006/MMHal-Bench) 幻觉评测基准上达到**开源模型最佳水平**，并在 [Object HalBench](https://arxiv.org/abs/2312.00849) 中**优于GPT-4V**。
+
+- 🕹 **实时多模态交互。**
+
+  我们尝试结合OmniLMM-12B和GPT-3.5 (纯文本模型) ，实现**实时多模态交互助手**。该模型接受来自摄像头的视频流，并借助工具处理语音输入输出。虽然还很初步，我们发现该模型无需视频编辑可以**复现Gemini演示视频中的一些有趣例子**。
+
+### 评测结果 <!-- omit in toc -->
+
+<div align="center">
+    <img src=assets/radar_omnilmm12b.png width=66% />
+</div>
+<details>
+<summary> MME, MMBench, MMMU, MMBench, MMHal-Bench, Object HalBench, SeedBench, LLaVA Bench W, MathVista 上的详细评测结果。 </summary>
+
+<table>
+<thead>
+  <tr>
+    <th align="left">Model</th>
+    <th>Size</th>
+    <th>MME</th>
+    <th nowrap="nowrap">MMB dev (en)</th>
+    <th nowrap="nowrap" >MMMU val</th>
+    <th nowrap="nowrap" >MMHal-Bench</th>
+    <th nowrap="nowrap" >Object HalBench</th>
+    <th nowrap="nowrap" >SeedBench-I</th>
+    <th>MathVista</th>
+    <th nowrap="nowrap" >LLaVA Bench</th>
+  </tr>
+</thead>
+<tbody align="center">
+  <tr>
+    <td align="left">GPT-4V†</td>
+    <td>-</td>
+    <td>1771.5</td>
+    <td>75.1 </td>
+    <td>56.8</td>
+    <td>3.53 / 70.8</td>
+    <td>86.4 / 92.7</td>
+    <td>71.6 </td>
+    <td>47.8 </td>
+    <td>93.1 </td>
+  </tr>
+  <tr>
+    <td nowrap="nowrap" align="left">Qwen-VL-Plus†</td>
+    <td>-</td>
+    <td>2183.4</td>
+    <td>66.2 </td>
+    <td>45.2</td>
+    <td>- </td>
+    <td>- </td>
+    <td>65.7 </td>
+    <td>36.0 </td>
+    <td>73.7 </td>
+  </tr>
+  <tr>
+    <td align="left">Yi-VL 6B</td>
+    <td align="right">6.7B </td>
+    <td>1915.1 </td>
+    <td>68.6 </td>
+    <td>40.3 </td>
+    <td>- </td>
+    <td>- </td>
+    <td>67.5 </td>
+    <td>28.8 </td>
+    <td>51.9 </td>
+  </tr>
+  <tr>
+    <td nowrap="nowrap" align="left" >Qwen-VL-Chat</td>
+    <td align="right">9.6B</td>
+    <td>1860.0</td>
+    <td>60.6 </td>
+    <td>35.9</td>
+    <td>2.93 / 59.4</td>
+    <td>56.2 / 80.0</td>
+    <td>64.8 </td>
+    <td>33.8 </td>
+    <td>67.7 </td>
+  </tr>
+  <tr>
+    <td align="left" >CogVLM-Chat</td>
+    <td align="right">17.4B</td>
+    <td>1736.6</td>
+    <td>63.7 </td>
+    <td>32.1 </td>
+    <td>2.68 / 52.1 </td>
+    <td>73.6 / 87.4 </td>
+    <td>68.8 </td>
+    <td>34.7 </td>
+    <td>73.9 </td>
+  </tr>
+  <tr>
+    <td align="left" >LLaVA 1.5</td>
+    <td align="right">13.6B </td>
+    <td>1808.4 </td>
+    <td>68.2 </td>
+    <td>36.4 </td>
+    <td>2.71 / 51.0 </td>
+    <td>53.7 / 77.4 </td>
+    <td>68.1 </td>
+    <td>26.4 </td>
+    <td>64.6 </td>
+  </tr>
+  <tr>
+    <td nowrap="nowrap" align="left" ><b>OmniLMM-12B</b></td>
+    <td align="right">11.6B </td>
+    <td>1935.8 </td>
+    <td>71.6 </td>
+    <td>40.7 </td>
+    <td>3.45 / 68.8 </td>
+    <td>90.3 / 95.5 </td>
+    <td>71.1 </td>
+    <td>34.9 </td>
+    <td>72.0 </td>
+  </tr>
+</tbody>
+</table>
+<small>†: 闭源模型</small>
+<br>
+</details>
+
+### 典型示例 <!-- omit in toc -->
+
+<table align="center" >
+  <p align="center" > 
+    <img src="assets/omnilmm-12b-examples_2.png" />
+  </p>
+</table>
+
+
+我们结合 OmniLMM-12B 和 ChatGPT-3.5 (纯文本模型) 尝试构建 **实时多模态交互助手**. OmniLMM-12B 将视频帧转为对应的图像描述并输入给ChatGPT-3.5来生成对用户指令的响应。演示视频未经编辑。
+
+<div align="center" >
+  <video controls src="https://github.com/OpenBMB/OmniLMM/assets/157115220/8fec13bf-bb47-4bf8-8f8c-d0b716a964ec" type="video/mp4" width=80%/>
+</div>
+
+## Online Demo
+
+欢迎通过以下链接使用我们的网页端推理服务： [OmniLMM-12B](http://120.92.209.146:8081) ｜ [MiniCPM-V 2.0](http://120.92.209.146:80).
+
+## 安装
+
+1. 克隆我们的仓库并跳转到相应目录
+
+```bash
+git clone https://github.com/OpenBMB/MiniCPM-V.git
+cd MiniCPM-V
+```
+
+1. 创建 conda 环境
+
+```Shell
+conda create -n MiniCPMV python=3.10 -y
+conda activate MiniCPMV
+```
+
+3. 安装依赖
+
+```shell
+pip install -r requirements.txt
+```
+
+## 推理
+
+### 模型库
+
+| 模型                | 简介       | 下载链接 |
+|:----------------------|:-------------------|:---------------:|
+| OmniLMM-12B | 性能最强的版本                   |  [🤗](https://huggingface.co/openbmb/OmniLMM-12B) &nbsp;&nbsp;  [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/OmniLMM-12B/files) |
+
--- a/omnilmm_en.md
+++ b/omnilmm_en.md
@@ -0,0 +1,155 @@
+## OmniLMM-12B
+
+> OmniLMM-12B is released at early time of this project. We recommond you to use our [recently released models](./README.md), for better performance and efficiency.
+
+> Archieve at: 2024-05-19
+
+
+**OmniLMM-12B** is the most capable version. The model is built based on EVA02-5B and Zephyr-7B-β, connected with a perceiver resampler layer, and trained on multimodal data in a curriculum fashion. The model has three notable features:
+
+- 🔥 **Strong Performance.** 
+
+  OmniLMM-12B achieves **leading performance** among models with comparable sizes, surpassing established LMMs on multiple benchmarks (including MME, MMBench, SEED-Bench, etc). The model also endows rich multi-modal world knowledge.
+
+- 🏆 **Trustworthy Behavior.** 
+
+  LMMs are known for suffering from hallucination, often generating text that is not factually grounded in images (e.g., faithfully describing non-existing objects in images). OmniLMM-12B is **the first state-of-the-art open-source LMM aligned via multimodal RLHF for trustworthy behavior** (using the recent [RLHF-V](https://rlhf-v.github.io/) technique). It **ranks #1** among open-source models on [MMHal-Bench](https://huggingface.co/datasets/Shengcao1006/MMHal-Bench), and **outperforms GPT-4V** on [Object HalBench](https://arxiv.org/abs/2312.00849).
+
+- 🕹 **Real-time Multimodal Interaction.** 
+
+  We combine the OmniLMM-12B and GPT-3.5 (text-only) into a **real-time multimodal interactive assistant**. The assistant accepts video streams from the camera and speech streams from the microphone and emits speech output. While still primary, we find the model can **replicate some of the fun cases shown in the Gemini Demo video, without any video edition**.
+
+
+### Evaluation <!-- omit in toc -->
+<div align="center">
+    <img src=assets/radar_omnilmm12b.png width=66% />
+</div>
+<details>
+<summary>Click to view results on MME, MMBench, MMMU, MMBench, MMHal-Bench, Object HalBench, SeedBench, LLaVA Bench, MathVista. </summary>
+
+<table>
+<thead>
+  <tr>
+    <th align="left">Model</th>
+    <th>Size</th>
+    <th>MME</th>
+    <th nowrap="nowrap">MMB dev (en)</th>
+    <th nowrap="nowrap" >MMMU val</th>
+    <th nowrap="nowrap" >MMHal-Bench</th>
+    <th nowrap="nowrap" >Object HalBench</th>
+    <th nowrap="nowrap" >SeedBench-I</th>
+    <th>MathVista</th>
+    <th nowrap="nowrap" >LLaVA Bench</th>
+  </tr>
+</thead>
+<tbody align="center">
+  <tr>
+    <td align="left">GPT-4V†</td>
+    <td>-</td>
+    <td>1771.5</td>
+    <td>75.1 </td>
+    <td>56.8</td>
+    <td>3.53 / 70.8</td>
+    <td>86.4 / 92.7</td>
+    <td>71.6 </td>
+    <td>47.8 </td>
+    <td>93.1 </td>
+  </tr>
+  <tr>
+    <td nowrap="nowrap" align="left">Qwen-VL-Plus†</td>
+    <td>-</td>
+    <td>2183.4</td>
+    <td>66.2 </td>
+    <td>45.2</td>
+    <td>- </td>
+    <td>- </td>
+    <td>65.7 </td>
+    <td>36.0 </td>
+    <td>73.7 </td>
+  </tr>
+  <tr>
+    <td align="left">Yi-VL 6B</td>
+    <td align="right">6.7B </td>
+    <td>1915.1 </td>
+    <td>68.6 </td>
+    <td>40.3 </td>
+    <td>- </td>
+    <td>- </td>
+    <td>67.5 </td>
+    <td>28.8 </td>
+    <td>51.9 </td>
+  </tr>
+  <tr>
+    <td nowrap="nowrap" align="left" >Qwen-VL-Chat</td>
+    <td align="right">9.6B</td>
+    <td>1860.0</td>
+    <td>60.6 </td>
+    <td>35.9</td>
+    <td>2.93 / 59.4</td>
+    <td>56.2 / 80.0</td>
+    <td>64.8 </td>
+    <td>33.8 </td>
+    <td>67.7 </td>
+  </tr>
+  <tr>
+    <td align="left" >CogVLM-Chat</td>
+    <td align="right">17.4B</td>
+    <td>1736.6</td>
+    <td>63.7 </td>
+    <td>32.1 </td>
+    <td>2.68 / 52.1 </td>
+    <td>73.6 / 87.4 </td>
+    <td>68.8 </td>
+    <td>34.7 </td>
+    <td>73.9 </td>
+  </tr>
+  <tr>
+    <td align="left" >LLaVA 1.5</td>
+    <td align="right">13.6B </td>
+    <td>1808.4 </td>
+    <td>68.2 </td>
+    <td>36.4 </td>
+    <td>2.71 / 51.0 </td>
+    <td>53.7 / 77.4 </td>
+    <td>68.1 </td>
+    <td>26.4 </td>
+    <td>64.6 </td>
+  </tr>
+  <tr>
+    <td nowrap="nowrap" align="left" ><b>OmniLMM-12B</b></td>
+    <td align="right">11.6B </td>
+    <td>1935.8 </td>
+    <td>71.6 </td>
+    <td>40.7 </td>
+    <td>3.45 / 68.8 </td>
+    <td>90.3 / 95.5 </td>
+    <td>71.1 </td>
+    <td>34.9 </td>
+    <td>72.0 </td>
+  </tr>
+</tbody>
+</table>
+<small>†: Proprietary models</small>
+<br>
+</details>
+
+### Examples <!-- omit in toc -->
+
+<table align="center" >
+  <p align="center" > 
+    <img src="assets/omnilmm-12b-examples_2.png" />
+  </p>
+</table>
+
+
+We combine the OmniLMM-12B and GPT-3.5 (text-only) into a **real-time multimodal interactive assistant**. Video frames are described in text using OmniLMM-12B, and ChatGPT 3.5 (text-only) is employed to generate response according to the descriptions and user prompts. The demo video is a raw recording without edition. 
+
+<div align="center" >
+  <video controls src="https://github.com/OpenBMB/OmniLMM/assets/157115220/485a8f52-fb4d-4eca-8fee-506347efcfc6" type="video/mp4" width=80%/>
+</div>
+
+### Model Zoo
+
+| Model                | Description       | Download Link |
+|:----------------------|:-------------------|:---------------:|
+| OmniLMM-12B | The most capable version with leading performance.   |  [🤗](https://huggingface.co/openbmb/OmniLMM-12B) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/OmniLMM-12B/files) |
--- a/requirements.txt
+++ b/requirements.txt
@@ -21,10 +21,10 @@ torch==2.1.2
 torchvision==0.16.2
 tqdm==4.66.1
 protobuf==4.25.0
-transformers==4.36.0
+transformers==4.40.0
 typing_extensions==4.8.0
 uvicorn==0.24.0.post1
 #xformers==0.0.22.post7
 #flash_attn==2.3.4
 sentencepiece==0.1.99
-accelerate==0.24.1
+accelerate==0.30.1
--- a/web_demo_2.5.py
+++ b/web_demo_2.5.py
@@ -0,0 +1,252 @@
+#!/usr/bin/env python
+# encoding: utf-8
+import gradio as gr
+from PIL import Image
+import traceback
+import re
+import torch
+import argparse
+from transformers import AutoModel, AutoTokenizer
+
+# README, How to run demo on different devices
+
+# For Nvidia GPUs.
+# python web_demo_2.5.py --device cuda
+
+# For Mac with MPS (Apple silicon or AMD GPUs).
+# PYTORCH_ENABLE_MPS_FALLBACK=1 python web_demo_2.5.py --device mps
+
+# Argparser
+parser = argparse.ArgumentParser(description='demo')
+parser.add_argument('--device', type=str, default='cuda', help='cuda or mps')
+args = parser.parse_args()
+device = args.device
+assert device in ['cuda', 'mps']
+
+# Load model
+model_path = 'openbmb/MiniCPM-Llama3-V-2_5'
+model = AutoModel.from_pretrained(model_path, trust_remote_code=True).to(dtype=torch.float16)
+tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
+
+model = model.to(device=device)
+model.eval()
+
+
+
+ERROR_MSG = "Error, please retry"
+model_name = 'MiniCPM-V 2.5'
+
+form_radio = {
+    'choices': ['Beam Search', 'Sampling'],
+    #'value': 'Beam Search',
+    'value': 'Sampling',
+    'interactive': True,
+    'label': 'Decode Type'
+}
+# Beam Form
+num_beams_slider = {
+    'minimum': 0,
+    'maximum': 5,
+    'value': 3,
+    'step': 1,
+    'interactive': True,
+    'label': 'Num Beams'
+}
+repetition_penalty_slider = {
+    'minimum': 0,
+    'maximum': 3,
+    'value': 1.2,
+    'step': 0.01,
+    'interactive': True,
+    'label': 'Repetition Penalty'
+}
+repetition_penalty_slider2 = {
+    'minimum': 0,
+    'maximum': 3,
+    'value': 1.05,
+    'step': 0.01,
+    'interactive': True,
+    'label': 'Repetition Penalty'
+}
+max_new_tokens_slider = {
+    'minimum': 1,
+    'maximum': 4096,
+    'value': 1024,
+    'step': 1,
+    'interactive': True,
+    'label': 'Max New Tokens'    
+}
+
+top_p_slider = {
+    'minimum': 0,
+    'maximum': 1,
+    'value': 0.8,
+    'step': 0.05,
+    'interactive': True,
+    'label': 'Top P'    
+}
+top_k_slider = {
+    'minimum': 0,
+    'maximum': 200,
+    'value': 100,
+    'step': 1,
+    'interactive': True,
+    'label': 'Top K'    
+}
+temperature_slider = {
+    'minimum': 0,
+    'maximum': 2,
+    'value': 0.7,
+    'step': 0.05,
+    'interactive': True,
+    'label': 'Temperature'    
+}
+
+
+def create_component(params, comp='Slider'):
+    if comp == 'Slider':
+        return gr.Slider(
+            minimum=params['minimum'],
+            maximum=params['maximum'],
+            value=params['value'],
+            step=params['step'],
+            interactive=params['interactive'],
+            label=params['label']
+        )
+    elif comp == 'Radio':
+        return gr.Radio(
+            choices=params['choices'],
+            value=params['value'],
+            interactive=params['interactive'],
+            label=params['label']
+        )
+    elif comp == 'Button':
+        return gr.Button(
+            value=params['value'],
+            interactive=True
+        )
+
+
+def chat(img, msgs, ctx, params=None, vision_hidden_states=None):
+    default_params = {"num_beams":3, "repetition_penalty": 1.2, "max_new_tokens": 1024}
+    if params is None:
+        params = default_params
+    if img is None:
+        return -1, "Error, invalid image, please upload a new image", None, None
+    try:
+        image = img.convert('RGB')
+        answer = model.chat(
+            image=image,
+            msgs=msgs,
+            tokenizer=tokenizer,
+            **params
+        )
+        res = re.sub(r'(<box>.*</box>)', '', answer)
+        res = res.replace('<ref>', '')
+        res = res.replace('</ref>', '')
+        res = res.replace('<box>', '')
+        answer = res.replace('</box>', '')
+        return -1, answer, None, None
+    except Exception as err:
+        print(err)
+        traceback.print_exc()
+        return -1, ERROR_MSG, None, None
+
+
+def upload_img(image, _chatbot, _app_session):
+    image = Image.fromarray(image)
+
+    _app_session['sts']=None
+    _app_session['ctx']=[]
+    _app_session['img']=image 
+    _chatbot.append(('', 'Image uploaded successfully, you can talk to me now'))
+    return _chatbot, _app_session
+
+
+def respond(_question, _chat_bot, _app_cfg, params_form, num_beams, repetition_penalty, repetition_penalty_2, top_p, top_k, temperature):
+    if _app_cfg.get('ctx', None) is None:
+        _chat_bot.append((_question, 'Please upload an image to start'))
+        return '', _chat_bot, _app_cfg
+
+    _context = _app_cfg['ctx'].copy()
+    if _context:
+        _context.append({"role": "user", "content": _question})
+    else:
+        _context = [{"role": "user", "content": _question}] 
+    print('<User>:', _question)
+
+    if params_form == 'Beam Search':
+        params = {
+            'sampling': False,
+            'num_beams': num_beams,
+            'repetition_penalty': repetition_penalty,
+            "max_new_tokens": 896 
+        }
+    else:
+        params = {
+            'sampling': True,
+            'top_p': top_p,
+            'top_k': top_k,
+            'temperature': temperature,
+            'repetition_penalty': repetition_penalty_2,
+            "max_new_tokens": 896 
+        }
+    code, _answer, _, sts = chat(_app_cfg['img'], _context, None, params)
+    print('<Assistant>:', _answer)
+
+    _context.append({"role": "assistant", "content": _answer}) 
+    _chat_bot.append((_question, _answer))
+    if code == 0:
+        _app_cfg['ctx']=_context
+        _app_cfg['sts']=sts
+    return '', _chat_bot, _app_cfg
+
+
+def regenerate_button_clicked(_question, _chat_bot, _app_cfg, params_form, num_beams, repetition_penalty, repetition_penalty_2, top_p, top_k, temperature):
+    if len(_chat_bot) <= 1:
+        _chat_bot.append(('Regenerate', 'No question for regeneration.'))
+        return '', _chat_bot, _app_cfg
+    elif _chat_bot[-1][0] == 'Regenerate':
+        return '', _chat_bot, _app_cfg
+    else:
+        _question = _chat_bot[-1][0]
+        _chat_bot = _chat_bot[:-1]
+        _app_cfg['ctx'] = _app_cfg['ctx'][:-2]
+    return respond(_question, _chat_bot, _app_cfg, params_form, num_beams, repetition_penalty, repetition_penalty_2, top_p, top_k, temperature)
+
+
+
+with gr.Blocks() as demo:
+    with gr.Row():
+        with gr.Column(scale=1, min_width=300):
+            params_form = create_component(form_radio, comp='Radio')
+            with gr.Accordion("Beam Search") as beams_according:
+                num_beams = create_component(num_beams_slider)
+                repetition_penalty = create_component(repetition_penalty_slider)
+            with gr.Accordion("Sampling") as sampling_according:
+                top_p = create_component(top_p_slider)
+                top_k = create_component(top_k_slider)
+                temperature = create_component(temperature_slider)
+                repetition_penalty_2 = create_component(repetition_penalty_slider2)
+            regenerate = create_component({'value': 'Regenerate'}, comp='Button')
+        with gr.Column(scale=3, min_width=500):
+            app_session = gr.State({'sts':None,'ctx':None,'img':None})
+            bt_pic = gr.Image(label="Upload an image to start")
+            chat_bot = gr.Chatbot(label=f"Chat with {model_name}")
+            txt_message = gr.Textbox(label="Input text")
+            
+            regenerate.click(
+                regenerate_button_clicked,
+                [txt_message, chat_bot, app_session, params_form, num_beams, repetition_penalty, repetition_penalty_2, top_p, top_k, temperature],
+                [txt_message, chat_bot, app_session]
+            )
+            txt_message.submit(
+                respond, 
+                [txt_message, chat_bot, app_session, params_form, num_beams, repetition_penalty, repetition_penalty_2, top_p, top_k, temperature], 
+                [txt_message, chat_bot, app_session]
+            )
+            bt_pic.upload(lambda: None, None, chat_bot, queue=False).then(upload_img, inputs=[bt_pic,chat_bot,app_session], outputs=[chat_bot,app_session])
+
+# launch
+demo.launch(share=False, debug=True, show_api=False, server_port=8080, server_name="0.0.0.0")
+