diff --git a/README.md b/README.md index a89a0c2..223073e 100644 --- a/README.md +++ b/README.md @@ -1,35 +1,33 @@
- OmniLMM-3B 🤗 🤖 | + MiniCPM-V 2.0 🤗 🤖 | OmniLMM-12B 🤗 🤖
+ +
+ + +
@@ -186,132 +468,24 @@ We combine the OmniLMM-12B and GPT-3.5 (text-only) into a **real-time multimodal -## OmniLMM-3B -**OmniLMM-3B** (i.e., MiniCPM-V) is an efficient version with promising performance for deployment. The model is built based on SigLip-400M and [MiniCPM-2.4B](https://github.com/OpenBMB/MiniCPM/), connected by a perceiver resampler. Notable features of OmniLMM-3B include: - -- ⚡️ **High Efficiency.** - - OmniLMM-3B can be **efficiently deployed on most GPU cards and personal computers**, and **even on end devices such as mobile phones**. In terms of visual encoding, we compress the image representations into 64 tokens via a perceiver resampler, which is significantly fewer than other LMMs based on MLP architecture (typically > 512 tokens). This allows OmniLMM-3B to operate with **much less memory cost and higher speed during inference**. - -- 🔥 **Promising Performance.** - - OmniLMM-3B achieves **state-of-the-art performance** on multiple benchmarks (including MMMU, MME, and MMbech, etc) among models with comparable sizes, surpassing existing LMMs built on Phi-2. It even **achieves comparable or better performance than the 9.6B Qwen-VL-Chat**. - -- 🙌 **Bilingual Support.** - - OmniLMM-3B is **the first end-deployable LMM supporting bilingual multimodal interaction in English and Chinese**. This is achieved by generalizing multimodal capabilities across languages, a technique from the ICLR 2024 spotlight [paper](https://arxiv.org/abs/2308.12038). - -### Evaluation - - - - - - - Model - Size - Visual Tokens - MME - MMB dev (en) - MMB dev (zh) - MMMU val - CMMMU val - - - - - LLaVA-Phi - 3B - 576 - 1335 - 59.8 - - - - - - - - - MobileVLM - 3B - 144 - 1289 - 59.6 - - - - - - - - - Imp-v1 - 3B - 576 - 1434 - 66.5 - - - - - - - - - Qwen-VL-Chat - 9.6B - 256 - 1487 - 60.6 - 56.7 - 35.9 - 30.7 - - - CogVLM - 17.4B - 1225 - 1438 - 63.7 - 53.8 - 32.1 - - - - - OmniLMM-3B - 3B - 64 - 1452 - 67.9 - 65.3 - 37.2 - 32.1 - - - - - - -### Examples - -We deploy OmniLLM-3B on end devices. The demo video is the raw screen recording on a OnePlus 9R without edition. - - - - - - - ## Demo -Click here to try out the Demo of [OmniLMM-12B](http://120.92.209.146:8081) and [OmniLMM-3B](http://120.92.209.146:80). +Click here to try out the Demo of [MiniCPM-V 2.0](http://120.92.209.146:80/) and [OmniLMM-12B](http://120.92.209.146:8081). ## Install 1. Clone this repository and navigate to the source folder ```bash -git clone https://github.com/OpenBMB/OmniLMM.git -cd OmniLMM +git clone https://github.com/OpenBMB/MiniCPM-V.git +cd MiniCPM-V ``` 2. Create conda environment ```Shell -conda create -n OmniLMM python=3.10 -y -conda activate OmniLMM +conda create -n MiniCPM-V python=3.10 -y +conda activate MiniCPM-V ``` 3. Install dependencies @@ -325,27 +499,27 @@ pip install -r requirements.txt ### Model Zoo | Model | Description | Download Link | |:----------------------|:-------------------|:---------------:| -| OmniLMM-12B | The most capable version with strong performance. | [🤗](https://huggingface.co/openbmb/OmniLMM-12B) [](https://modelscope.cn/models/OpenBMB/OmniLMM-12B/files) | -| OmniLMM-3B | The efficient version for end device deployment. | [🤗](https://huggingface.co/openbmb/MiniCPM-V) [](https://modelscope.cn/models/OpenBMB/MiniCPM-V/files) | - +| MiniCPM-V 2.0 | The latest version for state-of-the-art end-side capabilities with high efficiency. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-2.0) [](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2.0/files) | +| MiniCPM-V | The first version of MiniCPM-V. | [🤗](https://huggingface.co/openbmb/MiniCPM-V) [](https://modelscope.cn/models/OpenBMB/MiniCPM-V/files) | +| OmniLMM-12B | The most capable version with leading performance. | [🤗](https://huggingface.co/openbmb/OmniLMM-12B) [](https://modelscope.cn/models/OpenBMB/OmniLMM-12B/files) | ### Multi-turn Conversation -Please refer to the following codes to run `OmniLMM`. +Please refer to the following codes to run `MiniCPM-V` and `OmniLMM`. - + ```python from chat import OmniLMMChat, img2base64 -chat_model = OmniLMMChat('openbmb/OmniLMM-12B') # or 'openbmb/MiniCPM-V' +chat_model = OmniLMMChat('openbmb/OmniLMM-12B') # or 'openbmb/MiniCPM-V-2' -im_64 = img2base64('./assets/worldmap_ck.jpg') +im_64 = img2base64('./assets/hk_OCR.jpg') # First round chat -msgs = [{"role": "user", "content": "What is interesting about this image?"}] +msgs = [{"role": "user", "content": "Where should I go to buy a camera?"}] inputs = {"image": im_64, "question": json.dumps(msgs)} answer = chat_model.chat(inputs) @@ -354,7 +528,7 @@ print(answer) # Second round chat # pass history context of multi-turn conversation msgs.append({"role": "assistant", "content": answer}) -msgs.append({"role": "user", "content": "Where is China in the image"}) +msgs.append({"role": "user", "content": "Where is this store in the image?"}) inputs = {"image": im_64, "question": json.dumps(msgs)} answer = chat_model.chat(inputs) @@ -362,15 +536,18 @@ print(answer) ``` We can obtain the following results: -``` -"The interesting aspect of this image is the shape of the chicken nuggets on the pan. The nuggets are shaped like the continents of the world, which is an unusual and creative way to present the food. It adds a fun and playful element to the meal, making it more visually appealing and engaging." -"In the image, China is located on the right side of the pan. It is one of the nuggets shaped like the continents of the world, and its placement on the right side of the pan is consistent with its geographical location in the real world" ``` +"You should go to the Canon store for a camera." + +"The Canon store is located on the right side of the image." +``` + + ### Inference on Mac -Click to view example, OmniLMM-3B (i.e., MiniCPM-V) can run on Mac with MPS (Apple silicon or AMD GPUs). +Click to view an example, to run MiniCPM-V 2.0 on 💻 Mac with MPS (Apple silicon or AMD GPUs). ```python # test.py @@ -378,14 +555,14 @@ import torch from PIL import Image from transformers import AutoModel, AutoTokenizer -model = AutoModel.from_pretrained('openbmb/MiniCPM-V', trust_remote_code=True, torch_dtype=torch.bfloat16) +model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2.0', trust_remote_code=True, torch_dtype=torch.bfloat16) model = model.to(device='mps', dtype=torch.float16) -tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V', trust_remote_code=True) +tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2.0', trust_remote_code=True) model.eval() -image = Image.open('./assets/worldmap_ck.jpg').convert('RGB') -question = 'What is interesting about this image?' +image = Image.open('./assets/hk_OCR.jpg').convert('RGB') +question = 'Where is this photo taken?' msgs = [{'role': 'user', 'content': question}] answer, context, _ = model.chat( @@ -404,7 +581,7 @@ PYTORCH_ENABLE_MPS_FALLBACK=1 python test.py ### Deployment on Mobile Phone -Currently OmniLMM-3B (i.e., MiniCPM-V) can be deployed on mobile phones with Android and Harmony operating systems. 🚀 Try it out [here](https://github.com/OpenBMB/mlc-MiniCPM). +Currently MiniCPM-V 2.0 can be deployed on mobile phones with Android and Harmony operating systems. 🚀 Try it out [here](https://github.com/OpenBMB/mlc-MiniCPM). ## TODO @@ -412,24 +589,24 @@ Currently OmniLMM-3B (i.e., MiniCPM-V) can be deployed on mobile phones with And - [ ] Local Web-UI deployment - [ ] Code release for real-time interactive assistant -## Model License +## Model License The code in this repo is released according to [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) -The usage of OmniLMMs' parameters is subject to "[General Model License Agreement - Source Notes - Publicity Restrictions - Commercial License](https://github.com/OpenBMB/General-Model-License/blob/main/通用模型许可协议-来源说明-宣传限制-商业授权.md)" +The usage of MiniCPM-V's and OmniLMM's parameters is subject to "[General Model License Agreement - Source Notes - Publicity Restrictions - Commercial License](https://github.com/OpenBMB/General-Model-License/blob/main/通用模型许可协议-来源说明-宣传限制-商业授权.md)" -The parameters are fully open to acedemic research +The parameters are fully open to academic research -Please contact cpm@modelbest.cn to obtain a written authorization for commercial uses. Free commercial use is also allowed after registration. +Please contact cpm@modelbest.cn to obtain written authorization for commercial uses. Free commercial use is also allowed after registration. -## Statement +## Statement -As LMMs, OmniLMMs generate contents by learning a large mount of multimodal corpora, but it cannot comprehend, express personal opinions or make value judgement. Anything generated by OmniLMMs does not represent the views and positions of the model developers +As LMMs, OmniLMMs generate contents by learning a large amount of multimodal corpora, but they cannot comprehend, express personal opinions or make value judgement. Anything generated by OmniLMMs does not represent the views and positions of the model developers We will not be liable for any problems arising from the use of OmniLMM open source models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model. -## Institutions +## Institutions This project is developed by the following institutions: diff --git a/README_zh.md b/README_zh.md index 746e1c6..547418d 100644 --- a/README_zh.md +++ b/README_zh.md @@ -2,49 +2,334 @@ - + **性能领先且部署高效的多模态大模型** - OmniLMM-3B 🤗 🤖 | + MiniCPM-V 2.0 🤗 🤖 | OmniLMM-12B 🤗 🤖 -**OmniLMM** 是面向图文理解的开源多模态大模型系列。该系列模型接受图像和文本输入,并提供高质量的文本输出。我们发布了两个版本的 OmniLMM,旨在实现**领先的性能和高效的部署**: +**MiniCPM-V**和**OmniLMM** 是面向图文理解的开源多模态大模型系列。该系列模型接受图像和文本输入,并提供高质量的文本输出。我们发布了两个版本的模型,旨在实现**领先的性能和高效的部署**: -- **OmniLMM-12B**:相比同规模其他模型在多个基准测试中具有领先性能。 +- **MiniCPM-V 2.8B**:可在终端设备上部署的先进多模态大模型。最新发布的 MiniCPM-V 2.0 可以接受 180 万像素的任意长宽比图像输入,实现了和 Gemini Pro 相近的场景文字识别能力以及和 GPT-4V 相匹的低幻觉率。 + +- **OmniLMM-12B**:相比同规模其他模型在多个基准测试中具有领先性能,实现了相比 GPT-4V 更低的幻觉率。 -- **OmniLMM-3B**:可在终端设备上部署并具备先进的多模态对话能力。 [English Document](./README.md) -## 目录 +## 目录 -- [目录](#目录) +- [MiniCPM-V 2.8B](#minicpm-v-28b) - [OmniLMM-12B](#omnilmm-12b) - - [评测结果](#评测结果) - - [典型示例](#典型示例) -- [OmniLMM-3B](#omnilmm-3b) - - [性能评估](#性能评估) - - [部署示例](#部署示例) - [Demo](#demo) - [安装](#安装) - [推理](#推理) - [模型库](#模型库) - [多轮对话](#多轮对话) - - [Mac推理](#mac推理) + - [Mac 推理](#mac-推理) - [手机端部署](#手机端部署) - [未来计划](#未来计划) -- [模型协议](#模型协议) -- [声明](#声明) -- [机构](#机构) + +## MiniCPM-V 2.8B + +**MiniCPM-V 2.8B**可以高效部署到终端设备。该模型基于 SigLip-400M 和 [MiniCPM-2.4B](https://github.com/OpenBMB/MiniCPM/)构建,通过perceiver resampler连接。最新发布的 MiniCPM-V 2.0 的特点包括: + +- 🔥 **优秀的性能。** + + MiniCPM-V 2.0 在多个测试基准(如 OCRBench, TextVQA, MME, MMB, MathVista 等)中实现了 7B 以下模型的**最佳性能**。**在综合了 11 个主流多模态大模型评测基准的 OpenCompass 榜单上超过了 Qwen-VL-Chat 9.6B、CogVLM-Chat 17.4B 和 Yi-VL 34B 等更大参数规模的模型**。MiniCPM-V 2.0 还展现出**领先的 OCR 能力**,在场景文字识别能力上**接近 Gemini Pro**,OCRBench 得分达到**开源模型第一**。 + + +- 🏆 **可信行为。** + + 多模态大模型深受幻觉问题困扰,模型经常生成和图像中的事实不符的文本。MiniCPM-V 2.0 是 **第一个通过多模态 RLHF 对齐的端侧多模态大模型**(借助 [RLHF-V](https://rlhf-v.github.io/) [CVPR'24] 系列技术)。该模型在 [Object HalBench](https://arxiv.org/abs/2312.00849) 达到**和 GPT-4V 相仿**的性能。 + + +- 🌟 **高清图像高效编码。** + + MiniCPM-V 2.0 可以接受 **180 万像素的任意长宽比图像输入**(基于最新的[LLaVA-UHD](https://arxiv.org/pdf/2403.11703.pdf) 技术),这使得模型可以感知到小物体、密集文字等更加细粒度的视觉信息。 + + +- ⚡️ **高效部署。** + + MiniCPM-V 2.0 可以**高效部署在大多数 GPU 和个人电脑上**,包括**移动手机等终端设备**。在视觉编码方面,我们通过perceiver resampler将图像表示压缩为更少的 token。这使得 MiniCPM-V 2.0 即便是**面对高分辨率图像,也能占用较低的存储并展现优秀的推理速度**。 + +- 🙌 **双语支持。** + + MiniCPM-V 2.0 **提供领先的中英双语多模态能力支持**。 + 该能力通过 [VisCPM](https://arxiv.org/abs/2308.12038) [ICLR'24] 论文中提出的多模态能力的跨语言泛化技术实现。 + +### 性能评估 + + + + + +TextVQA, DocVQA, OCRBench, OpenCompass, MME, MMBench, MMMU, MathVista, LLaVA Bench, Object HalBench 上的详细评测结果。 + + + + + + Model + Size + TextVQA val + DocVQA test + OCRBench + OpenCompass + MME + MMB dev(en) + MMB dev(zh) + MMMU val + MathVista + LLaVA Bench + Object HalBench + + + + + Proprietary models + + + Gemini Pro Vision + - + 74.6 + 88.1 + 680 + 63.8 + 2148.9 + 75.2 + 74.0 + 48.9 + 45.8 + 79.9 + - + + + GPT-4V + - + 78.0 + 88.4 + 645 + 63.2 + 1771.5 + 75.1 + 75.0 + 53.8 + 47.8 + 93.1 + 86.4 / 92.7 + + + Open-source models 6B~34B + + + Yi-VL-6B + 6.7B + 45.5* + 17.1* + 290 + 49.3 + 1915.1 + 68.6 + 68.3 + 40.3 + 28.8 + 51.9 + - + + + Qwen-VL-Chat + 9.6B + 61.5 + 62.6 + 488 + 52.1 + 1860.0 + 60.6 + 56.7 + 37.0 + 33.8 + 67.7 + 56.2 / 80.0 + + + Yi-VL-34B + 34B + 43.4* + 16.9* + 290 + 52.6 + 2050.2 + 71.1 + 71.4 + 45.1 + 30.7 + 62.3 + - + + + DeepSeek-VL-7B + 7.3B + 64.7* + 47.0* + 435 + 55.6 + 1765.4 + 74.1 + 72.8 + 38.3 + 36.8 + 77.8 + - + + + TextMonkey + 9.7B + 64.3 + 66.7 + 558 + - + - + - + - + - + - + - + - + + + CogVLM-Chat + 17.4B + 70.4 + 33.3* + 590 + 52.5 + 1736.6 + 63.7 + 53.8 + 37.3 + 34.7 + 73.9 + 73.6 / 87.4 + + + Open-source models 1B~3B + + + DeepSeek-VL-1.3B + 1.7B + 58.4* + 37.9* + 413 + 46.0 + 1531.6 + 64.0 + 61.2 + 33.8 + 29.4 + 51.1 + - + + + MobileVLM V2 + 3.1B + 57.5 + 19.4* + - + - + 1440.5(P) + 63.2 + - + - + - + - + - + + + Mini-Gemini + 2.2B + 56.2 + 34.2* + - + - + 1653.0 + 59.8 + - + 31.7 + - + - + - + + + MiniCPM-V + 2.8B + 60.6 + 38.2 + 366 + 47.6 + 1650.2 + 67.9 + 65.3 + 38.3 + 28.9 + 51.3 + 78.4 / 88.5 + + + MiniCPM-V 2.0 + 2.8B + 74.1 + 71.9 + 605 + 55.0 + 1808.6 + 69.6 + 68.1 + 38.2 + 38.7 + 69.2 + 85.5 / 92.2 + + + + + +* 我们自己评测了正式开源的模型权重。 + + + +### 典型示例 + + + + + + + + +我们将 MiniCPM-V 2.0 部署在小米 14 Pro 上,并录制了以下演示视频,未经任何视频剪辑。 + + + + + + + + +### MiniCPM-V 1.0 + +请参考[这里](./minicpm_v1.md)了解 MiniCPM-V 1.0 的信息和使用教程。 + + ## OmniLMM-12B **OmniLMM-12B** 是当前系列中性能最佳的版本。该模型基于EVA02-5B和Zephyr-7B-β初始化构建,并使用perceiver resampler连接,采用了课程学习的方法在多模态数据上进行训练。该模型具有三个特点: @@ -54,19 +339,19 @@ - 🏆 **行为可信。** - 多模态大模型的幻觉问题备受关注,模型经常生成和图像中的事实不符的文本(例如,确信地描述图片中并不存在的物体)。OmniLMM-12B是 **第一个通过多模态 RLHF 对齐的综合能力优秀的开源多模态大模型**(借助最新的 [RLHF-V](https://rlhf-v.github.io/) 技术)。该模型在 [MMHal-Bench](https://huggingface.co/datasets/Shengcao1006/MMHal-Bench) 幻觉评测基准上达到**开源模型最佳水平**,并在 [Object HalBench](https://arxiv.org/abs/2312.00849) 中**优于GPT-4V**。 + 多模态大模型的幻觉问题备受关注,模型经常生成和图像中的事实不符的文本(例如,确信地描述图片中并不存在的物体)。OmniLMM-12B是 **第一个通过多模态 RLHF 对齐的综合能力优秀的开源多模态大模型**(借助 [RLHF-V](https://rlhf-v.github.io/) [CVPR'24] 系列技术)。该模型在 [MMHal-Bench](https://huggingface.co/datasets/Shengcao1006/MMHal-Bench) 幻觉评测基准上达到**开源模型最佳水平**,并在 [Object HalBench](https://arxiv.org/abs/2312.00849) 中**优于GPT-4V**。 - 🕹 **实时多模态交互。** 我们尝试结合OmniLMM-12B和GPT-3.5 (纯文本模型) ,实现**实时多模态交互助手**。该模型接受来自摄像头的视频流,并借助工具处理语音输入输出。虽然还很初步,我们发现该模型无需视频编辑可以**复现Gemini演示视频中的一些有趣例子**。 -### 评测结果 +### 评测结果 - + - MME, MMBench, MMMU, MMBench, MMHal-Bench, Object HalBench, SeedBench, LLaVA Bench W, MathVista 上的详细评测结果. + MME, MMBench, MMMU, MMBench, MMHal-Bench, Object HalBench, SeedBench, LLaVA Bench W, MathVista 上的详细评测结果。 @@ -80,14 +365,14 @@ Object HalBench SeedBench-I MathVista - LLaVA Bench W + LLaVA Bench GPT-4V† - - 1409 + 1771.5 75.1 56.8 3.53 / 70.8 @@ -99,7 +384,7 @@ Qwen-VL-Plus† - - 1681 + 2183.4 66.2 45.2 - @@ -111,19 +396,19 @@ Yi-VL 6B 6.7B - - - 68.2 - 39.1 + 1915.1 + 68.6 + 40.3 - - - 66.1 - 28.0 - 39.9 + 67.5 + 28.8 + 51.9 Qwen-VL-Chat 9.6B - 1488 + 1860.0 60.6 35.9 2.93 / 59.4 @@ -133,9 +418,9 @@ 67.7 - CogVLM + CogVLM-Chat 17.4B - 1438 + 1736.6 63.7 32.1 2.68 / 52.1 @@ -147,7 +432,7 @@ LLaVA 1.5 13.6B - 1531 + 1808.4 68.2 36.4 2.71 / 51.0 @@ -159,7 +444,7 @@ OmniLMM-12B 11.6B - 1637 + 1935.8 71.6 40.7 3.45 / 68.8 @@ -171,10 +456,10 @@ †: 闭源模型 - + -### 典型示例 +### 典型示例 @@ -189,120 +474,9 @@ -## OmniLMM-3B - -**OmniLMM-3B**(即 MiniCPM-V)可以高效部署到终端设备。该模型基于 SigLip-400M 和 [MiniCPM-2.4B](https://github.com/OpenBMB/MiniCPM/)构建,通过perceiver resampler连接。OmniLMM-3B的特点包括: - -- ⚡️ **高效部署。** - - OmniLMM-3B 可以**高效部署在大多数 GPU 和个人电脑上**,包括**移动手机等终端设备**。在视觉编码方面,我们通过perceiver resampler将图像表示压缩为64个token,远远少于基于MLP架构的其他多模态大模型(通常大于512个token)。这使得 OmniLMM-3B 在推理期间**存储占用更低并且速度更快**。 - -- 🔥 **优秀的性能。** - - OmniLMM-3B 在多个测试基准中实现了同规模**最佳性能**,超过了基于Phi-2构建的多模态大模型。该模型甚至在部分基准中实现了**与9.6B Qwen-VL-Chat匹配或更好的性能**。 - -- 🙌 **双语支持。** - - OmniLMM-3B 是**第一个支持中英双语的端侧多模态大模型**。 - 该能力通过ICLR 2024 spotlight [论文](https://arxiv.org/abs/2308.12038)中提出的多模态能力的跨语言泛化技术实现。 - -### 性能评估 - - - - - - - Model - Size - Visual Tokens - MME - MMB dev (en) - MMB dev (zh) - MMMU val - CMMMU val - - - - - LLaVA-Phi - 3B - 576 - 1335 - 59.8 - - - - - - - - - MobileVLM - 3B - 144 - 1289 - 59.6 - - - - - - - - - Imp-v1 - 3B - 576 - 1434 - 66.5 - - - - - - - - - Qwen-VL-Chat - 9.6B - 256 - 1487 - 60.6 - 56.7 - 35.9 - 30.7 - - - CogVLM - 17.4B - 1225 - 1438 - 63.7 - 53.8 - 32.1 - - - - - OmniLMM-3B - 3B - 64 - 1452 - 67.9 - 65.3 - 37.2 - 32.1 - - - - - - -### 部署示例 - -我们在手机上部署了OmniLMM-3B。演示视频是OnePlus 9R上的原始录屏结果。 - - - - - - - - ## Demo -欢迎通过以下链接使用我们的网页端推理服务: [OmniLMM-12B](http://120.92.209.146:8081) | [OmniLMM-3B](http://120.92.209.146:80). +欢迎通过以下链接使用我们的网页端推理服务: [OmniLMM-12B](http://120.92.209.146:8081) | [MiniCPM-V 2.0](http://120.92.209.146:80). ## 安装 @@ -332,28 +506,29 @@ pip install -r requirements.txt | 模型 | 简介 | 下载链接 | |:----------------------|:-------------------|:---------------:| +| MiniCPM-V 2.0 | 最新版本,提供高效而领先的端侧双语多模态理解能力。 | [🤗](https://huggingface.co/openbmb/MiniCPM-V-2.0) [](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2.0/files) | +| MiniCPM-V | 第一版 MiniCPM-V | [🤗](https://huggingface.co/openbmb/MiniCPM-V) [](https://modelscope.cn/models/OpenBMB/MiniCPM-V/files) | | OmniLMM-12B | 性能最强的版本 | [🤗](https://huggingface.co/openbmb/OmniLMM-12B) [](https://modelscope.cn/models/OpenBMB/OmniLMM-12B/files) | -| OmniLMM-3B | 支持端侧高效部署,性能优秀 | [🤗](https://huggingface.co/openbmb/MiniCPM-V) [](https://modelscope.cn/models/OpenBMB/MiniCPM-V/files) | ### 多轮对话 -请参考以下代码使用 `OmniLMM` 进行推理。 +请参考以下代码使用 `MiniCPM-V` 和 `OmniLMM` 进行推理。 - + ```python from chat import OmniLMMChat, img2base64 -chat_model = OmniLMMChat('openbmb/OmniLMM-12B') # or 'openbmb/MiniCPM-V' +chat_model = OmniLMMChat('openbmb/OmniLMM-12B') # or 'openbmb/MiniCPM-V-2' -im_64 = img2base64('./assets/worldmap_ck.jpg') +im_64 = img2base64('./assets/hk_OCR.jpg') # First round chat -msgs = [{"role": "user", "content": "What is interesting about this image?"}] +msgs = [{"role": "user", "content": "Where should I go to buy a camera?"}] inputs = {"image": im_64, "question": json.dumps(msgs)} answer = chat_model.chat(inputs) @@ -362,7 +537,7 @@ print(answer) # Second round chat # pass history context of multi-turn conversation msgs.append({"role": "assistant", "content": answer}) -msgs.append({"role": "user", "content": "Where is China in the image"}) +msgs.append({"role": "user", "content": "Where is this store in the image?"}) inputs = {"image": im_64, "question": json.dumps(msgs)} answer = chat_model.chat(inputs) @@ -370,15 +545,18 @@ print(answer) ``` 可以得到以下输出: -``` -"The interesting aspect of this image is the shape of the chicken nuggets on the pan. The nuggets are shaped like the continents of the world, which is an unusual and creative way to present the food. It adds a fun and playful element to the meal, making it more visually appealing and engaging." -"In the image, China is located on the right side of the pan. It is one of the nuggets shaped like the continents of the world, and its placement on the right side of the pan is consistent with its geographical location in the real world" +``` +"You should go to the Canon store for a camera." + +"The Canon store is located on the right side of the image." ``` -### Mac推理 + + +### Mac 推理 -点击查看, OmniLMM-3B (即MiniCPM-V) 可基于Mac MPS运行 (Apple silicon or AMD GPUs). +点击查看 MiniCPM-V 2.0 基于Mac MPS运行 (Apple silicon or AMD GPUs)的示例。 ```python # test.py @@ -386,14 +564,14 @@ import torch from PIL import Image from transformers import AutoModel, AutoTokenizer -model = AutoModel.from_pretrained('openbmb/MiniCPM-V', trust_remote_code=True, torch_dtype=torch.bfloat16) +model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2.0', trust_remote_code=True, torch_dtype=torch.bfloat16) model = model.to(device='mps', dtype=torch.float16) -tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V', trust_remote_code=True) +tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2.0', trust_remote_code=True) model.eval() -image = Image.open('./assets/worldmap_ck.jpg').convert('RGB') -question = 'What is interesting about this image?' +image = Image.open('./assets/hk_OCR.jpg').convert('RGB') +question = 'Where is this photo taken?' msgs = [{'role': 'user', 'content': question}] answer, context, _ = model.chat( @@ -413,7 +591,7 @@ PYTORCH_ENABLE_MPS_FALLBACK=1 python test.py ### 手机端部署 -OmniLMM-3B (即MiniCPM-V) 目前可以部署在Android和Harmony操作系统的手机上。 🚀 点击[这里](https://github.com/OpenBMB/mlc-MiniCPM)开始手机端部署。 +MiniCPM-V 2.0 目前可以部署在Android和Harmony操作系统的手机上。 🚀 点击[这里](https://github.com/OpenBMB/mlc-MiniCPM)开始手机端部署。 ## 未来计划 @@ -424,7 +602,7 @@ OmniLMM-3B (即MiniCPM-V) 目前可以部署在Android和Harmony操作系统的 -## 模型协议 +## 模型协议 本仓库中代码依照 Apache-2.0 协议开源 @@ -435,14 +613,14 @@ OmniLMM 模型权重对学术研究完全开放。 如需将模型用于商业用途,请联系 cpm@modelbest.cn 来获取书面授权,登记后可以免费商业使用。 -## 声明 +## 声明 -作为多模态大模型,OmniLMM 通过学习大量的多模态数据来生成内容,但它无法理解、表达个人观点或价值判断,它所输出的任何内容都不代表模型开发者的观点和立场。 +作为多模态大模型,MiniCPM-V 和 OmniLMM 通过学习大量的多模态数据来生成内容,但它无法理解、表达个人观点或价值判断,它所输出的任何内容都不代表模型开发者的观点和立场。 -因此用户在使用 OmniLMM 生成的内容时,应自行负责对其进行评估和验证。如果由于使用 OmniLMM 开源模型而导致的任何问题,包括但不限于数据安全问题、公共舆论风险,或模型被误导、滥用、传播或不当利用所带来的任何风险和问题,我们将不承担任何责任。 +因此用户在使用 MiniCPM-V 和 OmniLMM 生成的内容时,应自行负责对其进行评估和验证。如果由于使用 OmniLMM 开源模型而导致的任何问题,包括但不限于数据安全问题、公共舆论风险,或模型被误导、滥用、传播或不当利用所带来的任何风险和问题,我们将不承担任何责任。 -## 机构 +## 机构 本项目由以下机构共同开发: diff --git a/assets/eval_radar.png b/assets/eval_radar.png deleted file mode 100644 index 43aaf6f..0000000 Binary files a/assets/eval_radar.png and /dev/null differ diff --git a/assets/gif_cases/english_menu.gif b/assets/gif_cases/english_menu.gif new file mode 100644 index 0000000..ef2c3f7 Binary files /dev/null and b/assets/gif_cases/english_menu.gif differ diff --git a/assets/gif_cases/hong_kong_street.gif b/assets/gif_cases/hong_kong_street.gif new file mode 100644 index 0000000..1b34082 Binary files /dev/null and b/assets/gif_cases/hong_kong_street.gif differ diff --git a/assets/gif_cases/station.gif b/assets/gif_cases/station.gif new file mode 100644 index 0000000..c0318ae Binary files /dev/null and b/assets/gif_cases/station.gif differ diff --git a/assets/hk_OCR.jpg b/assets/hk_OCR.jpg new file mode 100644 index 0000000..d27d27c Binary files /dev/null and b/assets/hk_OCR.jpg differ diff --git a/assets/minicpmv-2-peformance.png b/assets/minicpmv-2-peformance.png new file mode 100644 index 0000000..12c85bc Binary files /dev/null and b/assets/minicpmv-2-peformance.png differ diff --git a/assets/minicpmv2-cases.png b/assets/minicpmv2-cases.png new file mode 100644 index 0000000..1bcf815 Binary files /dev/null and b/assets/minicpmv2-cases.png differ diff --git a/assets/minicpmv2-cases_1.png b/assets/minicpmv2-cases_1.png new file mode 100644 index 0000000..4751c2f Binary files /dev/null and b/assets/minicpmv2-cases_1.png differ diff --git a/assets/minicpmv2-cases_2.png b/assets/minicpmv2-cases_2.png new file mode 100644 index 0000000..6ef6ab8 Binary files /dev/null and b/assets/minicpmv2-cases_2.png differ diff --git a/assets/radar_omnilmm12b.png b/assets/radar_omnilmm12b.png new file mode 100644 index 0000000..35a2601 Binary files /dev/null and b/assets/radar_omnilmm12b.png differ diff --git a/minicpm_v1.md b/minicpm_v1.md new file mode 100644 index 0000000..ecf1d98 --- /dev/null +++ b/minicpm_v1.md @@ -0,0 +1,210 @@ +## MiniCPM-V 1.0 +MiniCPM-V 1.0 is an efficient version with promising performance for deployment. The model is built based on SigLip-400M and [MiniCPM-2.4B](https://github.com/OpenBMB/MiniCPM/), connected by a perceiver resampler. Notable features of MiniCPM-V 1.0 include: + +- ⚡️ **High Efficiency.** + + MiniCPM-V 1.0 can be **efficiently deployed on most GPU cards and personal computers**, and **even on end devices such as mobile phones**. In terms of visual encoding, we compress the image representations into 64 tokens via a perceiver resampler, which is significantly fewer than other LMMs based on MLP architecture (typically > 512 tokens). This allows MiniCPM-V 1.0 to operate with **much less memory cost and higher speed during inference**. + +- 🔥 **Promising Performance.** + + MiniCPM-V 1.0 achieves **state-of-the-art performance** on multiple benchmarks (including MMMU, MME, and MMbech, etc) among models with comparable sizes, surpassing existing LMMs built on Phi-2. It even **achieves comparable or better performance than the 9.6B Qwen-VL-Chat**. + +- 🙌 **Bilingual Support.** + + MiniCPM-V 1.0 is **the first end-deployable LMM supporting bilingual multimodal interaction in English and Chinese**. This is achieved by generalizing multimodal capabilities across languages, a technique from the ICLR 2024 spotlight [paper](https://arxiv.org/abs/2308.12038). + +### Evaluation + + + + + + + Model + Size + Visual Tokens + MME + MMB dev (en) + MMB dev (zh) + MMMU val + CMMMU val + + + + + LLaVA-Phi + 3B + 576 + 1335 + 59.8 + - + - + - + + + MobileVLM + 3B + 144 + 1289 + 59.6 + - + - + - + + + Imp-v1 + 3B + 576 + 1434 + 66.5 + - + - + - + + + Qwen-VL-Chat + 9.6B + 256 + 1487 + 60.6 + 56.7 + 35.9 + 30.7 + + + CogVLM + 17.4B + 1225 + 1438 + 63.7 + 53.8 + 32.1 + - + + + MiniCPM-V 1.0 + 3B + 64 + 1452 + 67.9 + 65.3 + 37.2 + 32.1 + + + + + + +### Examples + +We deploy MiniCPM-V 1.0 on end devices. The demo video is the raw screen recording on a OnePlus 9R without edition. + + + + + + + + +## Install + +1. Clone this repository and navigate to the source folder + +```bash +git clone https://github.com/OpenBMB/OmniLMM.git +cd OmniLMM +``` + +2. Create conda environment + +```Shell +conda create -n OmniLMM python=3.10 -y +conda activate OmniLMM +``` + +3. Install dependencies + +```shell +pip install -r requirements.txt +``` + +## Inference + +### Model Zoo +| Model | Description | Download Link | +|:----------------------|:-------------------|:---------------:| +| MiniCPM-V 1.0 | The efficient version for end device deployment. | [🤗](https://huggingface.co/openbmb/MiniCPM-V) [](https://modelscope.cn/models/OpenBMB/MiniCPM-V/files) | + + +### Multi-turn Conversation +Please refer to the following codes to run `MiniCPM-V 1.0`. + + + + + + +```python +from chat import OmniLMMChat, img2base64 + +chat_model = OmniLMMChat('openbmb/MiniCPM-V') + +im_64 = img2base64('./assets/worldmap_ck.jpg') + +# First round chat +msgs = [{"role": "user", "content": "What is interesting about this image?"}] + +inputs = {"image": im_64, "question": json.dumps(msgs)} +answer = chat_model.chat(inputs) +print(answer) + +# Second round chat +# pass history context of multi-turn conversation +msgs.append({"role": "assistant", "content": answer}) +msgs.append({"role": "user", "content": "Where is China in the image"}) + +inputs = {"image": im_64, "question": json.dumps(msgs)} +answer = chat_model.chat(inputs) +print(answer) +``` + + +### Inference on Mac + +Click to view example, MiniCPM-V 1.0 can run on Mac with MPS (Apple silicon or AMD GPUs). + +```python +# test.py +import torch +from PIL import Image +from transformers import AutoModel, AutoTokenizer + +model = AutoModel.from_pretrained('openbmb/MiniCPM-V', trust_remote_code=True, torch_dtype=torch.bfloat16) +model = model.to(device='mps', dtype=torch.float16) + +tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V', trust_remote_code=True) +model.eval() + +image = Image.open('./assets/worldmap_ck.jpg').convert('RGB') +question = 'What is interesting about this image?' +msgs = [{'role': 'user', 'content': question}] + +answer, context, _ = model.chat( + image=image, + msgs=msgs, + context=None, + tokenizer=tokenizer, + sampling=True +) +print(answer) +``` +Run with command: +```shell +PYTORCH_ENABLE_MPS_FALLBACK=1 python test.py +``` + + +### Deployment on Mobile Phone + +Currently MiniCPM-V 1.0 can be deployed on mobile phones with Android and Harmony operating systems. 🚀 Try it out [here](https://github.com/OpenBMB/mlc-MiniCPM).
- - -
@@ -189,120 +474,9 @@ -## OmniLMM-3B - -**OmniLMM-3B**(即 MiniCPM-V)可以高效部署到终端设备。该模型基于 SigLip-400M 和 [MiniCPM-2.4B](https://github.com/OpenBMB/MiniCPM/)构建,通过perceiver resampler连接。OmniLMM-3B的特点包括: - -- ⚡️ **高效部署。** - - OmniLMM-3B 可以**高效部署在大多数 GPU 和个人电脑上**,包括**移动手机等终端设备**。在视觉编码方面,我们通过perceiver resampler将图像表示压缩为64个token,远远少于基于MLP架构的其他多模态大模型(通常大于512个token)。这使得 OmniLMM-3B 在推理期间**存储占用更低并且速度更快**。 - -- 🔥 **优秀的性能。** - - OmniLMM-3B 在多个测试基准中实现了同规模**最佳性能**,超过了基于Phi-2构建的多模态大模型。该模型甚至在部分基准中实现了**与9.6B Qwen-VL-Chat匹配或更好的性能**。 - -- 🙌 **双语支持。** - - OmniLMM-3B 是**第一个支持中英双语的端侧多模态大模型**。 - 该能力通过ICLR 2024 spotlight [论文](https://arxiv.org/abs/2308.12038)中提出的多模态能力的跨语言泛化技术实现。 - -### 性能评估 - - - - - - - Model - Size - Visual Tokens - MME - MMB dev (en) - MMB dev (zh) - MMMU val - CMMMU val - - - - - LLaVA-Phi - 3B - 576 - 1335 - 59.8 - - - - - - - - - MobileVLM - 3B - 144 - 1289 - 59.6 - - - - - - - - - Imp-v1 - 3B - 576 - 1434 - 66.5 - - - - - - - - - Qwen-VL-Chat - 9.6B - 256 - 1487 - 60.6 - 56.7 - 35.9 - 30.7 - - - CogVLM - 17.4B - 1225 - 1438 - 63.7 - 53.8 - 32.1 - - - - - OmniLMM-3B - 3B - 64 - 1452 - 67.9 - 65.3 - 37.2 - 32.1 - - - - - - -### 部署示例 - -我们在手机上部署了OmniLMM-3B。演示视频是OnePlus 9R上的原始录屏结果。 - - - - - - - - ## Demo -欢迎通过以下链接使用我们的网页端推理服务: [OmniLMM-12B](http://120.92.209.146:8081) | [OmniLMM-3B](http://120.92.209.146:80). +欢迎通过以下链接使用我们的网页端推理服务: [OmniLMM-12B](http://120.92.209.146:8081) | [MiniCPM-V 2.0](http://120.92.209.146:80). ## 安装 @@ -332,28 +506,29 @@ pip install -r requirements.txt | 模型 | 简介 | 下载链接 | |:----------------------|:-------------------|:---------------:| +| MiniCPM-V 2.0 | 最新版本,提供高效而领先的端侧双语多模态理解能力。 | [🤗](https://huggingface.co/openbmb/MiniCPM-V-2.0) [](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2.0/files) | +| MiniCPM-V | 第一版 MiniCPM-V | [🤗](https://huggingface.co/openbmb/MiniCPM-V) [](https://modelscope.cn/models/OpenBMB/MiniCPM-V/files) | | OmniLMM-12B | 性能最强的版本 | [🤗](https://huggingface.co/openbmb/OmniLMM-12B) [](https://modelscope.cn/models/OpenBMB/OmniLMM-12B/files) | -| OmniLMM-3B | 支持端侧高效部署,性能优秀 | [🤗](https://huggingface.co/openbmb/MiniCPM-V) [](https://modelscope.cn/models/OpenBMB/MiniCPM-V/files) | ### 多轮对话 -请参考以下代码使用 `OmniLMM` 进行推理。 +请参考以下代码使用 `MiniCPM-V` 和 `OmniLMM` 进行推理。 - + ```python from chat import OmniLMMChat, img2base64 -chat_model = OmniLMMChat('openbmb/OmniLMM-12B') # or 'openbmb/MiniCPM-V' +chat_model = OmniLMMChat('openbmb/OmniLMM-12B') # or 'openbmb/MiniCPM-V-2' -im_64 = img2base64('./assets/worldmap_ck.jpg') +im_64 = img2base64('./assets/hk_OCR.jpg') # First round chat -msgs = [{"role": "user", "content": "What is interesting about this image?"}] +msgs = [{"role": "user", "content": "Where should I go to buy a camera?"}] inputs = {"image": im_64, "question": json.dumps(msgs)} answer = chat_model.chat(inputs) @@ -362,7 +537,7 @@ print(answer) # Second round chat # pass history context of multi-turn conversation msgs.append({"role": "assistant", "content": answer}) -msgs.append({"role": "user", "content": "Where is China in the image"}) +msgs.append({"role": "user", "content": "Where is this store in the image?"}) inputs = {"image": im_64, "question": json.dumps(msgs)} answer = chat_model.chat(inputs) @@ -370,15 +545,18 @@ print(answer) ``` 可以得到以下输出: -``` -"The interesting aspect of this image is the shape of the chicken nuggets on the pan. The nuggets are shaped like the continents of the world, which is an unusual and creative way to present the food. It adds a fun and playful element to the meal, making it more visually appealing and engaging." -"In the image, China is located on the right side of the pan. It is one of the nuggets shaped like the continents of the world, and its placement on the right side of the pan is consistent with its geographical location in the real world" +``` +"You should go to the Canon store for a camera." + +"The Canon store is located on the right side of the image." ``` -### Mac推理 + + +### Mac 推理 -点击查看, OmniLMM-3B (即MiniCPM-V) 可基于Mac MPS运行 (Apple silicon or AMD GPUs). +点击查看 MiniCPM-V 2.0 基于Mac MPS运行 (Apple silicon or AMD GPUs)的示例。 ```python # test.py @@ -386,14 +564,14 @@ import torch from PIL import Image from transformers import AutoModel, AutoTokenizer -model = AutoModel.from_pretrained('openbmb/MiniCPM-V', trust_remote_code=True, torch_dtype=torch.bfloat16) +model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2.0', trust_remote_code=True, torch_dtype=torch.bfloat16) model = model.to(device='mps', dtype=torch.float16) -tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V', trust_remote_code=True) +tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2.0', trust_remote_code=True) model.eval() -image = Image.open('./assets/worldmap_ck.jpg').convert('RGB') -question = 'What is interesting about this image?' +image = Image.open('./assets/hk_OCR.jpg').convert('RGB') +question = 'Where is this photo taken?' msgs = [{'role': 'user', 'content': question}] answer, context, _ = model.chat( @@ -413,7 +591,7 @@ PYTORCH_ENABLE_MPS_FALLBACK=1 python test.py ### 手机端部署 -OmniLMM-3B (即MiniCPM-V) 目前可以部署在Android和Harmony操作系统的手机上。 🚀 点击[这里](https://github.com/OpenBMB/mlc-MiniCPM)开始手机端部署。 +MiniCPM-V 2.0 目前可以部署在Android和Harmony操作系统的手机上。 🚀 点击[这里](https://github.com/OpenBMB/mlc-MiniCPM)开始手机端部署。 ## 未来计划 @@ -424,7 +602,7 @@ OmniLMM-3B (即MiniCPM-V) 目前可以部署在Android和Harmony操作系统的 -## 模型协议 +## 模型协议 本仓库中代码依照 Apache-2.0 协议开源 @@ -435,14 +613,14 @@ OmniLMM 模型权重对学术研究完全开放。 如需将模型用于商业用途,请联系 cpm@modelbest.cn 来获取书面授权,登记后可以免费商业使用。 -## 声明 +## 声明 -作为多模态大模型,OmniLMM 通过学习大量的多模态数据来生成内容,但它无法理解、表达个人观点或价值判断,它所输出的任何内容都不代表模型开发者的观点和立场。 +作为多模态大模型,MiniCPM-V 和 OmniLMM 通过学习大量的多模态数据来生成内容,但它无法理解、表达个人观点或价值判断,它所输出的任何内容都不代表模型开发者的观点和立场。 -因此用户在使用 OmniLMM 生成的内容时,应自行负责对其进行评估和验证。如果由于使用 OmniLMM 开源模型而导致的任何问题,包括但不限于数据安全问题、公共舆论风险,或模型被误导、滥用、传播或不当利用所带来的任何风险和问题,我们将不承担任何责任。 +因此用户在使用 MiniCPM-V 和 OmniLMM 生成的内容时,应自行负责对其进行评估和验证。如果由于使用 OmniLMM 开源模型而导致的任何问题,包括但不限于数据安全问题、公共舆论风险,或模型被误导、滥用、传播或不当利用所带来的任何风险和问题,我们将不承担任何责任。 -## 机构 +## 机构 本项目由以下机构共同开发: diff --git a/assets/eval_radar.png b/assets/eval_radar.png deleted file mode 100644 index 43aaf6f..0000000 Binary files a/assets/eval_radar.png and /dev/null differ diff --git a/assets/gif_cases/english_menu.gif b/assets/gif_cases/english_menu.gif new file mode 100644 index 0000000..ef2c3f7 Binary files /dev/null and b/assets/gif_cases/english_menu.gif differ diff --git a/assets/gif_cases/hong_kong_street.gif b/assets/gif_cases/hong_kong_street.gif new file mode 100644 index 0000000..1b34082 Binary files /dev/null and b/assets/gif_cases/hong_kong_street.gif differ diff --git a/assets/gif_cases/station.gif b/assets/gif_cases/station.gif new file mode 100644 index 0000000..c0318ae Binary files /dev/null and b/assets/gif_cases/station.gif differ diff --git a/assets/hk_OCR.jpg b/assets/hk_OCR.jpg new file mode 100644 index 0000000..d27d27c Binary files /dev/null and b/assets/hk_OCR.jpg differ diff --git a/assets/minicpmv-2-peformance.png b/assets/minicpmv-2-peformance.png new file mode 100644 index 0000000..12c85bc Binary files /dev/null and b/assets/minicpmv-2-peformance.png differ diff --git a/assets/minicpmv2-cases.png b/assets/minicpmv2-cases.png new file mode 100644 index 0000000..1bcf815 Binary files /dev/null and b/assets/minicpmv2-cases.png differ diff --git a/assets/minicpmv2-cases_1.png b/assets/minicpmv2-cases_1.png new file mode 100644 index 0000000..4751c2f Binary files /dev/null and b/assets/minicpmv2-cases_1.png differ diff --git a/assets/minicpmv2-cases_2.png b/assets/minicpmv2-cases_2.png new file mode 100644 index 0000000..6ef6ab8 Binary files /dev/null and b/assets/minicpmv2-cases_2.png differ diff --git a/assets/radar_omnilmm12b.png b/assets/radar_omnilmm12b.png new file mode 100644 index 0000000..35a2601 Binary files /dev/null and b/assets/radar_omnilmm12b.png differ diff --git a/minicpm_v1.md b/minicpm_v1.md new file mode 100644 index 0000000..ecf1d98 --- /dev/null +++ b/minicpm_v1.md @@ -0,0 +1,210 @@ +## MiniCPM-V 1.0 +MiniCPM-V 1.0 is an efficient version with promising performance for deployment. The model is built based on SigLip-400M and [MiniCPM-2.4B](https://github.com/OpenBMB/MiniCPM/), connected by a perceiver resampler. Notable features of MiniCPM-V 1.0 include: + +- ⚡️ **High Efficiency.** + + MiniCPM-V 1.0 can be **efficiently deployed on most GPU cards and personal computers**, and **even on end devices such as mobile phones**. In terms of visual encoding, we compress the image representations into 64 tokens via a perceiver resampler, which is significantly fewer than other LMMs based on MLP architecture (typically > 512 tokens). This allows MiniCPM-V 1.0 to operate with **much less memory cost and higher speed during inference**. + +- 🔥 **Promising Performance.** + + MiniCPM-V 1.0 achieves **state-of-the-art performance** on multiple benchmarks (including MMMU, MME, and MMbech, etc) among models with comparable sizes, surpassing existing LMMs built on Phi-2. It even **achieves comparable or better performance than the 9.6B Qwen-VL-Chat**. + +- 🙌 **Bilingual Support.** + + MiniCPM-V 1.0 is **the first end-deployable LMM supporting bilingual multimodal interaction in English and Chinese**. This is achieved by generalizing multimodal capabilities across languages, a technique from the ICLR 2024 spotlight [paper](https://arxiv.org/abs/2308.12038). + +### Evaluation + + + + + + + Model + Size + Visual Tokens + MME + MMB dev (en) + MMB dev (zh) + MMMU val + CMMMU val + + + + + LLaVA-Phi + 3B + 576 + 1335 + 59.8 + - + - + - + + + MobileVLM + 3B + 144 + 1289 + 59.6 + - + - + - + + + Imp-v1 + 3B + 576 + 1434 + 66.5 + - + - + - + + + Qwen-VL-Chat + 9.6B + 256 + 1487 + 60.6 + 56.7 + 35.9 + 30.7 + + + CogVLM + 17.4B + 1225 + 1438 + 63.7 + 53.8 + 32.1 + - + + + MiniCPM-V 1.0 + 3B + 64 + 1452 + 67.9 + 65.3 + 37.2 + 32.1 + + + + + + +### Examples + +We deploy MiniCPM-V 1.0 on end devices. The demo video is the raw screen recording on a OnePlus 9R without edition. + + + + + + + + +## Install + +1. Clone this repository and navigate to the source folder + +```bash +git clone https://github.com/OpenBMB/OmniLMM.git +cd OmniLMM +``` + +2. Create conda environment + +```Shell +conda create -n OmniLMM python=3.10 -y +conda activate OmniLMM +``` + +3. Install dependencies + +```shell +pip install -r requirements.txt +``` + +## Inference + +### Model Zoo +| Model | Description | Download Link | +|:----------------------|:-------------------|:---------------:| +| MiniCPM-V 1.0 | The efficient version for end device deployment. | [🤗](https://huggingface.co/openbmb/MiniCPM-V) [](https://modelscope.cn/models/OpenBMB/MiniCPM-V/files) | + + +### Multi-turn Conversation +Please refer to the following codes to run `MiniCPM-V 1.0`. + + + + + + +```python +from chat import OmniLMMChat, img2base64 + +chat_model = OmniLMMChat('openbmb/MiniCPM-V') + +im_64 = img2base64('./assets/worldmap_ck.jpg') + +# First round chat +msgs = [{"role": "user", "content": "What is interesting about this image?"}] + +inputs = {"image": im_64, "question": json.dumps(msgs)} +answer = chat_model.chat(inputs) +print(answer) + +# Second round chat +# pass history context of multi-turn conversation +msgs.append({"role": "assistant", "content": answer}) +msgs.append({"role": "user", "content": "Where is China in the image"}) + +inputs = {"image": im_64, "question": json.dumps(msgs)} +answer = chat_model.chat(inputs) +print(answer) +``` + + +### Inference on Mac + +Click to view example, MiniCPM-V 1.0 can run on Mac with MPS (Apple silicon or AMD GPUs). + +```python +# test.py +import torch +from PIL import Image +from transformers import AutoModel, AutoTokenizer + +model = AutoModel.from_pretrained('openbmb/MiniCPM-V', trust_remote_code=True, torch_dtype=torch.bfloat16) +model = model.to(device='mps', dtype=torch.float16) + +tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V', trust_remote_code=True) +model.eval() + +image = Image.open('./assets/worldmap_ck.jpg').convert('RGB') +question = 'What is interesting about this image?' +msgs = [{'role': 'user', 'content': question}] + +answer, context, _ = model.chat( + image=image, + msgs=msgs, + context=None, + tokenizer=tokenizer, + sampling=True +) +print(answer) +``` +Run with command: +```shell +PYTORCH_ENABLE_MPS_FALLBACK=1 python test.py +``` + + +### Deployment on Mobile Phone + +Currently MiniCPM-V 1.0 can be deployed on mobile phones with Android and Harmony operating systems. 🚀 Try it out [here](https://github.com/OpenBMB/mlc-MiniCPM).