Compare commits
18 Commits
qyc-98-4.5
...
2.6-sft
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
6f2e885144 | ||
|
|
8befa86b5e | ||
|
|
ad1d40c802 | ||
|
|
29c86db9a0 | ||
|
|
3f0d6974cd | ||
|
|
e225f19571 | ||
|
|
3e0abd9c51 | ||
|
|
16fd0f2cda | ||
|
|
e7c73972f9 | ||
|
|
5b6032c322 | ||
|
|
5be4e3ec28 | ||
|
|
7e439c97a6 | ||
|
|
69aed84270 | ||
|
|
6acf99fddf | ||
|
|
2c6a96f148 | ||
|
|
edf9a58fae | ||
|
|
3b3b9331cb | ||
|
|
7842ec1228 |
5
.vscode/settings.json
vendored
Normal file
@@ -0,0 +1,5 @@
|
||||
{
|
||||
"githubPullRequests.ignoredPullRequestBranches": [
|
||||
"main"
|
||||
]
|
||||
}
|
||||
@@ -1,41 +0,0 @@
|
||||
Version 1.0, June 5, 2024
|
||||
© 2024 OpenBMB. All rights reserved.
|
||||
|
||||
## Part One: Preamble
|
||||
|
||||
We are opening the entire series of the globally leading MiniCPM edge-side large language models, including the flagship edge-side models MiniCPM-2.4B and MiniCPM-1.2B, as well as the world's most powerful edge multimodal models MiniCPM-V series. The aforementioned weights are completely open for all academic research. Commercial use is also allowed after filling out a registration questionnaire. Community use of the MiniCPM series models must comply with Apache 2.0 and the "MiniCPM Model Community License Agreement."
|
||||
Therefore, you and the MiniCPM development team agree to the following "MiniCPM Model Community License Agreement":
|
||||
|
||||
## Part Two: Licensing and Redistributio
|
||||
|
||||
#### 1. Grant of Rights
|
||||
You are granted a non-exclusive, worldwide, non-transferable, royalty-free, limited license to use, copy, distribute, reproduce, create derivative works from, and modify MiniCPM materials in accordance with OpenBMB's intellectual property rights or other rights in the MiniCPM materials.
|
||||
#### 2. Distribution and Redistribution
|
||||
- If you distribute or provide MiniCPM series model materials (or any derivative works thereof), or use any product or service of them, you must (A) provide a copy of this agreement; and (B) prominently display "Built with 面壁MiniCPM" on the relevant website, user interface, blog post, about page, or product documentation. If you create, train, fine-tune, or improve an AI model using the MiniCPM series models, the model must include "MiniCPM" in its name.
|
||||
- You must retain the following attribution statement in all distributed MiniCPM-related materials: "MiniCPM is licensed under the MiniCPM Model Community License, © OpenBMB Platforms, Inc. All rights reserved."
|
||||
- Your use of MiniCPM materials must comply with applicable laws and regulations and the "MiniCPM Model Community License Agreement," which is incorporated into this agreement by reference.
|
||||
- You may not use MiniCPM series models or their outputs and results to improve any other large language models (other than MiniCPM or its derivatives).
|
||||
#### 3. Additional Commercial Terms
|
||||
If you or your affiliates' services or products deploy the model on edge-side devices not exceeding 5,000 units, or provide applications with a daily active user count (DAU) of less than 1 million, you can apply to OpenBMB for permission and, after filling out the registration questionnaire, may be allowed to use it commercially for free. Otherwise, please email (cpm@modelbest.cn) to apply for authorization from OpenBMB, which may, at its discretion, grant permission, and you will not have the right to exercise any rights under this agreement.
|
||||
#### 4. Usage-based Restrictions
|
||||
The restrictions set forth in Appendix A are considered usage-based restrictions. Therefore, you may not use the model or its derivatives for designated restricted uses. You may use the model under this license, including only for lawful purposes and in compliance with the terms of the license. Usage includes creating any content, fine-tuning, updating, running, training, evaluating, and/or re-parameterizing the model. You should require all users of the model or its derivatives to comply with the terms of this section.
|
||||
|
||||
## Part Three: Other Terms
|
||||
#### 5. Trademarks and Related
|
||||
This license does not grant you the right to use OpenBMB, OpenBMB Intelligence, MiniCPM trademarks, trade names, logos, or otherwise imply a relationship between the parties; any rights not expressly granted herein are reserved by OpenBMB.
|
||||
#### 6. Disclaimer
|
||||
Unless required by applicable law or agreed to in writing, OpenBMB provides the model and supplemental materials "as is," without any warranty or condition, express or implied, including but not limited to all express and implied warranties or conditions of title, non-infringement, merchantability, or fitness for a particular purpose. You are solely responsible for determining the appropriateness of using or redistributing the model, its derivatives, and supplemental materials, and assume any risks associated with exercising the permissions under this license.
|
||||
|
||||
## Appendix A: Usage Restrictions
|
||||
You agree not to use the model or its derivatives for:
|
||||
- Any use that violates applicable national or international laws or regulations or infringes upon the legal rights and interests of any third party;
|
||||
- Any military purposes;
|
||||
- Exploiting, harming, or attempting to exploit or harm minors in any way;
|
||||
- Generating or disseminating verifiable false information and/or content with the intent to harm others;
|
||||
- Generating or disseminating inappropriate content that must comply with applicable regulatory requirements;
|
||||
- Unauthorized generation or dissemination of personally identifiable information, or unreasonable use thereof;
|
||||
- Defamation, demeaning, or otherwise harassing others;
|
||||
- Fully automated decision-making that adversely affects individuals' legal rights or creates or modifies binding, enforceable obligations;
|
||||
- Any use intended to or having the effect of discriminating or harming individuals or groups based on online or offline social behaviors or known or predicted personal characteristics;
|
||||
- Exploiting the vulnerabilities of specific groups due to their age, social, physical, or psychological characteristics, in a manner that materially distorts the behavior of group members, leading to or likely leading to physical or psychological harm to the person or others;
|
||||
- Any use intended to or having the effect of discriminating against individuals or groups based on legally protected characteristics or categories.
|
||||
@@ -1,43 +0,0 @@
|
||||
版本 1.0,2024年6月5日
|
||||
版权所有 © 2024 OpenBMB
|
||||
|
||||
## 第一部分:序言
|
||||
|
||||
我们将全球领先的MiniCPM端侧模型全系开源,包括旗舰端侧模型MiniCPM-2.4B和MiniCPM-1.2B,以及全球领先的端侧多模态模型MiniCPM-V系列。以上权重对所有学术研究完全开放。在填写问卷登记后亦允许商业使用,社区使用 MiniCPM系列模型需要遵循 Apache 2.0 和《MiniCPM 模型社区许可协议》。
|
||||
因此,您与MiniCPM 开发团队达成如下《MiniCPM模型商用许可协议》:
|
||||
|
||||
## 第二部分:许可权和再分发
|
||||
|
||||
#### 1. 权利授予
|
||||
您被授予非排他性的、全球性的、不可转让的和免版税的有限许可,依据OpenBMB对MiniCPM材料所拥有的知识产权或其他权利来使用、复制、分发、复制、创建衍生作品和修改MiniCPM材料。
|
||||
#### 2. 分发和再分发
|
||||
- 如果您分发或提供MiniCPM系列模型材料(或其任何衍生作品),或使用其中任何一个的产品或服务,您必须(A)提供本协议的副本;并(B)在相关网站、用户界面、博客文章、关于页面或产品文档中显著显示“Built with 面壁MiniCPM”。如果您使用MiniCPM系列模型创建、训练、微调或改进AI模型,该模型必须包含“MiniCPM”命名。
|
||||
- 您必须在分发的所有MiniCPM相关材料中保留以下归属声明:“面壁MiniCPM 根据MiniCPM模型社区许可证许可,版权所有©面壁智能 Platforms, Inc. 保留所有权利。”
|
||||
- 您对MiniCPM材料的使用必须遵守适用的法律法规,并遵守《MiniCPM 模型社区许可协议》,该政策通过引用并入本协议。
|
||||
- 您不得使用MiniCPM系列模型或其输出和结果来改进任何其他大型语言模型(除 MiniCPM 或其衍生品外)。
|
||||
#### 3. 附加商业条款
|
||||
若您或您的关联方的服务或产品是将模型部署在端侧设备,且部署设备不超5000台,或提供应用的日均用户活跃量(DAU)低于100万,可直接向面壁智能申请许可,在填写问卷登记后可允许免费商业使用。否则请发邮件(cpm@modelbest.cn)向面壁智能申请授权,我们可自行决定是否授权,并自行决定授权的期限和范围。在我们给予书面授权前,您无权行使任何商业性权利,亦不得将模型用于任何商业用途。
|
||||
|
||||
#### 4. 基于使用的限制
|
||||
附录A中规定的限制被视为基于使用的限制。因此,您不得将模型及其衍生作品用于指定的受限用途。您可以根据本许可证使用模型,包括仅用于合法目的并符合许可证的规定。使用包括创建任何内容、微调、更新、运行、训练、评估和/或重新参数化模型。您应要求所有使用模型或其衍生作品的用户遵守本段的条款。
|
||||
|
||||
## 第三部分:其他条款
|
||||
#### 5. 商标和相关
|
||||
本许可证不授予您使用OpenBMB、面壁智能、MiniCPM商标、商号、标志或以其他方式暗示双方之间关系的权利;未在此明确授予的任何权利均由OpenBMB保留。
|
||||
|
||||
#### 6. 免责声明
|
||||
除非适用法律要求或书面同意,OpenBMB 按“现状”提供模型和补充材料,不提供任何形式的保证或条件,包括但不限于所有明示和暗示的保证或条件,包括所有权、非侵权、适销性或适用于特定目的的保证或条件。您自行负责确定使用或再分发模型、模型的衍生作品和补充材料的适当性,并承担在本许可证下行使权利所引发的任何风险。
|
||||
|
||||
## 附录A:使用限制
|
||||
您同意不将模型或其衍生作品用于:
|
||||
- 任何违反适用国家或国际法律法规或侵犯任何第三方合法权利和利益的方式;
|
||||
- 任何军事用途;
|
||||
- 以任何方式利用、伤害或试图利用或伤害未成年人;
|
||||
- 生成或传播可验证的虚假信息和/或内容,以损害他人为目的;
|
||||
- 生成或传播不适当内容,需符合适用的监管要求;
|
||||
- 未经授权生成或传播个人可识别信息,或进行不合理使用;
|
||||
- 诽谤、贬低或以其他方式骚扰他人;
|
||||
- 完全自动化的决策,导致个人的法律权利受到不利影响或创建或修改具有约束力、可执行的义务;
|
||||
- 任何意图或具有歧视或伤害个人或群体的效果,基于在线或离线的社会行为或已知或预测的个人特征;
|
||||
- 利用特定群体的年龄、社会、身体或心理特征的弱点,以实质性扭曲该群体成员的行为,导致或可能导致该人或其他人身体或心理伤害的方式;
|
||||
- 任何意图或具有歧视个人或群体效果的用途,基于法律保护的特征或类别。
|
||||
1693
README_en.md
Normal file
2717
README_zh.md
|
Before Width: | Height: | Size: 373 KiB |
|
Before Width: | Height: | Size: 272 B |
BIN
assets/join.png
|
Before Width: | Height: | Size: 868 B |
@@ -1,3 +0,0 @@
|
||||
<span style="color:#56A7DA; font-size: 10em; font-weight: bold;">
|
||||
MiniCPM-<span>o</span>
|
||||
</span>
|
||||
|
Before Width: | Height: | Size: 1.1 MiB |
|
Before Width: | Height: | Size: 1023 KiB |
|
Before Width: | Height: | Size: 957 KiB |
BIN
assets/minicpm-v17.png
Normal file
|
After Width: | Height: | Size: 562 KiB |
BIN
assets/minicpm-v18.png
Normal file
|
After Width: | Height: | Size: 219 KiB |
BIN
assets/minicpm-v21-2.png
Normal file
|
After Width: | Height: | Size: 222 KiB |
BIN
assets/minicpm-v21.png
Normal file
|
After Width: | Height: | Size: 209 KiB |
BIN
assets/minicpm-v22.png
Normal file
|
After Width: | Height: | Size: 216 KiB |
BIN
assets/minicpm-v23.png
Normal file
|
After Width: | Height: | Size: 115 KiB |
BIN
assets/minicpm-v24.png
Normal file
|
After Width: | Height: | Size: 84 KiB |
BIN
assets/minicpm-v25.png
Normal file
|
After Width: | Height: | Size: 112 KiB |
BIN
assets/minicpm-v_wechat.png
Normal file
|
After Width: | Height: | Size: 628 KiB |
|
Before Width: | Height: | Size: 81 KiB |
|
Before Width: | Height: | Size: 3.1 MiB |
|
Before Width: | Height: | Size: 1.8 MiB |
|
Before Width: | Height: | Size: 785 KiB |
|
Before Width: | Height: | Size: 8.6 MiB |
|
Before Width: | Height: | Size: 100 KiB |
BIN
assets/minicpmv22.jpeg
Normal file
|
After Width: | Height: | Size: 234 KiB |
|
Before Width: | Height: | Size: 5.5 MiB |
|
Before Width: | Height: | Size: 6.3 MiB |
|
Before Width: | Height: | Size: 15 MiB |
|
Before Width: | Height: | Size: 3.1 MiB |
|
Before Width: | Height: | Size: 3.7 MiB |
|
Before Width: | Height: | Size: 356 KiB |
|
Before Width: | Height: | Size: 3.2 MiB |
|
Before Width: | Height: | Size: 870 KiB |
|
Before Width: | Height: | Size: 1.8 MiB |
|
Before Width: | Height: | Size: 1.5 MiB |
|
Before Width: | Height: | Size: 2.4 MiB |
|
Before Width: | Height: | Size: 2.4 MiB |
|
Before Width: | Height: | Size: 7.5 MiB |
|
Before Width: | Height: | Size: 22 MiB |
|
Before Width: | Height: | Size: 3.8 MiB |
|
Before Width: | Height: | Size: 2.6 MiB |
|
Before Width: | Height: | Size: 2.0 MiB |
|
Before Width: | Height: | Size: 2.1 MiB |
|
Before Width: | Height: | Size: 589 KiB |
|
Before Width: | Height: | Size: 2.6 MiB |
BIN
assets/radar.jpg
|
Before Width: | Height: | Size: 842 KiB |
|
Before Width: | Height: | Size: 1.5 MiB |
|
Before Width: | Height: | Size: 108 KiB |
|
Before Width: | Height: | Size: 12 KiB |
|
Before Width: | Height: | Size: 245 B |
@@ -1,23 +0,0 @@
|
||||
# MiniCPM-V Best Practices
|
||||
|
||||
**MiniCPM-V** is a series of end-side multimodal LLMs (MLLMs) designed for vision-language understanding. The models take image, video and text as inputs and provide high-quality text output, aiming to achieve **strong performance and efficient deployment**. The most notable models in this series currently include MiniCPM-Llama3-V 2.5 and MiniCPM-V 2.6. The following sections provide detailed tutorials and guidance for each version of the MiniCPM-V models.
|
||||
|
||||
|
||||
## MiniCPM-V 2.6
|
||||
|
||||
MiniCPM-V 2.6 is the latest and most capable model in the MiniCPM-V series. With a total of 8B parameters, the model **surpasses GPT-4V in single image, multi-image and video understanding**. It outperforms **GPT-4o mini, Gemini 1.5 Pro and Claude 3.5 Sonnet** in single image understanding, and advances MiniCPM-Llama3-V 2.5's features such as strong OCR capability, trustworthy behavior, multilingual support, and end-side deployment. Due to its superior token density, MiniCPM-V 2.6 can for the first time support real-time video understanding on end-side devices such as iPad.
|
||||
|
||||
* [Deployment Tutorial](https://modelbest.feishu.cn/wiki/C2BWw4ZP0iCDy7kkCPCcX2BHnOf)
|
||||
* [Training Tutorial](https://modelbest.feishu.cn/wiki/GeHMwLMa0i2FhUkV0f6cz3HWnV1)
|
||||
* [Quantization Tutorial](https://modelbest.feishu.cn/wiki/YvsPwnPwWiqUjlkmW0scQ76TnBb)
|
||||
|
||||
## MiniCPM-Llama3-V 2.5
|
||||
|
||||
MiniCPM-Llama3-V 2.5 is built on SigLip-400M and Llama3-8B-Instruct with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.0.
|
||||
|
||||
* [Quantization Tutorial](https://modelbest.feishu.cn/wiki/Kc7ywV4X1ipSaAkuPFOc9SFun8b)
|
||||
* [Training Tutorial](https://modelbest.feishu.cn/wiki/UpSiw63o9iGDhIklmwScX4a6nhW)
|
||||
* [End-side Deployment](https://modelbest.feishu.cn/wiki/Lwr9wpOQdinr6AkLzHrc9LlgnJD)
|
||||
* [Deployment Tutorial](https://modelbest.feishu.cn/wiki/LTOKw3Hz7il9kGkCLX9czsennKe)
|
||||
* [HD Decoding Tutorial](https://modelbest.feishu.cn/wiki/Ug8iwdXfhiHVsDk2gGEco6xnnVg)
|
||||
* [Model Structure](https://modelbest.feishu.cn/wiki/ACtAw9bOgiBQ9lkWyafcvtVEnQf)
|
||||
@@ -1,22 +0,0 @@
|
||||
# MiniCPM-V 最佳实践
|
||||
|
||||
**MiniCPM-V**是面向图文理解的端侧多模态大模型系列。该系列模型接受图像和文本输入,并提供高质量的文本输出。自2024年2月以来,我们共发布了5个版本模型,旨在实现**领先的性能和高效的部署**,目前该系列最值得关注的模型包括:
|
||||
|
||||
## MiniCPM-V 2.6
|
||||
|
||||
MiniCPM-V系列的最新、性能最佳模型。总参数量 8B,单图、多图和视频理解性能**超越了 GPT-4V**。在单图理解上,它取得了优于 **GPT-4o mini、Gemini 1.5 Pro 和 Claude 3.5 Sonnet** 等商用闭源模型的表现,并进一步优化了 MiniCPM-Llama3-V 2.5 的 OCR、可信行为、多语言支持以及端侧部署等诸多特性。基于其领先的视觉 token 密度,MiniCPM-V 2.6 成为了首个支持在 iPad 等端侧设备上进行实时视频理解的多模态大模型。
|
||||
|
||||
* [部署教程](https://modelbest.feishu.cn/wiki/LZxLwp4Lzi29vXklYLFchwN5nCf)
|
||||
* [训练教程](https://modelbest.feishu.cn/wiki/HvfLwYzlIihqzXkmeCdczs6onmd)
|
||||
* [量化教程](https://modelbest.feishu.cn/wiki/PAsHw6N6xiEy0DkJWpJcIocRnz9)
|
||||
|
||||
## MiniCPM-Llama3-V 2.5
|
||||
|
||||
MiniCPM-Llama3-V 2.5 基于 SigLip-400M 和 Llama3-8B-Instruct 构建,总共有 80 亿参数。其性能相比 MiniCPM-V 2.0 有了显著提升。
|
||||
|
||||
* [量化教程](https://modelbest.feishu.cn/wiki/O0KTwQV5piUPzTkRXl9cSFyHnQb)
|
||||
* [训练教程](https://modelbest.feishu.cn/wiki/MPkPwvONEiZm3BkWMnyc83Tin4d)
|
||||
* [端侧部署](https://modelbest.feishu.cn/wiki/CZZJw1EDGitSSZka664cZwbWnrb)
|
||||
* [部署教程](https://modelbest.feishu.cn/wiki/BcHIwjOLGihJXCkkSdMc2WhbnZf)
|
||||
* [高清解码教程](https://modelbest.feishu.cn/wiki/L0ajwm8VAiiPY6kDZfJce3B7nRg)
|
||||
* [模型结构](https://modelbest.feishu.cn/wiki/X15nwGzqpioxlikbi2RcXDpJnjd)
|
||||
@@ -1,446 +0,0 @@
|
||||
# Best Practice with LLaMA-Factory
|
||||
|
||||
## Contents <!-- omit in toc -->
|
||||
|
||||
- [Support Models](#Support-Models)
|
||||
- [LLaMA-Factory Installation](#LLaMA-Factory-Installation)
|
||||
- [Dataset Prepare](#Dataset-Prepare)
|
||||
- [Image Dataset](#Image-Dataset)
|
||||
- [Video Dataset](#Video-Dataset)
|
||||
- [Audio Dataset](#Audio-Dataset)
|
||||
- [Lora Fine-Tuning](#Lora-Fine-Tuning)
|
||||
- [Full Parameters Fine-Tuning](#Full-Parameters-Fine-Tuning)
|
||||
- [Inference](#Inference)
|
||||
|
||||
## Support Models
|
||||
* [openbmb/MiniCPM-V-4](https://huggingface.co/openbmb/MiniCPM-V-4)
|
||||
* [openbmb/MiniCPM-o-2_6](https://huggingface.co/openbmb/MiniCPM-o-2_6)
|
||||
* [openbmb/MiniCPM-V-2_6](https://huggingface.co/openbmb/MiniCPM-V-2_6)
|
||||
|
||||
## LLaMA-Factory Installation
|
||||
|
||||
You can install LLaMA-Factory using commands below.
|
||||
|
||||
```
|
||||
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
|
||||
cd LLaMA-Factory
|
||||
pip install -e ".[torch,metrics,deepspeed,minicpm_v]"
|
||||
mkdir configs # let's put all yaml files here
|
||||
```
|
||||
|
||||
## Dataset Prepare
|
||||
|
||||
Refer to [data/dataset_info.json](https://github.com/hiyouga/LLaMA-Factory/blob/main/data/dataset_info.json) to add your customised dataset. Let's use the two existing demo datasets `mllm_demo`, `mllm_video_demo` and `mllm_audio_demo` as examples (audio is only for MiniCPM-o-2.6).
|
||||
|
||||
### Image Dataset
|
||||
|
||||
Refer to image sft demo data: [data/mllm_demo.json](https://github.com/hiyouga/LLaMA-Factory/blob/main/data/mllm_demo.json)
|
||||
|
||||
<details>
|
||||
<summary>
|
||||
<b>data/mllm_demo.json</b>
|
||||
</summary>
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"messages": [
|
||||
{
|
||||
"content": "<image>Who are they?",
|
||||
"role": "user"
|
||||
},
|
||||
{
|
||||
"content": "They're Kane and Gretzka from Bayern Munich.",
|
||||
"role": "assistant"
|
||||
},
|
||||
{
|
||||
"content": "What are they doing?",
|
||||
"role": "user"
|
||||
},
|
||||
{
|
||||
"content": "They are celebrating on the soccer field.",
|
||||
"role": "assistant"
|
||||
}
|
||||
],
|
||||
"images": [
|
||||
"mllm_demo_data/1.jpg"
|
||||
]
|
||||
},
|
||||
{
|
||||
"messages": [
|
||||
{
|
||||
"content": "<image>Who is he?",
|
||||
"role": "user"
|
||||
},
|
||||
{
|
||||
"content": "He's Thomas Muller from Bayern Munich.",
|
||||
"role": "assistant"
|
||||
},
|
||||
{
|
||||
"content": "Why is he on the ground?",
|
||||
"role": "user"
|
||||
},
|
||||
{
|
||||
"content": "Because he's sliding on his knees to celebrate.",
|
||||
"role": "assistant"
|
||||
}
|
||||
],
|
||||
"images": [
|
||||
"mllm_demo_data/2.jpg"
|
||||
]
|
||||
},
|
||||
{
|
||||
"messages": [
|
||||
{
|
||||
"content": "<image>Please describe this image",
|
||||
"role": "user"
|
||||
},
|
||||
{
|
||||
"content": "Chinese astronaut Gui Haichao is giving a speech.",
|
||||
"role": "assistant"
|
||||
},
|
||||
{
|
||||
"content": "What has he accomplished?",
|
||||
"role": "user"
|
||||
},
|
||||
{
|
||||
"content": "He was appointed to be a payload specialist on Shenzhou 16 mission in June 2022, thus becoming the first Chinese civilian of Group 3 in space on 30 May 2023. He is responsible for the on-orbit operation of space science experimental payloads.",
|
||||
"role": "assistant"
|
||||
}
|
||||
],
|
||||
"images": [
|
||||
"mllm_demo_data/3.jpg"
|
||||
]
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
### Video Dataset
|
||||
|
||||
Refer to video sft demo data: [data/mllm_video_demo.json](https://github.com/hiyouga/LLaMA-Factory/blob/main/data/mllm_video_demo.json)
|
||||
|
||||
<details>
|
||||
<summary>
|
||||
<b>data/mllm_video_demo.json</b>
|
||||
</summary>
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"messages": [
|
||||
{
|
||||
"content": "<video>Why is this video funny?",
|
||||
"role": "user"
|
||||
},
|
||||
{
|
||||
"content": "Because a baby is reading, and he is so cute!",
|
||||
"role": "assistant"
|
||||
}
|
||||
],
|
||||
"videos": [
|
||||
"mllm_demo_data/1.mp4"
|
||||
]
|
||||
},
|
||||
{
|
||||
"messages": [
|
||||
{
|
||||
"content": "<video>What is she doing?",
|
||||
"role": "user"
|
||||
},
|
||||
{
|
||||
"content": "She is cooking.",
|
||||
"role": "assistant"
|
||||
}
|
||||
],
|
||||
"videos": [
|
||||
"mllm_demo_data/2.avi"
|
||||
]
|
||||
},
|
||||
{
|
||||
"messages": [
|
||||
{
|
||||
"content": "<video>What's in the video?",
|
||||
"role": "user"
|
||||
},
|
||||
{
|
||||
"content": "A baby is playing in the living room.",
|
||||
"role": "assistant"
|
||||
}
|
||||
],
|
||||
"videos": [
|
||||
"mllm_demo_data/3.mp4"
|
||||
]
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Audio Dataset
|
||||
|
||||
Refer to audio sft demo data: [data/mllm_audio_demo.json](https://github.com/hiyouga/LLaMA-Factory/blob/main/data/mllm_audio_demo.json)
|
||||
|
||||
<details>
|
||||
<summary>
|
||||
<b>data/mllm_audio_demo.json</b>
|
||||
</summary>
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"messages": [
|
||||
{
|
||||
"content": "<audio>What's that sound?",
|
||||
"role": "user"
|
||||
},
|
||||
{
|
||||
"content": "It is the sound of glass shattering.",
|
||||
"role": "assistant"
|
||||
}
|
||||
],
|
||||
"audios": [
|
||||
"mllm_demo_data/1.mp3"
|
||||
]
|
||||
},
|
||||
{
|
||||
"messages": [
|
||||
{
|
||||
"content": "<audio>What can you hear?",
|
||||
"role": "user"
|
||||
},
|
||||
{
|
||||
"content": "A woman is coughing.",
|
||||
"role": "assistant"
|
||||
}
|
||||
],
|
||||
"audios": [
|
||||
"mllm_demo_data/2.wav"
|
||||
]
|
||||
},
|
||||
{
|
||||
"messages": [
|
||||
{
|
||||
"content": "<audio>What does the person say?",
|
||||
"role": "user"
|
||||
},
|
||||
{
|
||||
"content": "Mister Quiller is the apostle of the middle classes and we are glad to welcome his gospel.",
|
||||
"role": "assistant"
|
||||
}
|
||||
],
|
||||
"audios": [
|
||||
"mllm_demo_data/3.flac"
|
||||
]
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
## Lora Fine-Tuning
|
||||
|
||||
We can use one command to do lora sft:
|
||||
|
||||
```shell
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train configs/minicpmo_2_6_lora_sft.yaml
|
||||
```
|
||||
|
||||
<details>
|
||||
<summary>
|
||||
<b>configs/minicpmo_2_6_lora_sft.yaml</b>
|
||||
</summary>
|
||||
|
||||
```yaml
|
||||
### model
|
||||
model_name_or_path: openbmb/MiniCPM-o-2_6 # MiniCPM-o-2_6 MiniCPM-V-2_6
|
||||
trust_remote_code: true
|
||||
|
||||
### method
|
||||
stage: sft
|
||||
do_train: true
|
||||
finetuning_type: lora
|
||||
lora_target: q_proj,v_proj
|
||||
|
||||
### dataset
|
||||
dataset: mllm_demo # mllm_demo mllm_video_demo mllm_audio_demo
|
||||
template: minicpm_o # minicpm_o minicpm_v
|
||||
cutoff_len: 3072
|
||||
max_samples: 1000
|
||||
overwrite_cache: true
|
||||
preprocessing_num_workers: 16
|
||||
|
||||
### output
|
||||
output_dir: saves/minicpmo_2_6/lora/sft
|
||||
logging_steps: 1
|
||||
save_steps: 100
|
||||
plot_loss: true
|
||||
overwrite_output_dir: true
|
||||
save_total_limit: 10
|
||||
|
||||
### train
|
||||
per_device_train_batch_size: 2
|
||||
gradient_accumulation_steps: 1
|
||||
learning_rate: 1.0e-5
|
||||
num_train_epochs: 20.0
|
||||
lr_scheduler_type: cosine
|
||||
warmup_ratio: 0.1
|
||||
bf16: true
|
||||
ddp_timeout: 180000000
|
||||
save_only_model: true
|
||||
|
||||
### eval
|
||||
do_eval: false
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
### Lora Model Export
|
||||
|
||||
One command to export lora model
|
||||
|
||||
```shell
|
||||
llamafactory-cli export configs/minicpmo_2_6_lora_export.yaml
|
||||
```
|
||||
|
||||
<details>
|
||||
<summary>
|
||||
<b>configs/minicpmo_2_6_lora_export.yaml</b>
|
||||
</summary>
|
||||
|
||||
```yaml
|
||||
### model
|
||||
model_name_or_path: openbmb/MiniCPM-o-2_6 # MiniCPM-o-2_6 MiniCPM-V-2_6
|
||||
adapter_name_or_path: saves/minicpmo_2_6/lora/sft
|
||||
template: minicpm_o # minicpm_o minicpm_v
|
||||
finetuning_type: lora
|
||||
trust_remote_code: true
|
||||
|
||||
### export
|
||||
export_dir: models/minicpmo_2_6_lora_sft
|
||||
export_size: 2
|
||||
export_device: cpu
|
||||
export_legacy_format: false
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
## Full Parameters Fine-Tuning
|
||||
|
||||
We can use one command to do full sft:
|
||||
|
||||
```shell
|
||||
llamafactory-cli train configs/minicpmo_2_6_full_sft.yaml
|
||||
```
|
||||
|
||||
<details>
|
||||
<summary>
|
||||
<b>configs/minicpmo_2_6_full_sft.yaml</b>
|
||||
</summary>
|
||||
|
||||
```yaml
|
||||
### model
|
||||
model_name_or_path: openbmb/MiniCPM-o-2_6 # MiniCPM-o-2_6 MiniCPM-V-2_6
|
||||
trust_remote_code: true
|
||||
freeze_vision_tower: true
|
||||
print_param_status: true
|
||||
flash_attn: fa2
|
||||
|
||||
### method
|
||||
stage: sft
|
||||
do_train: true
|
||||
finetuning_type: full
|
||||
deepspeed: configs/deepspeed/ds_z2_config.json
|
||||
|
||||
### dataset
|
||||
dataset: mllm_demo # mllm_demo mllm_video_demo
|
||||
template: minicpm_o # minicpm_o minicpm_v
|
||||
cutoff_len: 3072
|
||||
max_samples: 1000
|
||||
overwrite_cache: true
|
||||
preprocessing_num_workers: 16
|
||||
|
||||
### output
|
||||
output_dir: saves/minicpmo_2_6/full/sft
|
||||
logging_steps: 1
|
||||
save_steps: 100
|
||||
plot_loss: true
|
||||
overwrite_output_dir: true
|
||||
save_total_limit: 10
|
||||
|
||||
### train
|
||||
per_device_train_batch_size: 2
|
||||
gradient_accumulation_steps: 1
|
||||
learning_rate: 1.0e-5
|
||||
num_train_epochs: 20.0
|
||||
lr_scheduler_type: cosine
|
||||
warmup_ratio: 0.1
|
||||
bf16: true
|
||||
ddp_timeout: 180000000
|
||||
save_only_model: true
|
||||
|
||||
### eval
|
||||
do_eval: false
|
||||
```
|
||||
</details>
|
||||
|
||||
## Inference
|
||||
|
||||
### Web UI ChatBox
|
||||
|
||||
Refer [LLaMA-Factory doc](https://github.com/hiyouga/LLaMA-Factory/tree/main/examples#inferring-lora-fine-tuned-models) for more inference usages.
|
||||
|
||||
For example, we can use one command to run web chat:
|
||||
|
||||
```shell
|
||||
CUDA_VISIBLE_DEVICES=0 llamafactory-cli webchat configs/minicpmo_2_6_infer.yaml
|
||||
```
|
||||
|
||||
<details>
|
||||
<summary>
|
||||
<b>configs/minicpmo_2_6_infer.yaml</b>
|
||||
</summary>
|
||||
|
||||
```yaml
|
||||
model_name_or_path: saves/minicpmo_2_6/full/sft
|
||||
template: minicpm_o # minicpm_o minicpm_v
|
||||
infer_backend: huggingface
|
||||
trust_remote_code: true
|
||||
```
|
||||
</details>
|
||||
|
||||
### Official Code
|
||||
You can also use official code to inference
|
||||
|
||||
<details>
|
||||
<summary>
|
||||
<b>official inference code</b>
|
||||
</summary>
|
||||
|
||||
```python
|
||||
# test.py
|
||||
import torch
|
||||
from PIL import Image
|
||||
from transformers import AutoModel, AutoTokenizer
|
||||
|
||||
model_id = "saves/minicpmo_2_6/full/sft"
|
||||
model = AutoModel.from_pretrained(model_id, trust_remote_code=True,
|
||||
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
|
||||
model = model.eval().cuda()
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
|
||||
|
||||
image = Image.open('data/mllm_demo_data/1.jpg').convert('RGB')
|
||||
question = 'Who are they??'
|
||||
msgs = [{'role': 'user', 'content': [image, question]}]
|
||||
|
||||
res = model.chat(
|
||||
image=None,
|
||||
msgs=msgs,
|
||||
tokenizer=tokenizer
|
||||
)
|
||||
print(res)
|
||||
```
|
||||
|
||||
</details>
|
||||
@@ -1,333 +0,0 @@
|
||||
## MiniCPM-Llama3-V 2.5
|
||||
|
||||
> Archieve at: 2025-01-13
|
||||
|
||||
|
||||
**MiniCPM-Llama3-V 2.5** is the latest model in the MiniCPM-V series. The model is built on SigLip-400M and Llama3-8B-Instruct with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.0. Notable features of MiniCPM-Llama3-V 2.5 include:
|
||||
|
||||
- 🔥 **Leading Performance.**
|
||||
MiniCPM-Llama3-V 2.5 has achieved an average score of 65.1 on OpenCompass, a comprehensive evaluation over 11 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4V-1106, Gemini Pro, Claude 3 and Qwen-VL-Max** and greatly outperforms other Llama 3-based MLLMs.
|
||||
|
||||
- 💪 **Strong OCR Capabilities.**
|
||||
MiniCPM-Llama3-V 2.5 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344), achieving a **700+ score on OCRBench, surpassing proprietary models such as GPT-4o, GPT-4V-0409, Qwen-VL-Max and Gemini Pro**. Based on recent user feedback, MiniCPM-Llama3-V 2.5 has now enhanced full-text OCR extraction, table-to-markdown conversion, and other high-utility capabilities, and has further strengthened its instruction-following and complex reasoning abilities, enhancing multimodal interaction experiences.
|
||||
|
||||
- 🏆 **Trustworthy Behavior.**
|
||||
Leveraging the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) method (the newest technique in the [RLHF-V](https://github.com/RLHF-V) [CVPR'24] series), MiniCPM-Llama3-V 2.5 exhibits more trustworthy behavior. It achieves a **10.3%** hallucination rate on Object HalBench, lower than GPT-4V-1106 (13.6%), achieving the best-level performance within the open-source community. [Data released](https://huggingface.co/datasets/openbmb/RLAIF-V-Dataset).
|
||||
|
||||
- 🌏 **Multilingual Support.**
|
||||
Thanks to the strong multilingual capabilities of Llama 3 and the cross-lingual generalization technique from [VisCPM](https://github.com/OpenBMB/VisCPM), MiniCPM-Llama3-V 2.5 extends its bilingual (Chinese-English) multimodal capabilities to **over 30 languages including German, French, Spanish, Italian, Korean etc.** [All Supported Languages](./assets/minicpm-llama-v-2-5_languages.md).
|
||||
|
||||
- 🚀 **Efficient Deployment.**
|
||||
MiniCPM-Llama3-V 2.5 systematically employs **model quantization, CPU optimizations, NPU optimizations and compilation optimizations**, achieving high-efficiency deployment on end-side devices. For mobile phones with Qualcomm chips, we have integrated the NPU acceleration framework QNN into llama.cpp for the first time. After systematic optimization, MiniCPM-Llama3-V 2.5 has realized a **150x acceleration in end-side MLLM image encoding** and a **3x speedup in language decoding**.
|
||||
|
||||
- 💫 **Easy Usage.**
|
||||
MiniCPM-Llama3-V 2.5 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-v2.5/examples/minicpmv/README.md) and [ollama](https://github.com/OpenBMB/ollama/tree/minicpm-v2.5/examples/minicpm-v2.5) support for efficient CPU inference on local devices, (2) [GGUF](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf) format quantized models in 16 sizes, (3) efficient [LoRA](https://github.com/OpenBMB/MiniCPM-V/tree/main/finetune#lora-finetuning) fine-tuning with only 2 V100 GPUs, (4) [streaming output](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5#usage), (5) quick local WebUI demo setup with [Gradio](https://github.com/OpenBMB/MiniCPM-V/blob/main/web_demo_2.5.py) and [Streamlit](https://github.com/OpenBMB/MiniCPM-V/blob/main/web_demo_streamlit-2_5.py), and (6) interactive demos on [HuggingFace Spaces](https://huggingface.co/spaces/openbmb/MiniCPM-Llama3-V-2_5).
|
||||
|
||||
### Evaluation <!-- omit in toc -->
|
||||
|
||||
<div align="center">
|
||||
<img src=../assets/MiniCPM-Llama3-V-2.5-peformance.png width=66% />
|
||||
</div>
|
||||
<details>
|
||||
<summary>Click to view results on TextVQA, DocVQA, OCRBench, OpenCompass, MME, MMBench, MMMU, MathVista, LLaVA Bench, RealWorld QA, Object HalBench. </summary>
|
||||
<div align="center">
|
||||
|
||||
<table style="margin: 0px auto;">
|
||||
<thead>
|
||||
<tr>
|
||||
<th align="left">Model</th>
|
||||
<th>Size</th>
|
||||
<th>OCRBench</th>
|
||||
<th>TextVQA val</th>
|
||||
<th>DocVQA test</th>
|
||||
<th>Open-Compass</th>
|
||||
<th>MME</th>
|
||||
<th>MMB test (en)</th>
|
||||
<th>MMB test (cn)</th>
|
||||
<th>MMMU val</th>
|
||||
<th>Math-Vista</th>
|
||||
<th>LLaVA Bench</th>
|
||||
<th>RealWorld QA</th>
|
||||
<th>Object HalBench</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody align="center">
|
||||
<tr>
|
||||
<td colspan="14" align="left"><strong>Proprietary</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Gemini Pro</td>
|
||||
<td>-</td>
|
||||
<td>680</td>
|
||||
<td>74.6</td>
|
||||
<td>88.1</td>
|
||||
<td>62.9</td>
|
||||
<td>2148.9</td>
|
||||
<td>73.6</td>
|
||||
<td>74.3</td>
|
||||
<td>48.9</td>
|
||||
<td>45.8</td>
|
||||
<td>79.9</td>
|
||||
<td>60.4</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GPT-4V (2023.11.06)</td>
|
||||
<td>-</td>
|
||||
<td>645</td>
|
||||
<td>78.0</td>
|
||||
<td>88.4</td>
|
||||
<td>63.5</td>
|
||||
<td>1771.5</td>
|
||||
<td>77.0</td>
|
||||
<td>74.4</td>
|
||||
<td>53.8</td>
|
||||
<td>47.8</td>
|
||||
<td>93.1</td>
|
||||
<td>63.0</td>
|
||||
<td>86.4</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td colspan="14" align="left"><strong>Open-source</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Mini-Gemini</td>
|
||||
<td>2.2B</td>
|
||||
<td>-</td>
|
||||
<td>56.2</td>
|
||||
<td>34.2*</td>
|
||||
<td>-</td>
|
||||
<td>1653.0</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>31.7</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Qwen-VL-Chat</td>
|
||||
<td>9.6B</td>
|
||||
<td>488</td>
|
||||
<td>61.5</td>
|
||||
<td>62.6</td>
|
||||
<td>51.6</td>
|
||||
<td>1860.0</td>
|
||||
<td>61.8</td>
|
||||
<td>56.3</td>
|
||||
<td>37.0</td>
|
||||
<td>33.8</td>
|
||||
<td>67.7</td>
|
||||
<td>49.3</td>
|
||||
<td>56.2</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">DeepSeek-VL-7B</td>
|
||||
<td>7.3B</td>
|
||||
<td>435</td>
|
||||
<td>64.7*</td>
|
||||
<td>47.0*</td>
|
||||
<td>54.6</td>
|
||||
<td>1765.4</td>
|
||||
<td>73.8</td>
|
||||
<td>71.4</td>
|
||||
<td>38.3</td>
|
||||
<td>36.8</td>
|
||||
<td>77.8</td>
|
||||
<td>54.2</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Yi-VL-34B</td>
|
||||
<td>34B</td>
|
||||
<td>290</td>
|
||||
<td>43.4*</td>
|
||||
<td>16.9*</td>
|
||||
<td>52.2</td>
|
||||
<td><strong>2050.2</strong></td>
|
||||
<td>72.4</td>
|
||||
<td>70.7</td>
|
||||
<td>45.1</td>
|
||||
<td>30.7</td>
|
||||
<td>62.3</td>
|
||||
<td>54.8</td>
|
||||
<td>79.3</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">CogVLM-Chat</td>
|
||||
<td>17.4B</td>
|
||||
<td>590</td>
|
||||
<td>70.4</td>
|
||||
<td>33.3*</td>
|
||||
<td>54.2</td>
|
||||
<td>1736.6</td>
|
||||
<td>65.8</td>
|
||||
<td>55.9</td>
|
||||
<td>37.3</td>
|
||||
<td>34.7</td>
|
||||
<td>73.9</td>
|
||||
<td>60.3</td>
|
||||
<td>73.6</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">TextMonkey</td>
|
||||
<td>9.7B</td>
|
||||
<td>558</td>
|
||||
<td>64.3</td>
|
||||
<td>66.7</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Idefics2</td>
|
||||
<td>8.0B</td>
|
||||
<td>-</td>
|
||||
<td>73.0</td>
|
||||
<td>74.0</td>
|
||||
<td>57.2</td>
|
||||
<td>1847.6</td>
|
||||
<td>75.7</td>
|
||||
<td>68.6</td>
|
||||
<td>45.2</td>
|
||||
<td>52.2</td>
|
||||
<td>49.1</td>
|
||||
<td>60.7</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Bunny-LLama-3-8B</td>
|
||||
<td>8.4B</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>54.3</td>
|
||||
<td>1920.3</td>
|
||||
<td>77.0</td>
|
||||
<td>73.9</td>
|
||||
<td>41.3</td>
|
||||
<td>31.5</td>
|
||||
<td>61.2</td>
|
||||
<td>58.8</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">LLaVA-NeXT Llama-3-8B</td>
|
||||
<td>8.4B</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>78.2</td>
|
||||
<td>-</td>
|
||||
<td>1971.5</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>41.7</td>
|
||||
<td>37.5</td>
|
||||
<td>80.1</td>
|
||||
<td>60.0</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Phi-3-vision-128k-instruct</td>
|
||||
<td>4.2B</td>
|
||||
<td>639*</td>
|
||||
<td>70.9</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>1537.5*</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>40.4</td>
|
||||
<td>44.5</td>
|
||||
<td>64.2*</td>
|
||||
<td>58.8*</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr style="background-color: #e6f2ff;">
|
||||
<td nowrap="nowrap" align="left">MiniCPM-V 1.0</td>
|
||||
<td>2.8B</td>
|
||||
<td>366</td>
|
||||
<td>60.6</td>
|
||||
<td>38.2</td>
|
||||
<td>47.5</td>
|
||||
<td>1650.2</td>
|
||||
<td>64.1</td>
|
||||
<td>62.6</td>
|
||||
<td>38.3</td>
|
||||
<td>28.9</td>
|
||||
<td>51.3</td>
|
||||
<td>51.2</td>
|
||||
<td>78.4</td>
|
||||
</tr>
|
||||
<tr style="background-color: #e6f2ff;">
|
||||
<td nowrap="nowrap" align="left">MiniCPM-V 2.0</td>
|
||||
<td>2.8B</td>
|
||||
<td>605</td>
|
||||
<td>74.1</td>
|
||||
<td>71.9</td>
|
||||
<td>54.5</td>
|
||||
<td>1808.6</td>
|
||||
<td>69.1</td>
|
||||
<td>66.5</td>
|
||||
<td>38.2</td>
|
||||
<td>38.7</td>
|
||||
<td>69.2</td>
|
||||
<td>55.8</td>
|
||||
<td>85.5</td>
|
||||
</tr>
|
||||
<tr style="background-color: #e6f2ff;">
|
||||
<td nowrap="nowrap" align="left">MiniCPM-Llama3-V 2.5</td>
|
||||
<td>8.5B</td>
|
||||
<td><strong>725</strong></td>
|
||||
<td><strong>76.6</strong></td>
|
||||
<td><strong>84.8</strong></td>
|
||||
<td><strong>65.1</strong></td>
|
||||
<td>2024.6</td>
|
||||
<td><strong>77.2</strong></td>
|
||||
<td><strong>74.2</strong></td>
|
||||
<td><strong>45.8</strong></td>
|
||||
<td><strong>54.3</strong></td>
|
||||
<td><strong>86.7</strong></td>
|
||||
<td><strong>63.5</strong></td>
|
||||
<td><strong>89.7</strong></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
|
||||
</div>
|
||||
* We evaluate the officially released checkpoint by ourselves.
|
||||
|
||||
</details>
|
||||
|
||||
<div align="center">
|
||||
<img src="../assets/llavabench_compare_3.png" width="100%" />
|
||||
<br>
|
||||
Evaluation results of multilingual LLaVA Bench
|
||||
</div>
|
||||
|
||||
### Examples <!-- omit in toc -->
|
||||
|
||||
<table align="center" >
|
||||
<p align="center" >
|
||||
<img src="../assets/minicpmv-llama3-v2.5/cases_all.png" />
|
||||
</p>
|
||||
</table>
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
### Model Zoo
|
||||
|
||||
| Model | Device | Memory |          Description | Download |
|
||||
|:-----------|:--:|:-----------:|:-------------------|:---------------:|
|
||||
| MiniCPM-Llama3-V 2.5 | GPU | 19 GB | Strong end-side multimodal performance. | [🤗](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5/) [<img src="../assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5) |
|
||||
| MiniCPM-Llama3-V 2.5 gguf | CPU | 6 GB | The gguf version, lower memory usage and faster inference. | [🤗](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf) [<img src="../assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5-gguf) |
|
||||
| MiniCPM-Llama3-V 2.5 int4 | GPU | 8 GB | The int4 quantized version, lower GPU memory usage. | [🤗](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-int4/) [<img src="../assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5-int4) |
|
||||
@@ -1,299 +0,0 @@
|
||||
## MiniCPM-V 2.0
|
||||
|
||||
|
||||
> Archive at:2025-01-13
|
||||
|
||||
|
||||
|
||||
**MiniCPM-V 2.0** is an efficient version with promising performance for deployment. The model is built based on SigLip-400M and [MiniCPM-2.4B](https://github.com/OpenBMB/MiniCPM/), connected by a perceiver resampler. Our latest version, MiniCPM-V 2.0 has several notable features.
|
||||
|
||||
- 🔥 **State-of-the-art Performance.**
|
||||
|
||||
MiniCPM-V 2.0 achieves **state-of-the-art performance** on multiple benchmarks (including OCRBench, TextVQA, MME, MMB, MathVista, etc) among models under 7B parameters. It even **outperforms strong Qwen-VL-Chat 9.6B, CogVLM-Chat 17.4B, and Yi-VL 34B on OpenCompass, a comprehensive evaluation over 11 popular benchmarks**. Notably, MiniCPM-V 2.0 shows **strong OCR capability**, achieving **comparable performance to Gemini Pro in scene-text understanding**, and **state-of-the-art performance on OCRBench** among open-source models.
|
||||
|
||||
- 🏆 **Trustworthy Behavior.**
|
||||
|
||||
LMMs are known for suffering from hallucination, often generating text not factually grounded in images. MiniCPM-V 2.0 is **the first end-side LMM aligned via multimodal RLHF for trustworthy behavior** (using the recent [RLHF-V](https://rlhf-v.github.io/) [CVPR'24] series technique). This allows the model to **match GPT-4V in preventing hallucinations** on Object HalBench.
|
||||
|
||||
- 🌟 **High-Resolution Images at Any Aspect Raito.**
|
||||
|
||||
MiniCPM-V 2.0 can accept **1.8 million pixels (e.g., 1344x1344) images at any aspect ratio**. This enables better perception of fine-grained visual information such as small objects and optical characters, which is achieved via a recent technique from [LLaVA-UHD](https://arxiv.org/pdf/2403.11703.pdf).
|
||||
|
||||
- ⚡️ **High Efficiency.**
|
||||
|
||||
MiniCPM-V 2.0 can be **efficiently deployed on most GPU cards and personal computers**, and **even on end devices such as mobile phones**. For visual encoding, we compress the image representations into much fewer tokens via a perceiver resampler. This allows MiniCPM-V 2.0 to operate with **favorable memory cost and speed during inference even when dealing with high-resolution images**.
|
||||
|
||||
- 🙌 **Bilingual Support.**
|
||||
|
||||
MiniCPM-V 2.0 **supports strong bilingual multimodal capabilities in both English and Chinese**. This is enabled by generalizing multimodal capabilities across languages, a technique from [VisCPM](https://arxiv.org/abs/2308.12038) [ICLR'24].
|
||||
|
||||
|
||||
### Evaluation <!-- omit in toc -->
|
||||
|
||||
<div align="center">
|
||||
<img src=../assets/minicpmv-2-peformance.png width=66% />
|
||||
</div>
|
||||
<details>
|
||||
<summary>Click to view results on TextVQA, DocVQA, OCRBench, OpenCompass, MME, MMBench, MMMU, MathVista, LLaVA Bench, Object HalBench. </summary>
|
||||
<div align="center">
|
||||
|
||||
<table style="margin: 0px auto;">
|
||||
<thead>
|
||||
<tr>
|
||||
<th align="left">Model</th>
|
||||
<th>Size</th>
|
||||
<th>TextVQA val</th>
|
||||
<th>DocVQA test</th>
|
||||
<th>OCRBench</th>
|
||||
<th>OpenCompass</th>
|
||||
<th nowrap="nowrap" >MME</th>
|
||||
<th>MMB dev(en)</th>
|
||||
<th>MMB dev(zh)</th>
|
||||
<th>MMMU val</th>
|
||||
<th>MathVista</th>
|
||||
<th>LLaVA Bench</th>
|
||||
<th nowrap="nowrap">Object HalBench</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody align="center">
|
||||
<tr>
|
||||
<td colspan="12" align="left"><strong>Proprietary models</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Gemini Pro Vision</td>
|
||||
<td>- </td>
|
||||
<td>74.6</td>
|
||||
<td>88.1</td>
|
||||
<td>680</td>
|
||||
<td>63.8</td>
|
||||
<td>2148.9</td>
|
||||
<td>75.2</td>
|
||||
<td>74.0</td>
|
||||
<td>48.9</td>
|
||||
<td>45.8</td>
|
||||
<td>79.9</td>
|
||||
<td>- </td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GPT-4V</td>
|
||||
<td>- </td>
|
||||
<td>78.0</td>
|
||||
<td>88.4</td>
|
||||
<td>645</td>
|
||||
<td>63.2</td>
|
||||
<td>1771.5</td>
|
||||
<td>75.1</td>
|
||||
<td>75.0</td>
|
||||
<td>53.8</td>
|
||||
<td>47.8</td>
|
||||
<td>93.1</td>
|
||||
<td>86.4 / 92.7</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td colspan="12" align="left"><strong>Open-source models 6B~34B</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left" >Yi-VL-6B</td>
|
||||
<td align="right" >6.7B</td>
|
||||
<td>45.5*</td>
|
||||
<td>17.1*</td>
|
||||
<td>290</td>
|
||||
<td>49.3</td>
|
||||
<td>1915.1 </td>
|
||||
<td>68.6 </td>
|
||||
<td>68.3 </td>
|
||||
<td>40.3 </td>
|
||||
<td>28.8 </td>
|
||||
<td>51.9 </td>
|
||||
<td>- </td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left" >Qwen-VL-Chat</td>
|
||||
<td align="right" >9.6B</td>
|
||||
<td>61.5</td>
|
||||
<td>62.6</td>
|
||||
<td>488 </td>
|
||||
<td>52.1 </td>
|
||||
<td>1860.0 </td>
|
||||
<td>60.6 </td>
|
||||
<td>56.7 </td>
|
||||
<td>37.0 </td>
|
||||
<td>33.8 </td>
|
||||
<td>67.7 </td>
|
||||
<td>56.2 / 80.0</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left" >Yi-VL-34B</td>
|
||||
<td align="right" >34B</td>
|
||||
<td>43.4*</td>
|
||||
<td>16.9*</td>
|
||||
<td>290</td>
|
||||
<td>52.6 </td>
|
||||
<td>2050.2</td>
|
||||
<td>71.1</td>
|
||||
<td>71.4</td>
|
||||
<td>45.1</td>
|
||||
<td>30.7</td>
|
||||
<td>62.3</td>
|
||||
<td>- </td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left" >DeepSeek-VL-7B</td>
|
||||
<td align="right" >7.3B</td>
|
||||
<td>64.7*</td>
|
||||
<td>47.0* </td>
|
||||
<td>435</td>
|
||||
<td>55.6 </td>
|
||||
<td>1765.4 </td>
|
||||
<td>74.1 </td>
|
||||
<td>72.8 </td>
|
||||
<td>38.3 </td>
|
||||
<td>36.8</td>
|
||||
<td>77.8 </td>
|
||||
<td>- </td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left" >TextMonkey</td>
|
||||
<td align="right" >9.7B</td>
|
||||
<td>64.3</td>
|
||||
<td>66.7 </td>
|
||||
<td>558</td>
|
||||
<td>- </td>
|
||||
<td>- </td>
|
||||
<td>- </td>
|
||||
<td>- </td>
|
||||
<td>- </td>
|
||||
<td>-</td>
|
||||
<td>- </td>
|
||||
<td>- </td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left" >CogVLM-Chat</td>
|
||||
<td align="right" >17.4B</td>
|
||||
<td>70.4</td>
|
||||
<td>33.3*</td>
|
||||
<td>590 </td>
|
||||
<td>52.5 </td>
|
||||
<td>1736.6 </td>
|
||||
<td>63.7 </td>
|
||||
<td>53.8 </td>
|
||||
<td>37.3 </td>
|
||||
<td>34.7 </td>
|
||||
<td>73.9 </td>
|
||||
<td>73.6 / 87.4 </td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td colspan="12" align="left"><strong>Open-source models 1B~3B </strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left" >DeepSeek-VL-1.3B</td>
|
||||
<td align="right" >1.7B</td>
|
||||
<td>58.4*</td>
|
||||
<td>37.9*</td>
|
||||
<td>413</td>
|
||||
<td>46.0 </td>
|
||||
<td>1531.6 </td>
|
||||
<td>64.0 </td>
|
||||
<td>61.2 </td>
|
||||
<td>33.8 </td>
|
||||
<td>29.4 </td>
|
||||
<td>51.1 </td>
|
||||
<td>- </td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left" >MobileVLM V2</td>
|
||||
<td align="right" >3.1B</td>
|
||||
<td>57.5</td>
|
||||
<td>19.4*</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>1440.5(P) </td>
|
||||
<td>63.2 </td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left" >Mini-Gemini</td>
|
||||
<td align="right" >2.2B</td>
|
||||
<td>56.2</td>
|
||||
<td>34.2*</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>1653.0 </td>
|
||||
<td>59.8 </td>
|
||||
<td>- </td>
|
||||
<td>31.7 </td>
|
||||
<td>-</td>
|
||||
<td>- </td>
|
||||
<td>- </td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left" >MiniCPM-V</td>
|
||||
<td align="right" >2.8B </td>
|
||||
<td>60.6</td>
|
||||
<td>38.2 </td>
|
||||
<td>366</td>
|
||||
<td>47.6</td>
|
||||
<td>1650.2 </td>
|
||||
<td>67.9 </td>
|
||||
<td>65.3 </td>
|
||||
<td><strong>38.3</strong></td>
|
||||
<td>28.9</td>
|
||||
<td>51.3 </td>
|
||||
<td>78.4 / 88.5 </td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left" ><strong>MiniCPM-V 2.0</strong></td>
|
||||
<td align="right" >2.8B </td>
|
||||
<td><strong>74.1</strong></td>
|
||||
<td><strong>71.9</strong> </td>
|
||||
<td><strong>605</strong></td>
|
||||
<td><strong>55.0</strong></td>
|
||||
<td><strong>1808.6</strong> </td>
|
||||
<td><strong>69.6</strong> </td>
|
||||
<td><strong>68.1</strong> </td>
|
||||
<td>38.2 </td>
|
||||
<td><strong>38.7</strong></td>
|
||||
<td><strong>69.2</strong> </td>
|
||||
<td><strong>85.5 / 92.2 </strong></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
</div>
|
||||
* We evaluate the officially released checkpoint by ourselves.
|
||||
</details>
|
||||
|
||||
### Examples <!-- omit in toc -->
|
||||
|
||||
<table align="center">
|
||||
<p align="center">
|
||||
<img src="../assets/minicpmv2-cases_2.png" width=95%/>
|
||||
</p>
|
||||
</table>
|
||||
|
||||
We deploy MiniCPM-V 2.0 on end devices. The demo video is the raw screen recording on a Xiaomi 14 Pro without edition.
|
||||
|
||||
<table align="center">
|
||||
<p align="center">
|
||||
<img src="../assets/gif_cases/station.gif" width=36%/>
|
||||
<img src="../assets/gif_cases/london_car.gif" width=36%/>
|
||||
</p>
|
||||
</table>
|
||||
|
||||
|
||||
|
||||
### Model Zoo
|
||||
|
||||
| Model | Device | Memory |          Description | Download |
|
||||
|:-----------|:--:|:-----------:|:-------------------|:---------------:|
|
||||
| MiniCPM-V 2.0 | GPU | 8 GB | Light version, balance the performance the computation cost. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-2) [<img src="../assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2) |
|
||||
| MiniCPM-V 1.0 | GPU | 7 GB | Lightest version, achieving the fastest inference. | [🤗](https://huggingface.co/openbmb/MiniCPM-V) [<img src="../assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V) |
|
||||
|
||||
|
||||
### Deployment on Mobile Phone
|
||||
|
||||
MiniCPM-V 2.0 can be deployed on mobile phones with Android operating systems. 🚀 Click [MiniCPM-V 2.0](https://github.com/OpenBMB/mlc-MiniCPM) to install apk.
|
||||
@@ -1,945 +0,0 @@
|
||||
## MiniCPM-V 2.6
|
||||
|
||||
> Archieve at: 2025-01-13
|
||||
|
||||
**MiniCPM-V 2.6** is the latest and most capable model in the MiniCPM-V series. The model is built on SigLip-400M and Qwen2-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-Llama3-V 2.5, and introduces new features for multi-image and video understanding. Notable features of MiniCPM-V 2.6 include:
|
||||
|
||||
- 🔥 **Leading Performance.**
|
||||
MiniCPM-V 2.6 achieves an average score of 65.2 on the latest version of OpenCompass, a comprehensive evaluation over 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o mini, GPT-4V, Gemini 1.5 Pro, and Claude 3.5 Sonnet** for single image understanding.
|
||||
|
||||
- 🖼️ **Multi Image Understanding and In-context Learning.** MiniCPM-V 2.6 can also perform **conversation and reasoning over multiple images**. It achieves **state-of-the-art performance** on popular multi-image benchmarks such as Mantis-Eval, BLINK, Mathverse mv and Sciverse mv, and also shows promising in-context learning capability.
|
||||
|
||||
- 🎬 **Video Understanding.** MiniCPM-V 2.6 can also **accept video inputs**, performing conversation and providing dense captions for spatial-temporal information. It outperforms **GPT-4V, Claude 3.5 Sonnet and LLaVA-NeXT-Video-34B** on Video-MME with/without subtitles.
|
||||
|
||||
- 💪 **Strong OCR Capability and Others.**
|
||||
MiniCPM-V 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves **state-of-the-art performance on OCRBench, surpassing proprietary models such as GPT-4o, GPT-4V, and Gemini 1.5 Pro**.
|
||||
Based on the the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) and [VisCPM](https://github.com/OpenBMB/VisCPM) techniques, it features **trustworthy behaviors**, with significantly lower hallucination rates than GPT-4o and GPT-4V on Object HalBench, and supports **multilingual capabilities** on English, Chinese, German, French, Italian, Korean, etc.
|
||||
|
||||
|
||||
- 🚀 **Superior Efficiency.**
|
||||
In addition to its friendly size, MiniCPM-V 2.6 also shows **state-of-the-art token density** (i.e., number of pixels encoded into each visual token). **It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models**. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-V 2.6 can efficiently support **real-time video understanding** on end-side devices such as iPad.
|
||||
|
||||
- 💫 **Easy Usage.**
|
||||
MiniCPM-V 2.6 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpmv-main/examples/llava/README-minicpmv2.6.md) and [ollama](https://github.com/OpenBMB/ollama/blob/minicpm-v2.6/examples/minicpm-v2.6/README.md) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-V-2_6-int4) and [GGUF](https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf) format quantized models in 16 sizes, (3) [vLLM](#inference-with-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks, (5) quick local WebUI demo setup with [Gradio](#chat-with-our-demo-on-gradio), and (6) online web [demo](http://120.92.209.146:8887/).
|
||||
|
||||
### Evaluation <!-- omit in toc -->
|
||||
<div align="center">
|
||||
<img src=../assets/radar_final.png width=66% />
|
||||
</div>
|
||||
|
||||
<details>
|
||||
<summary>Click to view single image results on OpenCompass, MME, MMVet, OCRBench, MMMU, MathVista, MMB, AI2D, TextVQA, DocVQA, HallusionBench, Object HalBench. </summary>
|
||||
<div align="center">
|
||||
|
||||
<table style="margin: 0px auto;">
|
||||
<thead>
|
||||
<tr>
|
||||
<th align="left">Model</th>
|
||||
<th>Size</th>
|
||||
<th>Token Density<sup>+</sup></th>
|
||||
<th>OpenCompass</th>
|
||||
<th>MME</th>
|
||||
<th>MMVet</th>
|
||||
<th>OCRBench</th>
|
||||
<th>MMMU val</th>
|
||||
<th>MathVista mini</th>
|
||||
<th>MMB1.1 test</th>
|
||||
<th>AI2D</th>
|
||||
<th>TextVQA val</th>
|
||||
<th>DocVQA test</th>
|
||||
<th>HallusionBench</th>
|
||||
<th>Object HalBench</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody align="center">
|
||||
<tr>
|
||||
<td colspan="15" align="left"><strong>Proprietary</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GPT-4o</td>
|
||||
<td>-</td>
|
||||
<td>1088</td>
|
||||
<td>69.9</td>
|
||||
<td>2328.7</td>
|
||||
<td>69.1</td>
|
||||
<td>736</td>
|
||||
<td>69.2</td>
|
||||
<td>61.3</td>
|
||||
<td>82.2</td>
|
||||
<td>84.6</td>
|
||||
<td>-</td>
|
||||
<td>92.8</td>
|
||||
<td>55.0</td>
|
||||
<td>17.6</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Claude 3.5 Sonnet</td>
|
||||
<td>-</td>
|
||||
<td>750</td>
|
||||
<td>67.9</td>
|
||||
<td>1920.0</td>
|
||||
<td>66.0</td>
|
||||
<td>788</td>
|
||||
<td>65.9</td>
|
||||
<td>61.6</td>
|
||||
<td>78.5</td>
|
||||
<td>80.2</td>
|
||||
<td>-</td>
|
||||
<td>95.2</td>
|
||||
<td>49.9</td>
|
||||
<td>13.8</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>64.4</td>
|
||||
<td>2110.6</td>
|
||||
<td>64.0</td>
|
||||
<td>754</td>
|
||||
<td>60.6</td>
|
||||
<td>57.7</td>
|
||||
<td>73.9</td>
|
||||
<td>79.1</td>
|
||||
<td>73.5</td>
|
||||
<td>86.5</td>
|
||||
<td>45.6</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GPT-4o mini</td>
|
||||
<td>-</td>
|
||||
<td>1088</td>
|
||||
<td>64.1</td>
|
||||
<td>2003.4</td>
|
||||
<td>66.9</td>
|
||||
<td>785</td>
|
||||
<td>60.0</td>
|
||||
<td>52.4</td>
|
||||
<td>76.0</td>
|
||||
<td>77.8</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>46.1</td>
|
||||
<td>12.4</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GPT-4V</td>
|
||||
<td>-</td>
|
||||
<td>1088</td>
|
||||
<td>63.5</td>
|
||||
<td>2070.2</td>
|
||||
<td>67.5</td>
|
||||
<td>656</td>
|
||||
<td>61.7</td>
|
||||
<td>54.7</td>
|
||||
<td>79.8</td>
|
||||
<td>78.6</td>
|
||||
<td>78.0</td>
|
||||
<td>87.2</td>
|
||||
<td>43.9</td>
|
||||
<td>14.2</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Step-1V</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>59.5</td>
|
||||
<td>2206.4</td>
|
||||
<td>63.3</td>
|
||||
<td>625</td>
|
||||
<td>49.9</td>
|
||||
<td>44.8</td>
|
||||
<td>78.0</td>
|
||||
<td>79.2</td>
|
||||
<td>71.6</td>
|
||||
<td>-</td>
|
||||
<td>48.4</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Qwen-VL-Max</td>
|
||||
<td>-</td>
|
||||
<td>784</td>
|
||||
<td>58.3</td>
|
||||
<td>2281.7</td>
|
||||
<td>61.8</td>
|
||||
<td>684</td>
|
||||
<td>52.0</td>
|
||||
<td>43.4</td>
|
||||
<td>74.6</td>
|
||||
<td>75.7</td>
|
||||
<td>79.5</td>
|
||||
<td>93.1</td>
|
||||
<td>41.2</td>
|
||||
<td>13.4</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td colspan="15" align="left"><strong>Open-source</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">LLaVA-NeXT-Yi-34B</td>
|
||||
<td>34B</td>
|
||||
<td>157</td>
|
||||
<td>55.0</td>
|
||||
<td>2006.5</td>
|
||||
<td>50.7</td>
|
||||
<td>574</td>
|
||||
<td>48.8</td>
|
||||
<td>40.4</td>
|
||||
<td>77.8</td>
|
||||
<td>78.9</td>
|
||||
<td>69.3</td>
|
||||
<td>-</td>
|
||||
<td>34.8</td>
|
||||
<td>12.6</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Mini-Gemini-HD-34B</td>
|
||||
<td>34B</td>
|
||||
<td>157</td>
|
||||
<td>-</td>
|
||||
<td>2141.0</td>
|
||||
<td>59.3</td>
|
||||
<td>518</td>
|
||||
<td>48.0</td>
|
||||
<td>43.3</td>
|
||||
<td>-</td>
|
||||
<td>80.5</td>
|
||||
<td>74.1</td>
|
||||
<td>78.9</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Cambrian-34B</td>
|
||||
<td>34B</td>
|
||||
<td>1820</td>
|
||||
<td>58.3</td>
|
||||
<td>2049.9</td>
|
||||
<td>53.2</td>
|
||||
<td>591</td>
|
||||
<td>50.4</td>
|
||||
<td>50.3</td>
|
||||
<td>77.8</td>
|
||||
<td>79.5</td>
|
||||
<td>76.7</td>
|
||||
<td>75.5</td>
|
||||
<td>41.6</td>
|
||||
<td>14.7</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GLM-4V-9B</td>
|
||||
<td>13B</td>
|
||||
<td>784</td>
|
||||
<td>59.1</td>
|
||||
<td>2018.8</td>
|
||||
<td>58.0</td>
|
||||
<td>776</td>
|
||||
<td>46.9</td>
|
||||
<td>51.1</td>
|
||||
<td>67.9</td>
|
||||
<td>71.2</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>45.0</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">InternVL2-8B</td>
|
||||
<td>8B</td>
|
||||
<td>706</td>
|
||||
<td>64.1</td>
|
||||
<td>2215.1</td>
|
||||
<td>54.3</td>
|
||||
<td>794</td>
|
||||
<td><strong>51.2</strong></td>
|
||||
<td>58.3</td>
|
||||
<td><strong>79.4</strong></td>
|
||||
<td><strong>83.6</strong></td>
|
||||
<td>77.4</td>
|
||||
<td><strong>91.6</strong></td>
|
||||
<td>45.0</td>
|
||||
<td>21.3</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiniCPM-Llama-V 2.5</td>
|
||||
<td>8B</td>
|
||||
<td>1882</td>
|
||||
<td>58.8</td>
|
||||
<td>2024.6</td>
|
||||
<td>52.8</td>
|
||||
<td>725</td>
|
||||
<td>45.8</td>
|
||||
<td>54.3</td>
|
||||
<td>72.0</td>
|
||||
<td>78.4</td>
|
||||
<td>76.6</td>
|
||||
<td>84.8</td>
|
||||
<td>42.4</td>
|
||||
<td>10.3</td>
|
||||
</tr>
|
||||
<tr style="background-color: #e6f2ff;">
|
||||
<td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
|
||||
<td>8B</td>
|
||||
<td><strong>2822</strong></td>
|
||||
<td><strong>65.2</strong></td>
|
||||
<td><strong>2348.4</strong>*</td>
|
||||
<td><strong>60.0</strong></td>
|
||||
<td><strong>852</strong>*</td>
|
||||
<td>49.8*</td>
|
||||
<td><strong>60.6</strong></td>
|
||||
<td>78.0</td>
|
||||
<td>82.1</td>
|
||||
<td><strong>80.1<strong></td>
|
||||
<td>90.8</td>
|
||||
<td><strong>48.1</strong>*</td>
|
||||
<td><strong>8.2</strong></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
</div>
|
||||
* We evaluate this benchmark using chain-of-thought prompting. Specifically, for MME, we used this technique only for the Cognition set.
|
||||
|
||||
<sup>+</sup> Token Density: number of pixels encoded into each visual token at maximum resolution, i.e., # pixels at maximum resolution / # visual tokens.
|
||||
|
||||
Note: For proprietary models, we calculate token density based on the image encoding charging strategy defined in the official API documentation, which provides an upper-bound estimation.
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
<details>
|
||||
<summary>Click to view multi-image results on Mantis Eval, BLINK, Mathverse mv, Sciverse mv, MIRB.</summary>
|
||||
<div align="center">
|
||||
|
||||
<table style="margin: 0px auto;">
|
||||
<thead>
|
||||
<tr>
|
||||
<th align="left">Model</th>
|
||||
<th>Size</th>
|
||||
<th>Mantis Eval</th>
|
||||
<th>BLINK val</th>
|
||||
<th>Mathverse mv</th>
|
||||
<th>Sciverse mv</th>
|
||||
<th>MIRB</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody align="center">
|
||||
<tr>
|
||||
<td colspan="7" align="left"><strong>Proprietary</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GPT-4V</td>
|
||||
<td>-</td>
|
||||
<td>62.7</td>
|
||||
<td>54.6</td>
|
||||
<td>60.3</td>
|
||||
<td>66.9</td>
|
||||
<td>53.1</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">LLaVA-NeXT-Interleave-14B</td>
|
||||
<td>14B</td>
|
||||
<td>66.4</td>
|
||||
<td>52.6</td>
|
||||
<td>32.7</td>
|
||||
<td>30.2</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td colspan="7" align="left"><strong>Open-source</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Emu2-Chat</td>
|
||||
<td>37B</td>
|
||||
<td>37.8</td>
|
||||
<td>36.2</td>
|
||||
<td>-</td>
|
||||
<td>27.2</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">CogVLM</td>
|
||||
<td>17B</td>
|
||||
<td>45.2</td>
|
||||
<td>41.1</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">VPG-C</td>
|
||||
<td>7B</td>
|
||||
<td>52.4</td>
|
||||
<td>43.1</td>
|
||||
<td>24.3</td>
|
||||
<td>23.1</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">VILA 8B</td>
|
||||
<td>8B</td>
|
||||
<td>51.2</td>
|
||||
<td>39.3</td>
|
||||
<td>-</td>
|
||||
<td>36.5</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">InternLM-XComposer-2.5</td>
|
||||
<td>8B</td>
|
||||
<td>53.1*</td>
|
||||
<td>48.9</td>
|
||||
<td>32.1*</td>
|
||||
<td>-</td>
|
||||
<td>42.5</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">InternVL2-8B</td>
|
||||
<td>8B</td>
|
||||
<td>59.0*</td>
|
||||
<td>50.9</td>
|
||||
<td>30.5*</td>
|
||||
<td>34.4*</td>
|
||||
<td><strong>56.9*</strong></td>
|
||||
</tr>
|
||||
<tr style="background-color: #e6f2ff;">
|
||||
<td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
|
||||
<td>8B</td>
|
||||
<td><strong>69.1</strong></td>
|
||||
<td><strong>53.0</strong></td>
|
||||
<td><strong>84.9</strong></td>
|
||||
<td><strong>74.9</strong></td>
|
||||
<td>53.8</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
</div>
|
||||
* We evaluate the officially released checkpoint by ourselves.
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>Click to view video results on Video-MME and Video-ChatGPT.</summary>
|
||||
<div align="center">
|
||||
<table style="margin: 0px auto;">
|
||||
<thead>
|
||||
<tr>
|
||||
<th align="left">Model</th>
|
||||
<th>Size</th>
|
||||
<th colspan="2">Video-MME</th>
|
||||
<th colspan="5">Video-ChatGPT</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<th align="left"></th>
|
||||
<th></th>
|
||||
<th>w/o subs</th>
|
||||
<th>w subs</th>
|
||||
<th>Correctness</th>
|
||||
<th>Detail</th>
|
||||
<th>Context</th>
|
||||
<th>Temporal</th>
|
||||
<th>Consistency</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody align="center">
|
||||
<tr>
|
||||
<td colspan="9" align="left"><strong>Proprietary</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Claude 3.5 Sonnet</td>
|
||||
<td>-</td>
|
||||
<td>60.0</td>
|
||||
<td>62.9</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GPT-4V</td>
|
||||
<td>-</td>
|
||||
<td>59.9</td>
|
||||
<td>63.3</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td colspan="9" align="left"><strong>Open-source</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">LLaVA-NeXT-7B</td>
|
||||
<td>7B</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>3.39</td>
|
||||
<td>3.29</td>
|
||||
<td>3.92</td>
|
||||
<td>2.60</td>
|
||||
<td>3.12</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">LLaVA-NeXT-34B</td>
|
||||
<td>34B</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>3.29</td>
|
||||
<td>3.23</td>
|
||||
<td>3.83</td>
|
||||
<td>2.51</td>
|
||||
<td>3.47</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">CogVLM2-Video</td>
|
||||
<td>12B</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>3.49</td>
|
||||
<td><strong>3.46</strong></td>
|
||||
<td>3.23</td>
|
||||
<td><strong>2.98</strong></td>
|
||||
<td><strong>3.64</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">LongVA</td>
|
||||
<td>7B</td>
|
||||
<td>52.4</td>
|
||||
<td>54.3</td>
|
||||
<td>3.05</td>
|
||||
<td>3.09</td>
|
||||
<td>3.77</td>
|
||||
<td>2.44</td>
|
||||
<td><strong>3.64</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">InternVL2-8B</td>
|
||||
<td>8B</td>
|
||||
<td>54.0</td>
|
||||
<td>56.9</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">InternLM-XComposer-2.5</td>
|
||||
<td>8B</td>
|
||||
<td>55.8</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">LLaVA-NeXT-Video</td>
|
||||
<td>32B</td>
|
||||
<td>60.2</td>
|
||||
<td>63.0</td>
|
||||
<td>3.48</td>
|
||||
<td>3.37</td>
|
||||
<td><strong>3.95</strong></td>
|
||||
<td>2.64</td>
|
||||
<td>3.28</td>
|
||||
</tr>
|
||||
<tr style="background-color: #e6f2ff;">
|
||||
<td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
|
||||
<td>8B</td>
|
||||
<td><strong>60.9</strong></td>
|
||||
<td><strong>63.6</strong></td>
|
||||
<td><strong>3.59</strong></td>
|
||||
<td>3.28</td>
|
||||
<td>3.93</td>
|
||||
<td>2.73</td>
|
||||
<td>3.62</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
</div>
|
||||
</details>
|
||||
|
||||
|
||||
<details>
|
||||
<summary>Click to view few-shot results on TextVQA, VizWiz, VQAv2, OK-VQA.</summary>
|
||||
<div align="center">
|
||||
<table style="margin: 0px auto;">
|
||||
<thead>
|
||||
<tr>
|
||||
<th align="left">Model</th>
|
||||
<th>Size</th>
|
||||
<th>Shot</th>
|
||||
<th>TextVQA val</th>
|
||||
<th>VizWiz test-dev</th>
|
||||
<th>VQAv2 test-dev</th>
|
||||
<th>OK-VQA val</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody align="center">
|
||||
<tr>
|
||||
<td align="left" nowrap="nowrap" rowspan="3">Flamingo</td>
|
||||
<td rowspan="3">80B</td>
|
||||
<td>0*</td>
|
||||
<td>35.0</td>
|
||||
<td>31.6</td>
|
||||
<td>56.3</td>
|
||||
<td>40.6</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>4</td>
|
||||
<td>36.5</td>
|
||||
<td>39.6</td>
|
||||
<td>63.1</td>
|
||||
<td><strong>57.4</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>8</td>
|
||||
<td>37.3</td>
|
||||
<td>44.8</td>
|
||||
<td>65.6</td>
|
||||
<td>57.5</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td align="left" nowrap="nowrap" rowspan="3">IDEFICS</td>
|
||||
<td rowspan="3">80B</td>
|
||||
<td>0*</td>
|
||||
<td>30.9</td>
|
||||
<td>36.0</td>
|
||||
<td>60.0</td>
|
||||
<td>45.2</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>4</td>
|
||||
<td>34.3</td>
|
||||
<td>40.4</td>
|
||||
<td>63.6</td>
|
||||
<td>52.4</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>8</td>
|
||||
<td>35.7</td>
|
||||
<td>46.1</td>
|
||||
<td>64.8</td>
|
||||
<td>55.1</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td align="left" nowrap="nowrap" rowspan="3">OmniCorpus</td>
|
||||
<td rowspan="3">7B</td>
|
||||
<td>0*</td>
|
||||
<td>43.0</td>
|
||||
<td>49.8</td>
|
||||
<td>63.2</td>
|
||||
<td>45.5</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>4</td>
|
||||
<td>45.4</td>
|
||||
<td>51.3</td>
|
||||
<td>64.5</td>
|
||||
<td>46.5</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>8</td>
|
||||
<td>45.6</td>
|
||||
<td>52.2</td>
|
||||
<td>64.7</td>
|
||||
<td>46.6</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td align="left" nowrap="nowrap" rowspan="3">Emu2</td>
|
||||
<td rowspan="3">37B</td>
|
||||
<td>0</td>
|
||||
<td>26.4</td>
|
||||
<td>40.4</td>
|
||||
<td>33.5</td>
|
||||
<td>26.7</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>4</td>
|
||||
<td>48.2</td>
|
||||
<td>54.6</td>
|
||||
<td>67.0</td>
|
||||
<td>53.2</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>8</td>
|
||||
<td>49.3</td>
|
||||
<td>54.7</td>
|
||||
<td>67.8</td>
|
||||
<td>54.1</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td align="left" nowrap="nowrap" rowspan="2">MM1</td>
|
||||
<td rowspan="2">30B</td>
|
||||
<td>0</td>
|
||||
<td>26.2</td>
|
||||
<td>40.4</td>
|
||||
<td>48.9</td>
|
||||
<td>26.7</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>8</td>
|
||||
<td>49.3</td>
|
||||
<td>54.7</td>
|
||||
<td><strong>70.9</strong></td>
|
||||
<td>54.1</td>
|
||||
</tr>
|
||||
<tr style="background-color: #e6f2ff;">
|
||||
<td align="left" nowrap="nowrap" rowspan="3">MiniCPM-V 2.6<sup>+</sup></td>
|
||||
<td rowspan="3">8B</td>
|
||||
<td>0</td>
|
||||
<td>43.9</td>
|
||||
<td>33.8</td>
|
||||
<td>45.4</td>
|
||||
<td>23.9</td>
|
||||
</tr>
|
||||
<tr style="background-color: #e6f2ff;">
|
||||
<td>4</td>
|
||||
<td>63.6</td>
|
||||
<td>60.5</td>
|
||||
<td>65.5</td>
|
||||
<td>50.1</td>
|
||||
</tr>
|
||||
<tr style="background-color: #e6f2ff;">
|
||||
<td>8</td>
|
||||
<td><strong>64.6</strong></td>
|
||||
<td><strong>63.4</strong></td>
|
||||
<td>68.2</td>
|
||||
<td>51.4</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
|
||||
</div>
|
||||
* denotes zero image shot and two additional text shots following Flamingo.
|
||||
|
||||
<sup>+</sup> We evaluate the pretraining ckpt without SFT.
|
||||
</details>
|
||||
|
||||
### Examples <!-- omit in toc -->
|
||||
|
||||
<div style="display: flex; flex-direction: column; align-items: center;">
|
||||
<img src="../assets/minicpmv2_6/multi_img-bike.png" alt="Bike" style="margin-bottom: 5px;">
|
||||
<img src="../assets/minicpmv2_6/multi_img-menu.png" alt="Menu" style="margin-bottom: 5px;">
|
||||
<img src="../assets/minicpmv2_6/multi_img-code.png" alt="Code" style="margin-bottom: 5px;">
|
||||
<img src="../assets/minicpmv2_6/ICL-Mem.png" alt="Mem" style="margin-bottom: 5px;">
|
||||
<img src="../assets/minicpmv2_6/multiling-medal.png" alt="medal" style="margin-bottom: 10px;">
|
||||
</div>
|
||||
<details>
|
||||
<summary>Click to view more cases.</summary>
|
||||
<div style="display: flex; flex-direction: column; align-items: center;">
|
||||
<img src="../assets/minicpmv2_6/ICL-elec.png" alt="elec" style="margin-bottom: 5px;">
|
||||
<img src="../assets/minicpmv2_6/multiling-olympic.png" alt="Menu" style="margin-bottom: 10px;">
|
||||
</div>
|
||||
</details>
|
||||
|
||||
We deploy MiniCPM-V 2.6 on end devices. The demo video is the raw screen recording on a iPad Pro without edition.
|
||||
|
||||
<table align="center">
|
||||
<p align="center">
|
||||
<img src="../assets/gif_cases/ai.gif" width=32%/>
|
||||
|
||||
<img src="../assets/gif_cases/beer.gif" width=32%/>
|
||||
</p>
|
||||
</table>
|
||||
|
||||
<table align="center">
|
||||
<p align="center">
|
||||
<img src="../assets/gif_cases/ticket.gif" width=32%/>
|
||||
|
||||
<img src="../assets/gif_cases/wfh.gif" width=32%/>
|
||||
</p>
|
||||
</table>
|
||||
|
||||
<table align="center">
|
||||
<p align="center">
|
||||
<video src="https://github.com/user-attachments/assets/21f4b818-ede1-4822-920e-91281725c830" width="360" /> </video>
|
||||
<!-- <video src="https://github.com/user-attachments/assets/c835f757-206b-4d9c-8e36-70d67b453628" width="360" /> </video> -->
|
||||
</p>
|
||||
</table>
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
|
||||
### Multi-turn Conversation
|
||||
|
||||
|
||||
<div align="center">
|
||||
<img src="../assets/airplane.jpeg" width="500px">
|
||||
</div>
|
||||
|
||||
|
||||
```python
|
||||
import torch
|
||||
from PIL import Image
|
||||
from transformers import AutoModel, AutoTokenizer
|
||||
|
||||
torch.manual_seed(0)
|
||||
|
||||
model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
|
||||
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
|
||||
model = model.eval().cuda()
|
||||
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
|
||||
|
||||
image = Image.open('./assets/airplane.jpeg').convert('RGB')
|
||||
|
||||
# First round chat
|
||||
question = "Tell me the model of this aircraft."
|
||||
msgs = [{'role': 'user', 'content': [image, question]}]
|
||||
|
||||
answer = model.chat(
|
||||
image=None,
|
||||
msgs=msgs,
|
||||
tokenizer=tokenizer
|
||||
)
|
||||
print(answer)
|
||||
|
||||
# Second round chat
|
||||
# pass history context of multi-turn conversation
|
||||
msgs.append({"role": "assistant", "content": [answer]})
|
||||
msgs.append({"role": "user", "content": ["Introduce something about Airbus A380."]})
|
||||
|
||||
answer = model.chat(
|
||||
image=None,
|
||||
msgs=msgs,
|
||||
tokenizer=tokenizer
|
||||
)
|
||||
print(answer)
|
||||
```
|
||||
|
||||
You could get the following output:
|
||||
|
||||
```
|
||||
"The aircraft in the image is an Airbus A380, which can be identified by its large size, double-deck structure, and the distinctive shape of its wings and engines. The A380 is a wide-body aircraft known for being the world's largest passenger airliner, designed for long-haul flights. It has four engines, which are characteristic of large commercial aircraft. The registration number on the aircraft can also provide specific information about the model if looked up in an aviation database."
|
||||
|
||||
"The Airbus A380 is a double-deck, wide-body, four-engine jet airliner made by Airbus. It is the world's largest passenger airliner and is known for its long-haul capabilities. The aircraft was developed to improve efficiency and comfort for passengers traveling over long distances. It has two full-length passenger decks, which can accommodate more passengers than a typical single-aisle airplane. The A380 has been operated by airlines such as Lufthansa, Singapore Airlines, and Emirates, among others. It is widely recognized for its unique design and significant impact on the aviation industry."
|
||||
```
|
||||
|
||||
#### Multi-image Understanding
|
||||
<details>
|
||||
<summary> Click to view Python example of MiniCPM-V 2.6 multi-image understanding </summary>
|
||||
|
||||
```python
|
||||
import torch
|
||||
from PIL import Image
|
||||
from transformers import AutoModel, AutoTokenizer
|
||||
|
||||
model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
|
||||
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
|
||||
model = model.eval().cuda()
|
||||
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
|
||||
|
||||
image1 = Image.open('image1.jpg').convert('RGB')
|
||||
image2 = Image.open('image2.jpg').convert('RGB')
|
||||
question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.'
|
||||
|
||||
msgs = [{'role': 'user', 'content': [image1, image2, question]}]
|
||||
|
||||
answer = model.chat(
|
||||
image=None,
|
||||
msgs=msgs,
|
||||
tokenizer=tokenizer
|
||||
)
|
||||
print(answer)
|
||||
```
|
||||
</details>
|
||||
|
||||
#### Few-shot In-Context-Learning
|
||||
|
||||
<details>
|
||||
<summary> Click to view Python example of MiniCPM-V 2.6 few-shot in-context-learning example </summary>
|
||||
|
||||
```python
|
||||
import torch
|
||||
from PIL import Image
|
||||
from transformers import AutoModel, AutoTokenizer
|
||||
|
||||
model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
|
||||
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
|
||||
model = model.eval().cuda()
|
||||
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
|
||||
|
||||
question = "production date"
|
||||
image1 = Image.open('example1.jpg').convert('RGB')
|
||||
answer1 = "2023.08.04"
|
||||
image2 = Image.open('example2.jpg').convert('RGB')
|
||||
answer2 = "2007.04.24"
|
||||
image_test = Image.open('test.jpg').convert('RGB')
|
||||
|
||||
msgs = [
|
||||
{'role': 'user', 'content': [image1, question]}, {'role': 'assistant', 'content': [answer1]},
|
||||
{'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]},
|
||||
{'role': 'user', 'content': [image_test, question]}
|
||||
]
|
||||
|
||||
answer = model.chat(
|
||||
image=None,
|
||||
msgs=msgs,
|
||||
tokenizer=tokenizer
|
||||
)
|
||||
print(answer)
|
||||
```
|
||||
</details>
|
||||
|
||||
#### Video understanding
|
||||
<details>
|
||||
<summary> Click to view Python example of MiniCPM-V 2.6 video understanding </summary>
|
||||
|
||||
```python
|
||||
import torch
|
||||
from PIL import Image
|
||||
from transformers import AutoModel, AutoTokenizer
|
||||
from decord import VideoReader, cpu # pip install decord
|
||||
|
||||
model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
|
||||
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
|
||||
model = model.eval().cuda()
|
||||
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
|
||||
|
||||
MAX_NUM_FRAMES=64 # if cuda OOM set a smaller number
|
||||
|
||||
def encode_video(video_path):
|
||||
def uniform_sample(l, n):
|
||||
gap = len(l) / n
|
||||
idxs = [int(i * gap + gap / 2) for i in range(n)]
|
||||
return [l[i] for i in idxs]
|
||||
|
||||
vr = VideoReader(video_path, ctx=cpu(0))
|
||||
sample_fps = round(vr.get_avg_fps() / 1) # FPS
|
||||
frame_idx = [i for i in range(0, len(vr), sample_fps)]
|
||||
if len(frame_idx) > MAX_NUM_FRAMES:
|
||||
frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
|
||||
frames = vr.get_batch(frame_idx).asnumpy()
|
||||
frames = [Image.fromarray(v.astype('uint8')) for v in frames]
|
||||
print('num frames:', len(frames))
|
||||
return frames
|
||||
|
||||
video_path="video_test.mp4"
|
||||
frames = encode_video(video_path)
|
||||
question = "Describe the video"
|
||||
msgs = [
|
||||
{'role': 'user', 'content': frames + [question]},
|
||||
]
|
||||
|
||||
# Set decode params for video
|
||||
params = {}
|
||||
params["use_image_id"] = False
|
||||
params["max_slice_nums"] = 2 # 如果cuda OOM且视频分辨率大于448*448可设为1
|
||||
|
||||
answer = model.chat(
|
||||
image=None,
|
||||
msgs=msgs,
|
||||
tokenizer=tokenizer,
|
||||
**params
|
||||
)
|
||||
print(answer)
|
||||
```
|
||||
</details>
|
||||
@@ -1,953 +0,0 @@
|
||||
## MiniCPM-V 2.6
|
||||
|
||||
> Archieve at: 2025-01-13
|
||||
|
||||
**MiniCPM-V 2.6** is the latest and most capable model in the MiniCPM-V series. The model is built on SigLip-400M and Qwen2-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-Llama3-V 2.5, and introduces new features for multi-image and video understanding. Notable features of MiniCPM-V 2.6 include:
|
||||
|
||||
- 🔥 **Leading Performance.**
|
||||
MiniCPM-V 2.6 achieves an average score of 65.2 on the latest version of OpenCompass, a comprehensive evaluation over 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o mini, GPT-4V, Gemini 1.5 Pro, and Claude 3.5 Sonnet** for single image understanding.
|
||||
|
||||
- 🖼️ **Multi Image Understanding and In-context Learning.** MiniCPM-V 2.6 can also perform **conversation and reasoning over multiple images**. It achieves **state-of-the-art performance** on popular multi-image benchmarks such as Mantis-Eval, BLINK, Mathverse mv and Sciverse mv, and also shows promising in-context learning capability.
|
||||
|
||||
- 🎬 **Video Understanding.** MiniCPM-V 2.6 can also **accept video inputs**, performing conversation and providing dense captions for spatial-temporal information. It outperforms **GPT-4V, Claude 3.5 Sonnet and LLaVA-NeXT-Video-34B** on Video-MME with/without subtitles.
|
||||
|
||||
- 💪 **Strong OCR Capability and Others.**
|
||||
MiniCPM-V 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves **state-of-the-art performance on OCRBench, surpassing proprietary models such as GPT-4o, GPT-4V, and Gemini 1.5 Pro**.
|
||||
Based on the the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) and [VisCPM](https://github.com/OpenBMB/VisCPM) techniques, it features **trustworthy behaviors**, with significantly lower hallucination rates than GPT-4o and GPT-4V on Object HalBench, and supports **multilingual capabilities** on English, Chinese, German, French, Italian, Korean, etc.
|
||||
|
||||
|
||||
- 🚀 **Superior Efficiency.**
|
||||
In addition to its friendly size, MiniCPM-V 2.6 also shows **state-of-the-art token density** (i.e., number of pixels encoded into each visual token). **It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models**. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-V 2.6 can efficiently support **real-time video understanding** on end-side devices such as iPad.
|
||||
|
||||
- 💫 **Easy Usage.**
|
||||
MiniCPM-V 2.6 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpmv-main/examples/llava/README-minicpmv2.6.md) and [ollama](https://github.com/OpenBMB/ollama/blob/minicpm-v2.6/examples/minicpm-v2.6/README.md) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-V-2_6-int4) and [GGUF](https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf) format quantized models in 16 sizes, (3) [vLLM](#inference-with-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks, (5) quick local WebUI demo setup with [Gradio](#chat-with-our-demo-on-gradio), and (6) online web [demo](http://120.92.209.146:8887/).
|
||||
|
||||
### Evaluation <!-- omit in toc -->
|
||||
<div align="center">
|
||||
<img src=../assets/radar_final.png width=66% />
|
||||
</div>
|
||||
|
||||
<details>
|
||||
<summary>Click to view single image results on OpenCompass, MME, MMVet, OCRBench, MMMU, MathVista, MMB, AI2D, TextVQA, DocVQA, HallusionBench, Object HalBench. </summary>
|
||||
<div align="center">
|
||||
|
||||
<table style="margin: 0px auto;">
|
||||
<thead>
|
||||
<tr>
|
||||
<th align="left">Model</th>
|
||||
<th>Size</th>
|
||||
<th>Token Density<sup>+</sup></th>
|
||||
<th>OpenCompass</th>
|
||||
<th>MME</th>
|
||||
<th>MMVet</th>
|
||||
<th>OCRBench</th>
|
||||
<th>MMMU val</th>
|
||||
<th>MathVista mini</th>
|
||||
<th>MMB1.1 test</th>
|
||||
<th>AI2D</th>
|
||||
<th>TextVQA val</th>
|
||||
<th>DocVQA test</th>
|
||||
<th>HallusionBench</th>
|
||||
<th>Object HalBench</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody align="center">
|
||||
<tr>
|
||||
<td colspan="15" align="left"><strong>Proprietary</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GPT-4o</td>
|
||||
<td>-</td>
|
||||
<td>1088</td>
|
||||
<td>69.9</td>
|
||||
<td>2328.7</td>
|
||||
<td>69.1</td>
|
||||
<td>736</td>
|
||||
<td>69.2</td>
|
||||
<td>61.3</td>
|
||||
<td>82.2</td>
|
||||
<td>84.6</td>
|
||||
<td>-</td>
|
||||
<td>92.8</td>
|
||||
<td>55.0</td>
|
||||
<td>17.6</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Claude 3.5 Sonnet</td>
|
||||
<td>-</td>
|
||||
<td>750</td>
|
||||
<td>67.9</td>
|
||||
<td>1920.0</td>
|
||||
<td>66.0</td>
|
||||
<td>788</td>
|
||||
<td>65.9</td>
|
||||
<td>61.6</td>
|
||||
<td>78.5</td>
|
||||
<td>80.2</td>
|
||||
<td>-</td>
|
||||
<td>95.2</td>
|
||||
<td>49.9</td>
|
||||
<td>13.8</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>64.4</td>
|
||||
<td>2110.6</td>
|
||||
<td>64.0</td>
|
||||
<td>754</td>
|
||||
<td>60.6</td>
|
||||
<td>57.7</td>
|
||||
<td>73.9</td>
|
||||
<td>79.1</td>
|
||||
<td>73.5</td>
|
||||
<td>86.5</td>
|
||||
<td>45.6</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GPT-4o mini</td>
|
||||
<td>-</td>
|
||||
<td>1088</td>
|
||||
<td>64.1</td>
|
||||
<td>2003.4</td>
|
||||
<td>66.9</td>
|
||||
<td>785</td>
|
||||
<td>60.0</td>
|
||||
<td>52.4</td>
|
||||
<td>76.0</td>
|
||||
<td>77.8</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>46.1</td>
|
||||
<td>12.4</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GPT-4V</td>
|
||||
<td>-</td>
|
||||
<td>1088</td>
|
||||
<td>63.5</td>
|
||||
<td>2070.2</td>
|
||||
<td>67.5</td>
|
||||
<td>656</td>
|
||||
<td>61.7</td>
|
||||
<td>54.7</td>
|
||||
<td>79.8</td>
|
||||
<td>78.6</td>
|
||||
<td>78.0</td>
|
||||
<td>87.2</td>
|
||||
<td>43.9</td>
|
||||
<td>14.2</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Step-1V</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>59.5</td>
|
||||
<td>2206.4</td>
|
||||
<td>63.3</td>
|
||||
<td>625</td>
|
||||
<td>49.9</td>
|
||||
<td>44.8</td>
|
||||
<td>78.0</td>
|
||||
<td>79.2</td>
|
||||
<td>71.6</td>
|
||||
<td>-</td>
|
||||
<td>48.4</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Qwen-VL-Max</td>
|
||||
<td>-</td>
|
||||
<td>784</td>
|
||||
<td>58.3</td>
|
||||
<td>2281.7</td>
|
||||
<td>61.8</td>
|
||||
<td>684</td>
|
||||
<td>52.0</td>
|
||||
<td>43.4</td>
|
||||
<td>74.6</td>
|
||||
<td>75.7</td>
|
||||
<td>79.5</td>
|
||||
<td>93.1</td>
|
||||
<td>41.2</td>
|
||||
<td>13.4</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td colspan="15" align="left"><strong>Open-source</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">LLaVA-NeXT-Yi-34B</td>
|
||||
<td>34B</td>
|
||||
<td>157</td>
|
||||
<td>55.0</td>
|
||||
<td>2006.5</td>
|
||||
<td>50.7</td>
|
||||
<td>574</td>
|
||||
<td>48.8</td>
|
||||
<td>40.4</td>
|
||||
<td>77.8</td>
|
||||
<td>78.9</td>
|
||||
<td>69.3</td>
|
||||
<td>-</td>
|
||||
<td>34.8</td>
|
||||
<td>12.6</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Mini-Gemini-HD-34B</td>
|
||||
<td>34B</td>
|
||||
<td>157</td>
|
||||
<td>-</td>
|
||||
<td>2141.0</td>
|
||||
<td>59.3</td>
|
||||
<td>518</td>
|
||||
<td>48.0</td>
|
||||
<td>43.3</td>
|
||||
<td>-</td>
|
||||
<td>80.5</td>
|
||||
<td>74.1</td>
|
||||
<td>78.9</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Cambrian-34B</td>
|
||||
<td>34B</td>
|
||||
<td>1820</td>
|
||||
<td>58.3</td>
|
||||
<td>2049.9</td>
|
||||
<td>53.2</td>
|
||||
<td>591</td>
|
||||
<td>50.4</td>
|
||||
<td>50.3</td>
|
||||
<td>77.8</td>
|
||||
<td>79.5</td>
|
||||
<td>76.7</td>
|
||||
<td>75.5</td>
|
||||
<td>41.6</td>
|
||||
<td>14.7</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GLM-4V-9B</td>
|
||||
<td>13B</td>
|
||||
<td>784</td>
|
||||
<td>59.1</td>
|
||||
<td>2018.8</td>
|
||||
<td>58.0</td>
|
||||
<td>776</td>
|
||||
<td>46.9</td>
|
||||
<td>51.1</td>
|
||||
<td>67.9</td>
|
||||
<td>71.2</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>45.0</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">InternVL2-8B</td>
|
||||
<td>8B</td>
|
||||
<td>706</td>
|
||||
<td>64.1</td>
|
||||
<td>2215.1</td>
|
||||
<td>54.3</td>
|
||||
<td>794</td>
|
||||
<td><strong>51.2</strong></td>
|
||||
<td>58.3</td>
|
||||
<td><strong>79.4</strong></td>
|
||||
<td><strong>83.6</strong></td>
|
||||
<td>77.4</td>
|
||||
<td><strong>91.6</strong></td>
|
||||
<td>45.0</td>
|
||||
<td>21.3</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiniCPM-Llama-V 2.5</td>
|
||||
<td>8B</td>
|
||||
<td>1882</td>
|
||||
<td>58.8</td>
|
||||
<td>2024.6</td>
|
||||
<td>52.8</td>
|
||||
<td>725</td>
|
||||
<td>45.8</td>
|
||||
<td>54.3</td>
|
||||
<td>72.0</td>
|
||||
<td>78.4</td>
|
||||
<td>76.6</td>
|
||||
<td>84.8</td>
|
||||
<td>42.4</td>
|
||||
<td>10.3</td>
|
||||
</tr>
|
||||
<tr style="background-color: #e6f2ff;">
|
||||
<td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
|
||||
<td>8B</td>
|
||||
<td><strong>2822</strong></td>
|
||||
<td><strong>65.2</strong></td>
|
||||
<td><strong>2348.4</strong>*</td>
|
||||
<td><strong>60.0</strong></td>
|
||||
<td><strong>852</strong>*</td>
|
||||
<td>49.8*</td>
|
||||
<td><strong>60.6</strong></td>
|
||||
<td>78.0</td>
|
||||
<td>82.1</td>
|
||||
<td><strong>80.1<strong></td>
|
||||
<td>90.8</td>
|
||||
<td><strong>48.1</strong>*</td>
|
||||
<td><strong>8.2</strong></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
</div>
|
||||
* We evaluate this benchmark using chain-of-thought prompting. Specifically, for MME, we used this technique only for the Cognition set.
|
||||
|
||||
<sup>+</sup> Token Density: number of pixels encoded into each visual token at maximum resolution, i.e., # pixels at maximum resolution / # visual tokens.
|
||||
|
||||
Note: For proprietary models, we calculate token density based on the image encoding charging strategy defined in the official API documentation, which provides an upper-bound estimation.
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
<details>
|
||||
<summary>Click to view multi-image results on Mantis Eval, BLINK, Mathverse mv, Sciverse mv, MIRB.</summary>
|
||||
<div align="center">
|
||||
|
||||
<table style="margin: 0px auto;">
|
||||
<thead>
|
||||
<tr>
|
||||
<th align="left">Model</th>
|
||||
<th>Size</th>
|
||||
<th>Mantis Eval</th>
|
||||
<th>BLINK val</th>
|
||||
<th>Mathverse mv</th>
|
||||
<th>Sciverse mv</th>
|
||||
<th>MIRB</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody align="center">
|
||||
<tr>
|
||||
<td colspan="7" align="left"><strong>Proprietary</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GPT-4V</td>
|
||||
<td>-</td>
|
||||
<td>62.7</td>
|
||||
<td>54.6</td>
|
||||
<td>60.3</td>
|
||||
<td>66.9</td>
|
||||
<td>53.1</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">LLaVA-NeXT-Interleave-14B</td>
|
||||
<td>14B</td>
|
||||
<td>66.4</td>
|
||||
<td>52.6</td>
|
||||
<td>32.7</td>
|
||||
<td>30.2</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td colspan="7" align="left"><strong>Open-source</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Emu2-Chat</td>
|
||||
<td>37B</td>
|
||||
<td>37.8</td>
|
||||
<td>36.2</td>
|
||||
<td>-</td>
|
||||
<td>27.2</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">CogVLM</td>
|
||||
<td>17B</td>
|
||||
<td>45.2</td>
|
||||
<td>41.1</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">VPG-C</td>
|
||||
<td>7B</td>
|
||||
<td>52.4</td>
|
||||
<td>43.1</td>
|
||||
<td>24.3</td>
|
||||
<td>23.1</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">VILA 8B</td>
|
||||
<td>8B</td>
|
||||
<td>51.2</td>
|
||||
<td>39.3</td>
|
||||
<td>-</td>
|
||||
<td>36.5</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">InternLM-XComposer-2.5</td>
|
||||
<td>8B</td>
|
||||
<td>53.1*</td>
|
||||
<td>48.9</td>
|
||||
<td>32.1*</td>
|
||||
<td>-</td>
|
||||
<td>42.5</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">InternVL2-8B</td>
|
||||
<td>8B</td>
|
||||
<td>59.0*</td>
|
||||
<td>50.9</td>
|
||||
<td>30.5*</td>
|
||||
<td>34.4*</td>
|
||||
<td><strong>56.9*</strong></td>
|
||||
</tr>
|
||||
<tr style="background-color: #e6f2ff;">
|
||||
<td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
|
||||
<td>8B</td>
|
||||
<td><strong>69.1</strong></td>
|
||||
<td><strong>53.0</strong></td>
|
||||
<td><strong>84.9</strong></td>
|
||||
<td><strong>74.9</strong></td>
|
||||
<td>53.8</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
</div>
|
||||
* We evaluate the officially released checkpoint by ourselves.
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>Click to view video results on Video-MME and Video-ChatGPT.</summary>
|
||||
<div align="center">
|
||||
<table style="margin: 0px auto;">
|
||||
<thead>
|
||||
<tr>
|
||||
<th align="left">Model</th>
|
||||
<th>Size</th>
|
||||
<th colspan="2">Video-MME</th>
|
||||
<th colspan="5">Video-ChatGPT</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<th align="left"></th>
|
||||
<th></th>
|
||||
<th>w/o subs</th>
|
||||
<th>w subs</th>
|
||||
<th>Correctness</th>
|
||||
<th>Detail</th>
|
||||
<th>Context</th>
|
||||
<th>Temporal</th>
|
||||
<th>Consistency</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody align="center">
|
||||
<tr>
|
||||
<td colspan="9" align="left"><strong>Proprietary</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Claude 3.5 Sonnet</td>
|
||||
<td>-</td>
|
||||
<td>60.0</td>
|
||||
<td>62.9</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GPT-4V</td>
|
||||
<td>-</td>
|
||||
<td>59.9</td>
|
||||
<td>63.3</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td colspan="9" align="left"><strong>Open-source</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">LLaVA-NeXT-7B</td>
|
||||
<td>7B</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>3.39</td>
|
||||
<td>3.29</td>
|
||||
<td>3.92</td>
|
||||
<td>2.60</td>
|
||||
<td>3.12</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">LLaVA-NeXT-34B</td>
|
||||
<td>34B</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>3.29</td>
|
||||
<td>3.23</td>
|
||||
<td>3.83</td>
|
||||
<td>2.51</td>
|
||||
<td>3.47</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">CogVLM2-Video</td>
|
||||
<td>12B</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>3.49</td>
|
||||
<td><strong>3.46</strong></td>
|
||||
<td>3.23</td>
|
||||
<td><strong>2.98</strong></td>
|
||||
<td><strong>3.64</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">LongVA</td>
|
||||
<td>7B</td>
|
||||
<td>52.4</td>
|
||||
<td>54.3</td>
|
||||
<td>3.05</td>
|
||||
<td>3.09</td>
|
||||
<td>3.77</td>
|
||||
<td>2.44</td>
|
||||
<td><strong>3.64</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">InternVL2-8B</td>
|
||||
<td>8B</td>
|
||||
<td>54.0</td>
|
||||
<td>56.9</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">InternLM-XComposer-2.5</td>
|
||||
<td>8B</td>
|
||||
<td>55.8</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">LLaVA-NeXT-Video</td>
|
||||
<td>32B</td>
|
||||
<td>60.2</td>
|
||||
<td>63.0</td>
|
||||
<td>3.48</td>
|
||||
<td>3.37</td>
|
||||
<td><strong>3.95</strong></td>
|
||||
<td>2.64</td>
|
||||
<td>3.28</td>
|
||||
</tr>
|
||||
<tr style="background-color: #e6f2ff;">
|
||||
<td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
|
||||
<td>8B</td>
|
||||
<td><strong>60.9</strong></td>
|
||||
<td><strong>63.6</strong></td>
|
||||
<td><strong>3.59</strong></td>
|
||||
<td>3.28</td>
|
||||
<td>3.93</td>
|
||||
<td>2.73</td>
|
||||
<td>3.62</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
</div>
|
||||
</details>
|
||||
|
||||
|
||||
<details>
|
||||
<summary>Click to view few-shot results on TextVQA, VizWiz, VQAv2, OK-VQA.</summary>
|
||||
<div align="center">
|
||||
<table style="margin: 0px auto;">
|
||||
<thead>
|
||||
<tr>
|
||||
<th align="left">Model</th>
|
||||
<th>Size</th>
|
||||
<th>Shot</th>
|
||||
<th>TextVQA val</th>
|
||||
<th>VizWiz test-dev</th>
|
||||
<th>VQAv2 test-dev</th>
|
||||
<th>OK-VQA val</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody align="center">
|
||||
<tr>
|
||||
<td align="left" nowrap="nowrap" rowspan="3">Flamingo</td>
|
||||
<td rowspan="3">80B</td>
|
||||
<td>0*</td>
|
||||
<td>35.0</td>
|
||||
<td>31.6</td>
|
||||
<td>56.3</td>
|
||||
<td>40.6</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>4</td>
|
||||
<td>36.5</td>
|
||||
<td>39.6</td>
|
||||
<td>63.1</td>
|
||||
<td><strong>57.4</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>8</td>
|
||||
<td>37.3</td>
|
||||
<td>44.8</td>
|
||||
<td>65.6</td>
|
||||
<td>57.5</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td align="left" nowrap="nowrap" rowspan="3">IDEFICS</td>
|
||||
<td rowspan="3">80B</td>
|
||||
<td>0*</td>
|
||||
<td>30.9</td>
|
||||
<td>36.0</td>
|
||||
<td>60.0</td>
|
||||
<td>45.2</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>4</td>
|
||||
<td>34.3</td>
|
||||
<td>40.4</td>
|
||||
<td>63.6</td>
|
||||
<td>52.4</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>8</td>
|
||||
<td>35.7</td>
|
||||
<td>46.1</td>
|
||||
<td>64.8</td>
|
||||
<td>55.1</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td align="left" nowrap="nowrap" rowspan="3">OmniCorpus</td>
|
||||
<td rowspan="3">7B</td>
|
||||
<td>0*</td>
|
||||
<td>43.0</td>
|
||||
<td>49.8</td>
|
||||
<td>63.2</td>
|
||||
<td>45.5</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>4</td>
|
||||
<td>45.4</td>
|
||||
<td>51.3</td>
|
||||
<td>64.5</td>
|
||||
<td>46.5</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>8</td>
|
||||
<td>45.6</td>
|
||||
<td>52.2</td>
|
||||
<td>64.7</td>
|
||||
<td>46.6</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td align="left" nowrap="nowrap" rowspan="3">Emu2</td>
|
||||
<td rowspan="3">37B</td>
|
||||
<td>0</td>
|
||||
<td>26.4</td>
|
||||
<td>40.4</td>
|
||||
<td>33.5</td>
|
||||
<td>26.7</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>4</td>
|
||||
<td>48.2</td>
|
||||
<td>54.6</td>
|
||||
<td>67.0</td>
|
||||
<td>53.2</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>8</td>
|
||||
<td>49.3</td>
|
||||
<td>54.7</td>
|
||||
<td>67.8</td>
|
||||
<td>54.1</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td align="left" nowrap="nowrap" rowspan="2">MM1</td>
|
||||
<td rowspan="2">30B</td>
|
||||
<td>0</td>
|
||||
<td>26.2</td>
|
||||
<td>40.4</td>
|
||||
<td>48.9</td>
|
||||
<td>26.7</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>8</td>
|
||||
<td>49.3</td>
|
||||
<td>54.7</td>
|
||||
<td><strong>70.9</strong></td>
|
||||
<td>54.1</td>
|
||||
</tr>
|
||||
<tr style="background-color: #e6f2ff;">
|
||||
<td align="left" nowrap="nowrap" rowspan="3">MiniCPM-V 2.6<sup>+</sup></td>
|
||||
<td rowspan="3">8B</td>
|
||||
<td>0</td>
|
||||
<td>43.9</td>
|
||||
<td>33.8</td>
|
||||
<td>45.4</td>
|
||||
<td>23.9</td>
|
||||
</tr>
|
||||
<tr style="background-color: #e6f2ff;">
|
||||
<td>4</td>
|
||||
<td>63.6</td>
|
||||
<td>60.5</td>
|
||||
<td>65.5</td>
|
||||
<td>50.1</td>
|
||||
</tr>
|
||||
<tr style="background-color: #e6f2ff;">
|
||||
<td>8</td>
|
||||
<td><strong>64.6</strong></td>
|
||||
<td><strong>63.4</strong></td>
|
||||
<td>68.2</td>
|
||||
<td>51.4</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
|
||||
</div>
|
||||
* denotes zero image shot and two additional text shots following Flamingo.
|
||||
|
||||
<sup>+</sup> We evaluate the pretraining ckpt without SFT.
|
||||
</details>
|
||||
|
||||
### Examples <!-- omit in toc -->
|
||||
|
||||
<div style="display: flex; flex-direction: column; align-items: center;">
|
||||
<img src="../assets/minicpmv2_6/multi_img-bike.png" alt="Bike" style="margin-bottom: 5px;">
|
||||
<img src="../assets/minicpmv2_6/multi_img-menu.png" alt="Menu" style="margin-bottom: 5px;">
|
||||
<img src="../assets/minicpmv2_6/multi_img-code.png" alt="Code" style="margin-bottom: 5px;">
|
||||
<img src="../assets/minicpmv2_6/ICL-Mem.png" alt="Mem" style="margin-bottom: 5px;">
|
||||
<img src="../assets/minicpmv2_6/multiling-medal.png" alt="medal" style="margin-bottom: 10px;">
|
||||
</div>
|
||||
<details>
|
||||
<summary>Click to view more cases.</summary>
|
||||
<div style="display: flex; flex-direction: column; align-items: center;">
|
||||
<img src="../assets/minicpmv2_6/ICL-elec.png" alt="elec" style="margin-bottom: 5px;">
|
||||
<img src="../assets/minicpmv2_6/multiling-olympic.png" alt="Menu" style="margin-bottom: 10px;">
|
||||
</div>
|
||||
</details>
|
||||
|
||||
We deploy MiniCPM-V 2.6 on end devices. The demo video is the raw screen recording on a iPad Pro without edition.
|
||||
|
||||
<table align="center">
|
||||
<p align="center">
|
||||
<img src="../assets/gif_cases/ai.gif" width=32%/>
|
||||
|
||||
<img src="../assets/gif_cases/beer.gif" width=32%/>
|
||||
</p>
|
||||
</table>
|
||||
|
||||
<table align="center">
|
||||
<p align="center">
|
||||
<img src="../assets/gif_cases/ticket.gif" width=32%/>
|
||||
|
||||
<img src="../assets/gif_cases/wfh.gif" width=32%/>
|
||||
</p>
|
||||
</table>
|
||||
|
||||
<table align="center">
|
||||
<p align="center">
|
||||
<video src="https://github.com/user-attachments/assets/21f4b818-ede1-4822-920e-91281725c830" width="360" /> </video>
|
||||
<!-- <video src="https://github.com/user-attachments/assets/c835f757-206b-4d9c-8e36-70d67b453628" width="360" /> </video> -->
|
||||
</p>
|
||||
</table>
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
|
||||
### Multi-turn Conversation
|
||||
|
||||
|
||||
<div align="center">
|
||||
<img src="../assets/airplane.jpeg" width="500px">
|
||||
</div>
|
||||
|
||||
|
||||
```python
|
||||
import torch
|
||||
from PIL import Image
|
||||
from transformers import AutoModel, AutoTokenizer
|
||||
|
||||
torch.manual_seed(0)
|
||||
|
||||
model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
|
||||
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
|
||||
model = model.eval().cuda()
|
||||
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
|
||||
|
||||
image = Image.open('./assets/airplane.jpeg').convert('RGB')
|
||||
|
||||
# First round chat
|
||||
question = "Tell me the model of this aircraft."
|
||||
msgs = [{'role': 'user', 'content': [image, question]}]
|
||||
|
||||
answer = model.chat(
|
||||
image=None,
|
||||
msgs=msgs,
|
||||
tokenizer=tokenizer
|
||||
)
|
||||
print(answer)
|
||||
|
||||
# Second round chat
|
||||
# pass history context of multi-turn conversation
|
||||
msgs.append({"role": "assistant", "content": [answer]})
|
||||
msgs.append({"role": "user", "content": ["Introduce something about Airbus A380."]})
|
||||
|
||||
answer = model.chat(
|
||||
image=None,
|
||||
msgs=msgs,
|
||||
tokenizer=tokenizer
|
||||
)
|
||||
print(answer)
|
||||
```
|
||||
|
||||
You could get the following output:
|
||||
|
||||
```
|
||||
"The aircraft in the image is an Airbus A380, which can be identified by its large size, double-deck structure, and the distinctive shape of its wings and engines. The A380 is a wide-body aircraft known for being the world's largest passenger airliner, designed for long-haul flights. It has four engines, which are characteristic of large commercial aircraft. The registration number on the aircraft can also provide specific information about the model if looked up in an aviation database."
|
||||
|
||||
"The Airbus A380 is a double-deck, wide-body, four-engine jet airliner made by Airbus. It is the world's largest passenger airliner and is known for its long-haul capabilities. The aircraft was developed to improve efficiency and comfort for passengers traveling over long distances. It has two full-length passenger decks, which can accommodate more passengers than a typical single-aisle airplane. The A380 has been operated by airlines such as Lufthansa, Singapore Airlines, and Emirates, among others. It is widely recognized for its unique design and significant impact on the aviation industry."
|
||||
```
|
||||
|
||||
#### Multi-image Understanding
|
||||
<details>
|
||||
<summary> Click to view Python example of MiniCPM-V 2.6 multi-image understanding </summary>
|
||||
|
||||
```python
|
||||
import torch
|
||||
from PIL import Image
|
||||
from transformers import AutoModel, AutoTokenizer
|
||||
|
||||
model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
|
||||
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
|
||||
model = model.eval().cuda()
|
||||
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
|
||||
|
||||
image1 = Image.open('image1.jpg').convert('RGB')
|
||||
image2 = Image.open('image2.jpg').convert('RGB')
|
||||
question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.'
|
||||
|
||||
msgs = [{'role': 'user', 'content': [image1, image2, question]}]
|
||||
|
||||
answer = model.chat(
|
||||
image=None,
|
||||
msgs=msgs,
|
||||
tokenizer=tokenizer
|
||||
)
|
||||
print(answer)
|
||||
```
|
||||
</details>
|
||||
|
||||
#### Few-shot In-Context-Learning
|
||||
|
||||
<details>
|
||||
<summary> Click to view Python example of MiniCPM-V 2.6 few-shot in-context-learning example </summary>
|
||||
|
||||
```python
|
||||
import torch
|
||||
from PIL import Image
|
||||
from transformers import AutoModel, AutoTokenizer
|
||||
|
||||
model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
|
||||
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
|
||||
model = model.eval().cuda()
|
||||
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
|
||||
|
||||
question = "production date"
|
||||
image1 = Image.open('example1.jpg').convert('RGB')
|
||||
answer1 = "2023.08.04"
|
||||
image2 = Image.open('example2.jpg').convert('RGB')
|
||||
answer2 = "2007.04.24"
|
||||
image_test = Image.open('test.jpg').convert('RGB')
|
||||
|
||||
msgs = [
|
||||
{'role': 'user', 'content': [image1, question]}, {'role': 'assistant', 'content': [answer1]},
|
||||
{'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]},
|
||||
{'role': 'user', 'content': [image_test, question]}
|
||||
]
|
||||
|
||||
answer = model.chat(
|
||||
image=None,
|
||||
msgs=msgs,
|
||||
tokenizer=tokenizer
|
||||
)
|
||||
print(answer)
|
||||
```
|
||||
</details>
|
||||
|
||||
#### Video understanding
|
||||
<details>
|
||||
<summary> Click to view Python example of MiniCPM-V 2.6 video understanding </summary>
|
||||
|
||||
```python
|
||||
import torch
|
||||
from PIL import Image
|
||||
from transformers import AutoModel, AutoTokenizer
|
||||
from decord import VideoReader, cpu # pip install decord
|
||||
|
||||
model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
|
||||
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
|
||||
model = model.eval().cuda()
|
||||
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
|
||||
|
||||
MAX_NUM_FRAMES=64 # if cuda OOM set a smaller number
|
||||
|
||||
def encode_video(video_path):
|
||||
def uniform_sample(l, n):
|
||||
gap = len(l) / n
|
||||
idxs = [int(i * gap + gap / 2) for i in range(n)]
|
||||
return [l[i] for i in idxs]
|
||||
|
||||
vr = VideoReader(video_path, ctx=cpu(0))
|
||||
sample_fps = round(vr.get_avg_fps() / 1) # FPS
|
||||
frame_idx = [i for i in range(0, len(vr), sample_fps)]
|
||||
if len(frame_idx) > MAX_NUM_FRAMES:
|
||||
frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
|
||||
frames = vr.get_batch(frame_idx).asnumpy()
|
||||
frames = [Image.fromarray(v.astype('uint8')) for v in frames]
|
||||
print('num frames:', len(frames))
|
||||
return frames
|
||||
|
||||
video_path="video_test.mp4"
|
||||
frames = encode_video(video_path)
|
||||
question = "Describe the video"
|
||||
msgs = [
|
||||
{'role': 'user', 'content': frames + [question]},
|
||||
]
|
||||
|
||||
# Set decode params for video
|
||||
params = {}
|
||||
params["use_image_id"] = False
|
||||
params["max_slice_nums"] = 2 # 如果cuda OOM且视频分辨率大于448*448可设为1
|
||||
|
||||
answer = model.chat(
|
||||
image=None,
|
||||
msgs=msgs,
|
||||
tokenizer=tokenizer,
|
||||
**params
|
||||
)
|
||||
print(answer)
|
||||
```
|
||||
</details>
|
||||
|
||||
### Model Zoo
|
||||
|
||||
| Model | Device | Memory |          Description | Download |
|
||||
|:-----------|:--:|:-----------:|:-------------------|:---------------:|
|
||||
| MiniCPM-V 2.6| GPU | 17 GB | Strong end-side multimodal performance for single image, multi-image and video understanding. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-2_6) [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6) |
|
||||
| MiniCPM-V 2.6 gguf | CPU | 6 GB | The gguf version, lower memory usage and faster inference. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf) [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6-gguf) |
|
||||
| MiniCPM-V 2.6 int4 | GPU | 7 GB | The int4 quantized version, lower GPU memory usage. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-2_6-int4) [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6-int4) |
|
||||
@@ -1,773 +0,0 @@
|
||||
## MiniCPM-V 2.6
|
||||
|
||||
> Archieve at: 2025-08-25
|
||||
|
||||
**MiniCPM-V 2.6** 是 MiniCPM-V 系列中最新、性能最佳的模型。该模型基于 SigLip-400M 和 Qwen2-7B 构建,共 8B 参数。与 MiniCPM-Llama3-V 2.5 相比,MiniCPM-V 2.6 性能提升显著,并引入了多图和视频理解的新功能。MiniCPM-V 2.6 的主要特点包括:
|
||||
|
||||
|
||||
- 🔥 **领先的性能。**
|
||||
MiniCPM-V 2.6 在最新版本 OpenCompass 榜单上(综合 8 个主流多模态评测基准)平均得分 65.2,**以8B量级的大小在单图理解方面超越了 GPT-4o mini、GPT-4V、Gemini 1.5 Pro 和 Claude 3.5 Sonnet 等主流商用闭源多模态大模型**。
|
||||
|
||||
- 🖼️ **多图理解和上下文学习。**
|
||||
MiniCPM-V 2.6 还支持**多图对话和推理**。它在 Mantis-Eval、BLINK、Mathverse mv 和 Sciverse mv 等主流多图评测基准中取得了**最佳水平**,并展现出了优秀的上下文学习能力。
|
||||
|
||||
- 🎬 **视频理解。**
|
||||
MiniCPM-V 2.6 还可以**接受视频输入**,进行对话和提供涵盖时序和空间信息的详细视频描述。模型在 有/无字幕 评测场景下的 Video-MME 表现均超过了 **GPT-4V、Claude 3.5 Sonnet 和 LLaVA-NeXT-Video-34B**等商用闭源模型。
|
||||
|
||||
- 💪 **强大的 OCR 能力及其他功能。**
|
||||
MiniCPM-V 2.6 可以处理任意长宽比的图像,像素数可达 180 万(如 1344x1344)。在 OCRBench 上取得**最佳水平,超过 GPT-4o、GPT-4V 和 Gemini 1.5 Pro 等商用闭源模型**。基于最新的 [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) 和 [VisCPM](https://github.com/OpenBMB/VisCPM) 技术,其具备了**可信的多模态行为**,在 Object HalBench 上的幻觉率显著低于 GPT-4o 和 GPT-4V,并支持英语、中文、德语、法语、意大利语、韩语等**多种语言**。
|
||||
|
||||
- 🚀 **卓越的效率。**
|
||||
除了对个人用户友好的模型大小,MiniCPM-V 2.6 还表现出**最先进的视觉 token 密度**(即每个视觉 token 编码的像素数量)。它**仅需 640 个 token 即可处理 180 万像素图像,比大多数模型少 75%**。这一特性优化了模型的推理速度、首 token 延迟、内存占用和功耗。因此,MiniCPM-V 2.6 可以支持 iPad 等终端设备上的高效**实时视频理解**。
|
||||
|
||||
- 💫 **易于使用。**
|
||||
MiniCPM-V 2.6 可以通过多种方式轻松使用:(1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpmv-main/examples/llava/README-minicpmv2.6.md) 和 [ollama](https://github.com/OpenBMB/ollama/blob/minicpm-v2.6/examples/minicpm-v2.6/README.md) 支持在本地设备上进行高效的 CPU 推理,(2) [int4](https://huggingface.co/openbmb/MiniCPM-V-2_6-int4) 和 [GGUF](https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf) 格式的量化模型,有 16 种尺寸,(3) [vLLM](#vllm-部署-) 支持高吞吐量和内存高效的推理,(4) 针对新领域和任务进行微调,(5) 使用 [Gradio](#本地-webui-demo-) 快速设置本地 WebUI 演示,(6) 在线[demo](http://120.92.209.146:8887/)即可体验。
|
||||
|
||||
### 性能评估 <!-- omit in toc -->
|
||||
<div align="center">
|
||||
<img src=assets/radar_final.png width=90% />
|
||||
</div>
|
||||
|
||||
<details>
|
||||
<summary>点击查看 OpenCompass, MME, MMVet, OCRBench, MMMU, MathVista, MMB, AI2D, TextVQA, DocVQA, HallusionBench, Object HalBench 上的单图评测结果详情。 </summary>
|
||||
<div align="center">
|
||||
|
||||
<table style="margin: 0px auto;">
|
||||
<thead>
|
||||
<tr>
|
||||
<th align="left">Model</th>
|
||||
<th>Size</th>
|
||||
<th>Token Density<sup>+</sup></th>
|
||||
<th>OpenCompass</th>
|
||||
<th>MME</th>
|
||||
<th>MMVet</th>
|
||||
<th>OCRBench</th>
|
||||
<th>MMMU val</th>
|
||||
<th>MathVista mini</th>
|
||||
<th>MMB1.1 test</th>
|
||||
<th>AI2D</th>
|
||||
<th>TextVQA val</th>
|
||||
<th>DocVQA test</th>
|
||||
<th>HallusionBench</th>
|
||||
<th>Object HalBench</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody align="center">
|
||||
<tr>
|
||||
<td colspan="15" align="left"><strong>Proprietary</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GPT-4o</td>
|
||||
<td>-</td>
|
||||
<td>1088</td>
|
||||
<td>69.9</td>
|
||||
<td>2328.7</td>
|
||||
<td>69.1</td>
|
||||
<td>736</td>
|
||||
<td>69.2</td>
|
||||
<td>61.3</td>
|
||||
<td>82.2</td>
|
||||
<td>84.6</td>
|
||||
<td>-</td>
|
||||
<td>92.8</td>
|
||||
<td>55.0</td>
|
||||
<td>17.6</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Claude 3.5 Sonnet</td>
|
||||
<td>-</td>
|
||||
<td>750</td>
|
||||
<td>67.9</td>
|
||||
<td>1920.0</td>
|
||||
<td>66.0</td>
|
||||
<td>788</td>
|
||||
<td>65.9</td>
|
||||
<td>61.6</td>
|
||||
<td>78.5</td>
|
||||
<td>80.2</td>
|
||||
<td>-</td>
|
||||
<td>95.2</td>
|
||||
<td>49.9</td>
|
||||
<td>13.8</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>64.4</td>
|
||||
<td>2110.6</td>
|
||||
<td>64.0</td>
|
||||
<td>754</td>
|
||||
<td>60.6</td>
|
||||
<td>57.7</td>
|
||||
<td>73.9</td>
|
||||
<td>79.1</td>
|
||||
<td>73.5</td>
|
||||
<td>86.5</td>
|
||||
<td>45.6</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GPT-4o mini</td>
|
||||
<td>-</td>
|
||||
<td>1088</td>
|
||||
<td>64.1</td>
|
||||
<td>2003.4</td>
|
||||
<td>66.9</td>
|
||||
<td>785</td>
|
||||
<td>60.0</td>
|
||||
<td>52.4</td>
|
||||
<td>76.0</td>
|
||||
<td>77.8</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>46.1</td>
|
||||
<td>12.4</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GPT-4V</td>
|
||||
<td>-</td>
|
||||
<td>1088</td>
|
||||
<td>63.5</td>
|
||||
<td>2070.2</td>
|
||||
<td>67.5</td>
|
||||
<td>656</td>
|
||||
<td>61.7</td>
|
||||
<td>54.7</td>
|
||||
<td>79.8</td>
|
||||
<td>78.6</td>
|
||||
<td>78.0</td>
|
||||
<td>87.2</td>
|
||||
<td>43.9</td>
|
||||
<td>14.2</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Step-1V</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>59.5</td>
|
||||
<td>2206.4</td>
|
||||
<td>63.3</td>
|
||||
<td>625</td>
|
||||
<td>49.9</td>
|
||||
<td>44.8</td>
|
||||
<td>78.0</td>
|
||||
<td>79.2</td>
|
||||
<td>71.6</td>
|
||||
<td>-</td>
|
||||
<td>48.4</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Qwen-VL-Max</td>
|
||||
<td>-</td>
|
||||
<td>784</td>
|
||||
<td>58.3</td>
|
||||
<td>2281.7</td>
|
||||
<td>61.8</td>
|
||||
<td>684</td>
|
||||
<td>52.0</td>
|
||||
<td>43.4</td>
|
||||
<td>74.6</td>
|
||||
<td>75.7</td>
|
||||
<td>79.5</td>
|
||||
<td>93.1</td>
|
||||
<td>41.2</td>
|
||||
<td>13.4</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td colspan="15" align="left"><strong>Open-source</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">LLaVA-NeXT-Yi-34B</td>
|
||||
<td>34B</td>
|
||||
<td>157</td>
|
||||
<td>55.0</td>
|
||||
<td>2006.5</td>
|
||||
<td>50.7</td>
|
||||
<td>574</td>
|
||||
<td>48.8</td>
|
||||
<td>40.4</td>
|
||||
<td>77.8</td>
|
||||
<td>78.9</td>
|
||||
<td>69.3</td>
|
||||
<td>-</td>
|
||||
<td>34.8</td>
|
||||
<td>12.6</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Mini-Gemini-HD-34B</td>
|
||||
<td>34B</td>
|
||||
<td>157</td>
|
||||
<td>-</td>
|
||||
<td>2141</td>
|
||||
<td>59.3</td>
|
||||
<td>518</td>
|
||||
<td>48.0</td>
|
||||
<td>43.3</td>
|
||||
<td>-</td>
|
||||
<td>80.5</td>
|
||||
<td>74.1</td>
|
||||
<td>78.9</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Cambrian-34B</td>
|
||||
<td>34B</td>
|
||||
<td>1820</td>
|
||||
<td>58.3</td>
|
||||
<td>2049.9</td>
|
||||
<td>53.2</td>
|
||||
<td>591</td>
|
||||
<td>50.4</td>
|
||||
<td>50.3</td>
|
||||
<td>77.8</td>
|
||||
<td>79.5</td>
|
||||
<td>76.7</td>
|
||||
<td>75.5</td>
|
||||
<td>41.6</td>
|
||||
<td>14.7</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GLM-4V-9B</td>
|
||||
<td>13B</td>
|
||||
<td>784</td>
|
||||
<td>59.1</td>
|
||||
<td>2018.8</td>
|
||||
<td>58.0</td>
|
||||
<td>776</td>
|
||||
<td>46.9</td>
|
||||
<td>51.1</td>
|
||||
<td>67.9</td>
|
||||
<td>71.2</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>45.0</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">InternVL2-8B</td>
|
||||
<td>8B</td>
|
||||
<td>706</td>
|
||||
<td>64.1</td>
|
||||
<td>2215.1</td>
|
||||
<td>54.3</td>
|
||||
<td>794</td>
|
||||
<td><strong>51.2</strong></td>
|
||||
<td>58.3</td>
|
||||
<td><strong>79.4</strong></td>
|
||||
<td><strong>83.6</strong></td>
|
||||
<td>77.4</td>
|
||||
<td><strong>91.6</strong></td>
|
||||
<td>45.0</td>
|
||||
<td>21.3</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiniCPM-Llama-V 2.5</td>
|
||||
<td>8B</td>
|
||||
<td>1882</td>
|
||||
<td>58.8</td>
|
||||
<td>2024.6</td>
|
||||
<td>52.8</td>
|
||||
<td>725</td>
|
||||
<td>45.8</td>
|
||||
<td>54.3</td>
|
||||
<td>72.0</td>
|
||||
<td>78.4</td>
|
||||
<td>76.6</td>
|
||||
<td>84.8</td>
|
||||
<td>42.4</td>
|
||||
<td>10.3</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
|
||||
<td>8B</td>
|
||||
<td><strong>2822</strong></td>
|
||||
<td><strong>65.2</strong></td>
|
||||
<td><strong>2348.4</strong>*</td>
|
||||
<td><strong>60.0</strong></td>
|
||||
<td><strong>852</strong>*</td>
|
||||
<td>49.8*</td>
|
||||
<td><strong>60.6</strong></td>
|
||||
<td>78.0</td>
|
||||
<td>82.1</td>
|
||||
<td><strong>80.1<strong></td>
|
||||
<td>90.8</td>
|
||||
<td><strong>48.1</strong>*</td>
|
||||
<td><strong>8.2</strong></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
</div>
|
||||
* 我们使用思维链提示词来评估这些基准。
|
||||
|
||||
<sup>+</sup> Token Density:每个视觉 token 在最大分辨率下编码的像素数,即最大分辨率下的像素数 / 视觉 token 数。
|
||||
|
||||
注意:闭源模型的 Token Density 由 API 收费方式估算得到。
|
||||
</details>
|
||||
|
||||
|
||||
<details>
|
||||
<summary>点击查看 Mantis Eval, BLINK, Mathverse mv, Sciverse mv, MIRB 上的多图评测结果详情。</summary>
|
||||
<div align="center">
|
||||
|
||||
<table style="margin: 0px auto;">
|
||||
<thead>
|
||||
<tr>
|
||||
<th align="left">Model</th>
|
||||
<th>Size</th>
|
||||
<th>Mantis Eval</th>
|
||||
<th>BLINK val</th>
|
||||
<th>Mathverse mv</th>
|
||||
<th>Sciverse mv</th>
|
||||
<th>MIRB</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody align="center">
|
||||
<tr>
|
||||
<td colspan="7" align="left"><strong>Proprietary</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GPT-4V</td>
|
||||
<td>-</td>
|
||||
<td>62.7</td>
|
||||
<td>54.6</td>
|
||||
<td>60.3</td>
|
||||
<td>66.9</td>
|
||||
<td>53.1</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">LLaVA-NeXT-Interleave-14B</td>
|
||||
<td>14B</td>
|
||||
<td>66.4</td>
|
||||
<td>52.6</td>
|
||||
<td>32.7</td>
|
||||
<td>30.2</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td colspan="7" align="left"><strong>Open-source</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Emu2-Chat</td>
|
||||
<td>37B</td>
|
||||
<td>37.8</td>
|
||||
<td>36.2</td>
|
||||
<td>-</td>
|
||||
<td>27.2</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">CogVLM</td>
|
||||
<td>17B</td>
|
||||
<td>45.2</td>
|
||||
<td>41.1</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">VPG-C</td>
|
||||
<td>7B</td>
|
||||
<td>52.4</td>
|
||||
<td>43.1</td>
|
||||
<td>24.3</td>
|
||||
<td>23.1</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">VILA 8B</td>
|
||||
<td>8B</td>
|
||||
<td>51.2</td>
|
||||
<td>39.3</td>
|
||||
<td>-</td>
|
||||
<td>36.5</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">InternLM-XComposer-2.5</td>
|
||||
<td>8B</td>
|
||||
<td>53.1*</td>
|
||||
<td>48.9</td>
|
||||
<td>32.1*</td>
|
||||
<td>-</td>
|
||||
<td>42.5</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">InternVL2-8B</td>
|
||||
<td>8B</td>
|
||||
<td>59.0*</td>
|
||||
<td>50.9</td>
|
||||
<td>30.5*</td>
|
||||
<td>34.4*</td>
|
||||
<td><strong>56.9*</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
|
||||
<td>8B</td>
|
||||
<td><strong>69.1</strong></td>
|
||||
<td><strong>53.0</strong></td>
|
||||
<td><strong>84.9</strong></td>
|
||||
<td><strong>74.9</strong></td>
|
||||
<td>53.8</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
|
||||
</div>
|
||||
* 正式开源模型权重的评测结果。
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>点击查看 Video-MME 和 Video-ChatGPT 上的视频评测结果详情。</summary>
|
||||
<div align="center">
|
||||
|
||||
<table style="margin: 0px auto;">
|
||||
<thead>
|
||||
<tr>
|
||||
<th align="left">Model</th>
|
||||
<th>Size</th>
|
||||
<th colspan="2">Video-MME</th>
|
||||
<th colspan="5">Video-ChatGPT</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<th align="left"></th>
|
||||
<th></th>
|
||||
<th>w/o subs</th>
|
||||
<th>w subs</th>
|
||||
<th>Correctness</th>
|
||||
<th>Detail</th>
|
||||
<th>Context</th>
|
||||
<th>Temporal</th>
|
||||
<th>Consistency</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody align="center">
|
||||
<tr>
|
||||
<td colspan="9" align="left"><strong>Proprietary</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Claude 3.5 Sonnet</td>
|
||||
<td>-</td>
|
||||
<td>60.0</td>
|
||||
<td>62.9</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GPT-4V</td>
|
||||
<td>-</td>
|
||||
<td>59.9</td>
|
||||
<td>63.3</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td colspan="9" align="left"><strong>Open-source</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">LLaVA-NeXT-7B</td>
|
||||
<td>7B</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>3.39</td>
|
||||
<td>3.29</td>
|
||||
<td>3.92</td>
|
||||
<td>2.60</td>
|
||||
<td>3.12</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">LLaVA-NeXT-34B</td>
|
||||
<td>34B</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>3.29</td>
|
||||
<td>3.23</td>
|
||||
<td>3.83</td>
|
||||
<td>2.51</td>
|
||||
<td>3.47</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">CogVLM2-Video</td>
|
||||
<td>12B</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>3.49</td>
|
||||
<td><strong>3.46</strong></td>
|
||||
<td>3.23</td>
|
||||
<td><strong>2.98</strong></td>
|
||||
<td><strong>3.64</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">LongVA</td>
|
||||
<td>7B</td>
|
||||
<td>52.4</td>
|
||||
<td>54.3</td>
|
||||
<td>3.05</td>
|
||||
<td>3.09</td>
|
||||
<td>3.77</td>
|
||||
<td>2.44</td>
|
||||
<td><strong>3.64</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">InternVL2-8B</td>
|
||||
<td>8B</td>
|
||||
<td>54.0</td>
|
||||
<td>56.9</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">InternLM-XComposer-2.5</td>
|
||||
<td>8B</td>
|
||||
<td>55.8</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">LLaVA-NeXT-Video</td>
|
||||
<td>32B</td>
|
||||
<td>60.2</td>
|
||||
<td>63.0</td>
|
||||
<td>3.48</td>
|
||||
<td>3.37</td>
|
||||
<td><strong>3.95</strong></td>
|
||||
<td>2.64</td>
|
||||
<td>3.28</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
|
||||
<td>8B</td>
|
||||
<td><strong>60.9</strong></td>
|
||||
<td><strong>63.6</strong></td>
|
||||
<td><strong>3.59</strong></td>
|
||||
<td>3.28</td>
|
||||
<td>3.93</td>
|
||||
<td>2.73</td>
|
||||
<td>3.62</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
</div>
|
||||
</details>
|
||||
|
||||
|
||||
<details>
|
||||
<summary>点击查看 TextVQA, VizWiz, VQAv2, OK-VQA上的少样本评测结果详情。</summary>
|
||||
<div align="center">
|
||||
|
||||
<table style="margin: 0px auto;">
|
||||
<thead>
|
||||
<tr>
|
||||
<th align="left">Model</th>
|
||||
<th>Size</th>
|
||||
<th>Shot</th>
|
||||
<th>TextVQA val</th>
|
||||
<th>VizWiz test-dev</th>
|
||||
<th>VQAv2 test-dev</th>
|
||||
<th>OK-VQA val</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody align="center">
|
||||
<tr>
|
||||
<td align="left" nowrap="nowrap" rowspan="3">Flamingo</td>
|
||||
<td rowspan="3">80B</td>
|
||||
<td>0*</td>
|
||||
<td>35.0</td>
|
||||
<td>31.6</td>
|
||||
<td>56.3</td>
|
||||
<td>40.6</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>4</td>
|
||||
<td>36.5</td>
|
||||
<td>39.6</td>
|
||||
<td>63.1</td>
|
||||
<td><strong>57.4</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>8</td>
|
||||
<td>37.3</td>
|
||||
<td>44.8</td>
|
||||
<td>65.6</td>
|
||||
<td>57.5</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td align="left" nowrap="nowrap" rowspan="3">IDEFICS</td>
|
||||
<td rowspan="3">80B</td>
|
||||
<td>0*</td>
|
||||
<td>30.9</td>
|
||||
<td>36.0</td>
|
||||
<td>60.0</td>
|
||||
<td>45.2</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>4</td>
|
||||
<td>34.3</td>
|
||||
<td>40.4</td>
|
||||
<td>63.6</td>
|
||||
<td>52.4</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>8</td>
|
||||
<td>35.7</td>
|
||||
<td>46.1</td>
|
||||
<td>64.8</td>
|
||||
<td>55.1</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td align="left" nowrap="nowrap" rowspan="3">OmniCorpus</td>
|
||||
<td rowspan="3">7B</td>
|
||||
<td>0*</td>
|
||||
<td>43.0</td>
|
||||
<td>49.8</td>
|
||||
<td>63.2</td>
|
||||
<td>45.5</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>4</td>
|
||||
<td>45.4</td>
|
||||
<td>51.3</td>
|
||||
<td>64.5</td>
|
||||
<td>46.5</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>8</td>
|
||||
<td>45.6</td>
|
||||
<td>52.2</td>
|
||||
<td>64.7</td>
|
||||
<td>46.6</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td align="left" nowrap="nowrap" rowspan="3">Emu2</td>
|
||||
<td rowspan="3">37B</td>
|
||||
<td>0</td>
|
||||
<td>26.4</td>
|
||||
<td>40.4</td>
|
||||
<td>33.5</td>
|
||||
<td>26.7</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>4</td>
|
||||
<td>48.2</td>
|
||||
<td>54.6</td>
|
||||
<td>67.0</td>
|
||||
<td>53.2</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>8</td>
|
||||
<td>49.3</td>
|
||||
<td>54.7</td>
|
||||
<td>67.8</td>
|
||||
<td>54.1</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td align="left" nowrap="nowrap" rowspan="2">MM1</td>
|
||||
<td rowspan="2">30B</td>
|
||||
<td>0</td>
|
||||
<td>26.2</td>
|
||||
<td>40.4</td>
|
||||
<td>48.9</td>
|
||||
<td>26.7</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>8</td>
|
||||
<td>49.3</td>
|
||||
<td>54.7</td>
|
||||
<td><strong>70.9</strong></td>
|
||||
<td>54.1</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td align="left" nowrap="nowrap" rowspan="3">MiniCPM-V 2.6<sup>+</sup></td>
|
||||
<td rowspan="3">8B</td>
|
||||
<td>0</td>
|
||||
<td>43.9</td>
|
||||
<td>33.8</td>
|
||||
<td>45.4</td>
|
||||
<td>23.9</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>4</td>
|
||||
<td>63.6</td>
|
||||
<td>60.5</td>
|
||||
<td>65.5</td>
|
||||
<td>50.1</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>8</td>
|
||||
<td><strong>64.6</strong></td>
|
||||
<td><strong>63.4</strong></td>
|
||||
<td>68.2</td>
|
||||
<td>51.4</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
|
||||
</div>
|
||||
* 使用 Flamingo 方式 zero image shot 和 two additional text shots 评估零样本性能。
|
||||
|
||||
<sup>+</sup> 我们在没有进行监督微调 (SFT) 的情况下评估预训练的模型权重 (ckpt)。
|
||||
</details>
|
||||
|
||||
### 典型示例 <!-- omit in toc -->
|
||||
|
||||
<div style="display: flex; flex-direction: column; align-items: center;">
|
||||
<img src="../assets/minicpmv2_6/multi_img-bike.png" alt="Bike" style="margin-bottom: 5px;">
|
||||
<img src="../assets/minicpmv2_6/multi_img-menu.png" alt="Menu" style="margin-bottom: 5px;">
|
||||
<img src="../assets/minicpmv2_6/multi_img-code.png" alt="Code" style="margin-bottom: 5px;">
|
||||
<img src="../assets/minicpmv2_6/ICL-Mem.png" alt="Mem" style="margin-bottom: 5px;">
|
||||
<img src="../assets/minicpmv2_6/multiling-medal.png" alt="medal" style="margin-bottom: 10px;">
|
||||
</div>
|
||||
<details>
|
||||
<summary>点击查看更多示例。</summary>
|
||||
<div style="display: flex; flex-direction: column; align-items: center;">
|
||||
<img src="../assets/minicpmv2_6/ICL-elec.png" alt="elec" style="margin-bottom: 5px;">
|
||||
<img src="../assets/minicpmv2_6/multiling-olympic.png" alt="Menu" style="margin-bottom: 10px;">
|
||||
</div>
|
||||
</details>
|
||||
|
||||
我们将 MiniCPM-V 2.6 部署在iPad Pro上,并录制了以下演示视频。
|
||||
|
||||
<table align="center">
|
||||
<p align="center">
|
||||
<img src="../assets/gif_cases/ai.gif" width=32%/>
|
||||
|
||||
<img src="../assets/gif_cases/beer.gif" width=32%/>
|
||||
</p>
|
||||
</table>
|
||||
|
||||
<table align="center">
|
||||
<p align="center">
|
||||
<video src="https://github.com/user-attachments/assets/21f4b818-ede1-4822-920e-91281725c830" width="360" /> </video>
|
||||
<!-- <video src="https://github.com/user-attachments/assets/c835f757-206b-4d9c-8e36-70d67b453628" width="360" /> </video> -->
|
||||
</p>
|
||||
</table>
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
|
||||
### 模型库
|
||||
|
||||
| 模型 | 设备 | 资源 |          简介 | 下载链接 |
|
||||
|:--------------|:-:|:----------:|:-------------------|:---------------:|
|
||||
| MiniCPM-V 2.6| GPU | 17 GB | 提供出色的端侧单图、多图、视频理解能力。 | [🤗](https://huggingface.co/openbmb/MiniCPM-V-2_6) [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6) |
|
||||
| MiniCPM-V 2.6 gguf | CPU | 6 GB | gguf 版本,更低的内存占用和更高的推理效率。 | [🤗](https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf) [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6-gguf) |
|
||||
| MiniCPM-V 2.6 int4 | GPU | 7 GB | int4量化版,更低显存占用。 | [🤗](https://huggingface.co/openbmb/MiniCPM-V-2_6-int4) [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6-int4) |
|
||||
@@ -1,556 +0,0 @@
|
||||
## MiniCPM-V 4.0
|
||||
|
||||
> Archieve at: 2025-08-25
|
||||
|
||||
**MiniCPM-V 4.0** is the latest efficient model in the MiniCPM-V series. The model is built based on SigLIP2-400M and MiniCPM4-3B with a total of 4.1B parameters. It inherits the strong single-image, multi-image and video understanding performance of MiniCPM-V 2.6 with largely improved efficiency. Notable features of MiniCPM-V 4.0 include:
|
||||
|
||||
- 🔥 **Leading Visual Capability.**
|
||||
With only 4.1B parameters, MiniCPM-V 4.0 achieves an average score of 69.0 on OpenCompass, a comprehensive evaluation of 8 popular benchmarks, **outperforming GPT-4.1-mini-20250414, MiniCPM-V 2.6 (8.1B params, OpenCompass 65.2) and Qwen2.5-VL-3B-Instruct (3.8B params, OpenCompass 64.5)**. It also shows good performance in multi-image understanding and video understanding.
|
||||
|
||||
- 🚀 **Superior Efficiency.**
|
||||
Designed for on-device deployment, MiniCPM-V 4.0 runs smoothly on end devices. For example, it devlivers **less than 2s first token delay and more than 17 token/s decoding on iPhone 16 Pro Max**, without heating problems. It also shows superior throughput under concurrent requests.
|
||||
|
||||
- 💫 **Easy Usage.**
|
||||
MiniCPM-V 4.0 can be easily used in various ways including **llama.cpp, Ollama, vLLM, SGLang, LLaMA-Factory and local web demo** etc. We also open-source iOS App that can run on iPhone and iPad. Get started easily with our well-structured [Cookbook](https://github.com/OpenSQZ/MiniCPM-V-CookBook), featuring detailed instructions and practical examples.
|
||||
|
||||
### Evaluation <!-- omit in toc -->
|
||||
|
||||
<details>
|
||||
<summary>Click to view single image results on OpenCompass. </summary>
|
||||
<div align="center">
|
||||
<table style="margin: 0px auto;">
|
||||
<thead>
|
||||
<tr>
|
||||
<th nowrap="nowrap" align="left">model</th>
|
||||
<th>Size</th>
|
||||
<th>Opencompass</th>
|
||||
<th>OCRBench</th>
|
||||
<th>MathVista</th>
|
||||
<th>HallusionBench</th>
|
||||
<th>MMMU</th>
|
||||
<th>MMVet</th>
|
||||
<th>MMBench V1.1</th>
|
||||
<th>MMStar</th>
|
||||
<th>AI2D</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody align="center">
|
||||
<tr>
|
||||
<td colspan="11" align="left"><strong>Proprietary</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GPT-4v-20240409</td>
|
||||
<td>-</td>
|
||||
<td>63.5</td>
|
||||
<td>656</td>
|
||||
<td>55.2</td>
|
||||
<td>43.9</td>
|
||||
<td>61.7</td>
|
||||
<td>67.5</td>
|
||||
<td>79.8</td>
|
||||
<td>56.0</td>
|
||||
<td>78.6</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Gemini-1.5-Pro</td>
|
||||
<td>-</td>
|
||||
<td>64.5</td>
|
||||
<td>754</td>
|
||||
<td>58.3</td>
|
||||
<td>45.6</td>
|
||||
<td>60.6</td>
|
||||
<td>64.0</td>
|
||||
<td>73.9</td>
|
||||
<td>59.1</td>
|
||||
<td>79.1</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GPT-4.1-mini-20250414</td>
|
||||
<td>-</td>
|
||||
<td>68.9</td>
|
||||
<td>840</td>
|
||||
<td>70.9</td>
|
||||
<td>49.3</td>
|
||||
<td>55.0</td>
|
||||
<td>74.3</td>
|
||||
<td>80.9</td>
|
||||
<td>60.9</td>
|
||||
<td>76.0</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Claude 3.5 Sonnet-20241022</td>
|
||||
<td>-</td>
|
||||
<td>70.6</td>
|
||||
<td>798</td>
|
||||
<td>65.3</td>
|
||||
<td>55.5</td>
|
||||
<td>66.4</td>
|
||||
<td>70.1</td>
|
||||
<td>81.7</td>
|
||||
<td>65.1</td>
|
||||
<td>81.2</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td colspan="11" align="left"><strong>Open-source</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Qwen2.5-VL-3B-Instruct</td>
|
||||
<td>3.8B</td>
|
||||
<td>64.5</td>
|
||||
<td>828</td>
|
||||
<td>61.2</td>
|
||||
<td>46.6</td>
|
||||
<td>51.2</td>
|
||||
<td>60.0</td>
|
||||
<td>76.8</td>
|
||||
<td>56.3</td>
|
||||
<td>81.4</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">InternVL2.5-4B</td>
|
||||
<td>3.7B</td>
|
||||
<td>65.1</td>
|
||||
<td>820</td>
|
||||
<td>60.8</td>
|
||||
<td>46.6</td>
|
||||
<td>51.8</td>
|
||||
<td>61.5</td>
|
||||
<td>78.2</td>
|
||||
<td>58.7</td>
|
||||
<td>81.4</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Qwen2.5-VL-7B-Instruct</td>
|
||||
<td>8.3B</td>
|
||||
<td>70.9</td>
|
||||
<td>888</td>
|
||||
<td>68.1</td>
|
||||
<td>51.9</td>
|
||||
<td>58.0</td>
|
||||
<td>69.7</td>
|
||||
<td>82.2</td>
|
||||
<td>64.1</td>
|
||||
<td>84.3</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">InternVL2.5-8B</td>
|
||||
<td>8.1B</td>
|
||||
<td>68.1</td>
|
||||
<td>821</td>
|
||||
<td>64.5</td>
|
||||
<td>49.0</td>
|
||||
<td>56.2</td>
|
||||
<td>62.8</td>
|
||||
<td>82.5</td>
|
||||
<td>63.2</td>
|
||||
<td>84.6</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiniCPM-V-2.6</td>
|
||||
<td>8.1B</td>
|
||||
<td>65.2</td>
|
||||
<td>852</td>
|
||||
<td>60.8</td>
|
||||
<td>48.1</td>
|
||||
<td>49.8</td>
|
||||
<td>60.0</td>
|
||||
<td>78.0</td>
|
||||
<td>57.5</td>
|
||||
<td>82.1</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiniCPM-o-2.6</td>
|
||||
<td>8.7B</td>
|
||||
<td>70.2</td>
|
||||
<td>889</td>
|
||||
<td>73.3</td>
|
||||
<td>51.1</td>
|
||||
<td>50.9</td>
|
||||
<td>67.2</td>
|
||||
<td>80.6</td>
|
||||
<td>63.3</td>
|
||||
<td>86.1</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiniCPM-V-4.0</td>
|
||||
<td>4.1B</td>
|
||||
<td>69.0</td>
|
||||
<td>894</td>
|
||||
<td>66.9</td>
|
||||
<td>50.8</td>
|
||||
<td>51.2</td>
|
||||
<td>68.0</td>
|
||||
<td>79.7</td>
|
||||
<td>62.8</td>
|
||||
<td>82.9</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
</div>
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>Click to view single image results on ChartQA, MME, RealWorldQA, TextVQA, DocVQA, MathVision, DynaMath, WeMath, Object HalBench and MM Halbench. </summary>
|
||||
|
||||
<div align="center">
|
||||
<table style="margin: 0px auto;">
|
||||
<thead>
|
||||
<tr>
|
||||
<th nowrap="nowrap" align="left">model</th>
|
||||
<th>Size</th>
|
||||
<th>ChartQA</th>
|
||||
<th>MME</th>
|
||||
<th>RealWorldQA</th>
|
||||
<th>TextVQA</th>
|
||||
<th>DocVQA</th>
|
||||
<th>MathVision</th>
|
||||
<th>DynaMath</th>
|
||||
<th>WeMath</th>
|
||||
<th colspan="2">Obj Hal</th>
|
||||
<th colspan="2">MM Hal</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td></td>
|
||||
<td></td>
|
||||
<td></td>
|
||||
<td></td>
|
||||
<td></td>
|
||||
<td></td>
|
||||
<td></td>
|
||||
<td></td>
|
||||
<td></td>
|
||||
<td></td>
|
||||
<td>CHAIRs↓</td>
|
||||
<td>CHAIRi↓</td>
|
||||
<td nowrap="nowrap">score avg@3↑</td>
|
||||
<td nowrap="nowrap">hall rate avg@3↓</td>
|
||||
</tr>
|
||||
<tbody align="center">
|
||||
<tr>
|
||||
<td colspan="14" align="left"><strong>Proprietary</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GPT-4v-20240409</td>
|
||||
<td>-</td>
|
||||
<td>78.5</td>
|
||||
<td>1927</td>
|
||||
<td>61.4</td>
|
||||
<td>78.0</td>
|
||||
<td>88.4</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Gemini-1.5-Pro</td>
|
||||
<td>-</td>
|
||||
<td>87.2</td>
|
||||
<td>-</td>
|
||||
<td>67.5</td>
|
||||
<td>78.8</td>
|
||||
<td>93.1</td>
|
||||
<td>41.0</td>
|
||||
<td>31.5</td>
|
||||
<td>50.5</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GPT-4.1-mini-20250414</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>45.3</td>
|
||||
<td>47.7</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Claude 3.5 Sonnet-20241022</td>
|
||||
<td>-</td>
|
||||
<td>90.8</td>
|
||||
<td>-</td>
|
||||
<td>60.1</td>
|
||||
<td>74.1</td>
|
||||
<td>95.2</td>
|
||||
<td>35.6</td>
|
||||
<td>35.7</td>
|
||||
<td>44.0</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td colspan="14" align="left"><strong>Open-source</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Qwen2.5-VL-3B-Instruct</td>
|
||||
<td>3.8B</td>
|
||||
<td>84.0</td>
|
||||
<td>2157</td>
|
||||
<td>65.4</td>
|
||||
<td>79.3</td>
|
||||
<td>93.9</td>
|
||||
<td>21.9</td>
|
||||
<td>13.2</td>
|
||||
<td>22.9</td>
|
||||
<td>18.3</td>
|
||||
<td>10.8</td>
|
||||
<td>3.9 </td>
|
||||
<td>33.3 </td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">InternVL2.5-4B</td>
|
||||
<td>3.7B</td>
|
||||
<td>84.0</td>
|
||||
<td>2338</td>
|
||||
<td>64.3</td>
|
||||
<td>76.8</td>
|
||||
<td>91.6</td>
|
||||
<td>18.4</td>
|
||||
<td>15.2</td>
|
||||
<td>21.2</td>
|
||||
<td>13.7</td>
|
||||
<td>8.7</td>
|
||||
<td>3.2 </td>
|
||||
<td>46.5 </td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Qwen2.5-VL-7B-Instruct</td>
|
||||
<td>8.3B</td>
|
||||
<td>87.3</td>
|
||||
<td>2347</td>
|
||||
<td>68.5</td>
|
||||
<td>84.9</td>
|
||||
<td>95.7</td>
|
||||
<td>25.4</td>
|
||||
<td>21.8</td>
|
||||
<td>36.2</td>
|
||||
<td>13.3</td>
|
||||
<td>7.9</td>
|
||||
<td>4.1 </td>
|
||||
<td>31.6 </td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">InternVL2.5-8B</td>
|
||||
<td>8.1B</td>
|
||||
<td>84.8</td>
|
||||
<td>2344</td>
|
||||
<td>70.1</td>
|
||||
<td>79.1</td>
|
||||
<td>93.0</td>
|
||||
<td>17.0</td>
|
||||
<td>9.4</td>
|
||||
<td>23.5</td>
|
||||
<td>18.3</td>
|
||||
<td>11.6</td>
|
||||
<td>3.6 </td>
|
||||
<td>37.2</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiniCPM-V-2.6</td>
|
||||
<td>8.1B</td>
|
||||
<td>79.4</td>
|
||||
<td>2348</td>
|
||||
<td>65.0</td>
|
||||
<td>80.1</td>
|
||||
<td>90.8</td>
|
||||
<td>17.5</td>
|
||||
<td>9.0</td>
|
||||
<td>20.4</td>
|
||||
<td>7.3</td>
|
||||
<td>4.7</td>
|
||||
<td>4.0 </td>
|
||||
<td>29.9 </td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiniCPM-o-2.6</td>
|
||||
<td>8.7B</td>
|
||||
<td>86.9</td>
|
||||
<td>2372</td>
|
||||
<td>68.1</td>
|
||||
<td>82.0</td>
|
||||
<td>93.5</td>
|
||||
<td>21.7</td>
|
||||
<td>10.4</td>
|
||||
<td>25.2</td>
|
||||
<td>6.3</td>
|
||||
<td>3.4</td>
|
||||
<td>4.1 </td>
|
||||
<td>31.3 </td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiniCPM-V-4.0</td>
|
||||
<td>4.1B</td>
|
||||
<td>84.4</td>
|
||||
<td>2298</td>
|
||||
<td>68.5</td>
|
||||
<td>80.8</td>
|
||||
<td>92.9</td>
|
||||
<td>20.7</td>
|
||||
<td>14.2</td>
|
||||
<td>32.7</td>
|
||||
<td>6.3</td>
|
||||
<td>3.5</td>
|
||||
<td>4.1 </td>
|
||||
<td>29.2 </td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
</div>
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>Click to view multi-image and video understanding results on Mantis, Blink and Video-MME. </summary>
|
||||
<div align="center">
|
||||
<table style="margin: 0px auto;">
|
||||
<thead>
|
||||
<tr>
|
||||
<th nowrap="nowrap" align="left">model</th>
|
||||
<th>Size</th>
|
||||
<th>Mantis</th>
|
||||
<th>Blink</th>
|
||||
<th nowrap="nowrap" colspan="2" >Video-MME</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td></td>
|
||||
<td></td>
|
||||
<td></td>
|
||||
<td></td>
|
||||
<td>wo subs</td>
|
||||
<td>w subs</td>
|
||||
</tr>
|
||||
<tbody align="center">
|
||||
<tr>
|
||||
<td colspan="6" align="left"><strong>Proprietary</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GPT-4v-20240409</td>
|
||||
<td>-</td>
|
||||
<td>62.7</td>
|
||||
<td>54.6</td>
|
||||
<td>59.9</td>
|
||||
<td>63.3</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Gemini-1.5-Pro</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>59.1</td>
|
||||
<td>75.0</td>
|
||||
<td>81.3</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GPT-4o-20240513</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>68.0</td>
|
||||
<td>71.9</td>
|
||||
<td>77.2</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td colspan="6" align="left"><strong>Open-source</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Qwen2.5-VL-3B-Instruct</td>
|
||||
<td>3.8B</td>
|
||||
<td>-</td>
|
||||
<td>47.6</td>
|
||||
<td>61.5</td>
|
||||
<td>67.6</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">InternVL2.5-4B</td>
|
||||
<td>3.7B</td>
|
||||
<td>62.7</td>
|
||||
<td>50.8</td>
|
||||
<td>62.3</td>
|
||||
<td>63.6</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Qwen2.5-VL-7B-Instruct</td>
|
||||
<td>8.3B</td>
|
||||
<td>-</td>
|
||||
<td>56.4</td>
|
||||
<td>65.1</td>
|
||||
<td>71.6</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">InternVL2.5-8B</td>
|
||||
<td>8.1B</td>
|
||||
<td>67.7</td>
|
||||
<td>54.8</td>
|
||||
<td>64.2</td>
|
||||
<td>66.9</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiniCPM-V-2.6</td>
|
||||
<td>8.1B</td>
|
||||
<td>69.1</td>
|
||||
<td>53.0</td>
|
||||
<td>60.9</td>
|
||||
<td>63.6</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiniCPM-o-2.6</td>
|
||||
<td>8.7B</td>
|
||||
<td>71.9</td>
|
||||
<td>56.7</td>
|
||||
<td>63.9</td>
|
||||
<td>69.6</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiniCPM-V-4.0</td>
|
||||
<td>4.1B</td>
|
||||
<td>71.4</td>
|
||||
<td>54.0</td>
|
||||
<td>61.2</td>
|
||||
<td>65.8</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
</div>
|
||||
|
||||
</details>
|
||||
|
||||
### Examples
|
||||
|
||||
<div style="display: flex; flex-direction: column; align-items: center;">
|
||||
<img src="../assets/minicpmv4/minicpm-v-4-case.png" alt="math" style="margin-bottom: 5px;">
|
||||
</div>
|
||||
|
||||
We deploy MiniCPM-V 4.0 on iPhone 16 Pro Max with [iOS demo](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/demo/ios_demo/ios.md). The demo video is the raw screen recording without edition.
|
||||
|
||||
<table align="center">
|
||||
<p align="center">
|
||||
<img src="../assets/minicpmv4/iphone_en.gif" width=45%/>
|
||||
|
||||
<img src="../assets/minicpmv4/iphone_en_information_extraction.gif" width=45%/>
|
||||
</p>
|
||||
<p align="center">
|
||||
<img src="../assets/minicpmv4/iphone_cn.gif" width=45%/>
|
||||
|
||||
<img src="../assets/minicpmv4/iphone_cn_funny_points.gif" width=45%/>
|
||||
</p>
|
||||
</table>
|
||||
|
||||
|
||||
@@ -1,557 +0,0 @@
|
||||
## MiniCPM-V 4.0
|
||||
|
||||
> Archieve at: 2025-08-25
|
||||
|
||||
MiniCPM-V 4.0 是 MiniCPM-V 系列中的最新模型。该模型基于 SigLIP2-400M 和 MiniCPM4-3B 构建,参数总量为 4.1B。它延续了 MiniCPM-V 2.6 在单图、多图和视频理解方面的强大能力,同时大幅提升了推理效率。MiniCPM-V 4.0 的主要特点包括:
|
||||
|
||||
- 🔥 **领先的视觉能力。**
|
||||
MiniCPM-V 4.0 在 OpenCompass 上获得了平均 69.0 的高分,超越了 MiniCPM-V 2.6(8.1B,得分 65.2)、 Qwen2.5-VL-3B-Instruct(3.8B,得分 64.5)和**广泛使用的闭源模型 GPT-4.1-mini-20250414**。在多图理解与视频理解任务上,MiniCPM-V 4.0 也表现出色。
|
||||
|
||||
- 🚀 **卓越的效率。**
|
||||
MiniCPM-V 4.0 专为端侧设备优化,**可在 iPhone 16 Pro Max 上流畅运行,首 token 延迟低至 2 秒,解码速度达 17.9 tokens/s**,且无发热问题。MiniCPM-V 4.0 在并发请求场景下表现出领先的吞吐率指标。
|
||||
|
||||
- 💫 **易于使用。**
|
||||
MiniCPM-V 4.0 支持多种推理方式,包括 **llama.cpp、Ollama、vLLM、SGLang、LLaMA-Factory 及本地 Web Demo 等**。我们还开源了可以在 iPhone 和 iPad 运行的 iOS App。欢迎参考我们开源的 **结构清晰的[使用手册](https://github.com/OpenSQZ/MiniCPM-V-CookBook)** 玩转 MiniCPM-V 4.0,其中涵盖了详细的部署指南和真实示例。
|
||||
|
||||
|
||||
### 性能评估 <!-- omit in toc -->
|
||||
|
||||
|
||||
<details>
|
||||
<summary>点击查看在OpenCompass上的单图理解能力的评测结果。</summary>
|
||||
<div align="center">
|
||||
<table style="margin: 0px auto;">
|
||||
<thead>
|
||||
<tr>
|
||||
<th nowrap="nowrap" align="left">model</th>
|
||||
<th>Size</th>
|
||||
<th>Opencompass</th>
|
||||
<th>OCRBench</th>
|
||||
<th>MathVista</th>
|
||||
<th>HallusionBench</th>
|
||||
<th>MMMU</th>
|
||||
<th>MMVet</th>
|
||||
<th>MMBench V1.1</th>
|
||||
<th>MMStar</th>
|
||||
<th>AI2D</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody align="center">
|
||||
<tr>
|
||||
<td colspan="11" align="left"><strong>Proprietary</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GPT-4v-20240409</td>
|
||||
<td>-</td>
|
||||
<td>63.5</td>
|
||||
<td>656</td>
|
||||
<td>55.2</td>
|
||||
<td>43.9</td>
|
||||
<td>61.7</td>
|
||||
<td>67.5</td>
|
||||
<td>79.8</td>
|
||||
<td>56.0</td>
|
||||
<td>78.6</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Gemini-1.5-Pro</td>
|
||||
<td>-</td>
|
||||
<td>64.5</td>
|
||||
<td>754</td>
|
||||
<td>58.3</td>
|
||||
<td>45.6</td>
|
||||
<td>60.6</td>
|
||||
<td>64.0</td>
|
||||
<td>73.9</td>
|
||||
<td>59.1</td>
|
||||
<td>79.1</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GPT-4.1-mini-20250414</td>
|
||||
<td>-</td>
|
||||
<td>68.9</td>
|
||||
<td>840</td>
|
||||
<td>70.9</td>
|
||||
<td>49.3</td>
|
||||
<td>55.0</td>
|
||||
<td>74.3</td>
|
||||
<td>80.9</td>
|
||||
<td>60.9</td>
|
||||
<td>76.0</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Claude 3.5 Sonnet-20241022</td>
|
||||
<td>-</td>
|
||||
<td>70.6</td>
|
||||
<td>798</td>
|
||||
<td>65.3</td>
|
||||
<td>55.5</td>
|
||||
<td>66.4</td>
|
||||
<td>70.1</td>
|
||||
<td>81.7</td>
|
||||
<td>65.1</td>
|
||||
<td>81.2</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td colspan="11" align="left"><strong>Open-source</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Qwen2.5-VL-3B-Instruct</td>
|
||||
<td>3.8B</td>
|
||||
<td>64.5</td>
|
||||
<td>828</td>
|
||||
<td>61.2</td>
|
||||
<td>46.6</td>
|
||||
<td>51.2</td>
|
||||
<td>60.0</td>
|
||||
<td>76.8</td>
|
||||
<td>56.3</td>
|
||||
<td>81.4</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">InternVL2.5-4B</td>
|
||||
<td>3.7B</td>
|
||||
<td>65.1</td>
|
||||
<td>820</td>
|
||||
<td>60.8</td>
|
||||
<td>46.6</td>
|
||||
<td>51.8</td>
|
||||
<td>61.5</td>
|
||||
<td>78.2</td>
|
||||
<td>58.7</td>
|
||||
<td>81.4</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Qwen2.5-VL-7B-Instruct</td>
|
||||
<td>8.3B</td>
|
||||
<td>70.9</td>
|
||||
<td>888</td>
|
||||
<td>68.1</td>
|
||||
<td>51.9</td>
|
||||
<td>58.0</td>
|
||||
<td>69.7</td>
|
||||
<td>82.2</td>
|
||||
<td>64.1</td>
|
||||
<td>84.3</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">InternVL2.5-8B</td>
|
||||
<td>8.1B</td>
|
||||
<td>68.1</td>
|
||||
<td>821</td>
|
||||
<td>64.5</td>
|
||||
<td>49.0</td>
|
||||
<td>56.2</td>
|
||||
<td>62.8</td>
|
||||
<td>82.5</td>
|
||||
<td>63.2</td>
|
||||
<td>84.6</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiniCPM-V-2.6</td>
|
||||
<td>8.1B</td>
|
||||
<td>65.2</td>
|
||||
<td>852</td>
|
||||
<td>60.8</td>
|
||||
<td>48.1</td>
|
||||
<td>49.8</td>
|
||||
<td>60.0</td>
|
||||
<td>78.0</td>
|
||||
<td>57.5</td>
|
||||
<td>82.1</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiniCPM-o-2.6</td>
|
||||
<td>8.7B</td>
|
||||
<td>70.2</td>
|
||||
<td>889</td>
|
||||
<td>73.3</td>
|
||||
<td>51.1</td>
|
||||
<td>50.9</td>
|
||||
<td>67.2</td>
|
||||
<td>80.6</td>
|
||||
<td>63.3</td>
|
||||
<td>86.1</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiniCPM-V-4.0</td>
|
||||
<td>4.1B</td>
|
||||
<td>69.0</td>
|
||||
<td>894</td>
|
||||
<td>66.9</td>
|
||||
<td>50.8</td>
|
||||
<td>51.2</td>
|
||||
<td>68.0</td>
|
||||
<td>79.7</td>
|
||||
<td>62.8</td>
|
||||
<td>82.9</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
</div>
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>点击查看在图表理解、文档理解、数学推理、幻觉等领域的评测结果。 </summary>
|
||||
|
||||
<div align="center">
|
||||
<table style="margin: 0px auto;">
|
||||
<thead>
|
||||
<tr>
|
||||
<th nowrap="nowrap" align="left">model</th>
|
||||
<th>Size</th>
|
||||
<th>ChartQA</th>
|
||||
<th>MME</th>
|
||||
<th>RealWorldQA</th>
|
||||
<th>TextVQA</th>
|
||||
<th>DocVQA</th>
|
||||
<th>MathVision</th>
|
||||
<th>DynaMath</th>
|
||||
<th>WeMath</th>
|
||||
<th colspan="2">Obj Hal</th>
|
||||
<th colspan="2">MM Hal</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td></td>
|
||||
<td></td>
|
||||
<td></td>
|
||||
<td></td>
|
||||
<td></td>
|
||||
<td></td>
|
||||
<td></td>
|
||||
<td></td>
|
||||
<td></td>
|
||||
<td></td>
|
||||
<td>CHAIRs↓</td>
|
||||
<td>CHAIRi↓</td>
|
||||
<td nowrap="nowrap">score avg@3↑</td>
|
||||
<td nowrap="nowrap">hall rate avg@3↓</td>
|
||||
</tr>
|
||||
<tbody align="center">
|
||||
<tr>
|
||||
<td colspan="14" align="left"><strong>Proprietary</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GPT-4v-20240409</td>
|
||||
<td>-</td>
|
||||
<td>78.5</td>
|
||||
<td>1927</td>
|
||||
<td>61.4</td>
|
||||
<td>78.0</td>
|
||||
<td>88.4</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Gemini-1.5-Pro</td>
|
||||
<td>-</td>
|
||||
<td>87.2</td>
|
||||
<td>-</td>
|
||||
<td>67.5</td>
|
||||
<td>78.8</td>
|
||||
<td>93.1</td>
|
||||
<td>41.0</td>
|
||||
<td>31.5</td>
|
||||
<td>50.5</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GPT-4.1-mini-20250414</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>45.3</td>
|
||||
<td>47.7</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Claude 3.5 Sonnet-20241022</td>
|
||||
<td>-</td>
|
||||
<td>90.8</td>
|
||||
<td>-</td>
|
||||
<td>60.1</td>
|
||||
<td>74.1</td>
|
||||
<td>95.2</td>
|
||||
<td>35.6</td>
|
||||
<td>35.7</td>
|
||||
<td>44.0</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td colspan="14" align="left"><strong>Open-source</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Qwen2.5-VL-3B-Instruct</td>
|
||||
<td>3.8B</td>
|
||||
<td>84.0</td>
|
||||
<td>2157</td>
|
||||
<td>65.4</td>
|
||||
<td>79.3</td>
|
||||
<td>93.9</td>
|
||||
<td>21.9</td>
|
||||
<td>13.2</td>
|
||||
<td>22.9</td>
|
||||
<td>18.3</td>
|
||||
<td>10.8</td>
|
||||
<td>3.9 </td>
|
||||
<td>33.3 </td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">InternVL2.5-4B</td>
|
||||
<td>3.7B</td>
|
||||
<td>84.0</td>
|
||||
<td>2338</td>
|
||||
<td>64.3</td>
|
||||
<td>76.8</td>
|
||||
<td>91.6</td>
|
||||
<td>18.4</td>
|
||||
<td>15.2</td>
|
||||
<td>21.2</td>
|
||||
<td>13.7</td>
|
||||
<td>8.7</td>
|
||||
<td>3.2 </td>
|
||||
<td>46.5 </td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Qwen2.5-VL-7B-Instruct</td>
|
||||
<td>8.3B</td>
|
||||
<td>87.3</td>
|
||||
<td>2347</td>
|
||||
<td>68.5</td>
|
||||
<td>84.9</td>
|
||||
<td>95.7</td>
|
||||
<td>25.4</td>
|
||||
<td>21.8</td>
|
||||
<td>36.2</td>
|
||||
<td>13.3</td>
|
||||
<td>7.9</td>
|
||||
<td>4.1 </td>
|
||||
<td>31.6 </td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">InternVL2.5-8B</td>
|
||||
<td>8.1B</td>
|
||||
<td>84.8</td>
|
||||
<td>2344</td>
|
||||
<td>70.1</td>
|
||||
<td>79.1</td>
|
||||
<td>93.0</td>
|
||||
<td>17.0</td>
|
||||
<td>9.4</td>
|
||||
<td>23.5</td>
|
||||
<td>18.3</td>
|
||||
<td>11.6</td>
|
||||
<td>3.6 </td>
|
||||
<td>37.2</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiniCPM-V-2.6</td>
|
||||
<td>8.1B</td>
|
||||
<td>79.4</td>
|
||||
<td>2348</td>
|
||||
<td>65.0</td>
|
||||
<td>80.1</td>
|
||||
<td>90.8</td>
|
||||
<td>17.5</td>
|
||||
<td>9.0</td>
|
||||
<td>20.4</td>
|
||||
<td>7.3</td>
|
||||
<td>4.7</td>
|
||||
<td>4.0 </td>
|
||||
<td>29.9 </td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiniCPM-o-2.6</td>
|
||||
<td>8.7B</td>
|
||||
<td>86.9</td>
|
||||
<td>2372</td>
|
||||
<td>68.1</td>
|
||||
<td>82.0</td>
|
||||
<td>93.5</td>
|
||||
<td>21.7</td>
|
||||
<td>10.4</td>
|
||||
<td>25.2</td>
|
||||
<td>6.3</td>
|
||||
<td>3.4</td>
|
||||
<td>4.1 </td>
|
||||
<td>31.3 </td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiniCPM-V-4.0</td>
|
||||
<td>4.1B</td>
|
||||
<td>84.4</td>
|
||||
<td>2298</td>
|
||||
<td>68.5</td>
|
||||
<td>80.8</td>
|
||||
<td>92.9</td>
|
||||
<td>20.7</td>
|
||||
<td>14.2</td>
|
||||
<td>32.7</td>
|
||||
<td>6.3</td>
|
||||
<td>3.5</td>
|
||||
<td>4.1 </td>
|
||||
<td>29.2 </td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
</div>
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>点击查看多图和视频理解能力的评测结果。 </summary>
|
||||
<div align="center">
|
||||
<table style="margin: 0px auto;">
|
||||
<thead>
|
||||
<tr>
|
||||
<th nowrap="nowrap" align="left">model</th>
|
||||
<th>Size</th>
|
||||
<th>Mantis</th>
|
||||
<th>Blink</th>
|
||||
<th nowrap="nowrap" colspan="2" >Video-MME</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td></td>
|
||||
<td></td>
|
||||
<td></td>
|
||||
<td></td>
|
||||
<td>wo subs</td>
|
||||
<td>w subs</td>
|
||||
</tr>
|
||||
<tbody align="center">
|
||||
<tr>
|
||||
<td colspan="6" align="left"><strong>Proprietary</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GPT-4v-20240409</td>
|
||||
<td>-</td>
|
||||
<td>62.7</td>
|
||||
<td>54.6</td>
|
||||
<td>59.9</td>
|
||||
<td>63.3</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Gemini-1.5-Pro</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>59.1</td>
|
||||
<td>75.0</td>
|
||||
<td>81.3</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GPT-4o-20240513</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>68.0</td>
|
||||
<td>71.9</td>
|
||||
<td>77.2</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td colspan="6" align="left"><strong>Open-source</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Qwen2.5-VL-3B-Instruct</td>
|
||||
<td>3.8B</td>
|
||||
<td>-</td>
|
||||
<td>47.6</td>
|
||||
<td>61.5</td>
|
||||
<td>67.6</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">InternVL2.5-4B</td>
|
||||
<td>3.7B</td>
|
||||
<td>62.7</td>
|
||||
<td>50.8</td>
|
||||
<td>62.3</td>
|
||||
<td>63.6</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Qwen2.5-VL-7B-Instruct</td>
|
||||
<td>8.3B</td>
|
||||
<td>-</td>
|
||||
<td>56.4</td>
|
||||
<td>65.1</td>
|
||||
<td>71.6</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">InternVL2.5-8B</td>
|
||||
<td>8.1B</td>
|
||||
<td>67.7</td>
|
||||
<td>54.8</td>
|
||||
<td>64.2</td>
|
||||
<td>66.9</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiniCPM-V-2.6</td>
|
||||
<td>8.1B</td>
|
||||
<td>69.1</td>
|
||||
<td>53.0</td>
|
||||
<td>60.9</td>
|
||||
<td>63.6</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiniCPM-o-2.6</td>
|
||||
<td>8.7B</td>
|
||||
<td>71.9</td>
|
||||
<td>56.7</td>
|
||||
<td>63.9</td>
|
||||
<td>69.6</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiniCPM-V-4.0</td>
|
||||
<td>4.1B</td>
|
||||
<td>71.4</td>
|
||||
<td>54.0</td>
|
||||
<td>61.2</td>
|
||||
<td>65.8</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
</div>
|
||||
|
||||
</details>
|
||||
|
||||
### 典型示例
|
||||
|
||||
<div style="display: flex; flex-direction: column; align-items: center;">
|
||||
<img src="../assets/minicpmv4/minicpm-v-4-case.png" alt="math" style="margin-bottom: 5px;">
|
||||
</div>
|
||||
|
||||
|
||||
我们在 iPhone 16 Pro Max 上部署了 MiniCPM-V 4.0 [iOS demo](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/demo/ios_demo/ios.md),并录制了以下演示录屏,视频未经加速等任何编辑:
|
||||
|
||||
<table align="center">
|
||||
<p align="center">
|
||||
<img src="../assets/minicpmv4/iphone_en.gif" width=45%/>
|
||||
|
||||
<img src="../assets/minicpmv4/iphone_en_information_extraction.gif" width=45%/>
|
||||
</p>
|
||||
<p align="center">
|
||||
<img src="../assets/minicpmv4/iphone_cn.gif" width=45%/>
|
||||
|
||||
<img src="../assets/minicpmv4/iphone_cn_funny_points.gif" width=45%/>
|
||||
</p>
|
||||
</table>
|
||||
@@ -1,6 +1,6 @@
|
||||
<div align="center">
|
||||
<img src="../assets/wechat-QR.jpeg" width="60%"/>
|
||||
<img src="../assets/minicpm-v25.png" width="60%"/>
|
||||
|
||||
<p> 扫码加入「MiniCPM-o 交流群」 </p>
|
||||
<p> Scan the QR code to join the "MiniCPM-o Discussion Group" </p>
|
||||
<p> 扫码加入「MiniCPM-V 交流群」 </p>
|
||||
<p> Scan the QR code to join the "MiniCPM-V Discussion Group" </p>
|
||||
</div>
|
||||
|
||||
@@ -1,369 +1,6 @@
|
||||
# Evaluation
|
||||
|
||||
## MiniCPM-o 2.6
|
||||
|
||||
### opencompass
|
||||
First, enter the `vlmevalkit` directory and install all dependencies:
|
||||
```bash
|
||||
cd vlmevalkit
|
||||
pip install --upgrade pip
|
||||
pip install -e .
|
||||
wget https://download.pytorch.org/whl/cu118/torch-2.2.0%2Bcu118-cp310-cp310-linux_x86_64.whl#sha256=4377e0a7fe8ff8ffc4f7c9c6130c1dcd3874050ae4fc28b7ff1d35234fbca423
|
||||
wget https://download.pytorch.org/whl/cu118/torchvision-0.17.0%2Bcu118-cp310-cp310-linux_x86_64.whl#sha256=2e63d62e09d9b48b407d3e1b30eb8ae4e3abad6968e8d33093b60d0657542428
|
||||
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu118torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
|
||||
pip install torch-2.2.0%2Bcu118-cp310-cp310-linux_x86_64.whl
|
||||
pip install torchvision-0.17.0%2Bcu118-cp310-cp310-linux_x86_64.whl
|
||||
pip install flash_attn-2.6.3+cu118torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
|
||||
```
|
||||
<br />
|
||||
|
||||
Then, run `scripts/run_inference.sh`, which receives two input parameters in sequence: `MODELNAME` and `DATALIST`. `MODELNAME` represents the name of the model, `DATALIST` represents the datasets used for inference:
|
||||
```bash
|
||||
chmod +x ./scripts/run_inference.sh
|
||||
./scripts/run_inference.sh $MODELNAME $DATALIST
|
||||
```
|
||||
<br />
|
||||
|
||||
The five available choices for `MODELNAME` are listed in `vlmeval/config.py`:
|
||||
```bash
|
||||
minicpm_series = {
|
||||
'MiniCPM-V': partial(MiniCPM_V, model_path='openbmb/MiniCPM-V'),
|
||||
'MiniCPM-V-2': partial(MiniCPM_V, model_path='openbmb/MiniCPM-V-2'),
|
||||
'MiniCPM-Llama3-V-2_5': partial(MiniCPM_Llama3_V, model_path='openbmb/MiniCPM-Llama3-V-2_5'),
|
||||
'MiniCPM-V-2_6': partial(MiniCPM_V_2_6, model_path='openbmb/MiniCPM-V-2_6'),
|
||||
'MiniCPM-o-2_6': partial(MiniCPM_o_2_6, model_path='openbmb/MiniCPM-o-2_6'),
|
||||
}
|
||||
```
|
||||
<br />
|
||||
|
||||
All available choices for `DATALIST` are listed in `vlmeval/utils/dataset_config.py`. While evaluating on multiple datasets at a time, separate the names of different datasets with spaces and add quotation marks at both ends:
|
||||
```bash
|
||||
$DATALIST="MMMU_DEV_VAL MathVista_MINI MMVet MMBench_DEV_EN_V11 MMBench_DEV_CN_V11 MMStar HallusionBench AI2D_TEST"
|
||||
```
|
||||
<br />
|
||||
|
||||
When the benchmark requires GPT series model for scoring, please specify `OPENAI_API_BASE` and `OPENAI_API_KEY` in the `.env` file.
|
||||
In order to reproduce the results on OpenCompass benchmarks together with ChartQA and MME, which are displayed in the table on the homepage (columns between OCRBench and HallusionBench), you need to run the script according to the following settings:
|
||||
```bash
|
||||
# Please note that we use different prompts for the perception and reasoning sets of MME. While evaluating on the reasoning subset, CoT is required, so you need to manually modify the judgment condition of the use_cot function in vlmeval/vlm/minicpm_v.py
|
||||
./scripts/run_inference.sh MiniCPM-o-2_6 "MMMU_DEV_VAL MathVista_MINI MMVet MMBench_TEST_EN_V11 MMBench_TEST_CN_V11 MMStar HallusionBench AI2D_TEST OCRBench ChartQA_TEST MME"
|
||||
```
|
||||
<br />
|
||||
|
||||
### vqadataset
|
||||
First, enter the `vqaeval` directory and install all dependencies. Then, create `downloads` subdirectory to store the downloaded dataset for all tasks:
|
||||
```bash
|
||||
cd vqaeval
|
||||
pip install -r requirements.txt
|
||||
mkdir downloads
|
||||
```
|
||||
<br />
|
||||
|
||||
Download the datasets from the following links and place it in the specified directories:
|
||||
###### TextVQA
|
||||
```bash
|
||||
cd downloads
|
||||
mkdir TextVQA && cd TextVQA
|
||||
wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip
|
||||
unzip train_val_images.zip && rm train_val_images.zip
|
||||
mv train_val_images/train_images . && rm -rf train_val_images
|
||||
wget https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_val.json
|
||||
cd ../..
|
||||
```
|
||||
|
||||
###### DocVQA / DocVQATest
|
||||
|
||||
```bash
|
||||
cd downloads
|
||||
mkdir DocVQA && cd DocVQA && mkdir spdocvqa_images
|
||||
# Download Images and Annotations from Task 1 - Single Page Document Visual Question Answering at https://rrc.cvc.uab.es/?ch=17&com=downloads
|
||||
# Move the spdocvqa_images.tar.gz and spdocvqa_qas.zip to DocVQA directory
|
||||
tar -zxvf spdocvqa_images.tar.gz -C spdocvqa_images && rm spdocvqa_images.tar.gz
|
||||
unzip spdocvqa_qas.zip && rm spdocvqa_qas.zip
|
||||
cp spdocvqa_qas/val_v1.0_withQT.json . && cp spdocvqa_qas/test_v1.0.json . && rm -rf spdocvqa_qas
|
||||
cd ../..
|
||||
```
|
||||
<br />
|
||||
|
||||
The `downloads` directory should be organized according to the following structure:
|
||||
```bash
|
||||
downloads
|
||||
├── TextVQA
|
||||
│ ├── train_images
|
||||
│ │ ├── ...
|
||||
│ ├── TextVQA_0.5.1_val.json
|
||||
├── DocVQA
|
||||
│ ├── spdocvqa_images
|
||||
│ │ ├── ...
|
||||
│ ├── val_v1.0_withQT.json
|
||||
│ ├── test_v1.0.json
|
||||
```
|
||||
<br />
|
||||
|
||||
Modify the parameters in `shell/run_inference.sh` and run inference:
|
||||
|
||||
```bash
|
||||
chmod +x ./shell/run_inference.sh
|
||||
./shell/run_inference.sh
|
||||
```
|
||||
<br />
|
||||
|
||||
All optional parameters are listed in `eval_utils/getargs.py`. The meanings of some major parameters are listed as follows.
|
||||
For `MiniCPM-o-2_6`, set `model_name` to `minicpmo26`:
|
||||
```bash
|
||||
# path to images and their corresponding questions
|
||||
# TextVQA
|
||||
--textVQA_image_dir
|
||||
--textVQA_ann_path
|
||||
# DocVQA
|
||||
--docVQA_image_dir
|
||||
--docVQA_ann_path
|
||||
# DocVQATest
|
||||
--docVQATest_image_dir
|
||||
--docVQATest_ann_path
|
||||
|
||||
# whether to eval on certain task
|
||||
--eval_textVQA
|
||||
--eval_docVQA
|
||||
--eval_docVQATest
|
||||
--eval_all
|
||||
|
||||
# model name and model path
|
||||
--model_name
|
||||
--model_path
|
||||
# load model from ckpt
|
||||
--ckpt
|
||||
# the way the model processes input data, "interleave" represents interleaved image-text form, while "old" represents non-interleaved.
|
||||
--generate_method
|
||||
|
||||
--batchsize
|
||||
|
||||
# path to save the outputs
|
||||
--answer_path
|
||||
```
|
||||
<br />
|
||||
|
||||
While evaluating on different tasks, parameters need to be set as follows:
|
||||
###### TextVQA
|
||||
```bash
|
||||
--eval_textVQA
|
||||
--textVQA_image_dir ./downloads/TextVQA/train_images
|
||||
--textVQA_ann_path ./downloads/TextVQA/TextVQA_0.5.1_val.json
|
||||
```
|
||||
|
||||
###### DocVQA
|
||||
```bash
|
||||
--eval_docVQA
|
||||
--docVQA_image_dir ./downloads/DocVQA/spdocvqa_images
|
||||
--docVQA_ann_path ./downloads/DocVQA/val_v1.0_withQT.json
|
||||
```
|
||||
|
||||
###### DocVQATest
|
||||
```bash
|
||||
--eval_docVQATest
|
||||
--docVQATest_image_dir ./downloads/DocVQA/spdocvqa_images
|
||||
--docVQATest_ann_path ./downloads/DocVQA/test_v1.0.json
|
||||
```
|
||||
|
||||
<br />
|
||||
|
||||
For the DocVQATest task, in order to upload the inference results to the [official website](https://rrc.cvc.uab.es/?ch=17) for evaluation, run `shell/run_transform.sh` for format transformation after inference. `input_file_path` represents the path to the original output json, `output_file_path` represents the path to the transformed json:
|
||||
```bash
|
||||
chmod +x ./shell/run_transform.sh
|
||||
./shell/run_transform.sh
|
||||
```
|
||||
|
||||
<br />
|
||||
|
||||
## MiniCPM-V 2.6
|
||||
|
||||
<details>
|
||||
<summary>Expand</summary>
|
||||
|
||||
### opencompass
|
||||
First, enter the `vlmevalkit` directory and install all dependencies:
|
||||
```bash
|
||||
cd vlmevalkit
|
||||
pip install --upgrade pip
|
||||
pip install -e .
|
||||
wget https://download.pytorch.org/whl/cu118/torch-2.2.0%2Bcu118-cp310-cp310-linux_x86_64.whl#sha256=4377e0a7fe8ff8ffc4f7c9c6130c1dcd3874050ae4fc28b7ff1d35234fbca423
|
||||
wget https://download.pytorch.org/whl/cu118/torchvision-0.17.0%2Bcu118-cp310-cp310-linux_x86_64.whl#sha256=2e63d62e09d9b48b407d3e1b30eb8ae4e3abad6968e8d33093b60d0657542428
|
||||
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu118torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
|
||||
pip install torch-2.2.0%2Bcu118-cp310-cp310-linux_x86_64.whl
|
||||
pip install torchvision-0.17.0%2Bcu118-cp310-cp310-linux_x86_64.whl
|
||||
pip install flash_attn-2.6.3+cu118torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
|
||||
```
|
||||
<br />
|
||||
|
||||
Then, run `scripts/run_inference.sh`, which receives three input parameters in sequence: `MODELNAME`, `DATALIST`, and `MODE`. `MODELNAME` represents the name of the model, `DATALIST` represents the datasets used for inference, and `MODE` represents evaluation mode:
|
||||
```bash
|
||||
chmod +x ./scripts/run_inference.sh
|
||||
./scripts/run_inference.sh $MODELNAME $DATALIST $MODE
|
||||
```
|
||||
<br />
|
||||
|
||||
The four available choices for `MODELNAME` are listed in `vlmeval/config.py`:
|
||||
```bash
|
||||
minicpm_series = {
|
||||
'MiniCPM-V': partial(MiniCPM_V, model_path='openbmb/MiniCPM-V'),
|
||||
'MiniCPM-V-2': partial(MiniCPM_V, model_path='openbmb/MiniCPM-V-2'),
|
||||
'MiniCPM-Llama3-V-2_5': partial(MiniCPM_Llama3_V, model_path='openbmb/MiniCPM-Llama3-V-2_5'),
|
||||
'MiniCPM-V-2_6': partial(MiniCPM_V_2_6, model_path='openbmb/MiniCPM-V-2_6'),
|
||||
}
|
||||
```
|
||||
<br />
|
||||
|
||||
All available choices for `DATALIST` are listed in `vlmeval/utils/dataset_config.py`. Separate the names of different datasets with spaces and add quotation marks at both ends:
|
||||
```bash
|
||||
$DATALIST="MMMU_DEV_VAL MathVista_MINI MMVet MMBench_DEV_EN_V11 MMBench_DEV_CN_V11 MMStar HallusionBench AI2D_TEST"
|
||||
```
|
||||
<br />
|
||||
|
||||
While scoring on each benchmark directly, set `MODE=all`. If only inference results are required, set `MODE=infer`. In order to reproduce the results in the table displayed on the homepage (columns between MME and HallusionBench), you need to run the script according to the following settings:
|
||||
```bash
|
||||
# without CoT
|
||||
./scripts/run_inference.sh MiniCPM-V-2_6 "MMMU_DEV_VAL MathVista_MINI MMVet MMBench_DEV_EN_V11 MMBench_DEV_CN_V11 MMStar HallusionBench AI2D_TEST" all
|
||||
./scripts/run_inference.sh MiniCPM-V-2_6 MME all
|
||||
# with CoT
|
||||
# While running the CoT version of MME, you need to modify the 'use_cot' function in vlmeval/vlm/minicpm_v.py and add MME to the branch that returns True.
|
||||
./scripts/run_inference/sh MiniCPM-V-2_6 "MMMU_DEV_VAL MMVet MMStar HallusionBench OCRBench" all
|
||||
./scripts/run_inference.sh MiniCPM-V-2_6 MME all
|
||||
```
|
||||
<br />
|
||||
|
||||
### vqadataset
|
||||
First, enter the `vqaeval` directory and install all dependencies. Then, create `downloads` subdirectory to store the downloaded dataset for all tasks:
|
||||
```bash
|
||||
cd vqaeval
|
||||
pip install -r requirements.txt
|
||||
mkdir downloads
|
||||
```
|
||||
<br />
|
||||
|
||||
Download the datasets from the following links and place it in the specified directories:
|
||||
###### TextVQA
|
||||
```bash
|
||||
cd downloads
|
||||
mkdir TextVQA && cd TextVQA
|
||||
wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip
|
||||
unzip train_val_images.zip && rm train_val_images.zip
|
||||
mv train_val_images/train_images . && rm -rf train_val_images
|
||||
wget https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_val.json
|
||||
cd ../..
|
||||
```
|
||||
|
||||
###### DocVQA / DocVQATest
|
||||
|
||||
```bash
|
||||
cd downloads
|
||||
mkdir DocVQA && cd DocVQA && mkdir spdocvqa_images
|
||||
# Download Images and Annotations from Task 1 - Single Page Document Visual Question Answering at https://rrc.cvc.uab.es/?ch=17&com=downloads
|
||||
# Move the spdocvqa_images.tar.gz and spdocvqa_qas.zip to DocVQA directory
|
||||
tar -zxvf spdocvqa_images.tar.gz -C spdocvqa_images && rm spdocvqa_images.tar.gz
|
||||
unzip spdocvqa_qas.zip && rm spdocvqa_qas.zip
|
||||
cp spdocvqa_qas/val_v1.0_withQT.json . && cp spdocvqa_qas/test_v1.0.json . && rm -rf spdocvqa_qas
|
||||
cd ../..
|
||||
```
|
||||
<br />
|
||||
|
||||
The `downloads` directory should be organized according to the following structure:
|
||||
```bash
|
||||
downloads
|
||||
├── TextVQA
|
||||
│ ├── train_images
|
||||
│ │ ├── ...
|
||||
│ ├── TextVQA_0.5.1_val.json
|
||||
├── DocVQA
|
||||
│ ├── spdocvqa_images
|
||||
│ │ ├── ...
|
||||
│ ├── val_v1.0_withQT.json
|
||||
│ ├── test_v1.0.json
|
||||
```
|
||||
<br />
|
||||
|
||||
Modify the parameters in `shell/run_inference.sh` and run inference:
|
||||
|
||||
```bash
|
||||
chmod +x ./shell/run_inference.sh
|
||||
./shell/run_inference.sh
|
||||
```
|
||||
<br />
|
||||
|
||||
All optional parameters are listed in `eval_utils/getargs.py`. The meanings of some major parameters are listed as follows.
|
||||
For `MiniCPM-V-2_6`, set `model_name` to `minicpmv26`:
|
||||
```bash
|
||||
# path to images and their corresponding questions
|
||||
# TextVQA
|
||||
--textVQA_image_dir
|
||||
--textVQA_ann_path
|
||||
# DocVQA
|
||||
--docVQA_image_dir
|
||||
--docVQA_ann_path
|
||||
# DocVQATest
|
||||
--docVQATest_image_dir
|
||||
--docVQATest_ann_path
|
||||
|
||||
# whether to eval on certain task
|
||||
--eval_textVQA
|
||||
--eval_docVQA
|
||||
--eval_docVQATest
|
||||
--eval_all
|
||||
|
||||
# model name and model path
|
||||
--model_name
|
||||
--model_path
|
||||
# load model from ckpt
|
||||
--ckpt
|
||||
# the way the model processes input data, "interleave" represents interleaved image-text form, while "old" represents non-interleaved.
|
||||
--generate_method
|
||||
|
||||
--batchsize
|
||||
|
||||
# path to save the outputs
|
||||
--answer_path
|
||||
```
|
||||
<br />
|
||||
|
||||
While evaluating on different tasks, parameters need to be set as follows:
|
||||
###### TextVQA
|
||||
```bash
|
||||
--eval_textVQA
|
||||
--textVQA_image_dir ./downloads/TextVQA/train_images
|
||||
--textVQA_ann_path ./downloads/TextVQA/TextVQA_0.5.1_val.json
|
||||
```
|
||||
|
||||
###### DocVQA
|
||||
```bash
|
||||
--eval_docVQA
|
||||
--docVQA_image_dir ./downloads/DocVQA/spdocvqa_images
|
||||
--docVQA_ann_path ./downloads/DocVQA/val_v1.0_withQT.json
|
||||
```
|
||||
|
||||
###### DocVQATest
|
||||
```bash
|
||||
--eval_docVQATest
|
||||
--docVQATest_image_dir ./downloads/DocVQA/spdocvqa_images
|
||||
--docVQATest_ann_path ./downloads/DocVQA/test_v1.0.json
|
||||
```
|
||||
|
||||
<br />
|
||||
|
||||
For the DocVQATest task, in order to upload the inference results to the [official website](https://rrc.cvc.uab.es/?ch=17) for evaluation, run `shell/run_transform.sh` for format transformation after inference. `input_file_path` represents the path to the original output json, `output_file_path` represents the path to the transformed json:
|
||||
```bash
|
||||
chmod +x ./shell/run_transform.sh
|
||||
./shell/run_transform.sh
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
<br />
|
||||
|
||||
## MiniCPM-Llama3-V-2_5
|
||||
|
||||
<details>
|
||||
<summary>Expand</summary>
|
||||
|
||||
### opencompass
|
||||
## opencompass
|
||||
First, enter the `vlmevalkit` directory and install all dependencies:
|
||||
```bash
|
||||
cd vlmevalkit
|
||||
@@ -371,10 +8,10 @@ pip install -r requirements.txt
|
||||
```
|
||||
<br />
|
||||
|
||||
Then, run `scripts/run_inference.sh`, which receives three input parameters in sequence: `MODELNAME`, `DATALIST`, and `MODE`. `MODELNAME` represents the name of the model, `DATALIST` represents the datasets used for inference, and `MODE` represents evaluation mode:
|
||||
Then, run `script/run_inference.sh`, which receives three input parameters in sequence: `MODELNAME`, `DATALIST`, and `MODE`. `MODELNAME` represents the name of the model, `DATALIST` represents the datasets used for inference, and `MODE` represents evaluation mode:
|
||||
```bash
|
||||
chmod +x ./scripts/run_inference.sh
|
||||
./scripts/run_inference.sh $MODELNAME $DATALIST $MODE
|
||||
chmod +x ./script/run_inference.sh
|
||||
./script/run_inference.sh $MODELNAME $DATALIST $MODE
|
||||
```
|
||||
<br />
|
||||
|
||||
@@ -397,27 +34,27 @@ $DATALIST="POPE ScienceQA_TEST ChartQA_TEST"
|
||||
While scoring on each benchmark directly, set `MODE=all`. If only inference results are required, set `MODE=infer`. In order to reproduce the results in the table displayed on the homepage (columns between MME and RealWorldQA), you need to run the script according to the following settings:
|
||||
```bash
|
||||
# run on all 7 datasets
|
||||
./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 "MME MMBench_TEST_EN MMBench_TEST_CN MMMU_DEV_VAL MathVista_MINI LLaVABench RealWorldQA" all
|
||||
./script/run_inference.sh MiniCPM-Llama3-V-2_5 "MME MMBench_TEST_EN MMBench_TEST_CN MMMU_DEV_VAL MathVista_MINI LLaVABench RealWorldQA" all
|
||||
|
||||
# The following are instructions for running on a single dataset
|
||||
# MME
|
||||
./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 MME all
|
||||
./script/run_inference.sh MiniCPM-Llama3-V-2_5 MME all
|
||||
# MMBench_TEST_EN
|
||||
./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 MMBench_TEST_EN all
|
||||
./script/run_inference.sh MiniCPM-Llama3-V-2_5 MMBench_TEST_EN all
|
||||
# MMBench_TEST_CN
|
||||
./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 MMBench_TEST_CN all
|
||||
./script/run_inference.sh MiniCPM-Llama3-V-2_5 MMBench_TEST_CN all
|
||||
# MMMU_DEV_VAL
|
||||
./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 MMMU_DEV_VAL all
|
||||
./script/run_inference.sh MiniCPM-Llama3-V-2_5 MMMU_DEV_VAL all
|
||||
# MathVista_MINI
|
||||
./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 MathVista_MINI all
|
||||
./script/run_inference.sh MiniCPM-Llama3-V-2_5 MathVista_MINI all
|
||||
# LLaVABench
|
||||
./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 LLaVABench all
|
||||
./script/run_inference.sh MiniCPM-Llama3-V-2_5 LLaVABench all
|
||||
# RealWorldQA
|
||||
./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 RealWorldQA all
|
||||
./script/run_inference.sh MiniCPM-Llama3-V-2_5 RealWorldQA all
|
||||
```
|
||||
<br />
|
||||
|
||||
### vqadataset
|
||||
## vqadataset
|
||||
First, enter the `vqaeval` directory and install all dependencies. Then, create `downloads` subdirectory to store the downloaded dataset for all tasks:
|
||||
```bash
|
||||
cd vqaeval
|
||||
@@ -475,8 +112,7 @@ chmod +x ./shell/run_inference.sh
|
||||
```
|
||||
<br />
|
||||
|
||||
All optional parameters are listed in `eval_utils/getargs.py`. The meanings of some major parameters are listed as follows.
|
||||
For `MiniCPM-Llama3-V-2_5`, set `model_name` to `minicpmv`:
|
||||
All optional parameters are listed in `eval_utils/getargs.py`. The meanings of some major parameters are listed as follows:
|
||||
```bash
|
||||
# path to images and their corresponding questions
|
||||
# TextVQA
|
||||
@@ -539,5 +175,3 @@ For the DocVQATest task, in order to upload the inference results to the [offici
|
||||
chmod +x ./shell/run_transform.sh
|
||||
./shell/run_transform.sh
|
||||
```
|
||||
|
||||
</details>
|
||||
@@ -1,365 +1,6 @@
|
||||
# Evaluation
|
||||
|
||||
## MiniCPM-o 2.6
|
||||
|
||||
### opencompass
|
||||
首先,进入 `vlmevalkit` 目录下,安装必要的依赖:
|
||||
```bash
|
||||
cd vlmevalkit
|
||||
pip install --upgrade pip
|
||||
pip install -e .
|
||||
wget https://download.pytorch.org/whl/cu118/torch-2.2.0%2Bcu118-cp310-cp310-linux_x86_64.whl#sha256=4377e0a7fe8ff8ffc4f7c9c6130c1dcd3874050ae4fc28b7ff1d35234fbca423
|
||||
wget https://download.pytorch.org/whl/cu118/torchvision-0.17.0%2Bcu118-cp310-cp310-linux_x86_64.whl#sha256=2e63d62e09d9b48b407d3e1b30eb8ae4e3abad6968e8d33093b60d0657542428
|
||||
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu118torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
|
||||
pip install torch-2.2.0%2Bcu118-cp310-cp310-linux_x86_64.whl
|
||||
pip install torchvision-0.17.0%2Bcu118-cp310-cp310-linux_x86_64.whl
|
||||
pip install flash_attn-2.6.3+cu118torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
|
||||
rm *.whl
|
||||
```
|
||||
<br />
|
||||
|
||||
然后,运行 `scripts/run_inference.sh`,该脚本依次接收两个输入参数:`MODELNAME`, `DATALIST`。其中,`MODELNAME` 为模型名称,`DATALIST` 为目标数据集。
|
||||
```bash
|
||||
chmod +x ./scripts/run_inference.sh
|
||||
./scripts/run_inference.sh $MODELNAME $DATALIST
|
||||
```
|
||||
<br />
|
||||
|
||||
`MODELNAME` 有五种选择,位于 `vlmeval/config.py` 中:
|
||||
```bash
|
||||
minicpm_series = {
|
||||
'MiniCPM-V': partial(MiniCPM_V, model_path='openbmb/MiniCPM-V'),
|
||||
'MiniCPM-V-2': partial(MiniCPM_V, model_path='openbmb/MiniCPM-V-2'),
|
||||
'MiniCPM-Llama3-V-2_5': partial(MiniCPM_Llama3_V, model_path='openbmb/MiniCPM-Llama3-V-2_5'),
|
||||
'MiniCPM-V-2_6': partial(MiniCPM_V_2_6, model_path='openbmb/MiniCPM-V-2_6'),
|
||||
'MiniCPM-o-2_6': partial(MiniCPM_o_2_6, model_path='openbmb/MiniCPM-o-2_6'),
|
||||
}
|
||||
```
|
||||
<br />
|
||||
|
||||
可选的所有 `DATALIST` 位于 `vlmeval/utils/dataset_config.py` 中。一次评测多个数据集时,将不同数据集名称以空格隔开,两端加引号:
|
||||
```bash
|
||||
$DATALIST="MMMU_DEV_VAL MathVista_MINI MMVet MMBench_TEST_EN_V11 MMBench_TEST_CN_V11 MMStar HallusionBench AI2D_TEST"
|
||||
```
|
||||
<br />
|
||||
|
||||
当评测的 benchmark 需要 GPT 系列模型进行评分时,请在 `.env` 文件中预先指定 `OPENAI_API_BASE` 和 `OPENAI_API_KEY`。
|
||||
为了复现出首页展示的表格中 OpenCompass 对应的各项数据集以及 ChartQA 和 MME 上的结果(OCRBench 到 HallusionBench 之间的列),需要按照如下设置运行:
|
||||
```bash
|
||||
# 请注意,对于 MME 的 perception 和 reasoning 集,我们采取了不同的 prompt 方式。评测 reasoning 子集时,需要使用 CoT,因此需要手动到 vlmeval/vlm/minicpm_v.py 中修改 use_cot 函数的判断条件
|
||||
./scripts/run_inference.sh MiniCPM-o-2_6 "MMMU_DEV_VAL MathVista_MINI MMVet MMBench_TEST_EN_V11 MMBench_TEST_CN_V11 MMStar HallusionBench AI2D_TEST OCRBench ChartQA_TEST MME"
|
||||
```
|
||||
<br />
|
||||
|
||||
### vqadataset
|
||||
首先,进入 `vqaeval` 目录下,安装必要的依赖,并创建 `downloads` 子目录,用于存储下载的数据集:
|
||||
```bash
|
||||
cd vqaeval
|
||||
pip install -r requirements.txt
|
||||
mkdir downloads
|
||||
```
|
||||
<br />
|
||||
|
||||
然后,从下列各地址下载数据集并置于指定目录下:
|
||||
###### TextVQA
|
||||
```bash
|
||||
cd downloads
|
||||
mkdir TextVQA && cd TextVQA
|
||||
wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip
|
||||
unzip train_val_images.zip && rm train_val_images.zip
|
||||
mv train_val_images/train_images . && rm -rf train_val_images
|
||||
wget https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_val.json
|
||||
cd ../..
|
||||
```
|
||||
|
||||
###### DocVQA / DocVQATest
|
||||
```bash
|
||||
cd downloads
|
||||
mkdir DocVQA && cd DocVQA && mkdir spdocvqa_images
|
||||
# 在 https://rrc.cvc.uab.es/?ch=17&com=downloads 下载 Task 1 - Single Page Document Visual Question Answering 下的 Images 和 Annotations
|
||||
# 将下载得到的 spdocvqa_images.tar.gz 以及 spdocvqa_qas.zip 置于 DocVQA 目录下
|
||||
tar -zxvf spdocvqa_images.tar.gz -C spdocvqa_images && rm spdocvqa_images.tar.gz
|
||||
unzip spdocvqa_qas.zip && rm spdocvqa_qas.zip
|
||||
cp spdocvqa_qas/val_v1.0_withQT.json . && cp spdocvqa_qas/test_v1.0.json . && rm -rf spdocvqa_qas
|
||||
cd ../..
|
||||
```
|
||||
<br />
|
||||
|
||||
`downloads` 目录应当按照下列结构组织:
|
||||
```bash
|
||||
downloads
|
||||
├── TextVQA
|
||||
│ ├── train_images
|
||||
│ │ ├── ...
|
||||
│ ├── TextVQA_0.5.1_val.json
|
||||
├── DocVQA
|
||||
│ ├── spdocvqa_images
|
||||
│ │ ├── ...
|
||||
│ ├── val_v1.0_withQT.json
|
||||
│ ├── test_v1.0.json
|
||||
```
|
||||
<br />
|
||||
|
||||
准备好相应的数据集之后,修改 `shell/run_inference.sh` 的参数,运行推理:
|
||||
|
||||
```bash
|
||||
chmod +x ./shell/run_inference.sh
|
||||
./shell/run_inference.sh
|
||||
```
|
||||
<br />
|
||||
|
||||
可以传入的参数位于 `eval_utils/getargs.py` 中,各主要参数的含义如下。
|
||||
对于 `MiniCPM-o-2_6`,需要将 `model_name`设置为 `minicpmo26`:
|
||||
```bash
|
||||
# 指定 TextVQA 评测所有图片和问题的路径
|
||||
--textVQA_image_dir
|
||||
--textVQA_ann_path
|
||||
# 指定 DocVQA 评测所有图片和问题的路径
|
||||
--docVQA_image_dir
|
||||
--docVQA_ann_path
|
||||
# 指定 DocVQATest 评测所有图片和问题的路径
|
||||
--docVQATest_image_dir
|
||||
--docVQATest_ann_path
|
||||
|
||||
# 决定是否评测某个任务,eval_all 设置为 True 表示所有任务都评测
|
||||
--eval_textVQA
|
||||
--eval_docVQA
|
||||
--eval_docVQATest
|
||||
--eval_all
|
||||
|
||||
# 模型名称、模型路径(从指定路径加载模型)
|
||||
--model_name
|
||||
--model_path
|
||||
# 从 checkpoint 加载模型
|
||||
--ckpt
|
||||
# 模型处理输入数据的方式,interleave 表示图文交错式,old 表示非交错式
|
||||
--generate_method
|
||||
# 推理时的批处理规模,建议推理时设置为 1
|
||||
--batchsize
|
||||
|
||||
# 输出内容保存的路径
|
||||
--answer_path
|
||||
```
|
||||
<br />
|
||||
|
||||
评测三个任务需要设置的参数如下:
|
||||
###### TextVQA
|
||||
```bash
|
||||
--eval_textVQA
|
||||
--textVQA_image_dir ./downloads/TextVQA/train_images
|
||||
--textVQA_ann_path ./downloads/TextVQA/TextVQA_0.5.1_val.json
|
||||
```
|
||||
|
||||
###### DocVQA
|
||||
```bash
|
||||
--eval_docVQA
|
||||
--docVQA_image_dir ./downloads/DocVQA/spdocvqa_images
|
||||
--docVQA_ann_path ./downloads/DocVQA/val_v1.0_withQT.json
|
||||
```
|
||||
|
||||
###### DocVQATest
|
||||
```bash
|
||||
--eval_docVQATest
|
||||
--docVQATest_image_dir ./downloads/DocVQA/spdocvqa_images
|
||||
--docVQATest_ann_path ./downloads/DocVQA/test_v1.0.json
|
||||
```
|
||||
<br />
|
||||
|
||||
对于 DocVQATest 任务,为了将推理结果上传到[官方网站](https://rrc.cvc.uab.es/?ch=17)进行评测,还需要运行 `shell/run_transform.sh` 进行格式转换。其中,`input_file_path` 对应原始输出的 json 的路径,`output_file_path` 为自定义的转换后的 json 的路径:
|
||||
```bash
|
||||
chmod +x ./shell/run_transform.sh
|
||||
./shell/run_transform.sh
|
||||
```
|
||||
|
||||
<br />
|
||||
|
||||
## MiniCPM-V 2.6
|
||||
|
||||
<details>
|
||||
<summary>展开</summary>
|
||||
|
||||
### opencompass
|
||||
首先,进入 `vlmevalkit` 目录下,安装必要的依赖:
|
||||
```bash
|
||||
cd vlmevalkit
|
||||
pip install --upgrade pip
|
||||
pip install -e .
|
||||
wget https://download.pytorch.org/whl/cu118/torch-2.2.0%2Bcu118-cp310-cp310-linux_x86_64.whl#sha256=4377e0a7fe8ff8ffc4f7c9c6130c1dcd3874050ae4fc28b7ff1d35234fbca423
|
||||
wget https://download.pytorch.org/whl/cu118/torchvision-0.17.0%2Bcu118-cp310-cp310-linux_x86_64.whl#sha256=2e63d62e09d9b48b407d3e1b30eb8ae4e3abad6968e8d33093b60d0657542428
|
||||
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu118torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
|
||||
pip install torch-2.2.0%2Bcu118-cp310-cp310-linux_x86_64.whl
|
||||
pip install torchvision-0.17.0%2Bcu118-cp310-cp310-linux_x86_64.whl
|
||||
pip install flash_attn-2.6.3+cu118torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
|
||||
rm *.whl
|
||||
```
|
||||
<br />
|
||||
|
||||
然后,运行 `scripts/run_inference.sh`,该脚本依次接收三个输入参数:`MODELNAME`, `DATALIST`, `MODE`。`MODELNAME` 为模型名称,`DATALIST` 为目标数据集,`MODE` 为评测模式。
|
||||
```bash
|
||||
chmod +x ./scripts/run_inference.sh
|
||||
./scripts/run_inference.sh $MODELNAME $DATALIST $MODE
|
||||
```
|
||||
<br />
|
||||
|
||||
`MODELNAME` 有四种选择,位于 `vlmeval/config.py` 中:
|
||||
```bash
|
||||
minicpm_series = {
|
||||
'MiniCPM-V': partial(MiniCPM_V, model_path='openbmb/MiniCPM-V'),
|
||||
'MiniCPM-V-2': partial(MiniCPM_V, model_path='openbmb/MiniCPM-V-2'),
|
||||
'MiniCPM-Llama3-V-2_5': partial(MiniCPM_Llama3_V, model_path='openbmb/MiniCPM-Llama3-V-2_5'),
|
||||
'MiniCPM-V-2_6': partial(MiniCPM_V_2_6, model_path='openbmb/MiniCPM-V-2_6'),
|
||||
}
|
||||
```
|
||||
<br />
|
||||
|
||||
可选的所有 `DATALIST` 位于 `vlmeval/utils/dataset_config.py` 中。将不同数据集名称以空格隔开,两端加引号:
|
||||
```bash
|
||||
$DATALIST="MMMU_DEV_VAL MathVista_MINI MMVet MMBench_DEV_EN_V11 MMBench_DEV_CN_V11 MMStar HallusionBench AI2D_TEST"
|
||||
```
|
||||
<br />
|
||||
|
||||
直接对各 benchmark 进行评分时,设置 `MODE=all`。如果仅需要推理结果,则设置 `MODE=infer`。
|
||||
为了复现出首页展示的表格中的各项结果(MME 到 HallusionBench 之间的列),需要按照如下设置运行:
|
||||
```bash
|
||||
# without CoT
|
||||
./scripts/run_inference.sh MiniCPM-V-2_6 "MMMU_DEV_VAL MathVista_MINI MMVet MMBench_DEV_EN_V11 MMBench_DEV_CN_V11 MMStar HallusionBench AI2D_TEST" all
|
||||
./scripts/run_inference.sh MiniCPM-V-2_6 MME all
|
||||
# with CoT,运行 CoT 版本的 MME 时,需要改写 vlmeval/vlm/minicpm_v.py 中的 'use_cot' 函数,将 MME 添加到 return True 的分支中
|
||||
./scripts/run_inference/sh MiniCPM-V-2_6 "MMMU_DEV_VAL MMVet MMStar HallusionBench OCRBench" all
|
||||
./scripts/run_inference.sh MiniCPM-V-2_6 MME all
|
||||
```
|
||||
<br />
|
||||
|
||||
### vqadataset
|
||||
首先,进入 `vqaeval` 目录下,安装必要的依赖,并创建 `downloads` 子目录,用于存储下载的数据集:
|
||||
```bash
|
||||
cd vqaeval
|
||||
pip install -r requirements.txt
|
||||
mkdir downloads
|
||||
```
|
||||
<br />
|
||||
|
||||
然后,从下列各地址下载数据集并置于指定目录下:
|
||||
###### TextVQA
|
||||
```bash
|
||||
cd downloads
|
||||
mkdir TextVQA && cd TextVQA
|
||||
wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip
|
||||
unzip train_val_images.zip && rm train_val_images.zip
|
||||
mv train_val_images/train_images . && rm -rf train_val_images
|
||||
wget https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_val.json
|
||||
cd ../..
|
||||
```
|
||||
|
||||
###### DocVQA / DocVQATest
|
||||
```bash
|
||||
cd downloads
|
||||
mkdir DocVQA && cd DocVQA && mkdir spdocvqa_images
|
||||
# 在 https://rrc.cvc.uab.es/?ch=17&com=downloads 下载 Task 1 - Single Page Document Visual Question Answering 下的 Images 和 Annotations
|
||||
# 将下载得到的 spdocvqa_images.tar.gz 以及 spdocvqa_qas.zip 置于 DocVQA 目录下
|
||||
tar -zxvf spdocvqa_images.tar.gz -C spdocvqa_images && rm spdocvqa_images.tar.gz
|
||||
unzip spdocvqa_qas.zip && rm spdocvqa_qas.zip
|
||||
cp spdocvqa_qas/val_v1.0_withQT.json . && cp spdocvqa_qas/test_v1.0.json . && rm -rf spdocvqa_qas
|
||||
cd ../..
|
||||
```
|
||||
<br />
|
||||
|
||||
`downloads` 目录应当按照下列结构组织:
|
||||
```bash
|
||||
downloads
|
||||
├── TextVQA
|
||||
│ ├── train_images
|
||||
│ │ ├── ...
|
||||
│ ├── TextVQA_0.5.1_val.json
|
||||
├── DocVQA
|
||||
│ ├── spdocvqa_images
|
||||
│ │ ├── ...
|
||||
│ ├── val_v1.0_withQT.json
|
||||
│ ├── test_v1.0.json
|
||||
```
|
||||
<br />
|
||||
|
||||
准备好相应的数据集之后,修改 `shell/run_inference.sh` 的参数,运行推理:
|
||||
|
||||
```bash
|
||||
chmod +x ./shell/run_inference.sh
|
||||
./shell/run_inference.sh
|
||||
```
|
||||
<br />
|
||||
|
||||
可以传入的参数位于 `eval_utils/getargs.py` 中,各主要参数的含义如下。
|
||||
对于 `MiniCPM-V-2_6`,需要将 `model_name`设置为 `minicpmv26`:
|
||||
```bash
|
||||
# 指定 TextVQA 评测所有图片和问题的路径
|
||||
--textVQA_image_dir
|
||||
--textVQA_ann_path
|
||||
# 指定 DocVQA 评测所有图片和问题的路径
|
||||
--docVQA_image_dir
|
||||
--docVQA_ann_path
|
||||
# 指定 DocVQATest 评测所有图片和问题的路径
|
||||
--docVQATest_image_dir
|
||||
--docVQATest_ann_path
|
||||
|
||||
# 决定是否评测某个任务,eval_all 设置为 True 表示所有任务都评测
|
||||
--eval_textVQA
|
||||
--eval_docVQA
|
||||
--eval_docVQATest
|
||||
--eval_all
|
||||
|
||||
# 模型名称、模型路径(从指定路径加载模型)
|
||||
--model_name
|
||||
--model_path
|
||||
# 从 checkpoint 加载模型
|
||||
--ckpt
|
||||
# 模型处理输入数据的方式,interleave 表示图文交错式,old 表示非交错式
|
||||
--generate_method
|
||||
# 推理时的批处理规模,建议推理时设置为 1
|
||||
--batchsize
|
||||
|
||||
# 输出内容保存的路径
|
||||
--answer_path
|
||||
```
|
||||
<br />
|
||||
|
||||
评测三个任务需要设置的参数如下:
|
||||
###### TextVQA
|
||||
```bash
|
||||
--eval_textVQA
|
||||
--textVQA_image_dir ./downloads/TextVQA/train_images
|
||||
--textVQA_ann_path ./downloads/TextVQA/TextVQA_0.5.1_val.json
|
||||
```
|
||||
|
||||
###### DocVQA
|
||||
```bash
|
||||
--eval_docVQA
|
||||
--docVQA_image_dir ./downloads/DocVQA/spdocvqa_images
|
||||
--docVQA_ann_path ./downloads/DocVQA/val_v1.0_withQT.json
|
||||
```
|
||||
|
||||
###### DocVQATest
|
||||
```bash
|
||||
--eval_docVQATest
|
||||
--docVQATest_image_dir ./downloads/DocVQA/spdocvqa_images
|
||||
--docVQATest_ann_path ./downloads/DocVQA/test_v1.0.json
|
||||
```
|
||||
<br />
|
||||
|
||||
对于 DocVQATest 任务,为了将推理结果上传到[官方网站](https://rrc.cvc.uab.es/?ch=17)进行评测,还需要运行 `shell/run_transform.sh` 进行格式转换。其中,`input_file_path` 对应原始输出的 json 的路径,`output_file_path` 为自定义的转换后的 json 的路径:
|
||||
```bash
|
||||
chmod +x ./shell/run_transform.sh
|
||||
./shell/run_transform.sh
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
<br />
|
||||
|
||||
## MiniCPM-Llama3-V-2_5
|
||||
|
||||
<details>
|
||||
<summary>展开</summary>
|
||||
|
||||
### opencompass
|
||||
## opencompass
|
||||
首先,进入 `vlmevalkit` 目录下,安装必要的依赖:
|
||||
```bash
|
||||
cd vlmevalkit
|
||||
@@ -367,10 +8,10 @@ pip install -r requirements.txt
|
||||
```
|
||||
<br />
|
||||
|
||||
然后,运行 `scripts/run_inference.sh`,该脚本依次接收三个输入参数:`MODELNAME`, `DATALIST`, `MODE`。`MODELNAME` 为模型名称,`DATALIST` 为目标数据集,`MODE` 为评测模式。
|
||||
然后,运行 `script/run_inference.sh`,该脚本依次接收三个输入参数:`MODELNAME`, `DATALIST`, `MODE`。`MODELNAME` 为模型名称,`DATALIST` 为目标数据集,`MODE` 为评测模式。
|
||||
```bash
|
||||
chmod +x ./scripts/run_inference.sh
|
||||
./scripts/run_inference.sh $MODELNAME $DATALIST $MODE
|
||||
chmod +x ./script/run_inference.sh
|
||||
./script/run_inference.sh $MODELNAME $DATALIST $MODE
|
||||
```
|
||||
<br />
|
||||
|
||||
@@ -394,27 +35,27 @@ $DATALIST="POPE ScienceQA_TEST ChartQA_TEST"
|
||||
为了复现出首页展示的表格中的各项结果(MME 到 RealWorldQA 之间的列),需要按照如下设置运行:
|
||||
```bash
|
||||
# 一次性运行 7 个数据集
|
||||
./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 "MME MMBench_TEST_EN MMBench_TEST_CN MMMU_DEV_VAL MathVista_MINI LLaVABench RealWorldQA" all
|
||||
./script/run_inference.sh MiniCPM-Llama3-V-2_5 "MME MMBench_TEST_EN MMBench_TEST_CN MMMU_DEV_VAL MathVista_MINI LLaVABench RealWorldQA" all
|
||||
|
||||
# 以下是单独运行 1 个数据集的指令
|
||||
# MME
|
||||
./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 MME all
|
||||
./script/run_inference.sh MiniCPM-Llama3-V-2_5 MME all
|
||||
# MMBench_TEST_EN
|
||||
./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 MMBench_TEST_EN all
|
||||
./script/run_inference.sh MiniCPM-Llama3-V-2_5 MMBench_TEST_EN all
|
||||
# MMBench_TEST_CN
|
||||
./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 MMBench_TEST_CN all
|
||||
./script/run_inference.sh MiniCPM-Llama3-V-2_5 MMBench_TEST_CN all
|
||||
# MMMU_DEV_VAL
|
||||
./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 MMMU_DEV_VAL all
|
||||
./script/run_inference.sh MiniCPM-Llama3-V-2_5 MMMU_DEV_VAL all
|
||||
# MathVista_MINI
|
||||
./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 MathVista_MINI all
|
||||
./script/run_inference.sh MiniCPM-Llama3-V-2_5 MathVista_MINI all
|
||||
# LLaVABench
|
||||
./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 LLaVABench all
|
||||
./script/run_inference.sh MiniCPM-Llama3-V-2_5 LLaVABench all
|
||||
# RealWorldQA
|
||||
./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 RealWorldQA all
|
||||
./script/run_inference.sh MiniCPM-Llama3-V-2_5 RealWorldQA all
|
||||
```
|
||||
<br />
|
||||
|
||||
### vqadataset
|
||||
## vqadataset
|
||||
首先,进入 `vqaeval` 目录下,安装必要的依赖,并创建 `downloads` 子目录,用于存储下载的数据集:
|
||||
```bash
|
||||
cd vqaeval
|
||||
@@ -471,8 +112,7 @@ chmod +x ./shell/run_inference.sh
|
||||
```
|
||||
<br />
|
||||
|
||||
可以传入的参数位于 `eval_utils/getargs.py` 中,各主要参数的含义如下。
|
||||
对于 `MiniCPM-Llama3-V-2_5`,需要将 `model_name` 设置为 `minicpmv`:
|
||||
可以传入的参数位于 `eval_utils/getargs.py` 中,各主要参数的含义如下:
|
||||
```bash
|
||||
# 指定 TextVQA 评测所有图片和问题的路径
|
||||
--textVQA_image_dir
|
||||
@@ -533,5 +173,3 @@ chmod +x ./shell/run_inference.sh
|
||||
chmod +x ./shell/run_transform.sh
|
||||
./shell/run_transform.sh
|
||||
```
|
||||
|
||||
</details>
|
||||
@@ -1,28 +0,0 @@
|
||||
# .env 文件,将其放置在 $VLMEvalKit 下
|
||||
# 专有 VLMs 的 API 密钥
|
||||
# QwenVL APIs
|
||||
DASHSCOPE_API_KEY=
|
||||
# Gemini w. Google Cloud Backends
|
||||
GOOGLE_API_KEY=
|
||||
# OpenAI API
|
||||
OPENAI_API_KEY=
|
||||
OPENAI_API_BASE=
|
||||
# StepAI API
|
||||
STEPAI_API_KEY=
|
||||
# REKA API
|
||||
REKA_API_KEY=
|
||||
# GLMV API
|
||||
GLMV_API_KEY=
|
||||
# CongRong API
|
||||
CW_API_BASE=
|
||||
CW_API_KEY=
|
||||
# SenseChat-V API
|
||||
SENSECHAT_AK=
|
||||
SENSECHAT_SK=
|
||||
# Hunyuan-Vision API
|
||||
HUNYUAN_SECRET_KEY=
|
||||
HUNYUAN_SECRET_ID=
|
||||
# LMDeploy API
|
||||
LMDEPLOY_API_BASE=
|
||||
# 你可以设置一个评估时代理,评估阶段产生的 API 调用将通过这个代理进行
|
||||
EVAL_PROXY=
|
||||
@@ -1,30 +1,33 @@
|
||||
decord; platform_machine != 'arm64'
|
||||
eva-decord; platform_machine == 'arm64'
|
||||
gradio
|
||||
einops
|
||||
gradio==4.15.0
|
||||
huggingface_hub
|
||||
imageio
|
||||
matplotlib
|
||||
numpy
|
||||
numpy>=1.23.4
|
||||
omegaconf
|
||||
openai
|
||||
openai==1.3.5
|
||||
opencv-python>=4.4.0.46
|
||||
openpyxl
|
||||
pandas
|
||||
pandas>=1.5.3
|
||||
pillow
|
||||
portalocker
|
||||
protobuf
|
||||
pycocoevalcap
|
||||
python-dotenv
|
||||
requests
|
||||
rich
|
||||
seaborn
|
||||
sentencepiece
|
||||
setuptools
|
||||
sty
|
||||
tabulate
|
||||
tiktoken
|
||||
timeout-decorator
|
||||
torch
|
||||
tqdm
|
||||
transformers
|
||||
typing_extensions
|
||||
typing_extensions==4.7.1
|
||||
validators
|
||||
visual_genome
|
||||
xlsxwriter
|
||||
Pillow==10.1.0
|
||||
sentencepiece==0.1.99
|
||||
transformers==4.40.0
|
||||
torch==1.13.1
|
||||
torchvision
|
||||
|
||||
@@ -1,11 +0,0 @@
|
||||
docutils==0.18.1
|
||||
modelindex
|
||||
myst-parser
|
||||
-e git+https://github.com/open-compass/pytorch_sphinx_theme.git#egg=pytorch_sphinx_theme
|
||||
sphinx==6.1.3
|
||||
sphinx-copybutton
|
||||
sphinx-design
|
||||
sphinx-notfound-page
|
||||
sphinx-tabs
|
||||
sphinxcontrib-jquery
|
||||
tabulate
|
||||
@@ -1,422 +1,147 @@
|
||||
import torch
|
||||
import torch.distributed as dist
|
||||
|
||||
from vlmeval.config import supported_VLM
|
||||
from vlmeval.dataset.video_dataset_config import supported_video_datasets
|
||||
from vlmeval.dataset import build_dataset
|
||||
from vlmeval.inference import infer_data_job
|
||||
from vlmeval.inference_video import infer_data_job_video
|
||||
from vlmeval.inference_mt import infer_data_job_mt
|
||||
from vlmeval.smp import *
|
||||
from vlmeval.utils.result_transfer import MMMU_result_transfer, MMTBench_result_transfer
|
||||
|
||||
|
||||
def build_model_from_config(cfg, model_name):
|
||||
import vlmeval.api
|
||||
import vlmeval.vlm
|
||||
config = cp.deepcopy(cfg[model_name])
|
||||
if config == {}:
|
||||
return supported_VLM[model_name]()
|
||||
assert 'class' in config
|
||||
cls_name = config.pop('class')
|
||||
if hasattr(vlmeval.api, cls_name):
|
||||
return getattr(vlmeval.api, cls_name)(**config)
|
||||
elif hasattr(vlmeval.vlm, cls_name):
|
||||
return getattr(vlmeval.vlm, cls_name)(**config)
|
||||
else:
|
||||
raise ValueError(f'Class {cls_name} is not supported in `vlmeval.api` or `vlmeval.vlm`')
|
||||
|
||||
|
||||
def build_dataset_from_config(cfg, dataset_name):
|
||||
import vlmeval.dataset
|
||||
import inspect
|
||||
config = cp.deepcopy(cfg[dataset_name])
|
||||
if config == {}:
|
||||
return supported_video_datasets[dataset_name]()
|
||||
assert 'class' in config
|
||||
cls_name = config.pop('class')
|
||||
if hasattr(vlmeval.dataset, cls_name):
|
||||
cls = getattr(vlmeval.dataset, cls_name)
|
||||
sig = inspect.signature(cls.__init__)
|
||||
valid_params = {k: v for k, v in config.items() if k in sig.parameters}
|
||||
if valid_params.get('fps', 0) > 0 and valid_params.get('nframe', 0) > 0:
|
||||
raise ValueError('fps and nframe should not be set at the same time')
|
||||
if valid_params.get('fps', 0) <= 0 and valid_params.get('nframe', 0) <= 0:
|
||||
raise ValueError('fps and nframe should be set at least one valid value')
|
||||
return cls(**valid_params)
|
||||
else:
|
||||
raise ValueError(f'Class {cls_name} is not supported in `vlmeval.dataset`')
|
||||
from vlmeval.evaluate import *
|
||||
from vlmeval.inference import infer_data_job
|
||||
from vlmeval.config import supported_VLM
|
||||
from vlmeval.utils import dataset_URLs, DATASET_TYPE, abbr2full, MMMU_result_transfer
|
||||
|
||||
|
||||
def parse_args():
|
||||
help_msg = """\
|
||||
You can launch the evaluation by setting either --data and --model or --config.
|
||||
|
||||
--data and --model:
|
||||
Each Arg should be a list of strings, specifying the names of datasets and models.
|
||||
To find all supported model names, please refer to the `vlmeval/config.py` of check the output of the command \
|
||||
`vlmutil mlist all` in the terminal (you should first have vlmeval installed).
|
||||
To find all supported dataset names, please refer to the `vlmeval/dataset/__init__.py` file. The python script \
|
||||
to print all supported dataset names is as follows:
|
||||
```python
|
||||
from vlmeval.dataset import SUPPORTED_DATASETS
|
||||
print(SUPPORTED_DATASETS)
|
||||
```
|
||||
or you can check the output of the command `vlmutil dlist all` in the terminal.
|
||||
To find all supported video dataset default settings, please refer to the \
|
||||
`vlmeval/dataset/video_dataset_config.py` file.
|
||||
|
||||
--config:
|
||||
Launch the evaluation by specifying the path to the config json file. Sample Json Content:
|
||||
```json
|
||||
{
|
||||
"model": {
|
||||
"GPT4o_20240806_T00_HIGH": {
|
||||
"class": "GPT4V",
|
||||
"model": "gpt-4o-2024-08-06",
|
||||
"temperature": 0,
|
||||
"img_detail": "high"
|
||||
},
|
||||
"GPT4o_20240806_T10_Low": {
|
||||
"class": "GPT4V",
|
||||
"model": "gpt-4o-2024-08-06",
|
||||
"temperature": 1.0,
|
||||
"img_detail": "low"
|
||||
},
|
||||
"GPT4o_20241120": {}
|
||||
},
|
||||
"data": {
|
||||
"MME-RealWorld-Lite": {
|
||||
"class": "MMERealWorld",
|
||||
"dataset": "MME-RealWorld-Lite"
|
||||
},
|
||||
"MMBench_DEV_EN_V11": {
|
||||
"class": "ImageMCQDataset",
|
||||
"dataset": "MMBench_DEV_EN_V11"
|
||||
},
|
||||
"MMBench_Video_8frame_nopack": {},
|
||||
"Video-MME_16frame_subs": {
|
||||
"class": "VideoMME",
|
||||
"dataset": "Video-MME",
|
||||
"nframe": 16,
|
||||
"use_subtitle": true,
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
Currently, only `model` and `data` are supported fields. The content of each field is a dictionary.
|
||||
For `model`, the key is the name of the model, and the value is a dictionary containing the following keys:
|
||||
- `class`: The class name of the model, which should be a class in `vlmeval.vlm` or `vlmeval.api`.
|
||||
- Other keys are specific to the model, please refer to the corresponding class.
|
||||
- Tip: The defined model in the `supported_VLM` of `vlmeval/config.py` can be used as a shortcut.
|
||||
For `data`, the key is the name of the dataset (should be the same as the `dataset` field in most cases, \
|
||||
except for video datasets), and the value is a dictionary containing the following keys:
|
||||
- `class`: The class name of the dataset, which should be a class in `vlmeval.dataset`.
|
||||
- `dataset`: The name of the dataset, which should be a string that is accepted by the `dataset` argument of the \
|
||||
corresponding class.
|
||||
- Other keys are specific to the dataset, please refer to the corresponding class.
|
||||
- Tip: The defined dataset in the `supported_video_datasets` of `vlmeval/dataset/video_dataset_config.py` \
|
||||
can be used as a shortcut.
|
||||
|
||||
The keys in the `model` and `data` fields will be used for naming the prediction files and evaluation results.
|
||||
When launching with `--config`, args for API VLMs, such as `--retry`, `--verbose`, will be ignored.
|
||||
"""
|
||||
parser = argparse.ArgumentParser(description=help_msg, formatter_class=argparse.RawTextHelpFormatter)
|
||||
# Essential Args, Setting the Names of Datasets and Models
|
||||
parser.add_argument('--data', type=str, nargs='+', help='Names of Datasets')
|
||||
parser.add_argument('--model', type=str, nargs='+', help='Names of Models')
|
||||
parser.add_argument('--config', type=str, help='Path to the Config Json File')
|
||||
# Work Dir
|
||||
parser.add_argument('--work-dir', type=str, default='./outputs', help='select the output directory')
|
||||
# Infer + Eval or Infer Only
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument('--data', type=str, nargs='+', required=True)
|
||||
parser.add_argument('--model', type=str, nargs='+', required=True)
|
||||
parser.add_argument('--work-dir', type=str, default='.', help='select the output directory')
|
||||
parser.add_argument('--mode', type=str, default='all', choices=['all', 'infer'])
|
||||
# API Kwargs, Apply to API VLMs and Judge API LLMs
|
||||
parser.add_argument('--api_nproc', type=int, default=4, help='Parallel API calling')
|
||||
parser.add_argument('--nproc', type=int, default=4, help='Parallel API calling')
|
||||
parser.add_argument('--retry', type=int, default=None, help='retry numbers for API VLMs')
|
||||
# Explicitly Set the Judge Model
|
||||
parser.add_argument('--judge', type=str, default=None)
|
||||
# Logging Utils
|
||||
parser.add_argument('--verbose', action='store_true')
|
||||
# Configuration for Resume
|
||||
# Ignore: will not rerun failed VLM inference
|
||||
parser.add_argument('--ignore', action='store_true', help='Ignore failed indices. ')
|
||||
# Reuse: will reuse the existing prediction files
|
||||
parser.add_argument('--reuse', action='store_true')
|
||||
|
||||
parser.add_argument('--verbose', action='store_true')
|
||||
parser.add_argument('--rerun', action='store_true')
|
||||
args = parser.parse_args()
|
||||
return args
|
||||
|
||||
|
||||
def main():
|
||||
logger = get_logger('RUN')
|
||||
rank, world_size = get_rank_and_world_size()
|
||||
|
||||
args = parse_args()
|
||||
use_config, cfg = False, None
|
||||
if args.config is not None:
|
||||
assert args.data is None and args.model is None, '--data and --model should not be set when using --config'
|
||||
use_config, cfg = True, load(args.config)
|
||||
args.model = list(cfg['model'].keys())
|
||||
args.data = list(cfg['data'].keys())
|
||||
else:
|
||||
assert len(args.data), '--data should be a list of data files'
|
||||
assert len(args.data), '--data should be a list of data files'
|
||||
|
||||
if rank == 0:
|
||||
if not args.reuse:
|
||||
logger.warning('--reuse is not set, will not reuse previous (before one day) temporary files')
|
||||
else:
|
||||
logger.warning('--reuse is set, will reuse the latest prediction & temporary pickle files')
|
||||
|
||||
if 'MMEVAL_ROOT' in os.environ:
|
||||
args.work_dir = os.environ['MMEVAL_ROOT']
|
||||
|
||||
if not use_config:
|
||||
if args.retry is not None:
|
||||
for k, v in supported_VLM.items():
|
||||
if hasattr(v, 'keywords') and 'retry' in v.keywords and args.retry is not None:
|
||||
if hasattr(v, 'keywords') and 'retry' in v.keywords:
|
||||
v.keywords['retry'] = args.retry
|
||||
supported_VLM[k] = v
|
||||
if hasattr(v, 'keywords') and 'verbose' in v.keywords and args.verbose is not None:
|
||||
if hasattr(v, 'keywords') and 'verbose' in v.keywords:
|
||||
v.keywords['verbose'] = args.verbose
|
||||
supported_VLM[k] = v
|
||||
|
||||
rank, world_size = get_rank_and_world_size()
|
||||
if world_size > 1:
|
||||
local_rank = os.environ.get('LOCAL_RANK', 0)
|
||||
torch.cuda.set_device(int(local_rank))
|
||||
dist.init_process_group(
|
||||
backend='nccl',
|
||||
timeout=datetime.timedelta(seconds=int(os.environ.get('DIST_TIMEOUT', 3600)))
|
||||
)
|
||||
dist.init_process_group(backend='nccl', timeout=datetime.timedelta(seconds=10800))
|
||||
|
||||
for _, model_name in enumerate(args.model):
|
||||
model = None
|
||||
date, commit_id = timestr('day'), githash(digits=8)
|
||||
eval_id = f"T{date}_G{commit_id}"
|
||||
|
||||
pred_root = osp.join(args.work_dir, model_name, eval_id)
|
||||
pred_root_meta = osp.join(args.work_dir, model_name)
|
||||
os.makedirs(pred_root_meta, exist_ok=True)
|
||||
|
||||
prev_pred_roots = ls(osp.join(args.work_dir, model_name), mode='dir')
|
||||
if len(prev_pred_roots) and args.reuse:
|
||||
prev_pred_roots.sort()
|
||||
|
||||
if not osp.exists(pred_root):
|
||||
os.makedirs(pred_root, exist_ok=True)
|
||||
|
||||
if use_config:
|
||||
model = build_model_from_config(cfg['model'], model_name)
|
||||
pred_root = osp.join(args.work_dir, model_name)
|
||||
os.makedirs(pred_root, exist_ok=True)
|
||||
|
||||
for _, dataset_name in enumerate(args.data):
|
||||
try:
|
||||
result_file_base = f'{model_name}_{dataset_name}.xlsx'
|
||||
custom_flag = False
|
||||
|
||||
if use_config:
|
||||
if world_size > 1:
|
||||
if rank == 0:
|
||||
dataset = build_dataset_from_config(cfg['data'], dataset_name)
|
||||
dist.barrier()
|
||||
dataset = build_dataset_from_config(cfg['data'], dataset_name)
|
||||
if dataset is None:
|
||||
logger.error(f'Dataset {dataset_name} is not valid, will be skipped. ')
|
||||
continue
|
||||
if dataset_name not in dataset_URLs:
|
||||
dataset_name = abbr2full(dataset_name)
|
||||
|
||||
if dataset_name not in dataset_URLs:
|
||||
logger.warning(f'Dataset {dataset_name} is not officially supported. ')
|
||||
file_path = osp.join(LMUDataRoot(), f'{dataset_name}.tsv')
|
||||
if not osp.exists(file_path):
|
||||
logger.error(f'Cannot find the local dataset {dataset_name}. ')
|
||||
continue
|
||||
else:
|
||||
dataset_kwargs = {}
|
||||
if dataset_name in ['MMLongBench_DOC', 'DUDE', 'DUDE_MINI', 'SLIDEVQA', 'SLIDEVQA_MINI']:
|
||||
dataset_kwargs['model'] = model_name
|
||||
custom_flag = True
|
||||
|
||||
# If distributed, first build the dataset on the main process for doing preparation works
|
||||
if world_size > 1:
|
||||
if rank == 0:
|
||||
dataset = build_dataset(dataset_name, **dataset_kwargs)
|
||||
dist.barrier()
|
||||
result_file = f'{pred_root}/{model_name}_{dataset_name}.xlsx'
|
||||
if osp.exists(result_file) and args.rerun:
|
||||
os.system(f'rm {pred_root}/{model_name}_{dataset_name}_*')
|
||||
|
||||
dataset = build_dataset(dataset_name, **dataset_kwargs)
|
||||
if dataset is None:
|
||||
logger.error(f'Dataset {dataset_name} is not valid, will be skipped. ')
|
||||
continue
|
||||
if model is None:
|
||||
model = model_name # which is only a name
|
||||
|
||||
# Handling Multi-Turn Dataset
|
||||
if dataset.TYPE == 'MT':
|
||||
result_file_base = result_file_base.replace('.xlsx', '.tsv')
|
||||
model = infer_data_job(
|
||||
model,
|
||||
work_dir=pred_root,
|
||||
model_name=model_name,
|
||||
dataset_name=dataset_name,
|
||||
verbose=args.verbose,
|
||||
api_nproc=args.nproc,
|
||||
ignore_failed=args.ignore)
|
||||
|
||||
result_file = osp.join(pred_root, result_file_base)
|
||||
if rank == 0:
|
||||
if dataset_name in ['MMMU_TEST']:
|
||||
result_json = MMMU_result_transfer(result_file)
|
||||
logger.info(f'Transfer MMMU_TEST result to json for official evaluation, json file saved in {result_json}') # noqa: E501
|
||||
continue
|
||||
|
||||
# Reuse the previous prediction file if exists
|
||||
if rank == 0 and len(prev_pred_roots):
|
||||
prev_result_file = None
|
||||
prev_pkl_file_list = []
|
||||
for root in prev_pred_roots[::-1]:
|
||||
if osp.exists(osp.join(root, result_file_base)):
|
||||
prev_result_file = osp.join(root, result_file_base)
|
||||
break
|
||||
elif commit_id in root and len(ls(root)) and root != pred_root:
|
||||
temp_files = ls(root, match=[dataset_name, '.pkl'])
|
||||
if len(temp_files):
|
||||
prev_pkl_file_list.extend(temp_files)
|
||||
break
|
||||
if not args.reuse:
|
||||
prev_result_file = None
|
||||
prev_pkl_file_list = []
|
||||
if prev_result_file is not None:
|
||||
logger.warning(
|
||||
f'--reuse is set, will reuse the prediction file {prev_result_file}.')
|
||||
if prev_result_file != result_file:
|
||||
shutil.copy(prev_result_file, result_file)
|
||||
elif len(prev_pkl_file_list):
|
||||
for fname in prev_pkl_file_list:
|
||||
target_path = osp.join(pred_root, osp.basename(fname))
|
||||
if not osp.exists(target_path):
|
||||
shutil.copy(fname, target_path)
|
||||
logger.info(f'--reuse is set, will reuse the prediction pickle file {fname}.')
|
||||
else:
|
||||
logger.warning(f'File already exists: {target_path}')
|
||||
if dataset_name in [
|
||||
'MMBench_TEST_CN', 'MMBench_TEST_EN', 'MMBench', 'MMBench_CN'
|
||||
'MMBench_TEST_CN_V11', 'MMBench_TEST_EN_V11', 'MMBench_V11', 'MMBench_CN_V11'
|
||||
]:
|
||||
if not MMBenchOfficialServer(dataset_name):
|
||||
logger.error(
|
||||
f'Can not evaluate {dataset_name} on non-official servers, '
|
||||
'will skip the evaluation. '
|
||||
)
|
||||
continue
|
||||
|
||||
if world_size > 1:
|
||||
dist.barrier()
|
||||
judge_kwargs = {
|
||||
'nproc': args.nproc,
|
||||
'verbose': args.verbose,
|
||||
}
|
||||
if args.retry is not None:
|
||||
judge_kwargs['retry'] = args.retry
|
||||
if args.judge is not None:
|
||||
judge_kwargs['model'] = args.judge
|
||||
else:
|
||||
if DATASET_TYPE(dataset_name) in ['multi-choice', 'Y/N']:
|
||||
judge_kwargs['model'] = 'chatgpt-0613'
|
||||
elif listinstr(['MMVet', 'MathVista', 'LLaVABench'], dataset_name):
|
||||
judge_kwargs['model'] = 'gpt-4-turbo'
|
||||
if 'OPENAI_API_KEY_JUDGE' in os.environ and len(os.environ['OPENAI_API_KEY_JUDGE']):
|
||||
judge_kwargs['key'] = os.environ['OPENAI_API_KEY_JUDGE']
|
||||
if 'OPENAI_API_BASE_JUDGE' in os.environ and len(os.environ['OPENAI_API_BASE_JUDGE']):
|
||||
judge_kwargs['api_base'] = os.environ['OPENAI_API_BASE_JUDGE']
|
||||
|
||||
if model is None:
|
||||
model = model_name # which is only a name
|
||||
|
||||
# Perform the Inference
|
||||
if dataset.MODALITY == 'VIDEO':
|
||||
model = infer_data_job_video(
|
||||
model,
|
||||
work_dir=pred_root,
|
||||
model_name=model_name,
|
||||
dataset=dataset,
|
||||
result_file_name=result_file_base,
|
||||
verbose=args.verbose,
|
||||
api_nproc=args.api_nproc)
|
||||
elif dataset.TYPE == 'MT':
|
||||
model = infer_data_job_mt(
|
||||
model,
|
||||
work_dir=pred_root,
|
||||
model_name=model_name,
|
||||
dataset=dataset,
|
||||
verbose=args.verbose,
|
||||
api_nproc=args.api_nproc,
|
||||
ignore_failed=args.ignore)
|
||||
if rank == 0 and args.mode == 'all':
|
||||
if DATASET_TYPE(dataset_name) == 'multi-choice':
|
||||
dataset_name = 'default' if custom_flag else dataset_name
|
||||
multiple_choice_eval(
|
||||
result_file,
|
||||
dataset=dataset_name,
|
||||
**judge_kwargs)
|
||||
elif DATASET_TYPE(dataset_name) == 'Y/N':
|
||||
YOrN_eval(
|
||||
result_file,
|
||||
dataset=dataset_name,
|
||||
**judge_kwargs)
|
||||
elif DATASET_TYPE(dataset_name) == 'Caption':
|
||||
COCO_eval(result_file)
|
||||
elif dataset_name == 'MMVet':
|
||||
MMVet_eval(result_file, **judge_kwargs)
|
||||
elif dataset_name == 'OCRBench':
|
||||
OCRBench_eval(result_file)
|
||||
elif listinstr(['OCRVQA', 'TextVQA', 'ChartQA', 'DocVQA', 'InfoVQA'], dataset_name):
|
||||
VQAEval(result_file, dataset_name)
|
||||
elif listinstr(['MathVista'], dataset_name):
|
||||
MathVista_eval(result_file, **judge_kwargs)
|
||||
elif listinstr(['LLaVABench'], dataset_name):
|
||||
LLaVABench_eval(result_file, **judge_kwargs)
|
||||
else:
|
||||
model = infer_data_job(
|
||||
model,
|
||||
work_dir=pred_root,
|
||||
model_name=model_name,
|
||||
dataset=dataset,
|
||||
verbose=args.verbose,
|
||||
api_nproc=args.api_nproc,
|
||||
ignore_failed=args.ignore)
|
||||
|
||||
# Set the judge kwargs first before evaluation or dumping
|
||||
|
||||
judge_kwargs = {
|
||||
'nproc': args.api_nproc,
|
||||
'verbose': args.verbose,
|
||||
'retry': args.retry if args.retry is not None else 3
|
||||
}
|
||||
|
||||
if args.retry is not None:
|
||||
judge_kwargs['retry'] = args.retry
|
||||
if args.judge is not None:
|
||||
judge_kwargs['model'] = args.judge
|
||||
else:
|
||||
if dataset.TYPE in ['MCQ', 'Y/N']:
|
||||
judge_kwargs['model'] = 'chatgpt-0125'
|
||||
elif listinstr(['MMVet', 'LLaVABench', 'MMBench-Video'], dataset_name):
|
||||
judge_kwargs['model'] = 'gpt-4-turbo'
|
||||
elif listinstr(['MathVista', 'MathVerse', 'MathVision', 'DynaMath', 'VL-RewardBench', 'WeMath', 'LogicVista'], dataset_name): # noqa: E501
|
||||
judge_kwargs['model'] = 'gpt-4o-mini'
|
||||
elif listinstr(['MMLongBench', 'MMDU', 'DUDE', 'SLIDEVQA', 'MIA-Bench', 'WildVision'], dataset_name): # noqa: E501
|
||||
judge_kwargs['model'] = 'gpt-4o'
|
||||
|
||||
if rank == 0:
|
||||
logger.info(judge_kwargs)
|
||||
|
||||
if world_size > 1:
|
||||
dist.barrier()
|
||||
|
||||
# Only Rank 0 handles the evaluation part
|
||||
if rank == 0:
|
||||
# Prepare Submission Files for MMMU_TEST AND MMT-Bench_ALL
|
||||
if dataset_name in ['MMMU_TEST']:
|
||||
result_json = MMMU_result_transfer(result_file)
|
||||
logger.info(f'Transfer MMMU_TEST result to json for official evaluation, '
|
||||
f'json file saved in {result_json}')
|
||||
continue
|
||||
elif 'MMT-Bench_ALL' in dataset_name:
|
||||
submission_file = MMTBench_result_transfer(result_file, **judge_kwargs)
|
||||
logger.info(f'Extract options from prediction of MMT-Bench FULL split for official evaluation '
|
||||
f'(https://eval.ai/web/challenges/challenge-page/2328/overview), '
|
||||
f'submission file saved in {submission_file}')
|
||||
continue
|
||||
|
||||
# Skip the evaluation part if only infer
|
||||
if args.mode == 'infer':
|
||||
continue
|
||||
|
||||
# Skip the evaluation part if the dataset evaluation is not supported or annotations are missing
|
||||
if 'MLLMGuard_DS' in dataset_name:
|
||||
logger.info('The evaluation of MLLMGuard_DS is not supported yet. ')
|
||||
continue
|
||||
elif 'AesBench_TEST' == dataset_name:
|
||||
logger.info(f'The results are saved in {result_file}. '
|
||||
f'Please send it to the AesBench Team via huangyipo@hotmail.com.')
|
||||
continue
|
||||
elif dataset_name in ['DocVQA_TEST', 'InfoVQA_TEST', 'Q-Bench1_TEST', 'A-Bench_TEST']:
|
||||
logger.info(f'{dataset_name} is a test split without ground-truth. '
|
||||
'Thus only the inference part is supported for those datasets. ')
|
||||
continue
|
||||
elif dataset_name in [
|
||||
'MMBench_TEST_CN', 'MMBench_TEST_EN', 'MMBench', 'MMBench_CN',
|
||||
'MMBench_TEST_CN_V11', 'MMBench_TEST_EN_V11', 'MMBench_V11', 'MMBench_CN_V11'
|
||||
] and not MMBenchOfficialServer(dataset_name):
|
||||
logger.error(
|
||||
f'Can not evaluate {dataset_name} on non-official servers, will skip the evaluation.')
|
||||
continue
|
||||
|
||||
# Setup the proxy for the evaluation
|
||||
eval_proxy = os.environ.get('EVAL_PROXY', None)
|
||||
old_proxy = os.environ.get('HTTP_PROXY', '')
|
||||
if eval_proxy is not None:
|
||||
proxy_set(eval_proxy)
|
||||
|
||||
# Perform the Evaluation
|
||||
eval_results = dataset.evaluate(result_file, **judge_kwargs)
|
||||
# Display Evaluation Results in Terminal
|
||||
if eval_results is not None:
|
||||
assert isinstance(eval_results, dict) or isinstance(eval_results, pd.DataFrame)
|
||||
logger.info(f'The evaluation of model {model_name} x dataset {dataset_name} has finished! ')
|
||||
logger.info('Evaluation Results:')
|
||||
if isinstance(eval_results, dict):
|
||||
logger.info('\n' + json.dumps(eval_results, indent=4))
|
||||
elif isinstance(eval_results, pd.DataFrame):
|
||||
if len(eval_results) < len(eval_results.columns):
|
||||
eval_results = eval_results.T
|
||||
logger.info('\n' + tabulate(eval_results))
|
||||
|
||||
# Restore the proxy
|
||||
if eval_proxy is not None:
|
||||
proxy_set(old_proxy)
|
||||
|
||||
# Create the symbolic links for the prediction files
|
||||
files = os.listdir(pred_root)
|
||||
files = [x for x in files if (f'{model_name}_{dataset_name}' in x or "status.json" in x)]
|
||||
for f in files:
|
||||
cwd = os.getcwd()
|
||||
file_addr = osp.join(cwd, pred_root, f)
|
||||
link_addr = osp.join(cwd, pred_root_meta, f)
|
||||
if osp.exists(link_addr) or osp.islink(link_addr):
|
||||
os.remove(link_addr)
|
||||
os.symlink(file_addr, link_addr)
|
||||
|
||||
except Exception as e:
|
||||
logger.exception(f'Model {model_name} x Dataset {dataset_name} combination failed: {e}, '
|
||||
'skipping this combination.')
|
||||
continue
|
||||
|
||||
if world_size > 1:
|
||||
dist.barrier()
|
||||
|
||||
if world_size > 1:
|
||||
dist.destroy_process_group()
|
||||
logger.error(f'Dataset {dataset_name} is not handled by evaluator, will be skipped. ')
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
|
||||
31
eval_mm/vlmevalkit/script/run_inference.sh
Normal file
@@ -0,0 +1,31 @@
|
||||
export PATH=/usr/local/cuda/bin:$PATH
|
||||
|
||||
export HF_ENDPOINT=https://hf-mirror.com
|
||||
export OMP_NUM_THREADS=1
|
||||
export timestamp=`date +"%Y%m%d%H%M%S"`
|
||||
export OLD_VERSION='False'
|
||||
export PYTHONPATH=$(dirname $SELF_DIR):$PYTHONPATH
|
||||
|
||||
# gpu consumed
|
||||
# fp16 17-18G
|
||||
# int4 7-8G
|
||||
|
||||
# model to be used
|
||||
# Example: MODELNAME=MiniCPM-Llama3-V-2_5
|
||||
MODELNAME=$1
|
||||
# datasets to be tested
|
||||
# Example: DATALIST="POPE ScienceQA_TEST ChartQA_TEST"
|
||||
DATALIST=$2
|
||||
# test mode, all or infer
|
||||
MODE=$3
|
||||
|
||||
echo "Starting inference with model $MODELNAME on datasets $DATALIST"
|
||||
# run on multi gpus with torchrun command
|
||||
# remember to run twice, the first run may fail
|
||||
torchrun --nproc_per_node=8 run.py --data $DATALIST --model $MODELNAME --mode $MODE
|
||||
torchrun --nproc_per_node=8 run.py --data $DATALIST --model $MODELNAME --mode $MODE
|
||||
# run on single gpu with python command
|
||||
# python run.py --data $DATALIST --model $MODELNAME --verbose --mode $MODE
|
||||
# python run.py --data $DATALIST --model $MODELNAME --verbose --mode $MODE
|
||||
|
||||
ls
|
||||
@@ -1,41 +0,0 @@
|
||||
export PATH=/usr/local/cuda/bin:$PATH
|
||||
|
||||
export HF_ENDPOINT=https://hf-mirror.com
|
||||
export OMP_NUM_THREADS=1
|
||||
export timestamp=`date +"%Y%m%d%H%M%S"`
|
||||
export OLD_VERSION='False'
|
||||
export PYTHONPATH=$(dirname $SELF_DIR):$PYTHONPATH
|
||||
export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
|
||||
|
||||
# gpu consumed
|
||||
# fp16 17-18G
|
||||
# int4 7-8G
|
||||
|
||||
# model to be used
|
||||
# Example: MODELNAME=MiniCPM-o-2_6
|
||||
MODELNAME=$1
|
||||
# datasets to be tested
|
||||
# Example: DATALIST=MMMU_DEV_VAL
|
||||
DATALIST=$2
|
||||
|
||||
# run on multi gpus with torchrun command
|
||||
# remember to run twice, the first run may fail
|
||||
for DATASET in $DATALIST; do
|
||||
echo "Starting inference with model $MODELNAME on dataset $DATASET"
|
||||
torchrun --master_port 29500 --nproc_per_node=8 run.py --data $DATASET --model $MODELNAME --mode infer --reuse
|
||||
torchrun --master_port 29501 --nproc_per_node=8 run.py --data $DATASET --model $MODELNAME --mode infer --reuse
|
||||
|
||||
# for benchmarks which require gpt for scoring, you need to specify OPENAI_API_BASE and OPENAI_API_KEY in .env file
|
||||
if [[ "$DATASET" == *"MMBench_TEST"*]]; then
|
||||
echo "Skipping evaluation for dataset $DATASET"
|
||||
else
|
||||
echo "Starting evaluation with model $MODELNAME on datasets $DATASET"
|
||||
python run.py --data $DATASET --model $MODELNAME --nproc 16 --verbose
|
||||
fi
|
||||
done
|
||||
|
||||
# run on single gpu with python command
|
||||
# python run.py --data $DATALIST --model $MODELNAME --verbose --mode infer
|
||||
# python run.py --data $DATALIST --model $MODELNAME --verbose --mode infer
|
||||
# echo "Starting evaluation with model $MODELNAME on datasets $DATASET"
|
||||
# python run.py --data $DATASET --model $MODELNAME --nproc 16 --verbose
|
||||
@@ -1,122 +0,0 @@
|
||||
import re
|
||||
import sys
|
||||
from os.path import exists
|
||||
from setuptools import find_packages, setup
|
||||
|
||||
|
||||
def parse_requirements(fname='requirements.txt', with_version=True):
|
||||
"""Parse the package dependencies listed in a requirements file but strips
|
||||
specific versioning information.
|
||||
|
||||
Args:
|
||||
fname (str): path to requirements file
|
||||
with_version (bool, default=False): if True include version specs
|
||||
|
||||
Returns:
|
||||
List[str]: list of requirements items
|
||||
|
||||
CommandLine:
|
||||
python -c "import setup; print(setup.parse_requirements())"
|
||||
"""
|
||||
|
||||
require_fpath = fname
|
||||
|
||||
def parse_line(line):
|
||||
"""Parse information from a line in a requirements text file."""
|
||||
if line.startswith('-r '):
|
||||
# Allow specifying requirements in other files
|
||||
target = line.split(' ')[1]
|
||||
for info in parse_require_file(target):
|
||||
yield info
|
||||
else:
|
||||
info = {'line': line}
|
||||
if line.startswith('-e '):
|
||||
info['package'] = line.split('#egg=')[1]
|
||||
elif '@git+' in line:
|
||||
info['package'] = line
|
||||
else:
|
||||
# Remove versioning from the package
|
||||
pat = '(' + '|'.join(['>=', '==', '>']) + ')'
|
||||
parts = re.split(pat, line, maxsplit=1)
|
||||
parts = [p.strip() for p in parts]
|
||||
|
||||
info['package'] = parts[0]
|
||||
if len(parts) > 1:
|
||||
op, rest = parts[1:]
|
||||
if ';' in rest:
|
||||
# Handle platform specific dependencies
|
||||
# http://setuptools.readthedocs.io/en/latest/setuptools.html#declaring-platform-specific-dependencies
|
||||
version, platform_deps = map(str.strip,
|
||||
rest.split(';'))
|
||||
info['platform_deps'] = platform_deps
|
||||
else:
|
||||
version = rest # NOQA
|
||||
info['version'] = (op, version)
|
||||
yield info
|
||||
|
||||
def parse_require_file(fpath):
|
||||
with open(fpath, 'r') as f:
|
||||
for line in f.readlines():
|
||||
line = line.strip()
|
||||
if line and not line.startswith('#'):
|
||||
for info in parse_line(line):
|
||||
yield info
|
||||
|
||||
def gen_packages_items():
|
||||
if exists(require_fpath):
|
||||
for info in parse_require_file(require_fpath):
|
||||
parts = [info['package']]
|
||||
if with_version and 'version' in info:
|
||||
parts.extend(info['version'])
|
||||
if not sys.version.startswith('3.4'):
|
||||
# apparently package_deps are broken in 3.4
|
||||
platform_deps = info.get('platform_deps')
|
||||
if platform_deps is not None:
|
||||
parts.append(';' + platform_deps)
|
||||
item = ''.join(parts)
|
||||
yield item
|
||||
|
||||
packages = list(gen_packages_items())
|
||||
return packages
|
||||
|
||||
|
||||
with open('README.md') as f:
|
||||
readme = f.read()
|
||||
|
||||
|
||||
def do_setup():
|
||||
setup(
|
||||
name='vlmeval',
|
||||
version='0.1.0',
|
||||
description='OpenCompass VLM Evaluation Kit',
|
||||
author='Haodong Duan',
|
||||
author_email='dhd.efz@gmail.com',
|
||||
maintainer='Haodong Duan',
|
||||
maintainer_email='dhd.efz@gmail.com',
|
||||
long_description=readme,
|
||||
long_description_content_type='text/markdown',
|
||||
cmdclass={},
|
||||
install_requires=parse_requirements('requirements.txt'),
|
||||
setup_requires=[],
|
||||
python_requires='>=3.7.0',
|
||||
packages=find_packages(exclude=[
|
||||
'test*',
|
||||
'paper_test*',
|
||||
]),
|
||||
keywords=['AI', 'NLP', 'in-context learning'],
|
||||
entry_points={
|
||||
'console_scripts': ['vlmutil = vlmeval:cli']
|
||||
},
|
||||
classifiers=[
|
||||
'Programming Language :: Python :: 3.7',
|
||||
'Programming Language :: Python :: 3.8',
|
||||
'Programming Language :: Python :: 3.9',
|
||||
'Programming Language :: Python :: 3.10',
|
||||
'Intended Audience :: Developers',
|
||||
'Intended Audience :: Education',
|
||||
'Intended Audience :: Science/Research',
|
||||
])
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
do_setup()
|
||||
@@ -5,12 +5,9 @@ except ImportError:
|
||||
|
||||
from .smp import *
|
||||
from .api import *
|
||||
from .dataset import *
|
||||
from .evaluate import *
|
||||
from .utils import *
|
||||
from .vlm import *
|
||||
from .config import *
|
||||
from .tools import cli
|
||||
|
||||
load_env()
|
||||
|
||||
__version__ = '0.2rc1'
|
||||
|
||||
@@ -1,5 +1,6 @@
|
||||
from .gpt import OpenAIWrapper, GPT4V
|
||||
from .gpt_int import OpenAIWrapperInternal, GPT4V_Internal
|
||||
|
||||
__all__ = [
|
||||
'OpenAIWrapper', 'GPT4V',
|
||||
'OpenAIWrapper', 'OpenAIWrapperInternal', 'GPT4V', 'GPT4V_Internal'
|
||||
]
|
||||
|
||||
@@ -3,7 +3,7 @@ import random as rd
|
||||
from abc import abstractmethod
|
||||
import os.path as osp
|
||||
import copy as cp
|
||||
from ..smp import get_logger, parse_file, concat_images_vlmeval, LMUDataRoot, md5, decode_base64_to_image_file
|
||||
from ..smp import get_logger, parse_file
|
||||
|
||||
|
||||
class BaseAPI:
|
||||
@@ -62,22 +62,12 @@ class BaseAPI:
|
||||
Returns:
|
||||
bool: If the API model is working, return True, else return False.
|
||||
"""
|
||||
self.old_timeout = None
|
||||
if hasattr(self, 'timeout'):
|
||||
self.old_timeout = self.timeout
|
||||
self.timeout = 120
|
||||
|
||||
retry = 5
|
||||
retry = 3
|
||||
while retry > 0:
|
||||
ret = self.generate('hello')
|
||||
if ret is not None and ret != '' and self.fail_msg not in ret:
|
||||
if self.old_timeout is not None:
|
||||
self.timeout = self.old_timeout
|
||||
return True
|
||||
retry -= 1
|
||||
|
||||
if self.old_timeout is not None:
|
||||
self.timeout = self.old_timeout
|
||||
return False
|
||||
|
||||
def check_content(self, msgs):
|
||||
@@ -137,82 +127,6 @@ class BaseAPI:
|
||||
else:
|
||||
return None
|
||||
|
||||
# May exceed the context windows size, so try with different turn numbers.
|
||||
def chat_inner(self, inputs, **kwargs):
|
||||
_ = kwargs.pop('dataset', None)
|
||||
while len(inputs):
|
||||
try:
|
||||
return self.generate_inner(inputs, **kwargs)
|
||||
except Exception as e:
|
||||
if self.verbose:
|
||||
self.logger.info(f'{type(e)}: {e}')
|
||||
inputs = inputs[1:]
|
||||
while len(inputs) and inputs[0]['role'] != 'user':
|
||||
inputs = inputs[1:]
|
||||
continue
|
||||
return -1, self.fail_msg + ': ' + 'Failed with all possible conversation turns.', None
|
||||
|
||||
def chat(self, messages, **kwargs1):
|
||||
"""The main function for multi-turn chatting. Will call `chat_inner` with the preprocessed input messages."""
|
||||
assert hasattr(self, 'chat_inner'), 'The API model should has the `chat_inner` method. '
|
||||
for msg in messages:
|
||||
assert isinstance(msg, dict) and 'role' in msg and 'content' in msg, msg
|
||||
assert self.check_content(msg['content']) in ['str', 'dict', 'liststr', 'listdict'], msg
|
||||
msg['content'] = self.preproc_content(msg['content'])
|
||||
# merge kwargs
|
||||
kwargs = cp.deepcopy(self.default_kwargs)
|
||||
kwargs.update(kwargs1)
|
||||
|
||||
answer = None
|
||||
# a very small random delay [0s - 0.5s]
|
||||
T = rd.random() * 0.5
|
||||
time.sleep(T)
|
||||
|
||||
assert messages[-1]['role'] == 'user'
|
||||
|
||||
for i in range(self.retry):
|
||||
try:
|
||||
ret_code, answer, log = self.chat_inner(messages, **kwargs)
|
||||
if ret_code == 0 and self.fail_msg not in answer and answer != '':
|
||||
if self.verbose:
|
||||
print(answer)
|
||||
return answer
|
||||
elif self.verbose:
|
||||
if not isinstance(log, str):
|
||||
try:
|
||||
log = log.text
|
||||
except Exception as e:
|
||||
self.logger.warning(f'Failed to parse {log} as an http response: {str(e)}. ')
|
||||
self.logger.info(f'RetCode: {ret_code}\nAnswer: {answer}\nLog: {log}')
|
||||
except Exception as err:
|
||||
if self.verbose:
|
||||
self.logger.error(f'An error occured during try {i}: ')
|
||||
self.logger.error(f'{type(err)}: {err}')
|
||||
# delay before each retry
|
||||
T = rd.random() * self.wait * 2
|
||||
time.sleep(T)
|
||||
|
||||
return self.fail_msg if answer in ['', None] else answer
|
||||
|
||||
def preprocess_message_with_role(self, message):
|
||||
system_prompt = ''
|
||||
new_message = []
|
||||
|
||||
for data in message:
|
||||
assert isinstance(data, dict)
|
||||
role = data.pop('role', 'user')
|
||||
if role == 'system':
|
||||
system_prompt += data['value'] + '\n'
|
||||
else:
|
||||
new_message.append(data)
|
||||
|
||||
if system_prompt != '':
|
||||
if self.system_prompt is None:
|
||||
self.system_prompt = system_prompt
|
||||
else:
|
||||
self.system_prompt += '\n' + system_prompt
|
||||
return new_message
|
||||
|
||||
def generate(self, message, **kwargs1):
|
||||
"""The main function to generate the answer. Will call `generate_inner` with the preprocessed input messages.
|
||||
|
||||
@@ -222,9 +136,6 @@ class BaseAPI:
|
||||
Returns:
|
||||
str: The generated answer of the Failed Message if failed to obtain answer.
|
||||
"""
|
||||
if self.check_content(message) == 'listdict':
|
||||
message = self.preprocess_message_with_role(message)
|
||||
|
||||
assert self.check_content(message) in ['str', 'dict', 'liststr', 'listdict'], f'Invalid input type: {message}'
|
||||
message = self.preproc_content(message)
|
||||
assert message is not None and self.check_content(message) == 'listdict'
|
||||
@@ -251,20 +162,20 @@ class BaseAPI:
|
||||
if not isinstance(log, str):
|
||||
try:
|
||||
log = log.text
|
||||
except Exception as e:
|
||||
self.logger.warning(f'Failed to parse {log} as an http response: {str(e)}. ')
|
||||
except:
|
||||
self.logger.warning(f'Failed to parse {log} as an http response. ')
|
||||
self.logger.info(f'RetCode: {ret_code}\nAnswer: {answer}\nLog: {log}')
|
||||
except Exception as err:
|
||||
if self.verbose:
|
||||
self.logger.error(f'An error occured during try {i}: ')
|
||||
self.logger.error(f'{type(err)}: {err}')
|
||||
self.logger.error(f'An error occured during try {i}:')
|
||||
self.logger.error(err)
|
||||
# delay before each retry
|
||||
T = rd.random() * self.wait * 2
|
||||
time.sleep(T)
|
||||
|
||||
return self.fail_msg if answer in ['', None] else answer
|
||||
|
||||
def message_to_promptimg(self, message, dataset=None):
|
||||
def message_to_promptimg(self, message):
|
||||
assert not self.INTERLEAVE
|
||||
model_name = self.__class__.__name__
|
||||
import warnings
|
||||
@@ -280,10 +191,5 @@ class BaseAPI:
|
||||
image = [x['value'] for x in message if x['type'] == 'image'][0]
|
||||
else:
|
||||
prompt = '\n'.join([x['value'] if x['type'] == 'text' else '<image>' for x in message])
|
||||
if dataset == 'BLINK':
|
||||
image = concat_images_vlmeval(
|
||||
[x['value'] for x in message if x['type'] == 'image'],
|
||||
target_size=512)
|
||||
else:
|
||||
image = [x['value'] for x in message if x['type'] == 'image'][0]
|
||||
image = [x['value'] for x in message if x['type'] == 'image'][0]
|
||||
return prompt, image
|
||||
|
||||
@@ -10,18 +10,18 @@ APIBASES = {
|
||||
|
||||
def GPT_context_window(model):
|
||||
length_map = {
|
||||
'gpt-4': 8192,
|
||||
'gpt-4-0613': 8192,
|
||||
'gpt-4-turbo-preview': 128000,
|
||||
'gpt-4-1106-preview': 128000,
|
||||
'gpt-4-0125-preview': 128000,
|
||||
'gpt-4-vision-preview': 128000,
|
||||
'gpt-4-turbo': 128000,
|
||||
'gpt-4-turbo-2024-04-09': 128000,
|
||||
'gpt-3.5-turbo': 16385,
|
||||
'gpt-3.5-turbo-0125': 16385,
|
||||
'gpt-4': 8192,
|
||||
'gpt-4-32k': 32768,
|
||||
'gpt-4-0613': 8192,
|
||||
'gpt-4-32k-0613': 32768,
|
||||
'gpt-3.5-turbo-1106': 16385,
|
||||
'gpt-3.5-turbo': 4096,
|
||||
'gpt-3.5-turbo-16k': 16385,
|
||||
'gpt-3.5-turbo-instruct': 4096,
|
||||
'gpt-3.5-turbo-0613': 4096,
|
||||
'gpt-3.5-turbo-16k-0613': 16385,
|
||||
}
|
||||
if model in length_map:
|
||||
return length_map[model]
|
||||
@@ -38,7 +38,7 @@ class OpenAIWrapper(BaseAPI):
|
||||
retry: int = 5,
|
||||
wait: int = 5,
|
||||
key: str = None,
|
||||
verbose: bool = False,
|
||||
verbose: bool = True,
|
||||
system_prompt: str = None,
|
||||
temperature: float = 0,
|
||||
timeout: int = 60,
|
||||
@@ -46,7 +46,6 @@ class OpenAIWrapper(BaseAPI):
|
||||
max_tokens: int = 1024,
|
||||
img_size: int = 512,
|
||||
img_detail: str = 'low',
|
||||
use_azure: bool = False,
|
||||
**kwargs):
|
||||
|
||||
self.model = model
|
||||
@@ -54,43 +53,19 @@ class OpenAIWrapper(BaseAPI):
|
||||
self.fail_msg = 'Failed to obtain answer via API. '
|
||||
self.max_tokens = max_tokens
|
||||
self.temperature = temperature
|
||||
self.use_azure = use_azure
|
||||
|
||||
if 'step' in model:
|
||||
if 'step-1v' in model:
|
||||
env_key = os.environ.get('STEPAI_API_KEY', '')
|
||||
if key is None:
|
||||
key = env_key
|
||||
elif 'yi-vision' in model:
|
||||
env_key = os.environ.get('YI_API_KEY', '')
|
||||
if key is None:
|
||||
key = env_key
|
||||
elif 'internvl2-pro' in model:
|
||||
env_key = os.environ.get('InternVL2_PRO_KEY', '')
|
||||
if key is None:
|
||||
key = env_key
|
||||
elif 'abab' in model:
|
||||
env_key = os.environ.get('MiniMax_API_KEY', '')
|
||||
if key is None:
|
||||
key = env_key
|
||||
else:
|
||||
if use_azure:
|
||||
env_key = os.environ.get('AZURE_OPENAI_API_KEY', None)
|
||||
assert env_key is not None, 'Please set the environment variable AZURE_OPENAI_API_KEY. '
|
||||
|
||||
if key is None:
|
||||
key = env_key
|
||||
assert isinstance(key, str), (
|
||||
'Please set the environment variable AZURE_OPENAI_API_KEY to your openai key. '
|
||||
)
|
||||
else:
|
||||
env_key = os.environ.get('OPENAI_API_KEY', '')
|
||||
if key is None:
|
||||
key = env_key
|
||||
assert isinstance(key, str) and key.startswith('sk-'), (
|
||||
f'Illegal openai_key {key}. '
|
||||
'Please set the environment variable OPENAI_API_KEY to your openai key. '
|
||||
)
|
||||
|
||||
env_key = os.environ.get('OPENAI_API_KEY', '')
|
||||
if key is None:
|
||||
key = env_key
|
||||
assert isinstance(key, str) and key.startswith('sk-'), (
|
||||
f'Illegal openai_key {key}. '
|
||||
'Please set the environment variable OPENAI_API_KEY to your openai key. '
|
||||
)
|
||||
self.key = key
|
||||
assert img_size > 0 or img_size == -1
|
||||
self.img_size = img_size
|
||||
@@ -100,46 +75,30 @@ class OpenAIWrapper(BaseAPI):
|
||||
|
||||
super().__init__(wait=wait, retry=retry, system_prompt=system_prompt, verbose=verbose, **kwargs)
|
||||
|
||||
if use_azure:
|
||||
api_base_template = (
|
||||
'{endpoint}openai/deployments/{deployment_name}/chat/completions?api-version={api_version}'
|
||||
)
|
||||
endpoint = os.getenv('AZURE_OPENAI_ENDPOINT', None)
|
||||
assert endpoint is not None, 'Please set the environment variable AZURE_OPENAI_ENDPOINT. '
|
||||
deployment_name = os.getenv('AZURE_OPENAI_DEPLOYMENT_NAME', None)
|
||||
assert deployment_name is not None, 'Please set the environment variable AZURE_OPENAI_DEPLOYMENT_NAME. '
|
||||
api_version = os.getenv('OPENAI_API_VERSION', None)
|
||||
assert api_version is not None, 'Please set the environment variable OPENAI_API_VERSION. '
|
||||
|
||||
self.api_base = api_base_template.format(
|
||||
endpoint=os.getenv('AZURE_OPENAI_ENDPOINT'),
|
||||
deployment_name=os.getenv('AZURE_OPENAI_DEPLOYMENT_NAME'),
|
||||
api_version=os.getenv('OPENAI_API_VERSION')
|
||||
)
|
||||
else:
|
||||
if api_base is None:
|
||||
if 'OPENAI_API_BASE' in os.environ and os.environ['OPENAI_API_BASE'] != '':
|
||||
self.logger.info('Environment variable OPENAI_API_BASE is set. Will use it as api_base. ')
|
||||
api_base = os.environ['OPENAI_API_BASE']
|
||||
else:
|
||||
api_base = 'OFFICIAL'
|
||||
|
||||
assert api_base is not None
|
||||
|
||||
if api_base in APIBASES:
|
||||
self.api_base = APIBASES[api_base]
|
||||
elif api_base.startswith('http'):
|
||||
self.api_base = api_base
|
||||
if api_base is None:
|
||||
if 'OPENAI_API_BASE' in os.environ and os.environ['OPENAI_API_BASE'] != '':
|
||||
self.logger.error('Environment variable OPENAI_API_BASE is set. Will use it as api_base. ')
|
||||
api_base = os.environ['OPENAI_API_BASE']
|
||||
else:
|
||||
self.logger.error('Unknown API Base. ')
|
||||
raise NotImplementedError
|
||||
api_base = 'OFFICIAL'
|
||||
|
||||
assert api_base is not None
|
||||
|
||||
if api_base in APIBASES:
|
||||
self.api_base = APIBASES[api_base]
|
||||
elif api_base.startswith('http'):
|
||||
self.api_base = api_base
|
||||
else:
|
||||
self.logger.error('Unknown API Base. ')
|
||||
sys.exit(-1)
|
||||
self.logger.info(f'Using API Base: {self.api_base}; API Key: {self.key}')
|
||||
|
||||
# inputs can be a lvl-2 nested list: [content1, content2, content3, ...]
|
||||
# content can be a string or a list of image & text
|
||||
def prepare_itlist(self, inputs):
|
||||
assert np.all([isinstance(x, dict) for x in inputs])
|
||||
def prepare_inputs(self, inputs):
|
||||
input_msgs = []
|
||||
if self.system_prompt is not None:
|
||||
input_msgs.append(dict(role='system', content=self.system_prompt))
|
||||
has_images = np.sum([x['type'] == 'image' for x in inputs])
|
||||
if has_images:
|
||||
content_list = []
|
||||
@@ -152,24 +111,11 @@ class OpenAIWrapper(BaseAPI):
|
||||
b64 = encode_image_to_base64(img, target_size=self.img_size)
|
||||
img_struct = dict(url=f'data:image/jpeg;base64,{b64}', detail=self.img_detail)
|
||||
content_list.append(dict(type='image_url', image_url=img_struct))
|
||||
input_msgs.append(dict(role='user', content=content_list))
|
||||
else:
|
||||
assert all([x['type'] == 'text' for x in inputs])
|
||||
text = '\n'.join([x['value'] for x in inputs])
|
||||
content_list = [dict(type='text', text=text)]
|
||||
return content_list
|
||||
|
||||
def prepare_inputs(self, inputs):
|
||||
input_msgs = []
|
||||
if self.system_prompt is not None:
|
||||
input_msgs.append(dict(role='system', content=self.system_prompt))
|
||||
assert isinstance(inputs, list) and isinstance(inputs[0], dict)
|
||||
assert np.all(['type' in x for x in inputs]) or np.all(['role' in x for x in inputs]), inputs
|
||||
if 'role' in inputs[0]:
|
||||
assert inputs[-1]['role'] == 'user', inputs[-1]
|
||||
for item in inputs:
|
||||
input_msgs.append(dict(role=item['role'], content=self.prepare_itlist(item['content'])))
|
||||
else:
|
||||
input_msgs.append(dict(role='user', content=self.prepare_itlist(inputs)))
|
||||
input_msgs.append(dict(role='user', content=text))
|
||||
return input_msgs
|
||||
|
||||
def generate_inner(self, inputs, **kwargs) -> str:
|
||||
@@ -177,24 +123,17 @@ class OpenAIWrapper(BaseAPI):
|
||||
temperature = kwargs.pop('temperature', self.temperature)
|
||||
max_tokens = kwargs.pop('max_tokens', self.max_tokens)
|
||||
|
||||
# context_window = GPT_context_window(self.model)
|
||||
# new_max_tokens = min(max_tokens, context_window - self.get_token_len(inputs))
|
||||
# if 0 < new_max_tokens <= 100 and new_max_tokens < max_tokens:
|
||||
# self.logger.warning(
|
||||
# 'Less than 100 tokens left, '
|
||||
# 'may exceed the context window with some additional meta symbols. '
|
||||
# )
|
||||
# if new_max_tokens <= 0:
|
||||
# return 0, self.fail_msg + 'Input string longer than context window. ', 'Length Exceeded. '
|
||||
# max_tokens = new_max_tokens
|
||||
context_window = GPT_context_window(self.model)
|
||||
max_tokens = min(max_tokens, context_window - self.get_token_len(inputs))
|
||||
if 0 < max_tokens <= 100:
|
||||
self.logger.warning(
|
||||
'Less than 100 tokens left, '
|
||||
'may exceed the context window with some additional meta symbols. '
|
||||
)
|
||||
if max_tokens <= 0:
|
||||
return 0, self.fail_msg + 'Input string longer than context window. ', 'Length Exceeded. '
|
||||
|
||||
# Will send request if use Azure, dk how to use openai client for it
|
||||
if self.use_azure:
|
||||
headers = {'Content-Type': 'application/json', 'api-key': self.key}
|
||||
elif 'internvl2-pro' in self.model:
|
||||
headers = {'Content-Type': 'application/json', 'Authorization': self.key}
|
||||
else:
|
||||
headers = {'Content-Type': 'application/json', 'Authorization': f'Bearer {self.key}'}
|
||||
headers = {'Content-Type': 'application/json', 'Authorization': f'Bearer {self.key}'}
|
||||
payload = dict(
|
||||
model=self.model,
|
||||
messages=input_msgs,
|
||||
@@ -202,62 +141,34 @@ class OpenAIWrapper(BaseAPI):
|
||||
n=1,
|
||||
temperature=temperature,
|
||||
**kwargs)
|
||||
response = requests.post(
|
||||
self.api_base,
|
||||
headers=headers, data=json.dumps(payload), timeout=self.timeout * 1.1)
|
||||
response = requests.post(self.api_base, headers=headers, data=json.dumps(payload), timeout=self.timeout * 1.1)
|
||||
ret_code = response.status_code
|
||||
ret_code = 0 if (200 <= int(ret_code) < 300) else ret_code
|
||||
answer = self.fail_msg
|
||||
try:
|
||||
resp_struct = json.loads(response.text)
|
||||
answer = resp_struct['choices'][0]['message']['content'].strip()
|
||||
except Exception as err:
|
||||
if self.verbose:
|
||||
self.logger.error(f'{type(err)}: {err}')
|
||||
self.logger.error(response.text if hasattr(response, 'text') else response)
|
||||
|
||||
except:
|
||||
pass
|
||||
return ret_code, answer, response
|
||||
|
||||
def get_image_token_len(self, img_path, detail='low'):
|
||||
import math
|
||||
if detail == 'low':
|
||||
return 85
|
||||
|
||||
im = Image.open(img_path)
|
||||
height, width = im.size
|
||||
if width > 1024 or height > 1024:
|
||||
if width > height:
|
||||
height = int(height * 1024 / width)
|
||||
width = 1024
|
||||
else:
|
||||
width = int(width * 1024 / height)
|
||||
height = 1024
|
||||
|
||||
h = math.ceil(height / 512)
|
||||
w = math.ceil(width / 512)
|
||||
total = 85 + 170 * h * w
|
||||
return total
|
||||
|
||||
def get_token_len(self, inputs) -> int:
|
||||
import tiktoken
|
||||
try:
|
||||
enc = tiktoken.encoding_for_model(self.model)
|
||||
except Exception as err:
|
||||
if 'gpt' in self.model.lower():
|
||||
if self.verbose:
|
||||
self.logger.warning(f'{type(err)}: {err}')
|
||||
enc = tiktoken.encoding_for_model('gpt-4')
|
||||
else:
|
||||
return 0
|
||||
except:
|
||||
enc = tiktoken.encoding_for_model('gpt-4')
|
||||
assert isinstance(inputs, list)
|
||||
tot = 0
|
||||
for item in inputs:
|
||||
if 'role' in item:
|
||||
tot += self.get_token_len(item['content'])
|
||||
elif item['type'] == 'text':
|
||||
if item['type'] == 'text':
|
||||
tot += len(enc.encode(item['value']))
|
||||
elif item['type'] == 'image':
|
||||
tot += self.get_image_token_len(item['value'], detail=self.img_detail)
|
||||
tot += 85
|
||||
if self.img_detail == 'high':
|
||||
img = Image.open(item['value'])
|
||||
npatch = np.ceil(img.size[0] / 512) * np.ceil(img.size[1] / 512)
|
||||
tot += npatch * 170
|
||||
return tot
|
||||
|
||||
|
||||
|
||||
90
eval_mm/vlmevalkit/vlmeval/api/gpt_int.py
Normal file
@@ -0,0 +1,90 @@
|
||||
import json
|
||||
import warnings
|
||||
import requests
|
||||
from ..smp import *
|
||||
from .gpt import GPT_context_window, OpenAIWrapper
|
||||
|
||||
url = 'http://ecs.sv.us.alles-apin.openxlab.org.cn/v1/openai/v2/text/chat'
|
||||
headers = {
|
||||
'Content-Type': 'application/json'
|
||||
}
|
||||
|
||||
|
||||
class OpenAIWrapperInternal(OpenAIWrapper):
|
||||
|
||||
is_api: bool = True
|
||||
|
||||
def __init__(self,
|
||||
model: str = 'gpt-3.5-turbo-0613',
|
||||
retry: int = 5,
|
||||
wait: int = 3,
|
||||
verbose: bool = True,
|
||||
system_prompt: str = None,
|
||||
temperature: float = 0,
|
||||
timeout: int = 60,
|
||||
max_tokens: int = 1024,
|
||||
img_size: int = 512,
|
||||
img_detail: str = 'low',
|
||||
**kwargs):
|
||||
|
||||
self.model = model
|
||||
if 'KEYS' in os.environ and osp.exists(os.environ['KEYS']):
|
||||
keys = load(os.environ['KEYS'])
|
||||
headers['alles-apin-token'] = keys.get('alles-apin-token', '')
|
||||
elif 'ALLES' in os.environ:
|
||||
headers['alles-apin-token'] = os.environ['ALLES']
|
||||
self.headers = headers
|
||||
self.temperature = temperature
|
||||
self.timeout = timeout
|
||||
self.max_tokens = max_tokens
|
||||
|
||||
assert img_size > 0 or img_size == -1
|
||||
self.img_size = img_size
|
||||
assert img_detail in ['high', 'low']
|
||||
self.img_detail = img_detail
|
||||
|
||||
super(OpenAIWrapper, self).__init__(
|
||||
wait=wait, retry=retry, system_prompt=system_prompt, verbose=verbose, **kwargs)
|
||||
|
||||
def generate_inner(self, inputs, **kwargs) -> str:
|
||||
input_msgs = self.prepare_inputs(inputs)
|
||||
|
||||
temperature = kwargs.pop('temperature', self.temperature)
|
||||
max_tokens = kwargs.pop('max_tokens', self.max_tokens)
|
||||
|
||||
# Held out 100 tokens as buffer
|
||||
context_window = GPT_context_window(self.model)
|
||||
max_tokens = min(max_tokens, context_window - self.get_token_len(inputs))
|
||||
if 0 < max_tokens <= 100:
|
||||
print('Less than 100 tokens left, may exceed the context window with some additional meta symbols. ')
|
||||
if max_tokens <= 0:
|
||||
return 0, self.fail_msg + 'Input string longer than context window. ', 'Length Exceeded. '
|
||||
|
||||
payload = dict(
|
||||
model=self.model,
|
||||
messages=input_msgs,
|
||||
max_tokens=max_tokens,
|
||||
n=1,
|
||||
stop=None,
|
||||
timeout=self.timeout,
|
||||
temperature=temperature,
|
||||
**kwargs)
|
||||
|
||||
response = requests.post(url, headers=headers, data=json.dumps(payload), timeout=self.timeout * 1.1)
|
||||
ret_code = response.status_code
|
||||
ret_code = 0 if (200 <= int(ret_code) < 300) else ret_code
|
||||
|
||||
answer = self.fail_msg
|
||||
try:
|
||||
resp_struct = json.loads(response.text)
|
||||
assert resp_struct['msg'] == 'ok' and resp_struct['msgCode'] == '10000', resp_struct
|
||||
answer = resp_struct['data']['choices'][0]['message']['content'].strip()
|
||||
except:
|
||||
pass
|
||||
return ret_code, answer, response
|
||||
|
||||
|
||||
class GPT4V_Internal(OpenAIWrapperInternal):
|
||||
|
||||
def generate(self, message, dataset=None):
|
||||
return super(GPT4V_Internal, self).generate(message)
|
||||
@@ -2,19 +2,18 @@ from vlmeval.vlm import *
|
||||
from vlmeval.api import *
|
||||
from functools import partial
|
||||
|
||||
minicpm_series = {
|
||||
'MiniCPM-V': partial(MiniCPM_V, model_path='openbmb/MiniCPM-V'),
|
||||
'MiniCPM-V-2': partial(MiniCPM_V, model_path='openbmb/MiniCPM-V-2'),
|
||||
'MiniCPM-Llama3-V-2_5': partial(MiniCPM_Llama3_V, model_path='openbmb/MiniCPM-Llama3-V-2_5'),
|
||||
'MiniCPM-V-2_6': partial(MiniCPM_V_2_6, model_path='openbmb/MiniCPM-V-2_6'),
|
||||
'MiniCPM-o-2_6': partial(MiniCPM_o_2_6, model_path='openbmb/MiniCPM-o-2_6'),
|
||||
ungrouped = {
|
||||
'MiniCPM-V':partial(MiniCPM_V, model_path='openbmb/MiniCPM-V'),
|
||||
'MiniCPM-V-2':partial(MiniCPM_V, model_path='openbmb/MiniCPM-V-2'),
|
||||
'MiniCPM-Llama3-V-2_5':partial(MiniCPM_Llama3_V, model_path='openbmb/MiniCPM-Llama3-V-2_5'),
|
||||
}
|
||||
|
||||
supported_VLM = {}
|
||||
|
||||
model_groups = [
|
||||
minicpm_series
|
||||
ungrouped
|
||||
]
|
||||
|
||||
for grp in model_groups:
|
||||
supported_VLM.update(grp)
|
||||
|
||||
|
||||
@@ -1,237 +0,0 @@
|
||||
import warnings
|
||||
|
||||
from .image_base import img_root_map, ImageBaseDataset
|
||||
from .image_caption import ImageCaptionDataset
|
||||
from .image_yorn import ImageYORNDataset
|
||||
from .image_mcq import (
|
||||
ImageMCQDataset, MMMUDataset, CustomMCQDataset, MUIRDataset, GMAIMMBenchDataset, MMERealWorld, HRBenchDataset,
|
||||
NaturalBenchDataset
|
||||
)
|
||||
from .image_mt import MMDUDataset
|
||||
from .image_vqa import (
|
||||
ImageVQADataset, MathVision, OCRBench, MathVista, LLaVABench, MMVet, MTVQADataset, TableVQABench,
|
||||
CustomVQADataset, CRPE, MathVerse, OlympiadBench, QSpatial, VizWiz, MMNIAH, WeMath, LogicVista
|
||||
)
|
||||
|
||||
from .image_ccocr import CCOCRDataset
|
||||
from .text_mcq import CustomTextMCQDataset, TextMCQDataset
|
||||
|
||||
from .vcr import VCRDataset
|
||||
from .mmlongbench import MMLongBench
|
||||
from .dude import DUDE
|
||||
from .slidevqa import SlideVQA
|
||||
from .vl_rewardbench import VLRewardBench
|
||||
|
||||
from .mmbench_video import MMBenchVideo
|
||||
from .videomme import VideoMME
|
||||
from .mvbench import MVBench, MVBench_MP4
|
||||
from .mlvu import MLVU, MLVU_MCQ, MLVU_OpenEnded
|
||||
from .tempcompass import TempCompass, TempCompass_Captioning, TempCompass_MCQ, TempCompass_YorN
|
||||
from .longvideobench import LongVideoBench
|
||||
from .video_concat_dataset import ConcatVideoDataset
|
||||
from .mmgenbench import MMGenBench
|
||||
from .cgbench import CGBench_MCQ_Grounding_Mini, CGBench_OpenEnded_Mini, CGBench_MCQ_Grounding, CGBench_OpenEnded
|
||||
|
||||
from .miabench import MIABench
|
||||
from .cmmmu import CMMMU
|
||||
from .wildvision import WildVision
|
||||
from .mmmath import MMMath
|
||||
from .dynamath import Dynamath
|
||||
from .utils import *
|
||||
from .video_dataset_config import *
|
||||
from ..smp import *
|
||||
|
||||
|
||||
class ConcatDataset(ImageBaseDataset):
|
||||
# This dataset takes multiple dataset names as input and aggregate them into a single dataset.
|
||||
# Each single dataset should not have a field named `SUB_DATASET`
|
||||
|
||||
DATASET_SETS = {
|
||||
'MMMB': ['MMMB_ar', 'MMMB_cn', 'MMMB_en', 'MMMB_pt', 'MMMB_ru', 'MMMB_tr'],
|
||||
'MTL_MMBench_DEV': [
|
||||
'MMBench_dev_ar', 'MMBench_dev_cn', 'MMBench_dev_en',
|
||||
'MMBench_dev_pt', 'MMBench_dev_ru', 'MMBench_dev_tr'
|
||||
]
|
||||
}
|
||||
|
||||
def __init__(self, dataset):
|
||||
datasets = self.DATASET_SETS[dataset]
|
||||
self.dataset_map = {}
|
||||
# The name of the compliation
|
||||
self.dataset_name = dataset
|
||||
self.datasets = datasets
|
||||
for dname in datasets:
|
||||
dataset = build_dataset(dname)
|
||||
assert dataset is not None, dataset
|
||||
self.dataset_map[dname] = dataset
|
||||
TYPES = [x.TYPE for x in self.dataset_map.values()]
|
||||
MODALITIES = [x.MODALITY for x in self.dataset_map.values()]
|
||||
assert np.all([x == TYPES[0] for x in TYPES]), (datasets, TYPES)
|
||||
assert np.all([x == MODALITIES[0] for x in MODALITIES]), (datasets, MODALITIES)
|
||||
self.TYPE = TYPES[0]
|
||||
self.MODALITY = MODALITIES[0]
|
||||
data_all = []
|
||||
for dname in datasets:
|
||||
data = self.dataset_map[dname].data
|
||||
data['SUB_DATASET'] = [dname] * len(data)
|
||||
data_new = localize_df(data, dname, nproc=16)
|
||||
data_all.append(data_new)
|
||||
|
||||
data = pd.concat(data_all)
|
||||
data['original_index'] = data.pop('index')
|
||||
data['index'] = np.arange(len(data))
|
||||
self.data = data
|
||||
|
||||
def build_prompt(self, line):
|
||||
if isinstance(line, int):
|
||||
line = self.data.iloc[line]
|
||||
idx = line['original_index']
|
||||
dname = line['SUB_DATASET']
|
||||
org_data = self.dataset_map[dname].data
|
||||
org_line = cp.deepcopy(org_data[org_data['index'] == idx]).iloc[0]
|
||||
return self.dataset_map[dname].build_prompt(org_line)
|
||||
|
||||
def dump_image(self, line):
|
||||
# Assert all images are pre-dumped
|
||||
assert 'image' not in line
|
||||
assert 'image_path' in line
|
||||
tgt_path = toliststr(line['image_path'])
|
||||
return tgt_path
|
||||
|
||||
@classmethod
|
||||
def supported_datasets(cls):
|
||||
return list(cls.DATASET_SETS)
|
||||
|
||||
def evaluate(self, eval_file, **judge_kwargs):
|
||||
suffix = eval_file.split('.')[-1]
|
||||
# First, split the eval_file by dataset
|
||||
data_all = load(eval_file)
|
||||
for dname in self.datasets:
|
||||
tgt = eval_file.replace(self.dataset_name, dname)
|
||||
data_sub = data_all[data_all['SUB_DATASET'] == dname]
|
||||
data_sub.pop('index')
|
||||
data_sub['index'] = data_sub.pop('original_index')
|
||||
data_sub.pop('SUB_DATASET')
|
||||
dump(data_sub, tgt)
|
||||
# Then, evaluate each dataset separately
|
||||
results_all = []
|
||||
for dname in self.datasets:
|
||||
tgt = eval_file.replace(self.dataset_name, dname)
|
||||
res = self.dataset_map[dname].evaluate(tgt, **judge_kwargs)
|
||||
assert isinstance(res, pd.DataFrame)
|
||||
res['DATASET'] = [dname] * len(res)
|
||||
results_all.append(res)
|
||||
result = pd.concat(results_all)
|
||||
score_file = eval_file.replace(f'.{suffix}', '_acc.csv')
|
||||
dump(result, score_file)
|
||||
return result
|
||||
|
||||
|
||||
# Add new supported dataset class here
|
||||
IMAGE_DATASET = [
|
||||
ImageCaptionDataset, ImageYORNDataset, ImageMCQDataset, ImageVQADataset, MathVision,
|
||||
MMMUDataset, OCRBench, MathVista, LLaVABench, MMVet, MTVQADataset, TableVQABench,
|
||||
MMLongBench, VCRDataset, MMDUDataset, DUDE, SlideVQA, MUIRDataset, CCOCRDataset,
|
||||
GMAIMMBenchDataset, MMERealWorld, HRBenchDataset, CRPE, MathVerse, NaturalBenchDataset,
|
||||
MIABench, OlympiadBench, WildVision, MMMath, QSpatial, Dynamath, MMGenBench, VizWiz, MMNIAH,
|
||||
CMMMU, VLRewardBench, WeMath, LogicVista
|
||||
]
|
||||
|
||||
VIDEO_DATASET = [
|
||||
MMBenchVideo, VideoMME, MVBench, MVBench_MP4, LongVideoBench,
|
||||
MLVU, MLVU_MCQ, MLVU_OpenEnded,
|
||||
TempCompass, TempCompass_MCQ, TempCompass_Captioning, TempCompass_YorN,
|
||||
CGBench_MCQ_Grounding_Mini, CGBench_OpenEnded_Mini, CGBench_MCQ_Grounding, CGBench_OpenEnded
|
||||
]
|
||||
|
||||
TEXT_DATASET = [
|
||||
TextMCQDataset
|
||||
]
|
||||
|
||||
CUSTOM_DATASET = [
|
||||
CustomMCQDataset, CustomVQADataset, CustomTextMCQDataset
|
||||
]
|
||||
|
||||
DATASET_COLLECTION = [ConcatDataset, ConcatVideoDataset]
|
||||
|
||||
DATASET_CLASSES = IMAGE_DATASET + VIDEO_DATASET + TEXT_DATASET + CUSTOM_DATASET + DATASET_COLLECTION
|
||||
SUPPORTED_DATASETS = []
|
||||
for DATASET_CLS in DATASET_CLASSES:
|
||||
SUPPORTED_DATASETS.extend(DATASET_CLS.supported_datasets())
|
||||
|
||||
|
||||
def DATASET_TYPE(dataset, *, default: str = 'MCQ') -> str:
|
||||
for cls in DATASET_CLASSES:
|
||||
if dataset in cls.supported_datasets():
|
||||
if hasattr(cls, 'TYPE'):
|
||||
return cls.TYPE
|
||||
# Have to add specific routine to handle ConcatDataset
|
||||
if dataset in ConcatDataset.DATASET_SETS:
|
||||
dataset_list = ConcatDataset.DATASET_SETS[dataset]
|
||||
TYPES = [DATASET_TYPE(dname) for dname in dataset_list]
|
||||
assert np.all([x == TYPES[0] for x in TYPES]), (dataset_list, TYPES)
|
||||
return TYPES[0]
|
||||
|
||||
if 'openended' in dataset.lower():
|
||||
return 'VQA'
|
||||
warnings.warn(f'Dataset {dataset} is a custom one and not annotated as `openended`, will treat as {default}. ')
|
||||
return default
|
||||
|
||||
|
||||
def DATASET_MODALITY(dataset, *, default: str = 'IMAGE') -> str:
|
||||
if dataset is None:
|
||||
warnings.warn(f'Dataset is not specified, will treat modality as {default}. ')
|
||||
return default
|
||||
for cls in DATASET_CLASSES:
|
||||
if dataset in cls.supported_datasets():
|
||||
if hasattr(cls, 'MODALITY'):
|
||||
return cls.MODALITY
|
||||
# Have to add specific routine to handle ConcatDataset
|
||||
if dataset in ConcatDataset.DATASET_SETS:
|
||||
dataset_list = ConcatDataset.DATASET_SETS[dataset]
|
||||
MODALITIES = [DATASET_MODALITY(dname) for dname in dataset_list]
|
||||
assert np.all([x == MODALITIES[0] for x in MODALITIES]), (dataset_list, MODALITIES)
|
||||
return MODALITIES[0]
|
||||
|
||||
if 'VIDEO' in dataset.lower():
|
||||
return 'VIDEO'
|
||||
elif 'IMAGE' in dataset.lower():
|
||||
return 'IMAGE'
|
||||
warnings.warn(f'Dataset {dataset} is a custom one, will treat modality as {default}. ')
|
||||
return default
|
||||
|
||||
|
||||
def build_dataset(dataset_name, **kwargs):
|
||||
for cls in DATASET_CLASSES:
|
||||
if dataset_name in supported_video_datasets:
|
||||
return supported_video_datasets[dataset_name](**kwargs)
|
||||
elif dataset_name in cls.supported_datasets():
|
||||
return cls(dataset=dataset_name, **kwargs)
|
||||
|
||||
warnings.warn(f'Dataset {dataset_name} is not officially supported. ')
|
||||
|
||||
data_file = osp.join(LMUDataRoot(), f'{dataset_name}.tsv')
|
||||
if not osp.exists(data_file):
|
||||
warnings.warn(f'Data file {data_file} does not exist. Dataset building failed. ')
|
||||
return None
|
||||
|
||||
data = load(data_file)
|
||||
if 'question' not in [x.lower() for x in data.columns]:
|
||||
warnings.warn(f'Data file {data_file} does not have a `question` column. Dataset building failed. ')
|
||||
return None
|
||||
|
||||
if 'A' in data and 'B' in data:
|
||||
if 'image' in data or 'image_path' in data:
|
||||
warnings.warn(f'Will assume unsupported dataset {dataset_name} as a Custom MCQ dataset. ')
|
||||
return CustomMCQDataset(dataset=dataset_name, **kwargs)
|
||||
else:
|
||||
warnings.warn(f'Will assume unsupported dataset {dataset_name} as a Custom Text MCQ dataset. ')
|
||||
return CustomTextMCQDataset(dataset=dataset_name, **kwargs)
|
||||
else:
|
||||
warnings.warn(f'Will assume unsupported dataset {dataset_name} as a Custom VQA dataset. ')
|
||||
return CustomVQADataset(dataset=dataset_name, **kwargs)
|
||||
|
||||
|
||||
__all__ = [
|
||||
'build_dataset', 'img_root_map', 'build_judge', 'extract_answer_from_item', 'prefetch_answer', 'DEBUG_MESSAGE'
|
||||
] + [cls.__name__ for cls in DATASET_CLASSES]
|
||||
@@ -1,354 +0,0 @@
|
||||
from .image_base import ImageBaseDataset
|
||||
import random
|
||||
from collections import Counter
|
||||
import os
|
||||
import re
|
||||
import tempfile
|
||||
from ..smp import *
|
||||
|
||||
|
||||
def get_multi_choice_prediction(response, all_choices, index2ans):
|
||||
for char in [',', '.', '!', '?', ';', ':', "'"]:
|
||||
response = response.strip(char)
|
||||
response = " " + response + " " # add space to avoid partial match
|
||||
|
||||
candidates = []
|
||||
|
||||
for choice in all_choices: # (A) (B) (C) (D)
|
||||
# Add the choice to candidates each time it appears in the response
|
||||
candidates.extend([choice for _ in range(response.count(f'({choice})'))])
|
||||
|
||||
if len(candidates) == 0:
|
||||
for choice in all_choices: # A B C D
|
||||
# Similarly, add the choice for each occurrence
|
||||
candidates.extend([choice for _ in range(response.count(f'{choice}'))])
|
||||
|
||||
if len(candidates) == 0 and len(response.split()) >= 1:
|
||||
for index, ans in index2ans.items():
|
||||
# Add index for each occurrence of ans in response
|
||||
candidates.extend([index for _ in range(response.count(ans))])
|
||||
|
||||
# if all above doesn't get candidates, check if the content is larger than 5 tokens and try to parse the example
|
||||
if len(candidates) == 0 and len(response.split()) >= 1:
|
||||
for index, ans in index2ans.items():
|
||||
if ans in response:
|
||||
candidates.append(index)
|
||||
# index_ans = False # it's content ans.
|
||||
|
||||
if len(candidates) == 0: # still not get answer, randomly choose one.
|
||||
return random.choice(all_choices)
|
||||
# return ''
|
||||
else:
|
||||
# Count the occurrence of each candidate
|
||||
candidate_counts = Counter(candidates)
|
||||
|
||||
# Select the most frequent candidates
|
||||
max_count = max(candidate_counts.values())
|
||||
most_frequent_candidates = [c for c in all_choices if candidate_counts.get(c, 0) == max_count]
|
||||
|
||||
# Combine the most frequent candidates in ABCD order
|
||||
return ''.join(most_frequent_candidates)
|
||||
|
||||
|
||||
def extract_numbers(string):
|
||||
# Pattern for numbers with Chinese commas
|
||||
pattern_commas = r'-?\d{1,3}(?:,\d{3})+'
|
||||
# Pattern for scientific notation
|
||||
pattern_scientific = r'-?\d+(?:\.\d+)?[eE][+-]?\d+'
|
||||
# Pattern for simple numbers without Chinese commas
|
||||
pattern_simple = r'-?(?:\d+\.\d+|\.\d+|\d+)(?![eE][+-]?\d+)(?!,\d)'
|
||||
|
||||
# Extract numbers with Chinese commas
|
||||
numbers_with_commas = re.findall(pattern_commas, string)
|
||||
# Extract numbers in scientific notation
|
||||
numbers_scientific = re.findall(pattern_scientific, string)
|
||||
# Extract simple numbers without Chinese commas
|
||||
numbers_simple = re.findall(pattern_simple, string)
|
||||
|
||||
# Combine all extracted numbers
|
||||
all_numbers = numbers_with_commas + numbers_scientific + numbers_simple
|
||||
return all_numbers
|
||||
|
||||
|
||||
def check_is_number(string):
|
||||
try:
|
||||
float(string.replace(',', ''))
|
||||
return True
|
||||
except ValueError:
|
||||
# check if there's comma inside
|
||||
return False
|
||||
|
||||
|
||||
def count_letters(string):
|
||||
return sum(c.isalpha() and 'a' <= c <= 'z' or 'A' <= c <= 'Z' for c in string)
|
||||
|
||||
|
||||
def normalize_str(string, answer):
|
||||
# check if characters in the string
|
||||
|
||||
# if number, numerize it.
|
||||
if string is None:
|
||||
return [string]
|
||||
string = string.strip()
|
||||
|
||||
is_number = check_is_number(string)
|
||||
|
||||
if is_number:
|
||||
string = string.replace(',', '')
|
||||
string = float(string)
|
||||
# leave 2 decimal
|
||||
string = round(string, 2)
|
||||
return [string]
|
||||
else: # it's likely to be a string
|
||||
if len(string) > len(answer) + 20 or count_letters(string) > count_letters(answer) + 2:
|
||||
return []
|
||||
return [string]
|
||||
|
||||
|
||||
def get_fill_blank_prediction(response, answer):
|
||||
"""get the prediction from the generated response,
|
||||
return a list of predicted strings or numbers"""
|
||||
|
||||
def get_key_subresponses(response):
|
||||
response = response.strip("。").strip()
|
||||
sub_responses = re.split(r'。|\n', response)
|
||||
indicators_of_keys = ['是', '为', '所以', '等于', '方案', '选择',
|
||||
'正确答案', '因此', '最后', '答案', '结果']
|
||||
key_responses = []
|
||||
for index, resp in enumerate(sub_responses):
|
||||
# if last one, accept it's an equation (the entire response can be just one sentence with equation)
|
||||
if index == len(sub_responses) - 1:
|
||||
indicators_of_keys.extend(['='])
|
||||
shortest_key_response = None
|
||||
# the shortest response that may contain the answer (tail part of the response)
|
||||
for indicator in indicators_of_keys:
|
||||
if indicator in resp:
|
||||
if not shortest_key_response:
|
||||
shortest_key_response = resp.split(indicator)[-1].strip()
|
||||
else:
|
||||
if len(resp.split(indicator)[-1].strip()) < len(shortest_key_response):
|
||||
shortest_key_response = resp.split(indicator)[-1].strip()
|
||||
|
||||
if shortest_key_response:
|
||||
# and it's not trivial
|
||||
if shortest_key_response.strip() not in [":", ",", ".", "!", "?", ";", ":", "'"]:
|
||||
key_responses.append(shortest_key_response)
|
||||
if len(key_responses) == 0: # did not found any
|
||||
return [response]
|
||||
return key_responses
|
||||
|
||||
key_responses = get_key_subresponses(response)
|
||||
|
||||
pred_list = key_responses.copy() # keep the original string response
|
||||
for resp in key_responses:
|
||||
pred_list.extend(extract_numbers(resp))
|
||||
|
||||
tmp_pred_list = []
|
||||
for i in range(len(pred_list)):
|
||||
tmp_pred_list.extend(normalize_str(pred_list[i], answer))
|
||||
pred_list = tmp_pred_list
|
||||
|
||||
# remove duplicates
|
||||
pred_list = list(set(pred_list))
|
||||
|
||||
return pred_list
|
||||
|
||||
|
||||
def get_TF_prediction(response):
|
||||
"""get the prediction from the generated response,
|
||||
return a list of predicted strings or numbers"""
|
||||
|
||||
def get_key_subresponses(response):
|
||||
response = response.strip("。").strip()
|
||||
sub_responses = re.split(r'。|\n', response)
|
||||
indicators_of_keys = ['是', '为', '所以', '判断',
|
||||
'陈述', '说法', '表达', '答案', '结果']
|
||||
key_responses = []
|
||||
for index, resp in enumerate(sub_responses):
|
||||
shortest_key_response = None
|
||||
# the shortest response that may contain the answer (tail part of the response)
|
||||
for indicator in indicators_of_keys:
|
||||
if indicator in resp:
|
||||
if not shortest_key_response:
|
||||
shortest_key_response = resp.split(indicator)[-1].strip()
|
||||
else:
|
||||
if len(resp.split(indicator)[-1].strip()) < len(shortest_key_response):
|
||||
shortest_key_response = resp.split(indicator)[-1].strip()
|
||||
|
||||
if shortest_key_response:
|
||||
# and it's not trivial
|
||||
if shortest_key_response.strip() not in [":", ",", ".", "!", "?", ";", ":", "'"]:
|
||||
key_responses.append(shortest_key_response)
|
||||
if len(key_responses) == 0: # did not found any
|
||||
return [response]
|
||||
return key_responses
|
||||
|
||||
key_responses = get_key_subresponses(response)
|
||||
|
||||
pred_list = key_responses.copy() # keep the original string response
|
||||
# remove duplicates
|
||||
pred_list = list(set(pred_list))
|
||||
|
||||
return pred_list
|
||||
|
||||
|
||||
class CMMMU(ImageBaseDataset):
|
||||
TYPE = 'VQA'
|
||||
|
||||
DATASET_URL = {
|
||||
'CMMMU_VAL': 'https://opencompass.openxlab.space/utils/VLMEval/CMMMU_VAL.tsv'
|
||||
}
|
||||
|
||||
DATASET_MD5 = {
|
||||
'CMMMU_VAL': 'b4727e2fce2415bf646379e60c11a726'
|
||||
}
|
||||
|
||||
def dump_image(self, line):
|
||||
os.makedirs(self.img_root, exist_ok=True)
|
||||
|
||||
tgt_path_z = []
|
||||
if isinstance(line['image'], list):
|
||||
for i in range(len(line['image'])):
|
||||
tgt_path = osp.join(self.img_root, f"{line['index']}--{i + 1}.jpg")
|
||||
if not read_ok(tgt_path):
|
||||
decode_base64_to_image_file(line['image'][i], tgt_path)
|
||||
tgt_path_z.append(tgt_path)
|
||||
else:
|
||||
tgt_path = osp.join(self.img_root, f"{line['index']}.jpg")
|
||||
if not read_ok(tgt_path):
|
||||
decode_base64_to_image_file(line['image'], tgt_path)
|
||||
tgt_path_z.append(tgt_path)
|
||||
return tgt_path_z
|
||||
|
||||
@classmethod
|
||||
def evaluate(self, eval_file, **judge_kwargs):
|
||||
|
||||
suffix = eval_file.split('.')[-1]
|
||||
result_file = eval_file.replace(f'.{suffix}', '_acc.csv')
|
||||
|
||||
if not osp.exists(result_file):
|
||||
data = load(eval_file)
|
||||
assert 'answer' in data and 'prediction' in data
|
||||
data['prediction'] = [str(x) for x in data['prediction']]
|
||||
data['answer'] = [str(x) for x in data['answer']]
|
||||
|
||||
correct_count = 0
|
||||
correct_category = {
|
||||
'技术与工程': [0, 0],
|
||||
'科学': [0, 0],
|
||||
'健康与医学': [0, 0],
|
||||
'商业': [0, 0],
|
||||
'艺术与设计': [0, 0],
|
||||
'人文社会科学': [0, 0],
|
||||
}
|
||||
|
||||
for i in tqdm(data.iterrows()):
|
||||
line = i[1]
|
||||
correct_category[line['category']][0] += 1
|
||||
|
||||
# Options
|
||||
if line['type'] == '选择':
|
||||
index2ans = {
|
||||
'A': line['option1'],
|
||||
'B': line['option2'],
|
||||
'C': line['option3'],
|
||||
'D': line['option4']
|
||||
}
|
||||
fact_option = get_multi_choice_prediction(line['prediction'], ['A', 'B', 'C', 'D'], index2ans)
|
||||
if fact_option == line['answer']:
|
||||
correct_count += 1
|
||||
correct_category[line['category']][1] += 1
|
||||
|
||||
# Binary
|
||||
elif line['type'] == '判断':
|
||||
positive_keywords = ['正确', '对', '准确', '肯定', '对的']
|
||||
negative_keywords = ['不对', '错误', '不正确', '不准确', '不合适', '否定', '错的', '错']
|
||||
ambiguous_keywords = ['对错', '是否正确', '否正确', '或者', '是否', '正确性', '对不']
|
||||
|
||||
def judge_similarity(pred_list, positive_keywords, negative_keywords):
|
||||
positive_count = 0
|
||||
negative_count = 0
|
||||
|
||||
for pred in pred_list:
|
||||
if any(pos_word in pred for pos_word in positive_keywords):
|
||||
positive_count += 1
|
||||
elif any(neg_word in pred for neg_word in negative_keywords):
|
||||
negative_count += 1
|
||||
|
||||
if positive_count > negative_count:
|
||||
return "对"
|
||||
elif negative_count > positive_count:
|
||||
return "错"
|
||||
else:
|
||||
return random.choice(['对', '错'])
|
||||
|
||||
answer = get_TF_prediction(line['prediction'])
|
||||
answer = [word for word in answer if not any(ambiguous in word for ambiguous in ambiguous_keywords)]
|
||||
fact_answer = judge_similarity(answer, positive_keywords, negative_keywords)
|
||||
if fact_answer == line['answer']:
|
||||
correct_count += 1
|
||||
correct_category[line['category']][1] += 1
|
||||
|
||||
# Fill the Blank
|
||||
else:
|
||||
norm_answers = normalize_str(line['answer'], line['answer'])
|
||||
predicted_answer = get_fill_blank_prediction(line['prediction'], line['answer'])
|
||||
|
||||
for pred in predicted_answer:
|
||||
# already normalized
|
||||
if isinstance(pred, str): # if it's a string, then find if ans in the pred_i
|
||||
for norm_ans in norm_answers:
|
||||
# only see if the string answer in the string pred
|
||||
# print(norm_ans, pred)
|
||||
if isinstance(norm_ans, str) and norm_ans in pred:
|
||||
correct_count += 1
|
||||
correct_category[line['category']][1] += 1
|
||||
else: # it's a number
|
||||
if pred in norm_answers:
|
||||
correct_count += 1
|
||||
correct_category[line['category']][1] += 1
|
||||
|
||||
accuracyz = {}
|
||||
accuracyz['总准确率'] = correct_count / len(data)
|
||||
for i in correct_category.keys():
|
||||
accuracyz[i] = correct_category[i][1] / correct_category[i][0]
|
||||
|
||||
accuracyz = d2df(accuracyz)
|
||||
accuracyz.round(10)
|
||||
dump(accuracyz, result_file)
|
||||
|
||||
result = pd.read_csv(result_file)
|
||||
return result
|
||||
|
||||
def build_prompt(self, line):
|
||||
if line['type'] == '选择':
|
||||
tgt_path = self.dump_image(line)
|
||||
question = line['question']
|
||||
options_prompt = 'Options:\n'
|
||||
|
||||
for i in [['A', '1'], ['B', '2'], ['C', '3'], ['D', '4']]:
|
||||
options_prompt += i[0] + '. ' + line['option' + i[1]] + '\n'
|
||||
|
||||
prompt = (f'问题: {question}\n' + options_prompt
|
||||
+ '请回答上述多项选择题,并选出正确选项。这些题目可能包括单选和多选题型。如果所提供的信息不足以确定一个明确的答案,那么请根据可用的数据和你的判断来选择最可能正确的选项。')
|
||||
|
||||
msgs = []
|
||||
if isinstance(tgt_path, list):
|
||||
msgs.extend([dict(type='image', value=p) for p in tgt_path])
|
||||
else:
|
||||
msgs = [dict(type='image', value=tgt_path)]
|
||||
msgs.append(dict(type='text', value=prompt))
|
||||
|
||||
return msgs
|
||||
|
||||
elif line['type'] == '判断':
|
||||
msgs = super().build_prompt(line)
|
||||
assert msgs[-1]['type'] == 'text'
|
||||
msgs[-1]['value'] += '\n请回答上述判断题,并根据题目描述和所给的信息来判断问题中陈述的对错。如果信息不完整或不足以作出绝对判断,请运用你的逻辑推理和现有信息来做出最可能的判断。'
|
||||
return msgs
|
||||
|
||||
else:
|
||||
msgs = super().build_prompt(line)
|
||||
assert msgs[-1]['type'] == 'text'
|
||||
msgs[-1]['value'] += '\n请回答上述填空题,并根据题目的要求和所提供的信息来给出最恰当的答案。如果信息不足以确切回答,那么请依据现有的数据和你的推理能力来填写最合理的答案。'
|
||||
return msgs
|
||||
@@ -1,211 +0,0 @@
|
||||
import math
|
||||
from typing import List
|
||||
|
||||
from .utils.judge_util import build_judge
|
||||
from .image_base import ImageBaseDataset
|
||||
from .mmlongbench import concat_images, MMLongBench_auxeval, anls_compute
|
||||
from ..smp import *
|
||||
|
||||
|
||||
FAIL_MSG = 'Failed to obtain answer via API.'
|
||||
|
||||
|
||||
def DUDE_acc(result_file):
|
||||
data = load(result_file)
|
||||
overall_score = 0.0
|
||||
score_list = list()
|
||||
for i in range(len(data)):
|
||||
item = data.iloc[i]
|
||||
if isinstance(item['answer'], float) and math.isnan(item['answer']):
|
||||
item['answer'] = 'Not answerable'
|
||||
|
||||
item['answer'] = item['answer'].lower()
|
||||
item['pred'] = item['pred'].lower()
|
||||
score = anls_compute(item['answer'], item['pred'])
|
||||
score_list.append(score)
|
||||
overall_score += score
|
||||
|
||||
data['score'] = score_list
|
||||
dump(data, result_file)
|
||||
|
||||
res = dict()
|
||||
res['category'], res['num'], res['avg_score'] = ['anls'], [len(data)], [overall_score / len(data)]
|
||||
res = pd.DataFrame(res)
|
||||
return res
|
||||
|
||||
|
||||
class DUDE(ImageBaseDataset):
|
||||
|
||||
TYPE = 'VQA'
|
||||
|
||||
DATASET_URL = {
|
||||
'DUDE': 'https://opencompass.openxlab.space/utils/VLMEval/DUDE.tsv',
|
||||
'DUDE_MINI': 'https://opencompass.openxlab.space/utils/VLMEval/DUDE_MINI.tsv',
|
||||
}
|
||||
DATASET_MD5 = {
|
||||
'DUDE': '130d860d08206e1e407cd77150c10d88',
|
||||
'DUDE_MINI': 'e0c0d998114f0cca7516d12039d2b538',
|
||||
}
|
||||
|
||||
SUPPORTED_MODELS = {
|
||||
'GPT4': (1, 1),
|
||||
'GPT4V': (1, 1),
|
||||
'GPT4V_HIGH': (1, 1),
|
||||
'GPT4o': (1, 1),
|
||||
'GPT4o_HIGH': (1, 1),
|
||||
'GPT4o_MINI': (1, 1),
|
||||
'XComposer2d5': (1, -1),
|
||||
'XComposer2_4KHD': (1, -1),
|
||||
'MiniCPM-Llama3-V-2_5': (1, 5),
|
||||
'InternVL-Chat-V1-5': (5, 2),
|
||||
}
|
||||
|
||||
def __init__(self, dataset, **kwargs):
|
||||
self.model_list = list(self.SUPPORTED_MODELS.keys())
|
||||
model_name = kwargs['model']
|
||||
if not listinstr(self.model_list, model_name):
|
||||
raise AssertionError("{} doesn't support the evaluation on DUDE.".format(model_name))
|
||||
super(DUDE, self).__init__(dataset)
|
||||
|
||||
self.is_api = True if listinstr(['GPT4'], model_name) else False
|
||||
self.max_pages = 120
|
||||
concat_num, column_num = self.SUPPORTED_MODELS.get(model_name)
|
||||
self.concat_num = concat_num
|
||||
self.column_num = column_num
|
||||
|
||||
def prepare_tsv(self, url, file_md5=None):
|
||||
data_root = LMUDataRoot()
|
||||
os.makedirs(data_root, exist_ok=True)
|
||||
file_name = url.split('/')[-1]
|
||||
data_path = osp.join(data_root, file_name)
|
||||
if osp.exists(data_path) and (file_md5 is None or md5(data_path) == file_md5):
|
||||
pass
|
||||
else:
|
||||
warnings.warn('The dataset tsv is not downloaded')
|
||||
download_file(url, data_path)
|
||||
return load(data_path)
|
||||
|
||||
def dump_image(self, origin_line):
|
||||
os.makedirs(self.img_root, exist_ok=True)
|
||||
try:
|
||||
import fitz
|
||||
except Exception as e:
|
||||
logging.critical(f'{type(e)}: {e}')
|
||||
logging.critical('Please use `pip install pymupdf` to parse PDF files.')
|
||||
|
||||
line = origin_line.copy()
|
||||
if not isinstance(line['image_path'], List):
|
||||
line['image_path'] = [line['image_path']]
|
||||
line['image_path'] = line['image_path'][:self.max_pages]
|
||||
skip_pdf_parse = True
|
||||
for im_name in line['image_path']:
|
||||
path = osp.join(self.img_root, im_name)
|
||||
if not read_ok(path):
|
||||
skip_pdf_parse = False
|
||||
break
|
||||
|
||||
# Just for being compatible with the zooped loop: zip(line['image'], line['image_path'])
|
||||
if skip_pdf_parse:
|
||||
line['image'] = line['image_path']
|
||||
else:
|
||||
pdf_data = base64.b64decode(line['image'])
|
||||
pdf_file = io.BytesIO(pdf_data)
|
||||
encoded_images = []
|
||||
with fitz.open(stream=pdf_file, filetype='pdf') as doc:
|
||||
doc = doc[:self.max_pages]
|
||||
for page in doc:
|
||||
image = page.get_pixmap(dpi=144)
|
||||
image_file = io.BytesIO(image.tobytes(output='png'))
|
||||
image = Image.open(image_file)
|
||||
encoded_image = encode_image_to_base64(image)
|
||||
encoded_images.append(encoded_image)
|
||||
line['image'] = encoded_images
|
||||
print('process {}'.format(line['doc_id']))
|
||||
|
||||
if 'image' in line:
|
||||
if isinstance(line['image'], list):
|
||||
tgt_path = []
|
||||
assert 'image_path' in line
|
||||
for img, im_name in zip(line['image'], line['image_path']):
|
||||
path = osp.join(self.img_root, im_name)
|
||||
if not read_ok(path):
|
||||
decode_base64_to_image_file(img, path)
|
||||
tgt_path.append(path)
|
||||
else:
|
||||
tgt_path = osp.join(self.img_root, f"{line['index']}.jpg")
|
||||
if not read_ok(tgt_path):
|
||||
decode_base64_to_image_file(line['image'], tgt_path)
|
||||
tgt_path = [tgt_path]
|
||||
else:
|
||||
assert 'image_path' in line
|
||||
tgt_path = toliststr(line['image_path'])
|
||||
|
||||
if self.concat_num > 0 and not self.is_api:
|
||||
concatenated_images = concat_images(tgt_path, max_concat=self.concat_num, column_num=self.column_num)
|
||||
|
||||
old_tgt_path = tgt_path
|
||||
assert isinstance(old_tgt_path, list)
|
||||
if self.column_num != -1:
|
||||
tgt_path = [
|
||||
'_'.join(old_tgt_path[0].split('_')[:-1]) + '_concat{}_{}.jpg'.format(self.concat_num, i)
|
||||
for i in range(len(concatenated_images))
|
||||
]
|
||||
else:
|
||||
tgt_path = ['_'.join(old_tgt_path[0].split('_')[:-1]) + '_concat_all.jpg']
|
||||
|
||||
for path, concatenated_image in zip(tgt_path, concatenated_images):
|
||||
if not read_ok(path):
|
||||
decode_base64_to_image_file(encode_image_to_base64(concatenated_image), path)
|
||||
num_images, image_size = len(old_tgt_path), concatenated_image.size
|
||||
print('concat {} images to a new one with size {}. save at {}'.format(num_images, image_size, path))
|
||||
return tgt_path
|
||||
|
||||
@classmethod
|
||||
def evaluate(self, eval_file, **judge_kwargs):
|
||||
logger = get_logger('Evaluation')
|
||||
model = judge_kwargs['model']
|
||||
|
||||
suffix = eval_file.split('.')[-1]
|
||||
storage = eval_file.replace(f'.{suffix}', f'_{model}.xlsx')
|
||||
tmp_file = eval_file.replace(f'.{suffix}', f'_{model}.pkl')
|
||||
|
||||
if osp.exists(storage):
|
||||
logger.warning(f'GPT scoring file {storage} already exists, will reuse it in DUDE_eval. ')
|
||||
else:
|
||||
data = load(eval_file)
|
||||
model = build_judge(max_tokens=128, **judge_kwargs)
|
||||
lt = len(data)
|
||||
lines = [data.iloc[i] for i in range(lt)]
|
||||
tups = [(model, line) for line in lines]
|
||||
indices = [line['index'] for line in lines]
|
||||
|
||||
ans = {}
|
||||
if osp.exists(tmp_file):
|
||||
ans = load(tmp_file)
|
||||
tups = [x for x, i in zip(tups, indices) if i not in ans]
|
||||
indices = [i for i in indices if i not in ans]
|
||||
|
||||
if len(indices):
|
||||
new_results = list()
|
||||
for model, line in tqdm(tups):
|
||||
res = MMLongBench_auxeval(model, line)
|
||||
new_results.append(res)
|
||||
|
||||
log_map, res_map, pred_map = {}, {}, {}
|
||||
all_inds = [line['index'] for line in lines]
|
||||
for k, v in zip(all_inds, new_results):
|
||||
log_map[k] = v['log']
|
||||
res_map[k] = v['res']
|
||||
pred_map[k] = v['pred']
|
||||
data['res'] = [res_map[idx] for idx in data['index']]
|
||||
data['log'] = [log_map[idx] for idx in data['index']]
|
||||
data['pred'] = [pred_map[idx] for idx in data['index']]
|
||||
dump(data, storage)
|
||||
|
||||
score = DUDE_acc(storage)
|
||||
score_pth = storage.replace('.xlsx', '_score.csv')
|
||||
|
||||
dump(score, score_pth)
|
||||
logger.info(f'DUDE successfully finished evaluating {eval_file}, results saved in {score_pth}')
|
||||
logger.info('Score: ')
|
||||
logger.info(score)
|
||||
@@ -1,240 +0,0 @@
|
||||
import re
|
||||
import json
|
||||
import sympy as sp
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
from sympy import simplify, Eq, sympify, Pow, pi
|
||||
from sympy.parsing.latex import parse_latex
|
||||
import sys
|
||||
import math
|
||||
import os
|
||||
import os.path as osp
|
||||
import argparse
|
||||
|
||||
from .image_base import ImageBaseDataset
|
||||
from .utils import build_judge
|
||||
from ..utils import track_progress_rich
|
||||
from ..smp import load, dump, d2df, toliststr
|
||||
|
||||
|
||||
def preprocess(str1):
|
||||
if 0 <= str1.find("{") < str1.rfind("}"):
|
||||
str1 = str1[str1.find("{"): str1.rfind("}") + 1]
|
||||
str2 = str1.replace("\\", "")
|
||||
str2 = str2.replace("\\n", "\n")
|
||||
return str2
|
||||
|
||||
|
||||
def transfer(str1):
|
||||
if "\u03c0" in str1:
|
||||
strs = str1.split('\u03c0')
|
||||
str1 = strs[0]
|
||||
return float(str1) * np.pi
|
||||
else:
|
||||
return float(str1)
|
||||
|
||||
|
||||
def parse_answer(answer, answer_type="multiple choice"):
|
||||
if answer_type == "float":
|
||||
if answer.isdigit():
|
||||
return True, float(answer)
|
||||
else:
|
||||
parts = answer.split(' ')
|
||||
answer = parts[0]
|
||||
try:
|
||||
answer = transfer(answer)
|
||||
return True, answer
|
||||
except:
|
||||
return False, None
|
||||
elif answer_type == "multiple choice":
|
||||
if len(answer) == 1:
|
||||
return True, answer.upper()
|
||||
else:
|
||||
in_flag = [ch in answer.upper() for ch in 'ABCDE']
|
||||
if sum(in_flag) == 1:
|
||||
for ch in 'ABCDE':
|
||||
if ch in answer.upper():
|
||||
return True, ch
|
||||
return False, None
|
||||
else:
|
||||
return True, answer
|
||||
|
||||
|
||||
def DynaMath_auxeval(model, line):
|
||||
pred = line['prediction']
|
||||
pred = preprocess(pred)
|
||||
|
||||
succeed, short_answer = None, None
|
||||
try:
|
||||
dj = json.loads(pred, strict=False)
|
||||
short_answer = dj.get("short answer")
|
||||
assert short_answer is not None
|
||||
succeed, short_answer = parse_answer(short_answer, answer_type=line['anwser_type'])
|
||||
assert succeed
|
||||
except:
|
||||
# Failed to parse the JSON, use an auxiliary LLM to get the short answer
|
||||
if line['answer_type'] == 'multiple choice':
|
||||
inst = "Output the corresponing choice option, such as 'A', 'B', 'C', 'D', in a single line."
|
||||
elif line['answer_type'] == 'float':
|
||||
inst = "Output a three-digit floating-point number in a single line."
|
||||
else:
|
||||
inst = (
|
||||
"Output a short answer in a single line. Any float numbers in the answer "
|
||||
"should be formatted as three-digit floating-point numbers."
|
||||
)
|
||||
|
||||
prompt = f"Free-form answer: {pred}\nInstruction: {inst}"
|
||||
response = pred
|
||||
succeed, short_answer = parse_answer(response, line['answer_type'])
|
||||
if not succeed:
|
||||
response = model.generate(prompt)
|
||||
succeed, short_answer = parse_answer(response, line['answer_type'])
|
||||
|
||||
if line['answer_type'] == 'float':
|
||||
if succeed:
|
||||
diff = float(short_answer) - float(line['answer'])
|
||||
if abs(diff) <= 0.001:
|
||||
return dict(parse=True, extracted=short_answer, correct=True)
|
||||
else:
|
||||
return dict(parse=True, extracted=short_answer, correct=False)
|
||||
else:
|
||||
return dict(parse=False, extracted=None, correct=False)
|
||||
elif line['answer_type'] == 'multiple choice':
|
||||
if succeed:
|
||||
return dict(parse=True, extracted=short_answer, correct=(short_answer == line['answer']))
|
||||
else:
|
||||
if line['answer'] in pred[:3].upper():
|
||||
return dict(parse=False, extracted=None, correct=True)
|
||||
else:
|
||||
return dict(parse=False, extracted=None, correct=False)
|
||||
else:
|
||||
if succeed:
|
||||
return dict(parse=True, extracted=short_answer, correct=(short_answer.lower() in line['answer'].lower()))
|
||||
else:
|
||||
return dict(parse=False, extracted=None, correct=(short_answer.lower() in line['answer'].lower()))
|
||||
|
||||
|
||||
class Dynamath(ImageBaseDataset):
|
||||
|
||||
TYPE = 'VQA'
|
||||
DATASET_URL = {'DynaMath': 'https://opencompass.openxlab.space/utils/VLMEval/DynaMath.tsv'}
|
||||
DATASET_MD5 = {'DynaMath': 'b8425ad9a7114571fc9366e013699494'}
|
||||
GUIDE = """
|
||||
## Answer Instruction Please provide an answer to the question outlined above. Your response should adhere \
|
||||
to the following JSON format, which includes two keys: 'solution' and 'short answer'. The 'solution' key can contain \
|
||||
detailed steps needed to solve the question, and the 'short answer' key should provide a concise response. {INST}
|
||||
|
||||
Example of expected JSON response format:
|
||||
|
||||
"""
|
||||
EXAMPLE = {
|
||||
"solution": "[Detailed step-by-step explanation]",
|
||||
"short answer": "[Concise Answer]"
|
||||
}
|
||||
TEXT_EXAMPLE = json.dumps(EXAMPLE, indent=4)
|
||||
|
||||
# Given one data record, return the built prompt (a multi-modal message), can override
|
||||
def build_prompt(self, line):
|
||||
if isinstance(line, int):
|
||||
line = self.data.iloc[line]
|
||||
|
||||
if self.meta_only:
|
||||
tgt_path = toliststr(line['image_path'])
|
||||
else:
|
||||
tgt_path = self.dump_image(line)
|
||||
|
||||
prompt = f"## Question\n {line['question']}"
|
||||
if line['answer_type'] == 'multiple choice':
|
||||
inst = "Provide the corresponing choice option in the 'short answer' key, such as 'A', 'B', 'C', or 'D'."
|
||||
elif line['answer_type'] == 'float':
|
||||
inst = "Format the answer as a three-digit floating-point number and provide it in the 'short answer' key."
|
||||
else:
|
||||
inst = "Float numbers in the answer should be formatted as three-digit floating-point numbers."
|
||||
|
||||
prompt = prompt + self.GUIDE.format(INST=inst) + self.TEXT_EXAMPLE
|
||||
|
||||
msgs = []
|
||||
if isinstance(tgt_path, list):
|
||||
msgs.extend([dict(type='image', value=p) for p in tgt_path])
|
||||
else:
|
||||
msgs = [dict(type='image', value=tgt_path)]
|
||||
msgs.append(dict(type='text', value=prompt))
|
||||
return msgs
|
||||
|
||||
def evaluate(self, eval_file, **judge_kwargs):
|
||||
judge_name = judge_kwargs.pop('model', 'gpt-4o-mini')
|
||||
|
||||
model = build_judge(model=judge_name, **judge_kwargs)
|
||||
suffix = eval_file.split('.')[-1]
|
||||
|
||||
storage = eval_file.replace(f'.{suffix}', f'_{judge_name}.xlsx') # noqa: F841
|
||||
score_file = eval_file.replace(f'.{suffix}', f'_{judge_name}_score.csv') # noqa: F841
|
||||
tmp_file = eval_file.replace(f'.{suffix}', f'_{judge_name}.pkl') # noqa: F841
|
||||
nproc = judge_kwargs.pop('nproc', 6) # noqa: F841
|
||||
|
||||
res = load(tmp_file) if os.path.exists(tmp_file) else {}
|
||||
res = {k: v for k, v in res.items() if v is not None}
|
||||
|
||||
model.system_prompt = """\
|
||||
You are a helpful assistant that helps me to format free-form answers into a short answer according to the instruction.
|
||||
"""
|
||||
if not osp.exists(storage):
|
||||
data = load(eval_file)
|
||||
lt = len(data)
|
||||
payloads = [dict(model=model, line=data.iloc[i]) for i in range(lt) if data.iloc[i]['index'] not in res]
|
||||
keys = [idx for idx in data['index'] if idx not in res]
|
||||
|
||||
if len(keys):
|
||||
results = track_progress_rich(DynaMath_auxeval, payloads, nproc=nproc, save=tmp_file, keys=keys)
|
||||
for k, r in zip(keys, results):
|
||||
res[k] = r
|
||||
|
||||
data['parse'] = [res[idx]['parse'] for idx in data['index']]
|
||||
data['extracted'] = [res[idx]['extracted'] for idx in data['index']]
|
||||
data['correct'] = [res[idx]['correct'] for idx in data['index']]
|
||||
dump(data, storage)
|
||||
|
||||
data = load(storage)
|
||||
# Calculate Average Accuracy
|
||||
score_avg = {}
|
||||
score_avg['Overall'] = np.mean(data['correct'])
|
||||
|
||||
subs = set(data['subject'])
|
||||
for sub in subs:
|
||||
data_sub = data[data['subject'] == sub]
|
||||
score_avg[f'Subject-{sub}'] = np.mean(data_sub['correct'])
|
||||
|
||||
lvls = set(data['knowledge_level'])
|
||||
for lvl in lvls:
|
||||
data_lvl = data[data['knowledge_level'] == lvl]
|
||||
score_avg[f'Level-{lvl}'] = np.mean(data_lvl['correct'])
|
||||
|
||||
# Calculate the Worst Case Accuracy
|
||||
score_worst = {}
|
||||
data_worst = data[data['varid'] == 1]
|
||||
qid2corr = {idx: True for idx in data_worst['index']}
|
||||
lt = len(data)
|
||||
for i in range(lt):
|
||||
item = data.iloc[i]
|
||||
qid2corr[item['qid']] *= item['correct']
|
||||
data_worst['correct'] = [qid2corr[idx] for idx in data_worst['qid']]
|
||||
score_worst['Overall'] = np.mean(data_worst['correct'])
|
||||
|
||||
subs = set(data_worst['subject'])
|
||||
for sub in subs:
|
||||
data_sub = data_worst[data_worst['subject'] == sub]
|
||||
score_worst[f'Subject-{sub}'] = np.mean(data_sub['correct'])
|
||||
|
||||
lvls = set(data_worst['knowledge_level'])
|
||||
for lvl in lvls:
|
||||
data_lvl = data_worst[data_worst['knowledge_level'] == lvl]
|
||||
score_worst[f'Level-{lvl}'] = np.mean(data_lvl['correct'])
|
||||
|
||||
d1 = {'Setting': 'Average'}
|
||||
d1.update(score_avg)
|
||||
d2 = {'Setting': 'Worst Case'}
|
||||
d2.update(score_worst)
|
||||
score = pd.concat([d2df(d1), d2df(d2)], ignore_index=True)
|
||||
|
||||
dump(score, score_file)
|
||||
return score
|
||||
@@ -1,172 +0,0 @@
|
||||
import pandas as pd
|
||||
from abc import abstractmethod
|
||||
from ..smp import *
|
||||
|
||||
|
||||
def img_root_map(dataset):
|
||||
if 'MM_NIAH' in dataset:
|
||||
return 'MMNIAH'
|
||||
if 'CRPE' in dataset:
|
||||
return 'CRPE'
|
||||
if 'OCRVQA' in dataset:
|
||||
return 'OCRVQA'
|
||||
if 'COCO_VAL' == dataset:
|
||||
return 'COCO'
|
||||
if 'MMMU' in dataset:
|
||||
return 'MMMU'
|
||||
if "QSpatial" in dataset:
|
||||
return "QSpatial"
|
||||
|
||||
mmbench_root_map = {
|
||||
'MMBench_DEV_EN': 'MMBench', 'MMBench_TEST_EN': 'MMBench',
|
||||
'MMBench_DEV_CN': 'MMBench', 'MMBench_TEST_CN': 'MMBench',
|
||||
'MMBench': 'MMBench', 'MMBench_CN': 'MMBench',
|
||||
'MMBench_DEV_EN_V11': 'MMBench_V11', 'MMBench_TEST_EN_V11': 'MMBench_V11',
|
||||
'MMBench_DEV_CN_V11': 'MMBench_V11', 'MMBench_TEST_CN_V11': 'MMBench_V11',
|
||||
'MMBench_V11': 'MMBench', 'MMBench_CN_V11': 'MMBench',
|
||||
}
|
||||
if dataset in mmbench_root_map:
|
||||
return mmbench_root_map[dataset]
|
||||
return dataset
|
||||
|
||||
|
||||
class ImageBaseDataset:
|
||||
|
||||
MODALITY = 'IMAGE'
|
||||
DATASET_URL = {}
|
||||
DATASET_MD5 = {}
|
||||
|
||||
def __init__(self, dataset='MMBench', skip_noimg=True):
|
||||
ROOT = LMUDataRoot()
|
||||
# You can override this variable to save image files to a different directory
|
||||
self.dataset_name = dataset
|
||||
self.img_root = osp.join(ROOT, 'images', img_root_map(dataset))
|
||||
|
||||
data = self.load_data(dataset)
|
||||
self.skip_noimg = skip_noimg
|
||||
if skip_noimg and 'image' in data:
|
||||
data = data[~pd.isna(data['image'])]
|
||||
|
||||
data['index'] = [str(x) for x in data['index']]
|
||||
|
||||
self.meta_only = True
|
||||
|
||||
# The image field can store the base64 encoded image or another question index (for saving space)
|
||||
if 'image' in data:
|
||||
data['image'] = [str(x) for x in data['image']]
|
||||
image_map = {x: y for x, y in zip(data['index'], data['image'])}
|
||||
for k in image_map:
|
||||
if len(image_map[k]) <= 64:
|
||||
idx = image_map[k]
|
||||
assert idx in image_map and len(image_map[idx]) > 64
|
||||
image_map[k] = image_map[idx]
|
||||
|
||||
images = [toliststr(image_map[k]) for k in data['index']]
|
||||
data['image'] = [x[0] if len(x) == 1 else x for x in images]
|
||||
self.meta_only = False
|
||||
|
||||
if 'image_path' in data:
|
||||
paths = [toliststr(x) for x in data['image_path']]
|
||||
data['image_path'] = [x[0] if len(x) == 1 else x for x in paths]
|
||||
|
||||
if np.all([istype(x, int) for x in data['index']]):
|
||||
data['index'] = [int(x) for x in data['index']]
|
||||
|
||||
self.data = data
|
||||
self.post_build(dataset)
|
||||
|
||||
def __len__(self):
|
||||
return len(self.data)
|
||||
|
||||
def __getitem__(self, idx):
|
||||
return dict(self.data.iloc[idx])
|
||||
|
||||
def prepare_tsv(self, url, file_md5=None):
|
||||
data_root = LMUDataRoot()
|
||||
os.makedirs(data_root, exist_ok=True)
|
||||
update_flag = False
|
||||
file_name = url.split('/')[-1]
|
||||
data_path = osp.join(data_root, file_name)
|
||||
if osp.exists(data_path) and (file_md5 is None or md5(data_path) == file_md5):
|
||||
pass
|
||||
else:
|
||||
warnings.warn('The dataset tsv is not downloaded')
|
||||
download_file(url, data_path)
|
||||
update_flag = True
|
||||
|
||||
if file_size(data_path, 'GB') > 1:
|
||||
local_path = data_path.replace('.tsv', '_local.tsv')
|
||||
if not osp.exists(local_path) or os.environ.get('FORCE_LOCAL', None) or update_flag:
|
||||
from ..tools import LOCALIZE
|
||||
LOCALIZE(data_path, local_path)
|
||||
data_path = local_path
|
||||
return load(data_path)
|
||||
|
||||
def dump_image(self, line):
|
||||
os.makedirs(self.img_root, exist_ok=True)
|
||||
|
||||
if 'image' in line:
|
||||
if isinstance(line['image'], list):
|
||||
tgt_path = []
|
||||
assert 'image_path' in line
|
||||
for img, im_name in zip(line['image'], line['image_path']):
|
||||
path = osp.join(self.img_root, im_name)
|
||||
if not read_ok(path):
|
||||
decode_base64_to_image_file(img, path)
|
||||
tgt_path.append(path)
|
||||
else:
|
||||
tgt_path = osp.join(self.img_root, f"{line['index']}.jpg")
|
||||
if not read_ok(tgt_path):
|
||||
decode_base64_to_image_file(line['image'], tgt_path)
|
||||
tgt_path = [tgt_path]
|
||||
else:
|
||||
assert 'image_path' in line
|
||||
tgt_path = toliststr(line['image_path'])
|
||||
|
||||
return tgt_path
|
||||
|
||||
def display(self, line):
|
||||
if isinstance(line, int):
|
||||
line = self.data.iloc[line]
|
||||
assert isinstance(line, pd.Series) or isinstance(line, dict)
|
||||
mmqa_display(line)
|
||||
|
||||
# Return a list of dataset names that are supported by this class, can override
|
||||
@classmethod
|
||||
def supported_datasets(cls):
|
||||
return list(cls.DATASET_URL)
|
||||
|
||||
# Given the dataset name, return the dataset as a pandas dataframe, can override
|
||||
def load_data(self, dataset):
|
||||
url = self.DATASET_URL[dataset]
|
||||
file_md5 = self.DATASET_MD5[dataset] if dataset in self.DATASET_MD5 else None
|
||||
return self.prepare_tsv(url, file_md5)
|
||||
|
||||
# Post built hook, will be called after the dataset is built, can override
|
||||
def post_build(self, dataset):
|
||||
pass
|
||||
|
||||
# Given one data record, return the built prompt (a multi-modal message), can override
|
||||
def build_prompt(self, line):
|
||||
if isinstance(line, int):
|
||||
line = self.data.iloc[line]
|
||||
|
||||
if self.meta_only:
|
||||
tgt_path = toliststr(line['image_path'])
|
||||
else:
|
||||
tgt_path = self.dump_image(line)
|
||||
|
||||
question = line['question']
|
||||
|
||||
msgs = []
|
||||
if isinstance(tgt_path, list):
|
||||
msgs.extend([dict(type='image', value=p) for p in tgt_path])
|
||||
else:
|
||||
msgs = [dict(type='image', value=tgt_path)]
|
||||
msgs.append(dict(type='text', value=question))
|
||||
return msgs
|
||||
|
||||
# Given the prediction file, return the evaluation results in the format of a dictionary or pandas dataframe
|
||||
@abstractmethod
|
||||
def evaluate(self, eval_file, **judge_kwargs):
|
||||
pass
|
||||
@@ -1,75 +0,0 @@
|
||||
from .image_base import ImageBaseDataset
|
||||
from ..smp import *
|
||||
|
||||
|
||||
class COCO_Caption_Scorer():
|
||||
def __init__(self, ref, gt):
|
||||
from pycocoevalcap.bleu.bleu import Bleu
|
||||
from pycocoevalcap.rouge.rouge import Rouge
|
||||
from pycocoevalcap.cider.cider import Cider
|
||||
|
||||
self.ref = ref
|
||||
self.gt = gt
|
||||
print('setting up scorers...')
|
||||
self.scorers = [
|
||||
(Bleu(4), ['Bleu_1', 'Bleu_2', 'Bleu_3', 'Bleu_4']),
|
||||
(Rouge(), 'ROUGE_L'),
|
||||
(Cider(), 'CIDEr'),
|
||||
]
|
||||
|
||||
def compute_scores(self):
|
||||
total_scores = {}
|
||||
for scorer, method in self.scorers:
|
||||
print('computing %s score...' % (scorer.method()))
|
||||
score, scores = scorer.compute_score(self.gt, self.ref)
|
||||
if isinstance(method, list):
|
||||
for sc, scs, m in zip(score, scores, method):
|
||||
print('%s: %0.3f' % (m, sc * 100))
|
||||
total_scores['Bleu'] = [x * 100 for x in score]
|
||||
else:
|
||||
print('%s: %0.3f' % (method, score * 100))
|
||||
total_scores[method] = score * 100
|
||||
|
||||
print('*****DONE*****')
|
||||
for key, value in total_scores.items():
|
||||
print('{}:{}'.format(key, value))
|
||||
return total_scores
|
||||
|
||||
|
||||
class ImageCaptionDataset(ImageBaseDataset):
|
||||
|
||||
TYPE = 'Caption'
|
||||
|
||||
DATASET_URL = {
|
||||
'COCO_VAL': 'https://opencompass.openxlab.space/utils/VLMEval/COCO_VAL.tsv',
|
||||
}
|
||||
|
||||
DATASET_MD5 = {
|
||||
'COCO_VAL': '72a5079dead060269ac222c5aa5128af',
|
||||
}
|
||||
|
||||
def load_data(self, dataset):
|
||||
data = super().load_data(dataset)
|
||||
if 'question' not in data:
|
||||
data['question'] = [(
|
||||
'Please describe this image in general. Directly provide the description, '
|
||||
'do not include prefix like "This image depicts". '
|
||||
)] * len(data)
|
||||
return data
|
||||
|
||||
# It returns a dictionary of scores
|
||||
@classmethod
|
||||
def evaluate(self, eval_file, **kwargs):
|
||||
data = load(eval_file)
|
||||
lt = len(data)
|
||||
lines = [data.iloc[i] for i in range(lt)]
|
||||
ref, gt = {}, {}
|
||||
for i, line in enumerate(lines):
|
||||
ref[str(i)] = [str(line['prediction'])]
|
||||
gt[str(i)] = eval(line['answer'])
|
||||
|
||||
scorer = COCO_Caption_Scorer(ref, gt)
|
||||
coco_caption_score_dict = scorer.compute_scores()
|
||||
score_pth = eval_file.replace('.xlsx', '_score.json')
|
||||
dump(coco_caption_score_dict, score_pth)
|
||||
return coco_caption_score_dict
|
||||