Update trainer.py

rm ide file
Signed-off-by: tc-mb <caitianchi@modelbest.cn>
2026-02-05 18:29:18 +08:00 · 2025-09-12 15:53:48 +08:00 · 2025-09-02 16:14:14 +08:00 · 2025-09-02 12:16:14 +08:00 · 2025-09-02 12:15:25 +08:00 · 2025-09-01 17:51:49 +08:00
292 changed files with 43554 additions and 6362 deletions
--- a/.vscode/settings.json
+++ b/.vscode/settings.json
@@ -1,5 +0,0 @@
 {
    "githubPullRequests.ignoredPullRequestBranches": [
        "main"
    ]
 }
--- a/License.md
+++ b/License.md
@@ -0,0 +1,41 @@
 Version 1.0, June 5, 2024
 © 2024 OpenBMB. All rights reserved.
 ## Part One: Preamble
 We are opening the entire series of the globally leading MiniCPM edge-side large language models, including the flagship edge-side models MiniCPM-2.4B and MiniCPM-1.2B, as well as the world's most powerful edge multimodal models MiniCPM-V series. The aforementioned weights are completely open for all academic research. Commercial use is also allowed after filling out a registration questionnaire. Community use of the MiniCPM series models must comply with Apache 2.0 and the "MiniCPM Model Community License Agreement."
 Therefore, you and the MiniCPM development team agree to the following "MiniCPM Model Community License Agreement":
 ## Part Two: Licensing and Redistributio
 ####  1. Grant of Rights
 You are granted a non-exclusive, worldwide, non-transferable, royalty-free, limited license to use, copy, distribute, reproduce, create derivative works from, and modify MiniCPM materials in accordance with OpenBMB's intellectual property rights or other rights in the MiniCPM materials.
 ####  2. Distribution and Redistribution
 - If you distribute or provide MiniCPM series model materials (or any derivative works thereof), or use any product or service of them, you must (A) provide a copy of this agreement; and (B) prominently display "Built with 面壁MiniCPM" on the relevant website, user interface, blog post, about page, or product documentation. If you create, train, fine-tune, or improve an AI model using the MiniCPM series models, the model must include "MiniCPM" in its name.
 - You must retain the following attribution statement in all distributed MiniCPM-related materials: "MiniCPM is licensed under the MiniCPM Model Community License, © OpenBMB Platforms, Inc. All rights reserved."
 - Your use of MiniCPM materials must comply with applicable laws and regulations and the "MiniCPM Model Community License Agreement," which is incorporated into this agreement by reference.
 - You may not use MiniCPM series models or their outputs and results to improve any other large language models (other than MiniCPM or its derivatives).
 ####  3. Additional Commercial Terms
 If you or your affiliates' services or products deploy the model on edge-side devices not exceeding 5,000 units, or provide applications with a daily active user count (DAU) of less than 1 million, you can apply to OpenBMB for permission and, after filling out the registration questionnaire, may be allowed to use it commercially for free. Otherwise, please email (cpm@modelbest.cn) to apply for authorization from OpenBMB, which may, at its discretion, grant permission, and you will not have the right to exercise any rights under this agreement.
 ####  4. Usage-based Restrictions
 The restrictions set forth in Appendix A are considered usage-based restrictions. Therefore, you may not use the model or its derivatives for designated restricted uses. You may use the model under this license, including only for lawful purposes and in compliance with the terms of the license. Usage includes creating any content, fine-tuning, updating, running, training, evaluating, and/or re-parameterizing the model. You should require all users of the model or its derivatives to comply with the terms of this section.
 ## Part Three: Other Terms
 ####  5. Trademarks and Related
 This license does not grant you the right to use OpenBMB, OpenBMB Intelligence, MiniCPM trademarks, trade names, logos, or otherwise imply a relationship between the parties; any rights not expressly granted herein are reserved by OpenBMB.
 ####  6. Disclaimer
 Unless required by applicable law or agreed to in writing, OpenBMB provides the model and supplemental materials "as is," without any warranty or condition, express or implied, including but not limited to all express and implied warranties or conditions of title, non-infringement, merchantability, or fitness for a particular purpose. You are solely responsible for determining the appropriateness of using or redistributing the model, its derivatives, and supplemental materials, and assume any risks associated with exercising the permissions under this license.
 ## Appendix A: Usage Restrictions
 You agree not to use the model or its derivatives for:
 - Any use that violates applicable national or international laws or regulations or infringes upon the legal rights and interests of any third party;
 - Any military purposes;
 - Exploiting, harming, or attempting to exploit or harm minors in any way;
 - Generating or disseminating verifiable false information and/or content with the intent to harm others;
 - Generating or disseminating inappropriate content that must comply with applicable regulatory requirements;
 - Unauthorized generation or dissemination of personally identifiable information, or unreasonable use thereof;
 - Defamation, demeaning, or otherwise harassing others;
 - Fully automated decision-making that adversely affects individuals' legal rights or creates or modifies binding, enforceable obligations;
 - Any use intended to or having the effect of discriminating or harming individuals or groups based on online or offline social behaviors or known or predicted personal characteristics;
 - Exploiting the vulnerabilities of specific groups due to their age, social, physical, or psychological characteristics, in a manner that materially distorts the behavior of group members, leading to or likely leading to physical or psychological harm to the person or others;
 - Any use intended to or having the effect of discriminating against individuals or groups based on legally protected characteristics or categories.
--- a/MiniCPM模型商用许可协议.md
+++ b/MiniCPM模型商用许可协议.md
@@ -0,0 +1,43 @@
 版本 1.0，2024年6月5日
 版权所有 © 2024 OpenBMB
 ## 第一部分：序言
 我们将全球领先的MiniCPM端侧模型全系开源，包括旗舰端侧模型MiniCPM-2.4B和MiniCPM-1.2B，以及全球领先的端侧多模态模型MiniCPM-V系列。以上权重对所有学术研究完全开放。在填写问卷登记后亦允许商业使用，社区使用 MiniCPM系列模型需要遵循 Apache 2.0 和《MiniCPM 模型社区许可协议》。
 因此，您与MiniCPM 开发团队达成如下《MiniCPM模型商用许可协议》：
 ## 第二部分：许可权和再分发
 #### 1. 权利授予
 您被授予非排他性的、全球性的、不可转让的和免版税的有限许可，依据OpenBMB对MiniCPM材料所拥有的知识产权或其他权利来使用、复制、分发、复制、创建衍生作品和修改MiniCPM材料。
 #### 2. 分发和再分发
 - 如果您分发或提供MiniCPM系列模型材料（或其任何衍生作品），或使用其中任何一个的产品或服务，您必须（A）提供本协议的副本；并（B）在相关网站、用户界面、博客文章、关于页面或产品文档中显著显示“Built with 面壁MiniCPM”。如果您使用MiniCPM系列模型创建、训练、微调或改进AI模型，该模型必须包含“MiniCPM”命名。
 - 您必须在分发的所有MiniCPM相关材料中保留以下归属声明：“面壁MiniCPM 根据MiniCPM模型社区许可证许可，版权所有©面壁智能 Platforms, Inc. 保留所有权利。”
 - 您对MiniCPM材料的使用必须遵守适用的法律法规，并遵守《MiniCPM 模型社区许可协议》，该政策通过引用并入本协议。
 - 您不得使用MiniCPM系列模型或其输出和结果来改进任何其他大型语言模型（除 MiniCPM 或其衍生品外）。
 #### 3. 附加商业条款
 若您或您的关联方的服务或产品是将模型部署在端侧设备，且部署设备不超5000台，或提供应用的日均用户活跃量（DAU）低于100万，可直接向面壁智能申请许可，在填写问卷登记后可允许免费商业使用。否则请发邮件（cpm@modelbest.cn）向面壁智能申请授权，我们可自行决定是否授权，并自行决定授权的期限和范围。在我们给予书面授权前，您无权行使任何商业性权利，亦不得将模型用于任何商业用途。
 #### 4. 基于使用的限制
 附录A中规定的限制被视为基于使用的限制。因此，您不得将模型及其衍生作品用于指定的受限用途。您可以根据本许可证使用模型，包括仅用于合法目的并符合许可证的规定。使用包括创建任何内容、微调、更新、运行、训练、评估和/或重新参数化模型。您应要求所有使用模型或其衍生作品的用户遵守本段的条款。
 ## 第三部分：其他条款
 #### 5. 商标和相关
 本许可证不授予您使用OpenBMB、面壁智能、MiniCPM商标、商号、标志或以其他方式暗示双方之间关系的权利；未在此明确授予的任何权利均由OpenBMB保留。
 #### 6. 免责声明
 除非适用法律要求或书面同意，OpenBMB 按“现状”提供模型和补充材料，不提供任何形式的保证或条件，包括但不限于所有明示和暗示的保证或条件，包括所有权、非侵权、适销性或适用于特定目的的保证或条件。您自行负责确定使用或再分发模型、模型的衍生作品和补充材料的适当性，并承担在本许可证下行使权利所引发的任何风险。
 ## 附录A：使用限制
 您同意不将模型或其衍生作品用于：
 - 任何违反适用国家或国际法律法规或侵犯任何第三方合法权利和利益的方式；
 - 任何军事用途；
 - 以任何方式利用、伤害或试图利用或伤害未成年人；
 - 生成或传播可验证的虚假信息和/或内容，以损害他人为目的；
 - 生成或传播不适当内容，需符合适用的监管要求；
 - 未经授权生成或传播个人可识别信息，或进行不合理使用；
 - 诽谤、贬低或以其他方式骚扰他人；
 - 完全自动化的决策，导致个人的法律权利受到不利影响或创建或修改具有约束力、可执行的义务；
 - 任何意图或具有歧视或伤害个人或群体的效果，基于在线或离线的社会行为或已知或预测的个人特征；
 - 利用特定群体的年龄、社会、身体或心理特征的弱点，以实质性扭曲该群体成员的行为，导致或可能导致该人或其他人身体或心理伤害的方式；
 - 任何意图或具有歧视个人或群体效果的用途，基于法律保护的特征或类别。
--- a/README.md
+++ b/README.md
--- a/README_en.md
+++ b/README_en.md
--- a/README_zh.md
+++ b/README_zh.md
--- a/assets/MiniCPM-o.png
+++ b/assets/MiniCPM-o.png
--- a/assets/discord.png
+++ b/assets/discord.png
--- a/assets/input_examples/assistant_default_female_voice.wav
+++ b/assets/input_examples/assistant_default_female_voice.wav
--- a/assets/input_examples/assistant_female_voice.wav
+++ b/assets/input_examples/assistant_female_voice.wav
--- a/assets/input_examples/assistant_male_voice.wav
+++ b/assets/input_examples/assistant_male_voice.wav
--- a/assets/input_examples/audio_understanding.mp3
+++ b/assets/input_examples/audio_understanding.mp3
--- a/assets/input_examples/chi-english-1.wav
+++ b/assets/input_examples/chi-english-1.wav
--- a/assets/input_examples/exciting-emotion.wav
+++ b/assets/input_examples/exciting-emotion.wav
--- a/assets/input_examples/fast-pace.wav
+++ b/assets/input_examples/fast-pace.wav
--- a/assets/input_examples/indian-accent.wav
+++ b/assets/input_examples/indian-accent.wav
--- a/assets/join.png
+++ b/assets/join.png
--- a/assets/logo.html
+++ b/assets/logo.html
@@ -0,0 +1,3 @@
 <span style="color:#56A7DA; font-size: 10em; font-weight: bold;">
    MiniCPM-<span>o</span>
 </span>
--- a/assets/minicpm-o-26-framework-v2.png
+++ b/assets/minicpm-o-26-framework-v2.png
--- a/assets/minicpm-o-26-framework.png
+++ b/assets/minicpm-o-26-framework.png
--- a/assets/minicpm-v-4dot5-framework.png
+++ b/assets/minicpm-v-4dot5-framework.png
--- a/assets/minicpm-v17.png
+++ b/assets/minicpm-v17.png
--- a/assets/minicpm-v18.png
+++ b/assets/minicpm-v18.png
--- a/assets/minicpm-v21-2.png
+++ b/assets/minicpm-v21-2.png
--- a/assets/minicpm-v21.png
+++ b/assets/minicpm-v21.png
--- a/assets/minicpm-v22.png
+++ b/assets/minicpm-v22.png
--- a/assets/minicpm-v23.png
+++ b/assets/minicpm-v23.png
--- a/assets/minicpm-v24.png
+++ b/assets/minicpm-v24.png
--- a/assets/minicpm-v25.png
+++ b/assets/minicpm-v25.png
--- a/assets/minicpm-v_wechat.png
+++ b/assets/minicpm-v_wechat.png
--- a/assets/minicpm_v_and_minicpm_o_title.png
+++ b/assets/minicpm_v_and_minicpm_o_title.png
--- a/assets/minicpmo2_6/2dot6_o_demo_video_img.png
+++ b/assets/minicpmo2_6/2dot6_o_demo_video_img.png
--- a/assets/minicpmo2_6/minicpmo2_6_diagram_train_NN.png
+++ b/assets/minicpmo2_6/minicpmo2_6_diagram_train_NN.png
--- a/assets/minicpmo2_6/minicpmo2_6_math_intersect.png
+++ b/assets/minicpmo2_6/minicpmo2_6_math_intersect.png
--- a/assets/minicpmo2_6/minicpmo2_6_multi-image_bike.png
+++ b/assets/minicpmo2_6/minicpmo2_6_multi-image_bike.png
--- a/assets/minicpmo2_6/show_demo.jpg
+++ b/assets/minicpmo2_6/show_demo.jpg
--- a/assets/minicpmv22.jpeg
+++ b/assets/minicpmv22.jpeg
--- a/assets/minicpmv4/iphone_cn.gif
+++ b/assets/minicpmv4/iphone_cn.gif
--- a/assets/minicpmv4/iphone_cn_funny_points.gif
+++ b/assets/minicpmv4/iphone_cn_funny_points.gif
--- a/assets/minicpmv4/iphone_en.gif
+++ b/assets/minicpmv4/iphone_en.gif
--- a/assets/minicpmv4/iphone_en_information_extraction.gif
+++ b/assets/minicpmv4/iphone_en_information_extraction.gif
--- a/assets/minicpmv4/minicpm-v-4-case.png
+++ b/assets/minicpmv4/minicpm-v-4-case.png
--- a/assets/minicpmv4_5/MiniCPM-V
+++ b/assets/minicpmv4_5/MiniCPM-V
--- a/assets/minicpmv4_5/MiniCPM-V
+++ b/assets/minicpmv4_5/MiniCPM-V
--- a/assets/minicpmv4_5/en_case1.png
+++ b/assets/minicpmv4_5/en_case1.png
--- a/assets/minicpmv4_5/en_case2.png
+++ b/assets/minicpmv4_5/en_case2.png
--- a/assets/minicpmv4_5/en_case3.jpeg
+++ b/assets/minicpmv4_5/en_case3.jpeg
--- a/assets/minicpmv4_5/en_case4.jpeg
+++ b/assets/minicpmv4_5/en_case4.jpeg
--- a/assets/minicpmv4_5/en_extra.jpg
+++ b/assets/minicpmv4_5/en_extra.jpg
--- a/assets/minicpmv4_5/v45_cn_handwriting.gif
+++ b/assets/minicpmv4_5/v45_cn_handwriting.gif
--- a/assets/minicpmv4_5/v45_cn_travel.gif
+++ b/assets/minicpmv4_5/v45_cn_travel.gif
--- a/assets/minicpmv4_5/v45_en_cot.gif
+++ b/assets/minicpmv4_5/v45_en_cot.gif
--- a/assets/minicpmv4_5/v45_en_handwriting.gif
+++ b/assets/minicpmv4_5/v45_en_handwriting.gif
--- a/assets/minicpmv4_5/zh_case1.jpeg
+++ b/assets/minicpmv4_5/zh_case1.jpeg
--- a/assets/minicpmv4_5/zh_case2.jpeg
+++ b/assets/minicpmv4_5/zh_case2.jpeg
--- a/assets/minicpmv4_5/zh_extra.jpeg
+++ b/assets/minicpmv4_5/zh_extra.jpeg
--- a/assets/minicpmv_4_5_evaluation_result.png
+++ b/assets/minicpmv_4_5_evaluation_result.png
--- a/assets/o-2dot6-demo-video-preview.png
+++ b/assets/o-2dot6-demo-video-preview.png
--- a/assets/radar.jpg
+++ b/assets/radar.jpg
--- a/assets/radar_minicpm_v45.png
+++ b/assets/radar_minicpm_v45.png
--- a/assets/ref_audios/default.wav
+++ b/assets/ref_audios/default.wav
--- a/assets/ref_audios/female_example.wav
+++ b/assets/ref_audios/female_example.wav
--- a/assets/ref_audios/male_example.wav
+++ b/assets/ref_audios/male_example.wav
--- a/assets/ref_audios/video_default.wav
+++ b/assets/ref_audios/video_default.wav
--- a/assets/star-history-25-09-02.png
+++ b/assets/star-history-25-09-02.png
--- a/assets/wechat-QR.jpeg
+++ b/assets/wechat-QR.jpeg
--- a/assets/wechat.png
+++ b/assets/wechat.png
--- a/docs/best_practice_summary.md
+++ b/docs/best_practice_summary.md
@@ -0,0 +1,23 @@
 # MiniCPM-V Best Practices
 **MiniCPM-V** is a series of end-side multimodal LLMs (MLLMs) designed for vision-language understanding. The models take image, video and text as inputs and provide high-quality text output, aiming to achieve **strong performance and efficient deployment**. The most notable models in this series currently include MiniCPM-Llama3-V 2.5 and MiniCPM-V 2.6. The following sections provide detailed tutorials and guidance for each version of the MiniCPM-V models.
 ## MiniCPM-V 2.6
 MiniCPM-V 2.6 is the latest and most capable model in the MiniCPM-V series. With a total of 8B parameters, the model **surpasses GPT-4V in single image, multi-image and video understanding**. It outperforms **GPT-4o mini, Gemini 1.5 Pro and Claude 3.5 Sonnet** in single image understanding, and advances MiniCPM-Llama3-V 2.5's features such as strong OCR capability, trustworthy behavior, multilingual support, and end-side deployment. Due to its superior token density, MiniCPM-V 2.6 can for the first time support real-time video understanding on end-side devices such as iPad.
 * [Deployment Tutorial](https://modelbest.feishu.cn/wiki/C2BWw4ZP0iCDy7kkCPCcX2BHnOf)
 * [Training Tutorial](https://modelbest.feishu.cn/wiki/GeHMwLMa0i2FhUkV0f6cz3HWnV1)
 * [Quantization Tutorial](https://modelbest.feishu.cn/wiki/YvsPwnPwWiqUjlkmW0scQ76TnBb)
 ## MiniCPM-Llama3-V 2.5
 MiniCPM-Llama3-V 2.5 is built on SigLip-400M and Llama3-8B-Instruct with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.0.
 * [Quantization Tutorial](https://modelbest.feishu.cn/wiki/Kc7ywV4X1ipSaAkuPFOc9SFun8b)
 * [Training Tutorial](https://modelbest.feishu.cn/wiki/UpSiw63o9iGDhIklmwScX4a6nhW)
 * [End-side Deployment](https://modelbest.feishu.cn/wiki/Lwr9wpOQdinr6AkLzHrc9LlgnJD)
 * [Deployment Tutorial](https://modelbest.feishu.cn/wiki/LTOKw3Hz7il9kGkCLX9czsennKe)
 * [HD Decoding Tutorial](https://modelbest.feishu.cn/wiki/Ug8iwdXfhiHVsDk2gGEco6xnnVg)
 * [Model Structure](https://modelbest.feishu.cn/wiki/ACtAw9bOgiBQ9lkWyafcvtVEnQf)
--- a/docs/best_practice_summary_zh.md
+++ b/docs/best_practice_summary_zh.md
@@ -0,0 +1,22 @@
 # MiniCPM-V 最佳实践
 **MiniCPM-V**是面向图文理解的端侧多模态大模型系列。该系列模型接受图像和文本输入，并提供高质量的文本输出。自2024年2月以来，我们共发布了5个版本模型，旨在实现**领先的性能和高效的部署**，目前该系列最值得关注的模型包括：
 ## MiniCPM-V 2.6
 MiniCPM-V系列的最新、性能最佳模型。总参数量 8B，单图、多图和视频理解性能**超越了 GPT-4V**。在单图理解上，它取得了优于 **GPT-4o mini、Gemini 1.5 Pro 和 Claude 3.5 Sonnet** 等商用闭源模型的表现，并进一步优化了 MiniCPM-Llama3-V 2.5 的 OCR、可信行为、多语言支持以及端侧部署等诸多特性。基于其领先的视觉 token 密度，MiniCPM-V 2.6 成为了首个支持在 iPad 等端侧设备上进行实时视频理解的多模态大模型。
 * [部署教程](https://modelbest.feishu.cn/wiki/LZxLwp4Lzi29vXklYLFchwN5nCf)
 * [训练教程](https://modelbest.feishu.cn/wiki/HvfLwYzlIihqzXkmeCdczs6onmd)
 * [量化教程](https://modelbest.feishu.cn/wiki/PAsHw6N6xiEy0DkJWpJcIocRnz9)
 ## MiniCPM-Llama3-V 2.5
 MiniCPM-Llama3-V 2.5 基于 SigLip-400M 和 Llama3-8B-Instruct 构建，总共有 80 亿参数。其性能相比 MiniCPM-V 2.0 有了显著提升。
 * [量化教程](https://modelbest.feishu.cn/wiki/O0KTwQV5piUPzTkRXl9cSFyHnQb)
 * [训练教程](https://modelbest.feishu.cn/wiki/MPkPwvONEiZm3BkWMnyc83Tin4d)
 * [端侧部署](https://modelbest.feishu.cn/wiki/CZZJw1EDGitSSZka664cZwbWnrb)
 * [部署教程](https://modelbest.feishu.cn/wiki/BcHIwjOLGihJXCkkSdMc2WhbnZf)
 * [高清解码教程](https://modelbest.feishu.cn/wiki/L0ajwm8VAiiPY6kDZfJce3B7nRg)
 * [模型结构](https://modelbest.feishu.cn/wiki/X15nwGzqpioxlikbi2RcXDpJnjd)
--- a/docs/llamafactory_train_and_infer.md
+++ b/docs/llamafactory_train_and_infer.md
@@ -0,0 +1,446 @@
 # Best Practice with LLaMA-Factory
 ## Contents <!-- omit in toc -->
 - [Support Models](#Support-Models)
 - [LLaMA-Factory Installation](#LLaMA-Factory-Installation)
 - [Dataset Prepare](#Dataset-Prepare)
  - [Image Dataset](#Image-Dataset)
  - [Video Dataset](#Video-Dataset)
  - [Audio Dataset](#Audio-Dataset)
 - [Lora Fine-Tuning](#Lora-Fine-Tuning)
 - [Full Parameters Fine-Tuning](#Full-Parameters-Fine-Tuning)
 - [Inference](#Inference)
 ## Support Models
 * [openbmb/MiniCPM-V-4](https://huggingface.co/openbmb/MiniCPM-V-4)
 * [openbmb/MiniCPM-o-2_6](https://huggingface.co/openbmb/MiniCPM-o-2_6)
 * [openbmb/MiniCPM-V-2_6](https://huggingface.co/openbmb/MiniCPM-V-2_6)
 ## LLaMA-Factory Installation
 You can install LLaMA-Factory using commands below.
 ```
 git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
 cd LLaMA-Factory
 pip install -e ".[torch,metrics,deepspeed,minicpm_v]"
 mkdir configs # let's put all yaml files here 
 ```
 ## Dataset Prepare
 Refer to [data/dataset_info.json](https://github.com/hiyouga/LLaMA-Factory/blob/main/data/dataset_info.json) to add your customised dataset. Let's use the two existing demo datasets `mllm_demo`, `mllm_video_demo` and `mllm_audio_demo` as examples (audio is only for MiniCPM-o-2.6).
 ### Image Dataset
 Refer to image sft demo data: [data/mllm_demo.json](https://github.com/hiyouga/LLaMA-Factory/blob/main/data/mllm_demo.json)
 <details>
  <summary>
    <b>data/mllm_demo.json</b>
  </summary>
 ```json
 [
  {
    "messages": [
      {
        "content": "<image>Who are they?",
        "role": "user"
      },
      {
        "content": "They're Kane and Gretzka from Bayern Munich.",
        "role": "assistant"
      },
      {
        "content": "What are they doing?",
        "role": "user"
      },
      {
        "content": "They are celebrating on the soccer field.",
        "role": "assistant"
      }
    ],
    "images": [
      "mllm_demo_data/1.jpg"
    ]
  },
  {
    "messages": [
      {
        "content": "<image>Who is he?",
        "role": "user"
      },
      {
        "content": "He's Thomas Muller from Bayern Munich.",
        "role": "assistant"
      },
      {
        "content": "Why is he on the ground?",
        "role": "user"
      },
      {
        "content": "Because he's sliding on his knees to celebrate.",
        "role": "assistant"
      }
    ],
    "images": [
      "mllm_demo_data/2.jpg"
    ]
  },
  {
    "messages": [
      {
        "content": "<image>Please describe this image",
        "role": "user"
      },
      {
        "content": "Chinese astronaut Gui Haichao is giving a speech.",
        "role": "assistant"
      },
      {
        "content": "What has he accomplished?",
        "role": "user"
      },
      {
        "content": "He was appointed to be a payload specialist on Shenzhou 16 mission in June 2022, thus becoming the first Chinese civilian of Group 3 in space on 30 May 2023. He is responsible for the on-orbit operation of space science experimental payloads.",
        "role": "assistant"
      }
    ],
    "images": [
      "mllm_demo_data/3.jpg"
    ]
  }
 ]
 ```
 </details>
 ### Video Dataset
 Refer to video sft demo data: [data/mllm_video_demo.json](https://github.com/hiyouga/LLaMA-Factory/blob/main/data/mllm_video_demo.json)
 <details>
  <summary>
    <b>data/mllm_video_demo.json</b>
  </summary>
 ```json
 [
  {
    "messages": [
      {
        "content": "<video>Why is this video funny?",
        "role": "user"
      },
      {
        "content": "Because a baby is reading, and he is so cute!",
        "role": "assistant"
      }
    ],
    "videos": [
      "mllm_demo_data/1.mp4"
    ]
  },
  {
    "messages": [
      {
        "content": "<video>What is she doing?",
        "role": "user"
      },
      {
        "content": "She is cooking.",
        "role": "assistant"
      }
    ],
    "videos": [
      "mllm_demo_data/2.avi"
    ]
  },
  {
    "messages": [
      {
        "content": "<video>What's in the video?",
        "role": "user"
      },
      {
        "content": "A baby is playing in the living room.",
        "role": "assistant"
      }
    ],
    "videos": [
      "mllm_demo_data/3.mp4"
    ]
  }
 ]
 ```
 </details>
 ### Audio Dataset
 Refer to audio sft demo data: [data/mllm_audio_demo.json](https://github.com/hiyouga/LLaMA-Factory/blob/main/data/mllm_audio_demo.json)
 <details>
  <summary>
    <b>data/mllm_audio_demo.json</b>
  </summary>
 ```json
 [
  {
    "messages": [
      {
        "content": "<audio>What's that sound?",
        "role": "user"
      },
      {
        "content": "It is the sound of glass shattering.",
        "role": "assistant"
      }
    ],
    "audios": [
      "mllm_demo_data/1.mp3"
    ]
  },
  {
    "messages": [
      {
        "content": "<audio>What can you hear?",
        "role": "user"
      },
      {
        "content": "A woman is coughing.",
        "role": "assistant"
      }
    ],
    "audios": [
      "mllm_demo_data/2.wav"
    ]
  },
  {
    "messages": [
      {
        "content": "<audio>What does the person say?",
        "role": "user"
      },
      {
        "content": "Mister Quiller is the apostle of the middle classes and we are glad to welcome his gospel.",
        "role": "assistant"
      }
    ],
    "audios": [
      "mllm_demo_data/3.flac"
    ]
  }
 ]
 ```
 </details>
 ## Lora Fine-Tuning
 We can use one command to do lora sft:
 ```shell
 CUDA_VISIBLE_DEVICES=0 llamafactory-cli train configs/minicpmo_2_6_lora_sft.yaml
 ```
 <details>
  <summary>
    <b>configs/minicpmo_2_6_lora_sft.yaml</b>
  </summary>
 ```yaml
 ### model
 model_name_or_path: openbmb/MiniCPM-o-2_6 # MiniCPM-o-2_6 MiniCPM-V-2_6
 trust_remote_code: true
 ### method
 stage: sft
 do_train: true
 finetuning_type: lora
 lora_target: q_proj,v_proj
 ### dataset
 dataset: mllm_demo # mllm_demo mllm_video_demo mllm_audio_demo
 template: minicpm_o # minicpm_o minicpm_v
 cutoff_len: 3072
 max_samples: 1000
 overwrite_cache: true
 preprocessing_num_workers: 16
 ### output
 output_dir: saves/minicpmo_2_6/lora/sft
 logging_steps: 1
 save_steps: 100
 plot_loss: true
 overwrite_output_dir: true
 save_total_limit: 10
 ### train
 per_device_train_batch_size: 2
 gradient_accumulation_steps: 1
 learning_rate: 1.0e-5
 num_train_epochs: 20.0
 lr_scheduler_type: cosine
 warmup_ratio: 0.1
 bf16: true
 ddp_timeout: 180000000
 save_only_model: true
 ### eval
 do_eval: false
 ```
 </details>
 ### Lora Model Export
 One command to export lora model
 ```shell
 llamafactory-cli export configs/minicpmo_2_6_lora_export.yaml
 ```
 <details>
  <summary>
    <b>configs/minicpmo_2_6_lora_export.yaml</b>
  </summary>
 ```yaml
 ### model
 model_name_or_path: openbmb/MiniCPM-o-2_6 # MiniCPM-o-2_6 MiniCPM-V-2_6
 adapter_name_or_path: saves/minicpmo_2_6/lora/sft
 template: minicpm_o # minicpm_o minicpm_v
 finetuning_type: lora
 trust_remote_code: true
 ### export
 export_dir: models/minicpmo_2_6_lora_sft
 export_size: 2
 export_device: cpu
 export_legacy_format: false
 ```
 </details>
 ## Full Parameters Fine-Tuning
 We can use one command to do full sft:
 ```shell
 llamafactory-cli train configs/minicpmo_2_6_full_sft.yaml
 ```
 <details>
  <summary>
    <b>configs/minicpmo_2_6_full_sft.yaml</b>
  </summary>
 ```yaml
 ### model
 model_name_or_path: openbmb/MiniCPM-o-2_6 # MiniCPM-o-2_6 MiniCPM-V-2_6
 trust_remote_code: true
 freeze_vision_tower: true
 print_param_status: true
 flash_attn: fa2
 ### method
 stage: sft
 do_train: true
 finetuning_type: full
 deepspeed: configs/deepspeed/ds_z2_config.json
 ### dataset
 dataset: mllm_demo # mllm_demo mllm_video_demo
 template: minicpm_o # minicpm_o minicpm_v
 cutoff_len: 3072
 max_samples: 1000
 overwrite_cache: true
 preprocessing_num_workers: 16
 ### output
 output_dir: saves/minicpmo_2_6/full/sft
 logging_steps: 1
 save_steps: 100
 plot_loss: true
 overwrite_output_dir: true
 save_total_limit: 10
 ### train
 per_device_train_batch_size: 2
 gradient_accumulation_steps: 1
 learning_rate: 1.0e-5
 num_train_epochs: 20.0
 lr_scheduler_type: cosine
 warmup_ratio: 0.1
 bf16: true
 ddp_timeout: 180000000
 save_only_model: true
 ### eval
 do_eval: false
 ```
 </details>
 ## Inference
 ### Web UI ChatBox
 Refer [LLaMA-Factory doc](https://github.com/hiyouga/LLaMA-Factory/tree/main/examples#inferring-lora-fine-tuned-models) for more inference usages.
 For example, we can use one command to run web chat:
 ```shell
 CUDA_VISIBLE_DEVICES=0 llamafactory-cli webchat configs/minicpmo_2_6_infer.yaml
 ```
 <details>
  <summary>
    <b>configs/minicpmo_2_6_infer.yaml</b>
  </summary>
 ```yaml
 model_name_or_path: saves/minicpmo_2_6/full/sft
 template: minicpm_o # minicpm_o minicpm_v
 infer_backend: huggingface
 trust_remote_code: true
 ```
 </details>
 ### Official Code
 You can also use official code to inference
 <details>
  <summary>
    <b>official inference code</b>
  </summary>
 ```python
 # test.py
 import torch
 from PIL import Image
 from transformers import AutoModel, AutoTokenizer
 model_id = "saves/minicpmo_2_6/full/sft"
 model = AutoModel.from_pretrained(model_id, trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
 model = model.eval().cuda()
 tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
 image = Image.open('data/mllm_demo_data/1.jpg').convert('RGB')
 question = 'Who are they??'
 msgs = [{'role': 'user', 'content': [image, question]}]
 res = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer
 )
 print(res)
 ```
 </details>
--- a/docs/minicpm_llama3_v2dot5.md
+++ b/docs/minicpm_llama3_v2dot5.md
@@ -0,0 +1,333 @@
 ## MiniCPM-Llama3-V 2.5
 > Archieve at: 2025-01-13
 **MiniCPM-Llama3-V 2.5** is the latest model in the MiniCPM-V series. The model is built on SigLip-400M and Llama3-8B-Instruct with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.0. Notable features of MiniCPM-Llama3-V 2.5 include:
 - 🔥 **Leading Performance.**
  MiniCPM-Llama3-V 2.5 has achieved an average score of 65.1 on OpenCompass, a comprehensive evaluation over 11 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4V-1106, Gemini Pro, Claude 3 and Qwen-VL-Max** and greatly outperforms other Llama 3-based MLLMs.
 - 💪 **Strong OCR Capabilities.**
  MiniCPM-Llama3-V 2.5 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344), achieving a **700+ score on OCRBench, surpassing proprietary models such as GPT-4o, GPT-4V-0409, Qwen-VL-Max and Gemini Pro**. Based on recent user feedback, MiniCPM-Llama3-V 2.5 has now enhanced full-text OCR extraction, table-to-markdown conversion, and other high-utility capabilities, and has further strengthened its instruction-following and complex reasoning abilities, enhancing multimodal interaction experiences.
 - 🏆 **Trustworthy Behavior.**
  Leveraging the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) method (the newest technique in the [RLHF-V](https://github.com/RLHF-V) [CVPR'24] series), MiniCPM-Llama3-V 2.5 exhibits more trustworthy behavior. It achieves a **10.3%** hallucination rate on Object HalBench, lower than GPT-4V-1106 (13.6%), achieving the best-level performance within the open-source community. [Data released](https://huggingface.co/datasets/openbmb/RLAIF-V-Dataset).
 - 🌏 **Multilingual Support.**
  Thanks to the strong multilingual capabilities of Llama 3 and the cross-lingual generalization technique from [VisCPM](https://github.com/OpenBMB/VisCPM), MiniCPM-Llama3-V 2.5 extends its bilingual (Chinese-English) multimodal capabilities to **over 30 languages including German, French, Spanish, Italian, Korean etc.** [All Supported Languages](./assets/minicpm-llama-v-2-5_languages.md).
 - 🚀 **Efficient Deployment.**
  MiniCPM-Llama3-V 2.5 systematically employs **model quantization, CPU optimizations, NPU optimizations and compilation optimizations**, achieving high-efficiency deployment on end-side devices. For mobile phones with Qualcomm chips, we have integrated the NPU acceleration framework QNN into llama.cpp for the first time. After systematic optimization, MiniCPM-Llama3-V 2.5 has realized a **150x acceleration in end-side MLLM image encoding** and a **3x speedup in language decoding**.
 -  💫  **Easy Usage.**
 MiniCPM-Llama3-V 2.5 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-v2.5/examples/minicpmv/README.md) and [ollama](https://github.com/OpenBMB/ollama/tree/minicpm-v2.5/examples/minicpm-v2.5) support for efficient CPU inference on local devices, (2) [GGUF](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf) format quantized models in 16 sizes, (3) efficient [LoRA](https://github.com/OpenBMB/MiniCPM-V/tree/main/finetune#lora-finetuning) fine-tuning with only 2 V100 GPUs, (4) [streaming output](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5#usage), (5) quick local WebUI demo setup with [Gradio](https://github.com/OpenBMB/MiniCPM-V/blob/main/web_demo_2.5.py) and [Streamlit](https://github.com/OpenBMB/MiniCPM-V/blob/main/web_demo_streamlit-2_5.py), and (6) interactive demos on [HuggingFace Spaces](https://huggingface.co/spaces/openbmb/MiniCPM-Llama3-V-2_5).
 ### Evaluation  <!-- omit in toc -->
 <div align="center">
    <img src=../assets/MiniCPM-Llama3-V-2.5-peformance.png width=66% />
 </div>
 <details>
 <summary>Click to view results on TextVQA, DocVQA, OCRBench, OpenCompass, MME, MMBench, MMMU, MathVista, LLaVA Bench, RealWorld QA, Object HalBench. </summary>
 <div align="center">
 <table style="margin: 0px auto;">
    <thead>
        <tr>
            <th align="left">Model</th>
            <th>Size</th>
            <th>OCRBench</th>
            <th>TextVQA val</th>
            <th>DocVQA test</th>
            <th>Open-Compass</th>
            <th>MME</th>
            <th>MMB test (en)</th>
            <th>MMB test (cn)</th>
            <th>MMMU val</th>
            <th>Math-Vista</th>
            <th>LLaVA Bench</th>
            <th>RealWorld QA</th>
            <th>Object HalBench</th>
        </tr>
    </thead>
    <tbody align="center">
        <tr>
            <td colspan="14" align="left"><strong>Proprietary</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Gemini Pro</td>
            <td>-</td>
            <td>680</td>
            <td>74.6</td>
            <td>88.1</td>
            <td>62.9</td>
            <td>2148.9</td>
            <td>73.6</td>
            <td>74.3</td>
            <td>48.9</td>
            <td>45.8</td>
            <td>79.9</td>
            <td>60.4</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">GPT-4V (2023.11.06)</td>
            <td>-</td>
            <td>645</td>
            <td>78.0</td>
            <td>88.4</td>
            <td>63.5</td>
            <td>1771.5</td>
            <td>77.0</td>
            <td>74.4</td>
            <td>53.8</td>
            <td>47.8</td>
            <td>93.1</td>
            <td>63.0</td>
            <td>86.4</td>
        </tr>
        <tr>
            <td colspan="14" align="left"><strong>Open-source</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Mini-Gemini</td>
            <td>2.2B</td>
            <td>-</td>
            <td>56.2</td>
            <td>34.2*</td>
            <td>-</td>
            <td>1653.0</td>
            <td>-</td>
            <td>-</td>
            <td>31.7</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Qwen-VL-Chat</td>
            <td>9.6B</td>
            <td>488</td>
            <td>61.5</td>
            <td>62.6</td>
            <td>51.6</td>
            <td>1860.0</td>
            <td>61.8</td>
            <td>56.3</td>
            <td>37.0</td>
            <td>33.8</td>
            <td>67.7</td>
            <td>49.3</td>
            <td>56.2</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">DeepSeek-VL-7B</td>
            <td>7.3B</td>
            <td>435</td>
            <td>64.7*</td>
            <td>47.0*</td>
            <td>54.6</td>
            <td>1765.4</td>
            <td>73.8</td>
            <td>71.4</td>
            <td>38.3</td>
            <td>36.8</td>
            <td>77.8</td>
            <td>54.2</td>
            <td>-</td>
        </tr>        
        <tr>
            <td nowrap="nowrap" align="left">Yi-VL-34B</td>
            <td>34B</td>
            <td>290</td>
            <td>43.4*</td>
            <td>16.9*</td>
            <td>52.2</td>
            <td><strong>2050.2</strong></td>
            <td>72.4</td>
            <td>70.7</td>
            <td>45.1</td>
            <td>30.7</td>
            <td>62.3</td>
            <td>54.8</td>
            <td>79.3</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">CogVLM-Chat</td>
            <td>17.4B</td>
            <td>590</td>
            <td>70.4</td>
            <td>33.3*</td>
            <td>54.2</td>
            <td>1736.6</td>
            <td>65.8</td>
            <td>55.9</td>
            <td>37.3</td>
            <td>34.7</td>
            <td>73.9</td>
            <td>60.3</td>
            <td>73.6</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">TextMonkey</td>
            <td>9.7B</td>
            <td>558</td>
            <td>64.3</td>
            <td>66.7</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
        </tr>
        <tr>
          <td nowrap="nowrap" align="left">Idefics2</td>
          <td>8.0B</td>
          <td>-</td>
          <td>73.0</td>
          <td>74.0</td>
          <td>57.2</td>
          <td>1847.6</td>
          <td>75.7</td>
          <td>68.6</td>
          <td>45.2</td>
          <td>52.2</td>
          <td>49.1</td>
          <td>60.7</td>
          <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Bunny-LLama-3-8B</td>
            <td>8.4B</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>54.3</td>
            <td>1920.3</td>
            <td>77.0</td>
            <td>73.9</td>
            <td>41.3</td>
            <td>31.5</td>
            <td>61.2</td>
            <td>58.8</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">LLaVA-NeXT Llama-3-8B</td>
            <td>8.4B</td>
            <td>-</td>
            <td>-</td>
            <td>78.2</td>
            <td>-</td>
            <td>1971.5</td>
            <td>-</td>
            <td>-</td>
            <td>41.7</td>
            <td>37.5</td>
            <td>80.1</td>
            <td>60.0</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Phi-3-vision-128k-instruct</td>
            <td>4.2B</td>
            <td>639*</td>
            <td>70.9</td>
            <td>-</td>
            <td>-</td>
            <td>1537.5*</td>
            <td>-</td>
            <td>-</td>
            <td>40.4</td>
            <td>44.5</td>
            <td>64.2*</td>
            <td>58.8*</td>
            <td>-</td>
        </tr>
        <tr style="background-color: #e6f2ff;">
            <td nowrap="nowrap" align="left">MiniCPM-V 1.0</td>
            <td>2.8B</td>
            <td>366</td>
            <td>60.6</td>
            <td>38.2</td>
            <td>47.5</td>
            <td>1650.2</td>
            <td>64.1</td>
            <td>62.6</td>
            <td>38.3</td>
            <td>28.9</td>
            <td>51.3</td>
            <td>51.2</td>
            <td>78.4</td>
        </tr>
        <tr style="background-color: #e6f2ff;">
            <td nowrap="nowrap" align="left">MiniCPM-V 2.0</td>
            <td>2.8B</td>
            <td>605</td>
            <td>74.1</td>
            <td>71.9</td>
            <td>54.5</td>
            <td>1808.6</td>
            <td>69.1</td>
            <td>66.5</td>
            <td>38.2</td>
            <td>38.7</td>
            <td>69.2</td>
            <td>55.8</td>
            <td>85.5</td>
        </tr>
        <tr style="background-color: #e6f2ff;">
            <td nowrap="nowrap" align="left">MiniCPM-Llama3-V 2.5</td>
            <td>8.5B</td>
            <td><strong>725</strong></td>
            <td><strong>76.6</strong></td>
            <td><strong>84.8</strong></td>
            <td><strong>65.1</strong></td>
            <td>2024.6</td>
            <td><strong>77.2</strong></td>
            <td><strong>74.2</strong></td>
            <td><strong>45.8</strong></td>
            <td><strong>54.3</strong></td>
            <td><strong>86.7</strong></td>
            <td><strong>63.5</strong></td>
            <td><strong>89.7</strong></td>
        </tr>
    </tbody>
 </table>
 </div>
 * We evaluate the officially released checkpoint by ourselves.
 </details>
 <div align="center">
    <img src="../assets/llavabench_compare_3.png" width="100%" />
    <br>
    Evaluation results of multilingual LLaVA Bench
 </div>
 ### Examples <!-- omit in toc -->
 <table align="center" >
  <p align="center" > 
  <img src="../assets/minicpmv-llama3-v2.5/cases_all.png" />
  </p>
 </table>
 </details>
 ### Model Zoo
 | Model           | Device | Memory    | &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; Description       | Download |
 |:-----------|:--:|:-----------:|:-------------------|:---------------:|
 | MiniCPM-Llama3-V 2.5 | GPU | 19 GB | Strong end-side multimodal performance.   |  [🤗](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5/) &nbsp;&nbsp; [<img src="../assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5) |
 | MiniCPM-Llama3-V 2.5 gguf | CPU  | 6 GB | The gguf version, lower memory usage and faster inference.   |  [🤗](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf) &nbsp;&nbsp;[<img src="../assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5-gguf) |
 | MiniCPM-Llama3-V 2.5 int4 | GPU | 8 GB | The int4 quantized version, lower GPU memory usage. |  [🤗](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-int4/) &nbsp;&nbsp; [<img src="../assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5-int4) |
--- a/docs/minicpm_v1.md
+++ b/docs/minicpm_v1.md
--- a/docs/minicpm_v2.md
+++ b/docs/minicpm_v2.md
@@ -0,0 +1,299 @@
 ## MiniCPM-V 2.0
 > Archive at：2025-01-13
 **MiniCPM-V 2.0** is an efficient version with promising performance for deployment. The model is built based on SigLip-400M and [MiniCPM-2.4B](https://github.com/OpenBMB/MiniCPM/), connected by a perceiver resampler. Our latest version, MiniCPM-V 2.0 has several notable features. 
 - 🔥 **State-of-the-art Performance.** 
  MiniCPM-V 2.0 achieves **state-of-the-art performance** on multiple benchmarks (including OCRBench, TextVQA, MME, MMB, MathVista, etc) among models under 7B parameters. It even **outperforms strong Qwen-VL-Chat 9.6B, CogVLM-Chat 17.4B, and Yi-VL 34B on OpenCompass, a comprehensive evaluation over 11 popular benchmarks**. Notably, MiniCPM-V 2.0 shows **strong OCR capability**, achieving **comparable performance to Gemini Pro in scene-text understanding**, and **state-of-the-art performance on OCRBench** among open-source models.
 - 🏆 **Trustworthy Behavior.** 
  LMMs are known for suffering from hallucination, often generating text not factually grounded in images. MiniCPM-V 2.0 is **the first end-side LMM aligned via multimodal RLHF for trustworthy behavior** (using the recent [RLHF-V](https://rlhf-v.github.io/) [CVPR'24] series technique). This allows the model to **match GPT-4V in preventing hallucinations** on Object HalBench.
 - 🌟 **High-Resolution Images at Any Aspect Raito.**
  MiniCPM-V 2.0 can accept **1.8 million pixels (e.g., 1344x1344) images at any aspect ratio**. This enables better perception of fine-grained visual information such as small objects and optical characters, which is achieved via a recent technique from [LLaVA-UHD](https://arxiv.org/pdf/2403.11703.pdf).
 - ⚡️ **High Efficiency.** 
  MiniCPM-V 2.0 can be **efficiently deployed on most GPU cards and personal computers**, and **even on end devices such as mobile phones**. For visual encoding, we compress the image representations into much fewer tokens via a perceiver resampler. This allows MiniCPM-V 2.0 to operate with **favorable memory cost and speed during inference even when dealing with high-resolution images**.
 - 🙌 **Bilingual Support.** 
  MiniCPM-V 2.0 **supports strong bilingual multimodal capabilities in both English and Chinese**. This is enabled by generalizing multimodal capabilities across languages, a technique from [VisCPM](https://arxiv.org/abs/2308.12038) [ICLR'24].
 ### Evaluation <!-- omit in toc -->
 <div align="center">
    <img src=../assets/minicpmv-2-peformance.png width=66% />
 </div>
 <details>
 <summary>Click to view results on TextVQA, DocVQA, OCRBench, OpenCompass, MME, MMBench, MMMU, MathVista, LLaVA Bench, Object HalBench. </summary>
 <div align="center">
 <table style="margin: 0px auto;">
 <thead>
  <tr>
    <th align="left">Model</th>
    <th>Size</th>
    <th>TextVQA val</th>
    <th>DocVQA test</th>
    <th>OCRBench</th>
    <th>OpenCompass</th>
    <th nowrap="nowrap" >MME</th>
    <th>MMB dev(en)</th>
    <th>MMB dev(zh)</th>
    <th>MMMU val</th>
    <th>MathVista</th>
    <th>LLaVA Bench</th>
    <th nowrap="nowrap">Object HalBench</th>
  </tr>
 </thead>
 <tbody align="center">
  <tr>
    <td colspan="12" align="left"><strong>Proprietary models</strong></td>
  </tr>
  <tr>
    <td nowrap="nowrap" align="left">Gemini Pro Vision</td>
    <td>- </td>
    <td>74.6</td>
    <td>88.1</td>
    <td>680</td>
    <td>63.8</td>
    <td>2148.9</td>
    <td>75.2</td>
    <td>74.0</td>
    <td>48.9</td>
    <td>45.8</td>
    <td>79.9</td>
    <td>- </td>
  </tr>
  <tr>
    <td nowrap="nowrap" align="left">GPT-4V</td>
    <td>- </td>
    <td>78.0</td>
    <td>88.4</td>
    <td>645</td>
    <td>63.2</td>
    <td>1771.5</td>
    <td>75.1</td>
    <td>75.0</td>
    <td>53.8</td>
    <td>47.8</td>
    <td>93.1</td>
    <td>86.4 / 92.7</td>
  </tr>
  <tr>
    <td colspan="12" align="left"><strong>Open-source models 6B~34B</strong></td>
  </tr>
  <tr>
    <td  nowrap="nowrap" align="left" >Yi-VL-6B</td>
    <td align="right" >6.7B</td>
    <td>45.5*</td>
    <td>17.1*</td>
    <td>290</td>
    <td>49.3</td>
    <td>1915.1 </td>
    <td>68.6 </td>
    <td>68.3 </td>
    <td>40.3 </td>
    <td>28.8 </td>
    <td>51.9 </td>
    <td>- </td>
  </tr>
  <tr>
    <td  nowrap="nowrap" align="left" >Qwen-VL-Chat</td>
    <td align="right" >9.6B</td>
    <td>61.5</td>
    <td>62.6</td>
    <td>488 </td>
    <td>52.1 </td>
    <td>1860.0 </td>
    <td>60.6 </td>
    <td>56.7 </td>
    <td>37.0 </td>
    <td>33.8 </td>
    <td>67.7 </td>
    <td>56.2 / 80.0</td>
  </tr>
  <tr>
    <td nowrap="nowrap" align="left" >Yi-VL-34B</td>
    <td align="right" >34B</td>
    <td>43.4*</td>
    <td>16.9*</td>
    <td>290</td>
    <td>52.6 </td>
    <td>2050.2</td>
    <td>71.1</td>
    <td>71.4</td>
    <td>45.1</td>
    <td>30.7</td>
    <td>62.3</td>
    <td>- </td>
  </tr>
  <tr>
    <td  nowrap="nowrap" align="left" >DeepSeek-VL-7B</td>
    <td align="right" >7.3B</td>
    <td>64.7*</td>
    <td>47.0* </td>
    <td>435</td>
    <td>55.6 </td>
    <td>1765.4 </td>
    <td>74.1 </td>
    <td>72.8 </td>
    <td>38.3 </td>
    <td>36.8</td>
    <td>77.8 </td>
    <td>- </td>
  </tr>
  <tr>
    <td  nowrap="nowrap" align="left" >TextMonkey</td>
    <td align="right" >9.7B</td>
    <td>64.3</td>
    <td>66.7 </td>
    <td>558</td>
    <td>- </td>
    <td>- </td>
    <td>- </td>
    <td>- </td>
    <td>- </td>
    <td>-</td>
    <td>- </td>
    <td>- </td>
  </tr>
    <tr>
    <td  nowrap="nowrap" align="left" >CogVLM-Chat</td>
    <td align="right" >17.4B</td>
    <td>70.4</td>
    <td>33.3*</td>
    <td>590 </td>
    <td>52.5 </td>
    <td>1736.6 </td>
    <td>63.7 </td>
    <td>53.8 </td>
    <td>37.3 </td>
    <td>34.7 </td>
    <td>73.9 </td>
    <td>73.6 / 87.4 </td>
  </tr>
  <tr>
    <td colspan="12" align="left"><strong>Open-source models 1B~3B </strong></td>
  </tr>
  <tr>
    <td  nowrap="nowrap" align="left" >DeepSeek-VL-1.3B</td>
    <td align="right" >1.7B</td>
    <td>58.4*</td>
    <td>37.9*</td>
    <td>413</td>
    <td>46.0 </td>
    <td>1531.6 </td>
    <td>64.0 </td>
    <td>61.2 </td>
    <td>33.8 </td>
    <td>29.4 </td>
    <td>51.1 </td>
    <td>- </td>
  </tr>
  <tr>
    <td  nowrap="nowrap" align="left" >MobileVLM V2</td>
    <td align="right" >3.1B</td>
    <td>57.5</td>
    <td>19.4*</td>
    <td>-</td>
    <td>-</td>
    <td>1440.5(P) </td>
    <td>63.2 </td>
    <td>-</td>
    <td>-</td>
    <td>-</td>
    <td>-</td>
    <td>-</td>
  </tr>
  <tr>
    <td  nowrap="nowrap" align="left" >Mini-Gemini</td>
    <td align="right" >2.2B</td>
    <td>56.2</td>
    <td>34.2*</td>
    <td>-</td>
    <td>-</td>
    <td>1653.0 </td>
    <td>59.8 </td>
    <td>- </td>
    <td>31.7 </td>
    <td>-</td>
    <td>- </td>
    <td>- </td>
  </tr>
  <tr>
    <td  nowrap="nowrap" align="left" >MiniCPM-V</td>
    <td align="right" >2.8B </td>
    <td>60.6</td>
    <td>38.2 </td>
    <td>366</td>
    <td>47.6</td>
    <td>1650.2 </td>
    <td>67.9 </td>
    <td>65.3 </td>
    <td><strong>38.3</strong></td>
    <td>28.9</td>
    <td>51.3 </td>
    <td>78.4 / 88.5 </td>
  </tr>
  <tr>
    <td  nowrap="nowrap" align="left" ><strong>MiniCPM-V 2.0</strong></td>
    <td align="right" >2.8B </td>
    <td><strong>74.1</strong></td>
    <td><strong>71.9</strong> </td>
    <td><strong>605</strong></td>
    <td><strong>55.0</strong></td>
    <td><strong>1808.6</strong> </td>
    <td><strong>69.6</strong> </td>
    <td><strong>68.1</strong> </td>
    <td>38.2 </td>
    <td><strong>38.7</strong></td>
    <td><strong>69.2</strong> </td>
    <td><strong>85.5 / 92.2 </strong></td>
  </tr>
 </tbody>
 </table>
 </div>
 * We evaluate the officially released checkpoint by ourselves.
 </details>
 ### Examples <!-- omit in toc -->
 <table align="center">
    <p align="center">
      <img src="../assets/minicpmv2-cases_2.png" width=95%/>
    </p>
 </table>
 We deploy MiniCPM-V 2.0 on end devices. The demo video is the raw screen recording on a Xiaomi 14 Pro without edition.
 <table align="center">
    <p align="center">
      <img src="../assets/gif_cases/station.gif" width=36%/>
      <img src="../assets/gif_cases/london_car.gif" width=36%/>
    </p>
 </table>
 ### Model Zoo
 | Model           | Device | Memory    | &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; Description       | Download |
 |:-----------|:--:|:-----------:|:-------------------|:---------------:|
 | MiniCPM-V 2.0 | GPU | 8 GB | Light version, balance the performance the computation cost.   |  [🤗](https://huggingface.co/openbmb/MiniCPM-V-2) &nbsp;&nbsp; [<img src="../assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2) |
 | MiniCPM-V 1.0 | GPU | 7 GB | Lightest version, achieving the fastest inference. |   [🤗](https://huggingface.co/openbmb/MiniCPM-V) &nbsp;&nbsp; [<img src="../assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V) |
 ### Deployment on Mobile Phone
 MiniCPM-V 2.0 can be deployed on mobile phones with Android operating systems. 🚀 Click [MiniCPM-V 2.0](https://github.com/OpenBMB/mlc-MiniCPM) to install apk.
--- a/docs/minicpm_v2dot6.md
+++ b/docs/minicpm_v2dot6.md
@@ -0,0 +1,945 @@
 ## MiniCPM-V 2.6
 > Archieve at: 2025-01-13
 **MiniCPM-V 2.6** is the latest and most capable model in the MiniCPM-V series. The model is built on SigLip-400M and Qwen2-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-Llama3-V 2.5, and introduces new features for multi-image and video understanding. Notable features of MiniCPM-V 2.6 include:
 - 🔥 **Leading Performance.**
  MiniCPM-V 2.6 achieves an average score of 65.2 on the latest version of OpenCompass, a comprehensive evaluation over 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o mini, GPT-4V, Gemini 1.5 Pro, and Claude 3.5 Sonnet** for single image understanding.
 - 🖼️ **Multi Image Understanding and In-context Learning.** MiniCPM-V 2.6 can also perform **conversation and reasoning over multiple images**. It achieves **state-of-the-art performance** on popular multi-image benchmarks such as Mantis-Eval, BLINK, Mathverse mv and Sciverse mv, and also shows promising in-context learning capability.
 - 🎬 **Video Understanding.** MiniCPM-V 2.6 can also **accept video inputs**, performing conversation and providing dense captions for spatial-temporal information. It outperforms **GPT-4V, Claude 3.5 Sonnet and LLaVA-NeXT-Video-34B** on Video-MME with/without subtitles.
 - 💪 **Strong OCR Capability and Others.**
  MiniCPM-V 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves **state-of-the-art performance on OCRBench, surpassing proprietary models such as GPT-4o, GPT-4V, and Gemini 1.5 Pro**.
  Based on the the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) and [VisCPM](https://github.com/OpenBMB/VisCPM) techniques, it features **trustworthy behaviors**, with significantly lower hallucination rates than GPT-4o and GPT-4V on Object HalBench, and supports **multilingual capabilities** on English, Chinese, German, French, Italian, Korean, etc.
 - 🚀 **Superior Efficiency.**
  In addition to its friendly size, MiniCPM-V 2.6 also shows **state-of-the-art token density** (i.e., number of pixels encoded into each visual token). **It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models**. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-V 2.6 can efficiently support **real-time video understanding** on end-side devices such as iPad.
 -  💫  **Easy Usage.**
 MiniCPM-V 2.6 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpmv-main/examples/llava/README-minicpmv2.6.md) and [ollama](https://github.com/OpenBMB/ollama/blob/minicpm-v2.6/examples/minicpm-v2.6/README.md) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-V-2_6-int4) and [GGUF](https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf) format quantized models in 16 sizes, (3) [vLLM](#inference-with-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks, (5) quick local WebUI demo setup with [Gradio](#chat-with-our-demo-on-gradio), and (6) online web [demo](http://120.92.209.146:8887/).
 ### Evaluation  <!-- omit in toc -->
 <div align="center">
    <img src=../assets/radar_final.png width=66% />
 </div>
 <details>
 <summary>Click to view single image results on OpenCompass, MME, MMVet, OCRBench, MMMU, MathVista, MMB, AI2D, TextVQA, DocVQA, HallusionBench, Object HalBench. </summary>
 <div align="center">
 <table style="margin: 0px auto;">
    <thead>
        <tr>
            <th align="left">Model</th>
            <th>Size</th>
            <th>Token Density<sup>+</sup></th>
            <th>OpenCompass</th>
            <th>MME</th>
            <th>MMVet</th>
            <th>OCRBench</th>
            <th>MMMU val</th>
            <th>MathVista mini</th>
            <th>MMB1.1 test</th>
            <th>AI2D</th>
            <th>TextVQA val</th>
            <th>DocVQA test</th>
            <th>HallusionBench</th>
            <th>Object HalBench</th>
        </tr>
    </thead>
    <tbody align="center">
        <tr>
            <td colspan="15" align="left"><strong>Proprietary</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">GPT-4o</td>
            <td>-</td>
            <td>1088</td>
            <td>69.9</td>
            <td>2328.7</td>
            <td>69.1</td>
            <td>736</td>
            <td>69.2</td>
            <td>61.3</td>
            <td>82.2</td>
            <td>84.6</td>
            <td>-</td>
            <td>92.8</td>
            <td>55.0</td>
            <td>17.6</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Claude 3.5 Sonnet</td>
            <td>-</td>
            <td>750</td>
            <td>67.9</td>
            <td>1920.0</td>
            <td>66.0</td>
            <td>788</td>
            <td>65.9</td>
            <td>61.6</td>
            <td>78.5</td>
            <td>80.2</td>
            <td>-</td>
            <td>95.2</td>
            <td>49.9</td>
            <td>13.8</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
            <td>-</td>
            <td>-</td>
            <td>64.4</td>
            <td>2110.6</td>
            <td>64.0</td>
            <td>754</td>
            <td>60.6</td>
            <td>57.7</td>
            <td>73.9</td>
            <td>79.1</td>
            <td>73.5</td>
            <td>86.5</td>
            <td>45.6</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">GPT-4o mini</td>
            <td>-</td>
            <td>1088</td>
            <td>64.1</td>
            <td>2003.4</td>
            <td>66.9</td>
            <td>785</td>
            <td>60.0</td>
            <td>52.4</td>
            <td>76.0</td>
            <td>77.8</td>
            <td>-</td>
            <td>-</td>
            <td>46.1</td>
            <td>12.4</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">GPT-4V</td>
            <td>-</td>
            <td>1088</td>
            <td>63.5</td>
            <td>2070.2</td>
            <td>67.5</td>
            <td>656</td>
            <td>61.7</td>
            <td>54.7</td>
            <td>79.8</td>
            <td>78.6</td>
            <td>78.0</td>
            <td>87.2</td>
            <td>43.9</td>
            <td>14.2</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Step-1V</td>
            <td>-</td>
            <td>-</td>
            <td>59.5</td>
            <td>2206.4</td>
            <td>63.3</td>
            <td>625</td>
            <td>49.9</td>
            <td>44.8</td>
            <td>78.0</td>
            <td>79.2</td>
            <td>71.6</td>
            <td>-</td>
            <td>48.4</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Qwen-VL-Max</td>
            <td>-</td>
            <td>784</td>
            <td>58.3</td>
            <td>2281.7</td>
            <td>61.8</td>
            <td>684</td>
            <td>52.0</td>
            <td>43.4</td>
            <td>74.6</td>
            <td>75.7</td>
            <td>79.5</td>
            <td>93.1</td>
            <td>41.2</td>
            <td>13.4</td>
        </tr>
        <tr>
            <td colspan="15" align="left"><strong>Open-source</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">LLaVA-NeXT-Yi-34B</td>
            <td>34B</td>
            <td>157</td>
            <td>55.0</td>
            <td>2006.5</td>
            <td>50.7</td>
            <td>574</td>
            <td>48.8</td>
            <td>40.4</td>
            <td>77.8</td>
            <td>78.9</td>
            <td>69.3</td>
            <td>-</td>
            <td>34.8</td>
            <td>12.6</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Mini-Gemini-HD-34B</td>
            <td>34B</td>
            <td>157</td>
            <td>-</td>
            <td>2141.0</td>
            <td>59.3</td>
            <td>518</td>
            <td>48.0</td>
            <td>43.3</td>
            <td>-</td>
            <td>80.5</td>
            <td>74.1</td>
            <td>78.9</td>
            <td>-</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Cambrian-34B</td>
            <td>34B</td>
            <td>1820</td>
            <td>58.3</td>
            <td>2049.9</td>
            <td>53.2</td>
            <td>591</td>
            <td>50.4</td>
            <td>50.3</td>
            <td>77.8</td>
            <td>79.5</td>
            <td>76.7</td>
            <td>75.5</td>
            <td>41.6</td>
            <td>14.7</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">GLM-4V-9B</td>
            <td>13B</td>
            <td>784</td>
            <td>59.1</td>
            <td>2018.8</td>
            <td>58.0</td>
            <td>776</td>
            <td>46.9</td>
            <td>51.1</td>
            <td>67.9</td>
            <td>71.2</td>
            <td>-</td>
            <td>-</td>
            <td>45.0</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">InternVL2-8B</td>
            <td>8B</td>
            <td>706</td>
            <td>64.1</td>
            <td>2215.1</td>
            <td>54.3</td>
            <td>794</td>
            <td><strong>51.2</strong></td>
            <td>58.3</td>
            <td><strong>79.4</strong></td>
            <td><strong>83.6</strong></td>
            <td>77.4</td>
            <td><strong>91.6</strong></td>
            <td>45.0</td>
            <td>21.3</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">MiniCPM-Llama-V 2.5</td>
            <td>8B</td>
            <td>1882</td>
            <td>58.8</td>
            <td>2024.6</td>
            <td>52.8</td>
            <td>725</td>
            <td>45.8</td>
            <td>54.3</td>
            <td>72.0</td>
            <td>78.4</td>
            <td>76.6</td>
            <td>84.8</td>
            <td>42.4</td>
            <td>10.3</td>
        </tr>
        <tr style="background-color: #e6f2ff;">
            <td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
            <td>8B</td>
            <td><strong>2822</strong></td>
            <td><strong>65.2</strong></td>
            <td><strong>2348.4</strong>*</td>
            <td><strong>60.0</strong></td>
            <td><strong>852</strong>*</td>
            <td>49.8*</td>
            <td><strong>60.6</strong></td>
            <td>78.0</td>
            <td>82.1</td>
            <td><strong>80.1<strong></td>
            <td>90.8</td>
            <td><strong>48.1</strong>*</td>
            <td><strong>8.2</strong></td>
        </tr>
    </tbody>
 </table>
 </div>
 * We evaluate this benchmark using chain-of-thought prompting. Specifically, for MME, we used this technique only for the Cognition set.
 <sup>+</sup> Token Density: number of pixels encoded into each visual token at maximum resolution, i.e., # pixels at maximum resolution / # visual tokens.
 Note: For proprietary models, we calculate token density based on the image encoding charging strategy defined in the official API documentation, which provides an upper-bound estimation.
 </details>
 <details>
 <summary>Click to view multi-image results on Mantis Eval, BLINK, Mathverse mv, Sciverse mv, MIRB.</summary>
 <div align="center">
 <table style="margin: 0px auto;">
    <thead>
        <tr>
            <th align="left">Model</th>
            <th>Size</th>
            <th>Mantis Eval</th>
            <th>BLINK val</th>
            <th>Mathverse mv</th>
            <th>Sciverse mv</th>
            <th>MIRB</th>
        </tr>
    </thead>
    <tbody align="center">
        <tr>
            <td colspan="7" align="left"><strong>Proprietary</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">GPT-4V</td>
            <td>-</td>
            <td>62.7</td>
            <td>54.6</td>
            <td>60.3</td>
            <td>66.9</td>
            <td>53.1</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">LLaVA-NeXT-Interleave-14B</td>
            <td>14B</td>
            <td>66.4</td>
            <td>52.6</td>
            <td>32.7</td>
            <td>30.2</td>
            <td>-</td>
        </tr>
        <tr>
            <td colspan="7" align="left"><strong>Open-source</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Emu2-Chat</td>
            <td>37B</td>
            <td>37.8</td>
            <td>36.2</td>
            <td>-</td>
            <td>27.2</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">CogVLM</td>
            <td>17B</td>
            <td>45.2</td>
            <td>41.1</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">VPG-C</td>
            <td>7B</td>
            <td>52.4</td>
            <td>43.1</td>
            <td>24.3</td>
            <td>23.1</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">VILA 8B</td>
            <td>8B</td>
            <td>51.2</td>
            <td>39.3</td>
            <td>-</td>
            <td>36.5</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">InternLM-XComposer-2.5</td>
            <td>8B</td>
            <td>53.1*</td>
            <td>48.9</td>
            <td>32.1*</td>
            <td>-</td>
            <td>42.5</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">InternVL2-8B</td>
            <td>8B</td>
            <td>59.0*</td>
            <td>50.9</td>
            <td>30.5*</td>
            <td>34.4*</td>
            <td><strong>56.9*</strong></td>
        </tr>
        <tr style="background-color: #e6f2ff;">
            <td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
            <td>8B</td>
            <td><strong>69.1</strong></td>
            <td><strong>53.0</strong></td>
            <td><strong>84.9</strong></td>
            <td><strong>74.9</strong></td>
            <td>53.8</td>
        </tr>
    </tbody>
 </table>
 </div>
 * We evaluate the officially released checkpoint by ourselves.
 </details>
 <details>
 <summary>Click to view video results on Video-MME and Video-ChatGPT.</summary>
 <div align="center">
 <table style="margin: 0px auto;">
    <thead>
        <tr>
            <th align="left">Model</th>
            <th>Size</th>
            <th colspan="2">Video-MME</th>
            <th colspan="5">Video-ChatGPT</th>
        </tr>
        <tr>
            <th align="left"></th>
            <th></th>
            <th>w/o subs</th>
            <th>w subs</th>
            <th>Correctness</th>
            <th>Detail</th>
            <th>Context</th>
            <th>Temporal</th>
            <th>Consistency</th>
        </tr>
    </thead>
    <tbody align="center">
        <tr>
            <td colspan="9" align="left"><strong>Proprietary</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Claude 3.5 Sonnet</td>
            <td>-</td>
            <td>60.0</td>
            <td>62.9</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">GPT-4V</td>
            <td>-</td>
            <td>59.9</td>
            <td>63.3</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
        </tr>
        <tr>
            <td colspan="9" align="left"><strong>Open-source</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">LLaVA-NeXT-7B</td>
            <td>7B</td>
            <td>-</td>
            <td>-</td>
            <td>3.39</td>
            <td>3.29</td>
            <td>3.92</td>
            <td>2.60</td>
            <td>3.12</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">LLaVA-NeXT-34B</td>
            <td>34B</td>
            <td>-</td>
            <td>-</td>
            <td>3.29</td>
            <td>3.23</td>
            <td>3.83</td>
            <td>2.51</td>
            <td>3.47</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">CogVLM2-Video</td>
            <td>12B</td>
            <td>-</td>
            <td>-</td>
            <td>3.49</td>
            <td><strong>3.46</strong></td>
            <td>3.23</td>
            <td><strong>2.98</strong></td>
            <td><strong>3.64</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">LongVA</td>
            <td>7B</td>
            <td>52.4</td>
            <td>54.3</td>
            <td>3.05</td>
            <td>3.09</td>
            <td>3.77</td>
            <td>2.44</td>
            <td><strong>3.64</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">InternVL2-8B</td>
            <td>8B</td>
            <td>54.0</td>
            <td>56.9</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">InternLM-XComposer-2.5</td>
            <td>8B</td>
            <td>55.8</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">LLaVA-NeXT-Video</td>
            <td>32B</td>
            <td>60.2</td>
            <td>63.0</td>
            <td>3.48</td>
            <td>3.37</td>
            <td><strong>3.95</strong></td>
            <td>2.64</td>
            <td>3.28</td>
        </tr>
        <tr style="background-color: #e6f2ff;">
            <td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
            <td>8B</td>
            <td><strong>60.9</strong></td>
            <td><strong>63.6</strong></td>
            <td><strong>3.59</strong></td>
            <td>3.28</td>
            <td>3.93</td>
            <td>2.73</td>
            <td>3.62</td>
        </tr>
    </tbody>
 </table>
 </div>
 </details>
 <details>
 <summary>Click to view few-shot results on TextVQA, VizWiz, VQAv2, OK-VQA.</summary>
 <div align="center">
 <table style="margin: 0px auto;">
    <thead>
        <tr>
            <th align="left">Model</th>
            <th>Size</th>
            <th>Shot</th>
            <th>TextVQA val</th>
            <th>VizWiz test-dev</th>
            <th>VQAv2 test-dev</th>
            <th>OK-VQA val</th>
        </tr>
    </thead>
    <tbody align="center">
        <tr>
            <td align="left" nowrap="nowrap" rowspan="3">Flamingo</td>
            <td rowspan="3">80B</td>
            <td>0*</td>
            <td>35.0</td>
            <td>31.6</td>
            <td>56.3</td>
            <td>40.6</td>
        </tr>
        <tr>
            <td>4</td>
            <td>36.5</td>
            <td>39.6</td>
            <td>63.1</td>
            <td><strong>57.4</strong></td>
        </tr>
        <tr>
            <td>8</td>
            <td>37.3</td>
            <td>44.8</td>
            <td>65.6</td>
            <td>57.5</td>
        </tr>
        <tr>
            <td align="left" nowrap="nowrap" rowspan="3">IDEFICS</td>
            <td rowspan="3">80B</td>
            <td>0*</td>
            <td>30.9</td>
            <td>36.0</td>
            <td>60.0</td>
            <td>45.2</td>
        </tr>
        <tr>
            <td>4</td>
            <td>34.3</td>
            <td>40.4</td>
            <td>63.6</td>
            <td>52.4</td>
        </tr>
        <tr>
            <td>8</td>
            <td>35.7</td>
            <td>46.1</td>
            <td>64.8</td>
            <td>55.1</td>
        </tr>
        <tr>
            <td align="left" nowrap="nowrap" rowspan="3">OmniCorpus</td>
            <td rowspan="3">7B</td>
            <td>0*</td>
            <td>43.0</td>
            <td>49.8</td>
            <td>63.2</td>
            <td>45.5</td>
        </tr>
        <tr>
            <td>4</td>
            <td>45.4</td>
            <td>51.3</td>
            <td>64.5</td>
            <td>46.5</td>
        </tr>
        <tr>
            <td>8</td>
            <td>45.6</td>
            <td>52.2</td>
            <td>64.7</td>
            <td>46.6</td>
        </tr>
        <tr>
            <td align="left" nowrap="nowrap" rowspan="3">Emu2</td>
            <td rowspan="3">37B</td>
            <td>0</td>
            <td>26.4</td>
            <td>40.4</td>
            <td>33.5</td>
            <td>26.7</td>
        </tr>
        <tr>
            <td>4</td>
            <td>48.2</td>
            <td>54.6</td>
            <td>67.0</td>
            <td>53.2</td>
        </tr>
        <tr>
            <td>8</td>
            <td>49.3</td>
            <td>54.7</td>
            <td>67.8</td>
            <td>54.1</td>
        </tr>
        <tr>
            <td align="left" nowrap="nowrap" rowspan="2">MM1</td>
            <td rowspan="2">30B</td>
            <td>0</td>
            <td>26.2</td>
            <td>40.4</td>
            <td>48.9</td>
            <td>26.7</td>
        </tr>
        <tr>
            <td>8</td>
            <td>49.3</td>
            <td>54.7</td>
            <td><strong>70.9</strong></td>
            <td>54.1</td>
        </tr>
        <tr style="background-color: #e6f2ff;">
            <td align="left" nowrap="nowrap" rowspan="3">MiniCPM-V 2.6<sup>+</sup></td>
            <td rowspan="3">8B</td>
            <td>0</td>
            <td>43.9</td>
            <td>33.8</td>
            <td>45.4</td>
            <td>23.9</td>
        </tr>
        <tr style="background-color: #e6f2ff;">
            <td>4</td>
            <td>63.6</td>
            <td>60.5</td>
            <td>65.5</td>
            <td>50.1</td>
        </tr>
        <tr style="background-color: #e6f2ff;">
            <td>8</td>
            <td><strong>64.6</strong></td>
            <td><strong>63.4</strong></td>
            <td>68.2</td>
            <td>51.4</td>
        </tr>
    </tbody>
 </table>
 </div>
 * denotes zero image shot and two additional text shots following Flamingo.
 <sup>+</sup> We evaluate the pretraining ckpt without SFT.
 </details>
 ### Examples <!-- omit in toc -->
 <div style="display: flex; flex-direction: column; align-items: center;">
  <img src="../assets/minicpmv2_6/multi_img-bike.png" alt="Bike" style="margin-bottom: 5px;">
  <img src="../assets/minicpmv2_6/multi_img-menu.png" alt="Menu" style="margin-bottom: 5px;">
  <img src="../assets/minicpmv2_6/multi_img-code.png" alt="Code" style="margin-bottom: 5px;">
  <img src="../assets/minicpmv2_6/ICL-Mem.png" alt="Mem" style="margin-bottom: 5px;">
  <img src="../assets/minicpmv2_6/multiling-medal.png" alt="medal" style="margin-bottom: 10px;">
 </div>
 <details>
  <summary>Click to view more cases.</summary>
  <div style="display: flex; flex-direction: column; align-items: center;">
    <img src="../assets/minicpmv2_6/ICL-elec.png" alt="elec" style="margin-bottom: 5px;">
    <img src="../assets/minicpmv2_6/multiling-olympic.png" alt="Menu" style="margin-bottom: 10px;">
  </div>
 </details>
 We deploy MiniCPM-V 2.6 on end devices. The demo video is the raw screen recording on a iPad Pro without edition.
 <table align="center"> 
    <p align="center">
      <img src="../assets/gif_cases/ai.gif" width=32%/>
      &nbsp;&nbsp;&nbsp;&nbsp;
      <img src="../assets/gif_cases/beer.gif" width=32%/>
    </p>
 </table> 
 <table align="center"> 
    <p align="center">
      <img src="../assets/gif_cases/ticket.gif" width=32%/>
      &nbsp;&nbsp;&nbsp;&nbsp;
      <img src="../assets/gif_cases/wfh.gif" width=32%/>
    </p>
 </table> 
 <table align="center">
    <p align="center">
      <video src="https://github.com/user-attachments/assets/21f4b818-ede1-4822-920e-91281725c830" width="360" /> </video>
      <!-- <video src="https://github.com/user-attachments/assets/c835f757-206b-4d9c-8e36-70d67b453628" width="360" /> </video> -->
    </p>
 </table>
 </details>
 ### Multi-turn Conversation
 <div align="center">
 <img src="../assets/airplane.jpeg" width="500px">
 </div>
 ```python
 import torch
 from PIL import Image
 from transformers import AutoModel, AutoTokenizer
 torch.manual_seed(0)
 model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
 model = model.eval().cuda()
 tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
 image = Image.open('./assets/airplane.jpeg').convert('RGB')
 # First round chat 
 question = "Tell me the model of this aircraft."
 msgs = [{'role': 'user', 'content': [image, question]}]
 answer = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer
 )
 print(answer)
 # Second round chat 
 # pass history context of multi-turn conversation
 msgs.append({"role": "assistant", "content": [answer]})
 msgs.append({"role": "user", "content": ["Introduce something about Airbus A380."]})
 answer = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer
 )
 print(answer)
 ```
 You could get the following output:
 ```
 "The aircraft in the image is an Airbus A380, which can be identified by its large size, double-deck structure, and the distinctive shape of its wings and engines. The A380 is a wide-body aircraft known for being the world's largest passenger airliner, designed for long-haul flights. It has four engines, which are characteristic of large commercial aircraft. The registration number on the aircraft can also provide specific information about the model if looked up in an aviation database."
 "The Airbus A380 is a double-deck, wide-body, four-engine jet airliner made by Airbus. It is the world's largest passenger airliner and is known for its long-haul capabilities. The aircraft was developed to improve efficiency and comfort for passengers traveling over long distances. It has two full-length passenger decks, which can accommodate more passengers than a typical single-aisle airplane. The A380 has been operated by airlines such as Lufthansa, Singapore Airlines, and Emirates, among others. It is widely recognized for its unique design and significant impact on the aviation industry."
 ```
 #### Multi-image Understanding
 <details>
 <summary> Click to view Python example of MiniCPM-V 2.6 multi-image understanding </summary>
 ```python
 import torch
 from PIL import Image
 from transformers import AutoModel, AutoTokenizer
 model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
 model = model.eval().cuda()
 tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
 image1 = Image.open('image1.jpg').convert('RGB')
 image2 = Image.open('image2.jpg').convert('RGB')
 question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.'
 msgs = [{'role': 'user', 'content': [image1, image2, question]}]
 answer = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer
 )
 print(answer)
 ```
 </details>
 #### Few-shot In-Context-Learning 
 <details>
 <summary> Click to view Python example of MiniCPM-V 2.6 few-shot in-context-learning example </summary>
 ```python
 import torch
 from PIL import Image
 from transformers import AutoModel, AutoTokenizer
 model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
 model = model.eval().cuda()
 tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
 question = "production date" 
 image1 = Image.open('example1.jpg').convert('RGB')
 answer1 = "2023.08.04"
 image2 = Image.open('example2.jpg').convert('RGB')
 answer2 = "2007.04.24"
 image_test = Image.open('test.jpg').convert('RGB')
 msgs = [
    {'role': 'user', 'content': [image1, question]}, {'role': 'assistant', 'content': [answer1]},
    {'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]},
    {'role': 'user', 'content': [image_test, question]}
 ]
 answer = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer
 )
 print(answer)
 ```
 </details>
 #### Video understanding
 <details>
 <summary> Click to view Python example of MiniCPM-V 2.6 video understanding </summary>
 ```python
 import torch
 from PIL import Image
 from transformers import AutoModel, AutoTokenizer
 from decord import VideoReader, cpu    # pip install decord
 model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
 model = model.eval().cuda()
 tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
 MAX_NUM_FRAMES=64 # if cuda OOM set a smaller number
 def encode_video(video_path):
    def uniform_sample(l, n):
        gap = len(l) / n
        idxs = [int(i * gap + gap / 2) for i in range(n)]
        return [l[i] for i in idxs]
    vr = VideoReader(video_path, ctx=cpu(0))
    sample_fps = round(vr.get_avg_fps() / 1)  # FPS
    frame_idx = [i for i in range(0, len(vr), sample_fps)]
    if len(frame_idx) > MAX_NUM_FRAMES:
        frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
    frames = vr.get_batch(frame_idx).asnumpy()
    frames = [Image.fromarray(v.astype('uint8')) for v in frames]
    print('num frames:', len(frames))
    return frames
 video_path="video_test.mp4"
 frames = encode_video(video_path)
 question = "Describe the video"
 msgs = [
    {'role': 'user', 'content': frames + [question]}, 
 ]
 # Set decode params for video
 params = {}
 params["use_image_id"] = False
 params["max_slice_nums"] = 2 # 如果cuda OOM且视频分辨率大于448*448可设为1
 answer = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer,
    **params
 )
 print(answer)
 ```
 </details>
--- a/docs/minicpm_v2dot6_en.md
+++ b/docs/minicpm_v2dot6_en.md
@@ -0,0 +1,953 @@
 ## MiniCPM-V 2.6
 > Archieve at: 2025-01-13
 **MiniCPM-V 2.6** is the latest and most capable model in the MiniCPM-V series. The model is built on SigLip-400M and Qwen2-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-Llama3-V 2.5, and introduces new features for multi-image and video understanding. Notable features of MiniCPM-V 2.6 include:
 - 🔥 **Leading Performance.**
  MiniCPM-V 2.6 achieves an average score of 65.2 on the latest version of OpenCompass, a comprehensive evaluation over 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o mini, GPT-4V, Gemini 1.5 Pro, and Claude 3.5 Sonnet** for single image understanding.
 - 🖼️ **Multi Image Understanding and In-context Learning.** MiniCPM-V 2.6 can also perform **conversation and reasoning over multiple images**. It achieves **state-of-the-art performance** on popular multi-image benchmarks such as Mantis-Eval, BLINK, Mathverse mv and Sciverse mv, and also shows promising in-context learning capability.
 - 🎬 **Video Understanding.** MiniCPM-V 2.6 can also **accept video inputs**, performing conversation and providing dense captions for spatial-temporal information. It outperforms **GPT-4V, Claude 3.5 Sonnet and LLaVA-NeXT-Video-34B** on Video-MME with/without subtitles.
 - 💪 **Strong OCR Capability and Others.**
  MiniCPM-V 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves **state-of-the-art performance on OCRBench, surpassing proprietary models such as GPT-4o, GPT-4V, and Gemini 1.5 Pro**.
  Based on the the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) and [VisCPM](https://github.com/OpenBMB/VisCPM) techniques, it features **trustworthy behaviors**, with significantly lower hallucination rates than GPT-4o and GPT-4V on Object HalBench, and supports **multilingual capabilities** on English, Chinese, German, French, Italian, Korean, etc.
 - 🚀 **Superior Efficiency.**
  In addition to its friendly size, MiniCPM-V 2.6 also shows **state-of-the-art token density** (i.e., number of pixels encoded into each visual token). **It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models**. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-V 2.6 can efficiently support **real-time video understanding** on end-side devices such as iPad.
 -  💫  **Easy Usage.**
 MiniCPM-V 2.6 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpmv-main/examples/llava/README-minicpmv2.6.md) and [ollama](https://github.com/OpenBMB/ollama/blob/minicpm-v2.6/examples/minicpm-v2.6/README.md) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-V-2_6-int4) and [GGUF](https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf) format quantized models in 16 sizes, (3) [vLLM](#inference-with-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks, (5) quick local WebUI demo setup with [Gradio](#chat-with-our-demo-on-gradio), and (6) online web [demo](http://120.92.209.146:8887/).
 ### Evaluation  <!-- omit in toc -->
 <div align="center">
    <img src=../assets/radar_final.png width=66% />
 </div>
 <details>
 <summary>Click to view single image results on OpenCompass, MME, MMVet, OCRBench, MMMU, MathVista, MMB, AI2D, TextVQA, DocVQA, HallusionBench, Object HalBench. </summary>
 <div align="center">
 <table style="margin: 0px auto;">
    <thead>
        <tr>
            <th align="left">Model</th>
            <th>Size</th>
            <th>Token Density<sup>+</sup></th>
            <th>OpenCompass</th>
            <th>MME</th>
            <th>MMVet</th>
            <th>OCRBench</th>
            <th>MMMU val</th>
            <th>MathVista mini</th>
            <th>MMB1.1 test</th>
            <th>AI2D</th>
            <th>TextVQA val</th>
            <th>DocVQA test</th>
            <th>HallusionBench</th>
            <th>Object HalBench</th>
        </tr>
    </thead>
    <tbody align="center">
        <tr>
            <td colspan="15" align="left"><strong>Proprietary</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">GPT-4o</td>
            <td>-</td>
            <td>1088</td>
            <td>69.9</td>
            <td>2328.7</td>
            <td>69.1</td>
            <td>736</td>
            <td>69.2</td>
            <td>61.3</td>
            <td>82.2</td>
            <td>84.6</td>
            <td>-</td>
            <td>92.8</td>
            <td>55.0</td>
            <td>17.6</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Claude 3.5 Sonnet</td>
            <td>-</td>
            <td>750</td>
            <td>67.9</td>
            <td>1920.0</td>
            <td>66.0</td>
            <td>788</td>
            <td>65.9</td>
            <td>61.6</td>
            <td>78.5</td>
            <td>80.2</td>
            <td>-</td>
            <td>95.2</td>
            <td>49.9</td>
            <td>13.8</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
            <td>-</td>
            <td>-</td>
            <td>64.4</td>
            <td>2110.6</td>
            <td>64.0</td>
            <td>754</td>
            <td>60.6</td>
            <td>57.7</td>
            <td>73.9</td>
            <td>79.1</td>
            <td>73.5</td>
            <td>86.5</td>
            <td>45.6</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">GPT-4o mini</td>
            <td>-</td>
            <td>1088</td>
            <td>64.1</td>
            <td>2003.4</td>
            <td>66.9</td>
            <td>785</td>
            <td>60.0</td>
            <td>52.4</td>
            <td>76.0</td>
            <td>77.8</td>
            <td>-</td>
            <td>-</td>
            <td>46.1</td>
            <td>12.4</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">GPT-4V</td>
            <td>-</td>
            <td>1088</td>
            <td>63.5</td>
            <td>2070.2</td>
            <td>67.5</td>
            <td>656</td>
            <td>61.7</td>
            <td>54.7</td>
            <td>79.8</td>
            <td>78.6</td>
            <td>78.0</td>
            <td>87.2</td>
            <td>43.9</td>
            <td>14.2</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Step-1V</td>
            <td>-</td>
            <td>-</td>
            <td>59.5</td>
            <td>2206.4</td>
            <td>63.3</td>
            <td>625</td>
            <td>49.9</td>
            <td>44.8</td>
            <td>78.0</td>
            <td>79.2</td>
            <td>71.6</td>
            <td>-</td>
            <td>48.4</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Qwen-VL-Max</td>
            <td>-</td>
            <td>784</td>
            <td>58.3</td>
            <td>2281.7</td>
            <td>61.8</td>
            <td>684</td>
            <td>52.0</td>
            <td>43.4</td>
            <td>74.6</td>
            <td>75.7</td>
            <td>79.5</td>
            <td>93.1</td>
            <td>41.2</td>
            <td>13.4</td>
        </tr>
        <tr>
            <td colspan="15" align="left"><strong>Open-source</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">LLaVA-NeXT-Yi-34B</td>
            <td>34B</td>
            <td>157</td>
            <td>55.0</td>
            <td>2006.5</td>
            <td>50.7</td>
            <td>574</td>
            <td>48.8</td>
            <td>40.4</td>
            <td>77.8</td>
            <td>78.9</td>
            <td>69.3</td>
            <td>-</td>
            <td>34.8</td>
            <td>12.6</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Mini-Gemini-HD-34B</td>
            <td>34B</td>
            <td>157</td>
            <td>-</td>
            <td>2141.0</td>
            <td>59.3</td>
            <td>518</td>
            <td>48.0</td>
            <td>43.3</td>
            <td>-</td>
            <td>80.5</td>
            <td>74.1</td>
            <td>78.9</td>
            <td>-</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Cambrian-34B</td>
            <td>34B</td>
            <td>1820</td>
            <td>58.3</td>
            <td>2049.9</td>
            <td>53.2</td>
            <td>591</td>
            <td>50.4</td>
            <td>50.3</td>
            <td>77.8</td>
            <td>79.5</td>
            <td>76.7</td>
            <td>75.5</td>
            <td>41.6</td>
            <td>14.7</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">GLM-4V-9B</td>
            <td>13B</td>
            <td>784</td>
            <td>59.1</td>
            <td>2018.8</td>
            <td>58.0</td>
            <td>776</td>
            <td>46.9</td>
            <td>51.1</td>
            <td>67.9</td>
            <td>71.2</td>
            <td>-</td>
            <td>-</td>
            <td>45.0</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">InternVL2-8B</td>
            <td>8B</td>
            <td>706</td>
            <td>64.1</td>
            <td>2215.1</td>
            <td>54.3</td>
            <td>794</td>
            <td><strong>51.2</strong></td>
            <td>58.3</td>
            <td><strong>79.4</strong></td>
            <td><strong>83.6</strong></td>
            <td>77.4</td>
            <td><strong>91.6</strong></td>
            <td>45.0</td>
            <td>21.3</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">MiniCPM-Llama-V 2.5</td>
            <td>8B</td>
            <td>1882</td>
            <td>58.8</td>
            <td>2024.6</td>
            <td>52.8</td>
            <td>725</td>
            <td>45.8</td>
            <td>54.3</td>
            <td>72.0</td>
            <td>78.4</td>
            <td>76.6</td>
            <td>84.8</td>
            <td>42.4</td>
            <td>10.3</td>
        </tr>
        <tr style="background-color: #e6f2ff;">
            <td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
            <td>8B</td>
            <td><strong>2822</strong></td>
            <td><strong>65.2</strong></td>
            <td><strong>2348.4</strong>*</td>
            <td><strong>60.0</strong></td>
            <td><strong>852</strong>*</td>
            <td>49.8*</td>
            <td><strong>60.6</strong></td>
            <td>78.0</td>
            <td>82.1</td>
            <td><strong>80.1<strong></td>
            <td>90.8</td>
            <td><strong>48.1</strong>*</td>
            <td><strong>8.2</strong></td>
        </tr>
    </tbody>
 </table>
 </div>
 * We evaluate this benchmark using chain-of-thought prompting. Specifically, for MME, we used this technique only for the Cognition set.
 <sup>+</sup> Token Density: number of pixels encoded into each visual token at maximum resolution, i.e., # pixels at maximum resolution / # visual tokens.
 Note: For proprietary models, we calculate token density based on the image encoding charging strategy defined in the official API documentation, which provides an upper-bound estimation.
 </details>
 <details>
 <summary>Click to view multi-image results on Mantis Eval, BLINK, Mathverse mv, Sciverse mv, MIRB.</summary>
 <div align="center">
 <table style="margin: 0px auto;">
    <thead>
        <tr>
            <th align="left">Model</th>
            <th>Size</th>
            <th>Mantis Eval</th>
            <th>BLINK val</th>
            <th>Mathverse mv</th>
            <th>Sciverse mv</th>
            <th>MIRB</th>
        </tr>
    </thead>
    <tbody align="center">
        <tr>
            <td colspan="7" align="left"><strong>Proprietary</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">GPT-4V</td>
            <td>-</td>
            <td>62.7</td>
            <td>54.6</td>
            <td>60.3</td>
            <td>66.9</td>
            <td>53.1</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">LLaVA-NeXT-Interleave-14B</td>
            <td>14B</td>
            <td>66.4</td>
            <td>52.6</td>
            <td>32.7</td>
            <td>30.2</td>
            <td>-</td>
        </tr>
        <tr>
            <td colspan="7" align="left"><strong>Open-source</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Emu2-Chat</td>
            <td>37B</td>
            <td>37.8</td>
            <td>36.2</td>
            <td>-</td>
            <td>27.2</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">CogVLM</td>
            <td>17B</td>
            <td>45.2</td>
            <td>41.1</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">VPG-C</td>
            <td>7B</td>
            <td>52.4</td>
            <td>43.1</td>
            <td>24.3</td>
            <td>23.1</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">VILA 8B</td>
            <td>8B</td>
            <td>51.2</td>
            <td>39.3</td>
            <td>-</td>
            <td>36.5</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">InternLM-XComposer-2.5</td>
            <td>8B</td>
            <td>53.1*</td>
            <td>48.9</td>
            <td>32.1*</td>
            <td>-</td>
            <td>42.5</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">InternVL2-8B</td>
            <td>8B</td>
            <td>59.0*</td>
            <td>50.9</td>
            <td>30.5*</td>
            <td>34.4*</td>
            <td><strong>56.9*</strong></td>
        </tr>
        <tr style="background-color: #e6f2ff;">
            <td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
            <td>8B</td>
            <td><strong>69.1</strong></td>
            <td><strong>53.0</strong></td>
            <td><strong>84.9</strong></td>
            <td><strong>74.9</strong></td>
            <td>53.8</td>
        </tr>
    </tbody>
 </table>
 </div>
 * We evaluate the officially released checkpoint by ourselves.
 </details>
 <details>
 <summary>Click to view video results on Video-MME and Video-ChatGPT.</summary>
 <div align="center">
 <table style="margin: 0px auto;">
    <thead>
        <tr>
            <th align="left">Model</th>
            <th>Size</th>
            <th colspan="2">Video-MME</th>
            <th colspan="5">Video-ChatGPT</th>
        </tr>
        <tr>
            <th align="left"></th>
            <th></th>
            <th>w/o subs</th>
            <th>w subs</th>
            <th>Correctness</th>
            <th>Detail</th>
            <th>Context</th>
            <th>Temporal</th>
            <th>Consistency</th>
        </tr>
    </thead>
    <tbody align="center">
        <tr>
            <td colspan="9" align="left"><strong>Proprietary</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Claude 3.5 Sonnet</td>
            <td>-</td>
            <td>60.0</td>
            <td>62.9</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">GPT-4V</td>
            <td>-</td>
            <td>59.9</td>
            <td>63.3</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
        </tr>
        <tr>
            <td colspan="9" align="left"><strong>Open-source</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">LLaVA-NeXT-7B</td>
            <td>7B</td>
            <td>-</td>
            <td>-</td>
            <td>3.39</td>
            <td>3.29</td>
            <td>3.92</td>
            <td>2.60</td>
            <td>3.12</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">LLaVA-NeXT-34B</td>
            <td>34B</td>
            <td>-</td>
            <td>-</td>
            <td>3.29</td>
            <td>3.23</td>
            <td>3.83</td>
            <td>2.51</td>
            <td>3.47</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">CogVLM2-Video</td>
            <td>12B</td>
            <td>-</td>
            <td>-</td>
            <td>3.49</td>
            <td><strong>3.46</strong></td>
            <td>3.23</td>
            <td><strong>2.98</strong></td>
            <td><strong>3.64</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">LongVA</td>
            <td>7B</td>
            <td>52.4</td>
            <td>54.3</td>
            <td>3.05</td>
            <td>3.09</td>
            <td>3.77</td>
            <td>2.44</td>
            <td><strong>3.64</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">InternVL2-8B</td>
            <td>8B</td>
            <td>54.0</td>
            <td>56.9</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">InternLM-XComposer-2.5</td>
            <td>8B</td>
            <td>55.8</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">LLaVA-NeXT-Video</td>
            <td>32B</td>
            <td>60.2</td>
            <td>63.0</td>
            <td>3.48</td>
            <td>3.37</td>
            <td><strong>3.95</strong></td>
            <td>2.64</td>
            <td>3.28</td>
        </tr>
        <tr style="background-color: #e6f2ff;">
            <td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
            <td>8B</td>
            <td><strong>60.9</strong></td>
            <td><strong>63.6</strong></td>
            <td><strong>3.59</strong></td>
            <td>3.28</td>
            <td>3.93</td>
            <td>2.73</td>
            <td>3.62</td>
        </tr>
    </tbody>
 </table>
 </div>
 </details>
 <details>
 <summary>Click to view few-shot results on TextVQA, VizWiz, VQAv2, OK-VQA.</summary>
 <div align="center">
 <table style="margin: 0px auto;">
    <thead>
        <tr>
            <th align="left">Model</th>
            <th>Size</th>
            <th>Shot</th>
            <th>TextVQA val</th>
            <th>VizWiz test-dev</th>
            <th>VQAv2 test-dev</th>
            <th>OK-VQA val</th>
        </tr>
    </thead>
    <tbody align="center">
        <tr>
            <td align="left" nowrap="nowrap" rowspan="3">Flamingo</td>
            <td rowspan="3">80B</td>
            <td>0*</td>
            <td>35.0</td>
            <td>31.6</td>
            <td>56.3</td>
            <td>40.6</td>
        </tr>
        <tr>
            <td>4</td>
            <td>36.5</td>
            <td>39.6</td>
            <td>63.1</td>
            <td><strong>57.4</strong></td>
        </tr>
        <tr>
            <td>8</td>
            <td>37.3</td>
            <td>44.8</td>
            <td>65.6</td>
            <td>57.5</td>
        </tr>
        <tr>
            <td align="left" nowrap="nowrap" rowspan="3">IDEFICS</td>
            <td rowspan="3">80B</td>
            <td>0*</td>
            <td>30.9</td>
            <td>36.0</td>
            <td>60.0</td>
            <td>45.2</td>
        </tr>
        <tr>
            <td>4</td>
            <td>34.3</td>
            <td>40.4</td>
            <td>63.6</td>
            <td>52.4</td>
        </tr>
        <tr>
            <td>8</td>
            <td>35.7</td>
            <td>46.1</td>
            <td>64.8</td>
            <td>55.1</td>
        </tr>
        <tr>
            <td align="left" nowrap="nowrap" rowspan="3">OmniCorpus</td>
            <td rowspan="3">7B</td>
            <td>0*</td>
            <td>43.0</td>
            <td>49.8</td>
            <td>63.2</td>
            <td>45.5</td>
        </tr>
        <tr>
            <td>4</td>
            <td>45.4</td>
            <td>51.3</td>
            <td>64.5</td>
            <td>46.5</td>
        </tr>
        <tr>
            <td>8</td>
            <td>45.6</td>
            <td>52.2</td>
            <td>64.7</td>
            <td>46.6</td>
        </tr>
        <tr>
            <td align="left" nowrap="nowrap" rowspan="3">Emu2</td>
            <td rowspan="3">37B</td>
            <td>0</td>
            <td>26.4</td>
            <td>40.4</td>
            <td>33.5</td>
            <td>26.7</td>
        </tr>
        <tr>
            <td>4</td>
            <td>48.2</td>
            <td>54.6</td>
            <td>67.0</td>
            <td>53.2</td>
        </tr>
        <tr>
            <td>8</td>
            <td>49.3</td>
            <td>54.7</td>
            <td>67.8</td>
            <td>54.1</td>
        </tr>
        <tr>
            <td align="left" nowrap="nowrap" rowspan="2">MM1</td>
            <td rowspan="2">30B</td>
            <td>0</td>
            <td>26.2</td>
            <td>40.4</td>
            <td>48.9</td>
            <td>26.7</td>
        </tr>
        <tr>
            <td>8</td>
            <td>49.3</td>
            <td>54.7</td>
            <td><strong>70.9</strong></td>
            <td>54.1</td>
        </tr>
        <tr style="background-color: #e6f2ff;">
            <td align="left" nowrap="nowrap" rowspan="3">MiniCPM-V 2.6<sup>+</sup></td>
            <td rowspan="3">8B</td>
            <td>0</td>
            <td>43.9</td>
            <td>33.8</td>
            <td>45.4</td>
            <td>23.9</td>
        </tr>
        <tr style="background-color: #e6f2ff;">
            <td>4</td>
            <td>63.6</td>
            <td>60.5</td>
            <td>65.5</td>
            <td>50.1</td>
        </tr>
        <tr style="background-color: #e6f2ff;">
            <td>8</td>
            <td><strong>64.6</strong></td>
            <td><strong>63.4</strong></td>
            <td>68.2</td>
            <td>51.4</td>
        </tr>
    </tbody>
 </table>
 </div>
 * denotes zero image shot and two additional text shots following Flamingo.
 <sup>+</sup> We evaluate the pretraining ckpt without SFT.
 </details>
 ### Examples <!-- omit in toc -->
 <div style="display: flex; flex-direction: column; align-items: center;">
  <img src="../assets/minicpmv2_6/multi_img-bike.png" alt="Bike" style="margin-bottom: 5px;">
  <img src="../assets/minicpmv2_6/multi_img-menu.png" alt="Menu" style="margin-bottom: 5px;">
  <img src="../assets/minicpmv2_6/multi_img-code.png" alt="Code" style="margin-bottom: 5px;">
  <img src="../assets/minicpmv2_6/ICL-Mem.png" alt="Mem" style="margin-bottom: 5px;">
  <img src="../assets/minicpmv2_6/multiling-medal.png" alt="medal" style="margin-bottom: 10px;">
 </div>
 <details>
  <summary>Click to view more cases.</summary>
  <div style="display: flex; flex-direction: column; align-items: center;">
    <img src="../assets/minicpmv2_6/ICL-elec.png" alt="elec" style="margin-bottom: 5px;">
    <img src="../assets/minicpmv2_6/multiling-olympic.png" alt="Menu" style="margin-bottom: 10px;">
  </div>
 </details>
 We deploy MiniCPM-V 2.6 on end devices. The demo video is the raw screen recording on a iPad Pro without edition.
 <table align="center"> 
    <p align="center">
      <img src="../assets/gif_cases/ai.gif" width=32%/>
      &nbsp;&nbsp;&nbsp;&nbsp;
      <img src="../assets/gif_cases/beer.gif" width=32%/>
    </p>
 </table> 
 <table align="center"> 
    <p align="center">
      <img src="../assets/gif_cases/ticket.gif" width=32%/>
      &nbsp;&nbsp;&nbsp;&nbsp;
      <img src="../assets/gif_cases/wfh.gif" width=32%/>
    </p>
 </table> 
 <table align="center">
    <p align="center">
      <video src="https://github.com/user-attachments/assets/21f4b818-ede1-4822-920e-91281725c830" width="360" /> </video>
      <!-- <video src="https://github.com/user-attachments/assets/c835f757-206b-4d9c-8e36-70d67b453628" width="360" /> </video> -->
    </p>
 </table>
 </details>
 ### Multi-turn Conversation
 <div align="center">
 <img src="../assets/airplane.jpeg" width="500px">
 </div>
 ```python
 import torch
 from PIL import Image
 from transformers import AutoModel, AutoTokenizer
 torch.manual_seed(0)
 model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
 model = model.eval().cuda()
 tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
 image = Image.open('./assets/airplane.jpeg').convert('RGB')
 # First round chat 
 question = "Tell me the model of this aircraft."
 msgs = [{'role': 'user', 'content': [image, question]}]
 answer = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer
 )
 print(answer)
 # Second round chat 
 # pass history context of multi-turn conversation
 msgs.append({"role": "assistant", "content": [answer]})
 msgs.append({"role": "user", "content": ["Introduce something about Airbus A380."]})
 answer = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer
 )
 print(answer)
 ```
 You could get the following output:
 ```
 "The aircraft in the image is an Airbus A380, which can be identified by its large size, double-deck structure, and the distinctive shape of its wings and engines. The A380 is a wide-body aircraft known for being the world's largest passenger airliner, designed for long-haul flights. It has four engines, which are characteristic of large commercial aircraft. The registration number on the aircraft can also provide specific information about the model if looked up in an aviation database."
 "The Airbus A380 is a double-deck, wide-body, four-engine jet airliner made by Airbus. It is the world's largest passenger airliner and is known for its long-haul capabilities. The aircraft was developed to improve efficiency and comfort for passengers traveling over long distances. It has two full-length passenger decks, which can accommodate more passengers than a typical single-aisle airplane. The A380 has been operated by airlines such as Lufthansa, Singapore Airlines, and Emirates, among others. It is widely recognized for its unique design and significant impact on the aviation industry."
 ```
 #### Multi-image Understanding
 <details>
 <summary> Click to view Python example of MiniCPM-V 2.6 multi-image understanding </summary>
 ```python
 import torch
 from PIL import Image
 from transformers import AutoModel, AutoTokenizer
 model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
 model = model.eval().cuda()
 tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
 image1 = Image.open('image1.jpg').convert('RGB')
 image2 = Image.open('image2.jpg').convert('RGB')
 question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.'
 msgs = [{'role': 'user', 'content': [image1, image2, question]}]
 answer = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer
 )
 print(answer)
 ```
 </details>
 #### Few-shot In-Context-Learning 
 <details>
 <summary> Click to view Python example of MiniCPM-V 2.6 few-shot in-context-learning example </summary>
 ```python
 import torch
 from PIL import Image
 from transformers import AutoModel, AutoTokenizer
 model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
 model = model.eval().cuda()
 tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
 question = "production date" 
 image1 = Image.open('example1.jpg').convert('RGB')
 answer1 = "2023.08.04"
 image2 = Image.open('example2.jpg').convert('RGB')
 answer2 = "2007.04.24"
 image_test = Image.open('test.jpg').convert('RGB')
 msgs = [
    {'role': 'user', 'content': [image1, question]}, {'role': 'assistant', 'content': [answer1]},
    {'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]},
    {'role': 'user', 'content': [image_test, question]}
 ]
 answer = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer
 )
 print(answer)
 ```
 </details>
 #### Video understanding
 <details>
 <summary> Click to view Python example of MiniCPM-V 2.6 video understanding </summary>
 ```python
 import torch
 from PIL import Image
 from transformers import AutoModel, AutoTokenizer
 from decord import VideoReader, cpu    # pip install decord
 model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
 model = model.eval().cuda()
 tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
 MAX_NUM_FRAMES=64 # if cuda OOM set a smaller number
 def encode_video(video_path):
    def uniform_sample(l, n):
        gap = len(l) / n
        idxs = [int(i * gap + gap / 2) for i in range(n)]
        return [l[i] for i in idxs]
    vr = VideoReader(video_path, ctx=cpu(0))
    sample_fps = round(vr.get_avg_fps() / 1)  # FPS
    frame_idx = [i for i in range(0, len(vr), sample_fps)]
    if len(frame_idx) > MAX_NUM_FRAMES:
        frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
    frames = vr.get_batch(frame_idx).asnumpy()
    frames = [Image.fromarray(v.astype('uint8')) for v in frames]
    print('num frames:', len(frames))
    return frames
 video_path="video_test.mp4"
 frames = encode_video(video_path)
 question = "Describe the video"
 msgs = [
    {'role': 'user', 'content': frames + [question]}, 
 ]
 # Set decode params for video
 params = {}
 params["use_image_id"] = False
 params["max_slice_nums"] = 2 # 如果cuda OOM且视频分辨率大于448*448可设为1
 answer = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer,
    **params
 )
 print(answer)
 ```
 </details>
 ### Model Zoo
 | Model           | Device | Memory    | &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; Description       | Download |
 |:-----------|:--:|:-----------:|:-------------------|:---------------:|
 | MiniCPM-V 2.6| GPU | 17 GB  | Strong end-side multimodal performance for single image, multi-image and video understanding.   |  [🤗](https://huggingface.co/openbmb/MiniCPM-V-2_6) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6) |
 | MiniCPM-V 2.6 gguf | CPU | 6 GB  | The gguf version, lower memory usage and faster inference.   |  [🤗](https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6-gguf) |
 | MiniCPM-V 2.6 int4 | GPU | 7 GB  | The int4 quantized version, lower GPU memory usage.   |  [🤗](https://huggingface.co/openbmb/MiniCPM-V-2_6-int4) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6-int4) |
--- a/docs/minicpm_v2dot6_zh.md
+++ b/docs/minicpm_v2dot6_zh.md
@@ -0,0 +1,773 @@
 ## MiniCPM-V 2.6
 > Archieve at: 2025-08-25
 **MiniCPM-V 2.6** 是 MiniCPM-V 系列中最新、性能最佳的模型。该模型基于 SigLip-400M 和 Qwen2-7B 构建，共 8B 参数。与 MiniCPM-Llama3-V 2.5 相比，MiniCPM-V 2.6 性能提升显著，并引入了多图和视频理解的新功能。MiniCPM-V 2.6 的主要特点包括：
 - 🔥 **领先的性能。**
  MiniCPM-V 2.6 在最新版本 OpenCompass 榜单上（综合 8 个主流多模态评测基准）平均得分 65.2，**以8B量级的大小在单图理解方面超越了 GPT-4o mini、GPT-4V、Gemini 1.5 Pro 和 Claude 3.5 Sonnet 等主流商用闭源多模态大模型**。
 - 🖼️ **多图理解和上下文学习。**
  MiniCPM-V 2.6 还支持**多图对话和推理**。它在 Mantis-Eval、BLINK、Mathverse mv 和 Sciverse mv 等主流多图评测基准中取得了**最佳水平**，并展现出了优秀的上下文学习能力。
 - 🎬 **视频理解。**
  MiniCPM-V 2.6 还可以**接受视频输入**，进行对话和提供涵盖时序和空间信息的详细视频描述。模型在 有/无字幕 评测场景下的 Video-MME 表现均超过了 **GPT-4V、Claude 3.5 Sonnet 和 LLaVA-NeXT-Video-34B**等商用闭源模型。
 - 💪 **强大的 OCR 能力及其他功能。**
  MiniCPM-V 2.6 可以处理任意长宽比的图像，像素数可达 180 万（如 1344x1344）。在 OCRBench 上取得**最佳水平，超过 GPT-4o、GPT-4V 和 Gemini 1.5 Pro 等商用闭源模型**。基于最新的 [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) 和 [VisCPM](https://github.com/OpenBMB/VisCPM) 技术，其具备了**可信的多模态行为**，在 Object HalBench 上的幻觉率显著低于 GPT-4o 和 GPT-4V，并支持英语、中文、德语、法语、意大利语、韩语等**多种语言**。
 - 🚀 **卓越的效率。**
  除了对个人用户友好的模型大小，MiniCPM-V 2.6 还表现出**最先进的视觉 token 密度**（即每个视觉 token 编码的像素数量）。它**仅需 640 个 token 即可处理 180 万像素图像，比大多数模型少 75%**。这一特性优化了模型的推理速度、首 token 延迟、内存占用和功耗。因此，MiniCPM-V 2.6 可以支持 iPad 等终端设备上的高效**实时视频理解**。
 - 💫 **易于使用。**
  MiniCPM-V 2.6 可以通过多种方式轻松使用：(1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpmv-main/examples/llava/README-minicpmv2.6.md) 和 [ollama](https://github.com/OpenBMB/ollama/blob/minicpm-v2.6/examples/minicpm-v2.6/README.md) 支持在本地设备上进行高效的 CPU 推理，(2) [int4](https://huggingface.co/openbmb/MiniCPM-V-2_6-int4) 和 [GGUF](https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf) 格式的量化模型，有 16 种尺寸，(3) [vLLM](#vllm-部署-) 支持高吞吐量和内存高效的推理，(4) 针对新领域和任务进行微调，(5) 使用 [Gradio](#本地-webui-demo-) 快速设置本地 WebUI 演示，(6) 在线[demo](http://120.92.209.146:8887/)即可体验。
 ### 性能评估  <!-- omit in toc -->
 <div align="center">
    <img src=assets/radar_final.png width=90% />
 </div>
 <details>
 <summary>点击查看 OpenCompass, MME, MMVet, OCRBench, MMMU, MathVista, MMB, AI2D, TextVQA, DocVQA, HallusionBench, Object HalBench 上的单图评测结果详情。 </summary>
 <div align="center">
 <table style="margin: 0px auto;">
    <thead>
        <tr>
            <th align="left">Model</th>
            <th>Size</th>
            <th>Token Density<sup>+</sup></th>
            <th>OpenCompass</th>
            <th>MME</th>
            <th>MMVet</th>
            <th>OCRBench</th>
            <th>MMMU val</th>
            <th>MathVista mini</th>
            <th>MMB1.1 test</th>
            <th>AI2D</th>
            <th>TextVQA val</th>
            <th>DocVQA test</th>
            <th>HallusionBench</th>
            <th>Object HalBench</th>
        </tr>
    </thead>
    <tbody align="center">
        <tr>
            <td colspan="15" align="left"><strong>Proprietary</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">GPT-4o</td>
            <td>-</td>
            <td>1088</td>
            <td>69.9</td>
            <td>2328.7</td>
            <td>69.1</td>
            <td>736</td>
            <td>69.2</td>
            <td>61.3</td>
            <td>82.2</td>
            <td>84.6</td>
            <td>-</td>
            <td>92.8</td>
            <td>55.0</td>
            <td>17.6</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Claude 3.5 Sonnet</td>
            <td>-</td>
            <td>750</td>
            <td>67.9</td>
            <td>1920.0</td>
            <td>66.0</td>
            <td>788</td>
            <td>65.9</td>
            <td>61.6</td>
            <td>78.5</td>
            <td>80.2</td>
            <td>-</td>
            <td>95.2</td>
            <td>49.9</td>
            <td>13.8</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
            <td>-</td>
            <td>-</td>
            <td>64.4</td>
            <td>2110.6</td>
            <td>64.0</td>
            <td>754</td>
            <td>60.6</td>
            <td>57.7</td>
            <td>73.9</td>
            <td>79.1</td>
            <td>73.5</td>
            <td>86.5</td>
            <td>45.6</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">GPT-4o mini</td>
            <td>-</td>
            <td>1088</td>
            <td>64.1</td>
            <td>2003.4</td>
            <td>66.9</td>
            <td>785</td>
            <td>60.0</td>
            <td>52.4</td>
            <td>76.0</td>
            <td>77.8</td>
            <td>-</td>
            <td>-</td>
            <td>46.1</td>
            <td>12.4</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">GPT-4V</td>
            <td>-</td>
            <td>1088</td>
            <td>63.5</td>
            <td>2070.2</td>
            <td>67.5</td>
            <td>656</td>
            <td>61.7</td>
            <td>54.7</td>
            <td>79.8</td>
            <td>78.6</td>
            <td>78.0</td>
            <td>87.2</td>
            <td>43.9</td>
            <td>14.2</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Step-1V</td>
            <td>-</td>
            <td>-</td>
            <td>59.5</td>
            <td>2206.4</td>
            <td>63.3</td>
            <td>625</td>
            <td>49.9</td>
            <td>44.8</td>
            <td>78.0</td>
            <td>79.2</td>
            <td>71.6</td>
            <td>-</td>
            <td>48.4</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Qwen-VL-Max</td>
            <td>-</td>
            <td>784</td>
            <td>58.3</td>
            <td>2281.7</td>
            <td>61.8</td>
            <td>684</td>
            <td>52.0</td>
            <td>43.4</td>
            <td>74.6</td>
            <td>75.7</td>
            <td>79.5</td>
            <td>93.1</td>
            <td>41.2</td>
            <td>13.4</td>
        </tr>
        <tr>
            <td colspan="15" align="left"><strong>Open-source</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">LLaVA-NeXT-Yi-34B</td>
            <td>34B</td>
            <td>157</td>
            <td>55.0</td>
            <td>2006.5</td>
            <td>50.7</td>
            <td>574</td>
            <td>48.8</td>
            <td>40.4</td>
            <td>77.8</td>
            <td>78.9</td>
            <td>69.3</td>
            <td>-</td>
            <td>34.8</td>
            <td>12.6</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Mini-Gemini-HD-34B</td>
            <td>34B</td>
            <td>157</td>
            <td>-</td>
            <td>2141</td>
            <td>59.3</td>
            <td>518</td>
            <td>48.0</td>
            <td>43.3</td>
            <td>-</td>
            <td>80.5</td>
            <td>74.1</td>
            <td>78.9</td>
            <td>-</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Cambrian-34B</td>
            <td>34B</td>
            <td>1820</td>
            <td>58.3</td>
            <td>2049.9</td>
            <td>53.2</td>
            <td>591</td>
            <td>50.4</td>
            <td>50.3</td>
            <td>77.8</td>
            <td>79.5</td>
            <td>76.7</td>
            <td>75.5</td>
            <td>41.6</td>
            <td>14.7</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">GLM-4V-9B</td>
            <td>13B</td>
            <td>784</td>
            <td>59.1</td>
            <td>2018.8</td>
            <td>58.0</td>
            <td>776</td>
            <td>46.9</td>
            <td>51.1</td>
            <td>67.9</td>
            <td>71.2</td>
            <td>-</td>
            <td>-</td>
            <td>45.0</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">InternVL2-8B</td>
            <td>8B</td>
            <td>706</td>
            <td>64.1</td>
            <td>2215.1</td>
            <td>54.3</td>
            <td>794</td>
            <td><strong>51.2</strong></td>
            <td>58.3</td>
            <td><strong>79.4</strong></td>
            <td><strong>83.6</strong></td>
            <td>77.4</td>
            <td><strong>91.6</strong></td>
            <td>45.0</td>
            <td>21.3</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">MiniCPM-Llama-V 2.5</td>
            <td>8B</td>
            <td>1882</td>
            <td>58.8</td>
            <td>2024.6</td>
            <td>52.8</td>
            <td>725</td>
            <td>45.8</td>
            <td>54.3</td>
            <td>72.0</td>
            <td>78.4</td>
            <td>76.6</td>
            <td>84.8</td>
            <td>42.4</td>
            <td>10.3</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
            <td>8B</td>
            <td><strong>2822</strong></td>
            <td><strong>65.2</strong></td>
            <td><strong>2348.4</strong>*</td>
            <td><strong>60.0</strong></td>
            <td><strong>852</strong>*</td>
            <td>49.8*</td>
            <td><strong>60.6</strong></td>
            <td>78.0</td>
            <td>82.1</td>
            <td><strong>80.1<strong></td>
            <td>90.8</td>
            <td><strong>48.1</strong>*</td>
            <td><strong>8.2</strong></td>
        </tr>
    </tbody>
 </table>
 </div>
 * 我们使用思维链提示词来评估这些基准。
 <sup>+</sup> Token Density：每个视觉 token 在最大分辨率下编码的像素数，即最大分辨率下的像素数 / 视觉 token 数。
 注意：闭源模型的 Token Density 由 API 收费方式估算得到。
 </details>
 <details>
 <summary>点击查看 Mantis Eval, BLINK, Mathverse mv, Sciverse mv, MIRB 上的多图评测结果详情。</summary>
 <div align="center">
 <table style="margin: 0px auto;">
    <thead>
        <tr>
            <th align="left">Model</th>
            <th>Size</th>
            <th>Mantis Eval</th>
            <th>BLINK val</th>
            <th>Mathverse mv</th>
            <th>Sciverse mv</th>
            <th>MIRB</th>
        </tr>
    </thead>
    <tbody align="center">
        <tr>
            <td colspan="7" align="left"><strong>Proprietary</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">GPT-4V</td>
            <td>-</td>
            <td>62.7</td>
            <td>54.6</td>
            <td>60.3</td>
            <td>66.9</td>
            <td>53.1</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">LLaVA-NeXT-Interleave-14B</td>
            <td>14B</td>
            <td>66.4</td>
            <td>52.6</td>
            <td>32.7</td>
            <td>30.2</td>
            <td>-</td>
        </tr>
        <tr>
            <td colspan="7" align="left"><strong>Open-source</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Emu2-Chat</td>
            <td>37B</td>
            <td>37.8</td>
            <td>36.2</td>
            <td>-</td>
            <td>27.2</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">CogVLM</td>
            <td>17B</td>
            <td>45.2</td>
            <td>41.1</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">VPG-C</td>
            <td>7B</td>
            <td>52.4</td>
            <td>43.1</td>
            <td>24.3</td>
            <td>23.1</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">VILA 8B</td>
            <td>8B</td>
            <td>51.2</td>
            <td>39.3</td>
            <td>-</td>
            <td>36.5</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">InternLM-XComposer-2.5</td>
            <td>8B</td>
            <td>53.1*</td>
            <td>48.9</td>
            <td>32.1*</td>
            <td>-</td>
            <td>42.5</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">InternVL2-8B</td>
            <td>8B</td>
            <td>59.0*</td>
            <td>50.9</td>
            <td>30.5*</td>
            <td>34.4*</td>
            <td><strong>56.9*</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
            <td>8B</td>
            <td><strong>69.1</strong></td>
            <td><strong>53.0</strong></td>
            <td><strong>84.9</strong></td>
            <td><strong>74.9</strong></td>
            <td>53.8</td>
        </tr>
    </tbody>
 </table>
 </div>
 * 正式开源模型权重的评测结果。
 </details>
 <details>
 <summary>点击查看 Video-MME 和 Video-ChatGPT 上的视频评测结果详情。</summary>
 <div align="center">
 <table style="margin: 0px auto;">
    <thead>
        <tr>
            <th align="left">Model</th>
            <th>Size</th>
            <th colspan="2">Video-MME</th>
            <th colspan="5">Video-ChatGPT</th>
        </tr>
        <tr>
            <th align="left"></th>
            <th></th>
            <th>w/o subs</th>
            <th>w subs</th>
            <th>Correctness</th>
            <th>Detail</th>
            <th>Context</th>
            <th>Temporal</th>
            <th>Consistency</th>
        </tr>
    </thead>
    <tbody align="center">
        <tr>
            <td colspan="9" align="left"><strong>Proprietary</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Claude 3.5 Sonnet</td>
            <td>-</td>
            <td>60.0</td>
            <td>62.9</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">GPT-4V</td>
            <td>-</td>
            <td>59.9</td>
            <td>63.3</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
        </tr>
        <tr>
            <td colspan="9" align="left"><strong>Open-source</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">LLaVA-NeXT-7B</td>
            <td>7B</td>
            <td>-</td>
            <td>-</td>
            <td>3.39</td>
            <td>3.29</td>
            <td>3.92</td>
            <td>2.60</td>
            <td>3.12</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">LLaVA-NeXT-34B</td>
            <td>34B</td>
            <td>-</td>
            <td>-</td>
            <td>3.29</td>
            <td>3.23</td>
            <td>3.83</td>
            <td>2.51</td>
            <td>3.47</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">CogVLM2-Video</td>
            <td>12B</td>
            <td>-</td>
            <td>-</td>
            <td>3.49</td>
            <td><strong>3.46</strong></td>
            <td>3.23</td>
            <td><strong>2.98</strong></td>
            <td><strong>3.64</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">LongVA</td>
            <td>7B</td>
            <td>52.4</td>
            <td>54.3</td>
            <td>3.05</td>
            <td>3.09</td>
            <td>3.77</td>
            <td>2.44</td>
            <td><strong>3.64</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">InternVL2-8B</td>
            <td>8B</td>
            <td>54.0</td>
            <td>56.9</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">InternLM-XComposer-2.5</td>
            <td>8B</td>
            <td>55.8</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">LLaVA-NeXT-Video</td>
            <td>32B</td>
            <td>60.2</td>
            <td>63.0</td>
            <td>3.48</td>
            <td>3.37</td>
            <td><strong>3.95</strong></td>
            <td>2.64</td>
            <td>3.28</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
            <td>8B</td>
            <td><strong>60.9</strong></td>
            <td><strong>63.6</strong></td>
            <td><strong>3.59</strong></td>
            <td>3.28</td>
            <td>3.93</td>
            <td>2.73</td>
            <td>3.62</td>
        </tr>
    </tbody>
 </table>
 </div>
 </details>
 <details>
 <summary>点击查看 TextVQA, VizWiz, VQAv2, OK-VQA上的少样本评测结果详情。</summary>
 <div align="center">
 <table style="margin: 0px auto;">
    <thead>
        <tr>
            <th align="left">Model</th>
            <th>Size</th>
            <th>Shot</th>
            <th>TextVQA val</th>
            <th>VizWiz test-dev</th>
            <th>VQAv2 test-dev</th>
            <th>OK-VQA val</th>
        </tr>
    </thead>
    <tbody align="center">
        <tr>
            <td align="left" nowrap="nowrap" rowspan="3">Flamingo</td>
            <td rowspan="3">80B</td>
            <td>0*</td>
            <td>35.0</td>
            <td>31.6</td>
            <td>56.3</td>
            <td>40.6</td>
        </tr>
        <tr>
            <td>4</td>
            <td>36.5</td>
            <td>39.6</td>
            <td>63.1</td>
            <td><strong>57.4</strong></td>
        </tr>
        <tr>
            <td>8</td>
            <td>37.3</td>
            <td>44.8</td>
            <td>65.6</td>
            <td>57.5</td>
        </tr>
        <tr>
            <td align="left" nowrap="nowrap" rowspan="3">IDEFICS</td>
            <td rowspan="3">80B</td>
            <td>0*</td>
            <td>30.9</td>
            <td>36.0</td>
            <td>60.0</td>
            <td>45.2</td>
        </tr>
        <tr>
            <td>4</td>
            <td>34.3</td>
            <td>40.4</td>
            <td>63.6</td>
            <td>52.4</td>
        </tr>
        <tr>
            <td>8</td>
            <td>35.7</td>
            <td>46.1</td>
            <td>64.8</td>
            <td>55.1</td>
        </tr>
        <tr>
            <td align="left" nowrap="nowrap" rowspan="3">OmniCorpus</td>
            <td rowspan="3">7B</td>
            <td>0*</td>
            <td>43.0</td>
            <td>49.8</td>
            <td>63.2</td>
            <td>45.5</td>
        </tr>
        <tr>
            <td>4</td>
            <td>45.4</td>
            <td>51.3</td>
            <td>64.5</td>
            <td>46.5</td>
        </tr>
        <tr>
            <td>8</td>
            <td>45.6</td>
            <td>52.2</td>
            <td>64.7</td>
            <td>46.6</td>
        </tr>
        <tr>
            <td align="left" nowrap="nowrap" rowspan="3">Emu2</td>
            <td rowspan="3">37B</td>
            <td>0</td>
            <td>26.4</td>
            <td>40.4</td>
            <td>33.5</td>
            <td>26.7</td>
        </tr>
        <tr>
            <td>4</td>
            <td>48.2</td>
            <td>54.6</td>
            <td>67.0</td>
            <td>53.2</td>
        </tr>
        <tr>
            <td>8</td>
            <td>49.3</td>
            <td>54.7</td>
            <td>67.8</td>
            <td>54.1</td>
        </tr>
        <tr>
            <td align="left" nowrap="nowrap" rowspan="2">MM1</td>
            <td rowspan="2">30B</td>
            <td>0</td>
            <td>26.2</td>
            <td>40.4</td>
            <td>48.9</td>
            <td>26.7</td>
        </tr>
        <tr>
            <td>8</td>
            <td>49.3</td>
            <td>54.7</td>
            <td><strong>70.9</strong></td>
            <td>54.1</td>
        </tr>
        <tr>
            <td align="left" nowrap="nowrap" rowspan="3">MiniCPM-V 2.6<sup>+</sup></td>
            <td rowspan="3">8B</td>
            <td>0</td>
            <td>43.9</td>
            <td>33.8</td>
            <td>45.4</td>
            <td>23.9</td>
        </tr>
        <tr>
            <td>4</td>
            <td>63.6</td>
            <td>60.5</td>
            <td>65.5</td>
            <td>50.1</td>
        </tr>
        <tr>
            <td>8</td>
            <td><strong>64.6</strong></td>
            <td><strong>63.4</strong></td>
            <td>68.2</td>
            <td>51.4</td>
        </tr>
    </tbody>
 </table>
 </div>
 * 使用 Flamingo 方式 zero image shot 和 two additional text shots 评估零样本性能。
 <sup>+</sup> 我们在没有进行监督微调 (SFT) 的情况下评估预训练的模型权重 (ckpt)。
 </details>
 ### 典型示例 <!-- omit in toc -->
 <div style="display: flex; flex-direction: column; align-items: center;">
  <img src="../assets/minicpmv2_6/multi_img-bike.png" alt="Bike" style="margin-bottom: 5px;">
  <img src="../assets/minicpmv2_6/multi_img-menu.png" alt="Menu" style="margin-bottom: 5px;">
  <img src="../assets/minicpmv2_6/multi_img-code.png" alt="Code" style="margin-bottom: 5px;">
  <img src="../assets/minicpmv2_6/ICL-Mem.png" alt="Mem" style="margin-bottom: 5px;">
  <img src="../assets/minicpmv2_6/multiling-medal.png" alt="medal" style="margin-bottom: 10px;">
 </div>
 <details>
  <summary>点击查看更多示例。</summary>
  <div style="display: flex; flex-direction: column; align-items: center;">
    <img src="../assets/minicpmv2_6/ICL-elec.png" alt="elec" style="margin-bottom: 5px;">
    <img src="../assets/minicpmv2_6/multiling-olympic.png" alt="Menu" style="margin-bottom: 10px;">
  </div>
 </details>
 我们将 MiniCPM-V 2.6 部署在iPad Pro上，并录制了以下演示视频。
 <table align="center"> 
    <p align="center">
      <img src="../assets/gif_cases/ai.gif" width=32%/>
      &nbsp;&nbsp;&nbsp;&nbsp;
      <img src="../assets/gif_cases/beer.gif" width=32%/>
    </p>
 </table>
 <table align="center">
    <p align="center">
      <video src="https://github.com/user-attachments/assets/21f4b818-ede1-4822-920e-91281725c830" width="360" /> </video>
      <!-- <video src="https://github.com/user-attachments/assets/c835f757-206b-4d9c-8e36-70d67b453628" width="360" /> </video> -->
    </p>
 </table>
 </details>
 ### 模型库
 | 模型           | 设备 | 资源     | &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; 简介       | 下载链接 |
 |:--------------|:-:|:----------:|:-------------------|:---------------:|
 | MiniCPM-V 2.6| GPU | 17 GB  | 提供出色的端侧单图、多图、视频理解能力。   |  [🤗](https://huggingface.co/openbmb/MiniCPM-V-2_6) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6) |
 | MiniCPM-V 2.6 gguf | CPU | 6 GB  | gguf 版本，更低的内存占用和更高的推理效率。   |  [🤗](https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6-gguf) |
 | MiniCPM-V 2.6 int4 | GPU | 7 GB  | int4量化版，更低显存占用。   |  [🤗](https://huggingface.co/openbmb/MiniCPM-V-2_6-int4) &nbsp;&nbsp; [<img src="./assets/modelscope_logo.png" width="20px"></img>](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6-int4) |
--- a/docs/minicpm_v4_en.md
+++ b/docs/minicpm_v4_en.md
@@ -0,0 +1,556 @@
 ## MiniCPM-V 4.0
 > Archieve at: 2025-08-25
 **MiniCPM-V 4.0** is the latest efficient model in the MiniCPM-V series. The model is built based on SigLIP2-400M and MiniCPM4-3B with a total of 4.1B parameters. It inherits the strong single-image, multi-image and video understanding performance of MiniCPM-V 2.6 with largely improved efficiency. Notable features of MiniCPM-V 4.0 include:
 - 🔥 **Leading Visual Capability.**
   With only 4.1B parameters, MiniCPM-V 4.0 achieves an average score of 69.0 on OpenCompass, a comprehensive evaluation of 8 popular benchmarks, **outperforming GPT-4.1-mini-20250414, MiniCPM-V 2.6 (8.1B params, OpenCompass 65.2) and Qwen2.5-VL-3B-Instruct (3.8B params, OpenCompass 64.5)**. It also shows good performance in multi-image understanding and video understanding.
 - 🚀 **Superior Efficiency.**
  Designed for on-device deployment, MiniCPM-V 4.0 runs smoothly on end devices. For example, it devlivers **less than 2s first token delay and more than 17 token/s decoding on iPhone 16 Pro Max**, without heating problems. It also shows superior throughput under concurrent requests.
 -  💫  **Easy Usage.**
  MiniCPM-V 4.0 can be easily used in various ways including **llama.cpp, Ollama, vLLM, SGLang, LLaMA-Factory and local web demo** etc. We also open-source iOS App that can run on iPhone and iPad. Get started easily with our well-structured [Cookbook](https://github.com/OpenSQZ/MiniCPM-V-CookBook), featuring detailed instructions and practical examples.
 ### Evaluation  <!-- omit in toc -->
 <details>
 <summary>Click to view single image results on OpenCompass. </summary>
 <div align="center">
 <table style="margin: 0px auto;">
    <thead>
        <tr>
            <th nowrap="nowrap" align="left">model</th>
            <th>Size</th>
            <th>Opencompass</th>
            <th>OCRBench</th>
            <th>MathVista</th>
            <th>HallusionBench</th>
            <th>MMMU</th>
            <th>MMVet</th>
            <th>MMBench V1.1</th>
            <th>MMStar</th>
            <th>AI2D</th>
        </tr>
    </thead>
    <tbody align="center">
        <tr>
            <td colspan="11" align="left"><strong>Proprietary</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">GPT-4v-20240409</td>
            <td>-</td>
            <td>63.5</td>
            <td>656</td>
            <td>55.2</td>
            <td>43.9</td>
            <td>61.7</td>
            <td>67.5</td>
            <td>79.8</td>
            <td>56.0</td>
            <td>78.6</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Gemini-1.5-Pro</td>
            <td>-</td>
            <td>64.5</td>
            <td>754</td>
            <td>58.3</td>
            <td>45.6</td>
            <td>60.6</td>
            <td>64.0</td>
            <td>73.9</td>
            <td>59.1</td>
            <td>79.1</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">GPT-4.1-mini-20250414</td>
            <td>-</td>
            <td>68.9</td>
            <td>840</td>
            <td>70.9</td>
            <td>49.3</td>
            <td>55.0</td>
            <td>74.3</td>
            <td>80.9</td>
            <td>60.9</td>
            <td>76.0</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Claude 3.5 Sonnet-20241022</td>
            <td>-</td>
            <td>70.6</td>
            <td>798</td>
            <td>65.3</td>
            <td>55.5</td>
            <td>66.4</td>
            <td>70.1</td>
            <td>81.7</td>
            <td>65.1</td>
            <td>81.2</td>
        </tr>
        <tr>
            <td colspan="11" align="left"><strong>Open-source</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Qwen2.5-VL-3B-Instruct</td>
            <td>3.8B</td>
            <td>64.5</td>
            <td>828</td>
            <td>61.2</td>
            <td>46.6</td>
            <td>51.2</td>
            <td>60.0</td>
            <td>76.8</td>
            <td>56.3</td>
            <td>81.4</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">InternVL2.5-4B</td>
            <td>3.7B</td>
            <td>65.1</td>
            <td>820</td>
            <td>60.8</td>
            <td>46.6</td>
            <td>51.8</td>
            <td>61.5</td>
            <td>78.2</td>
            <td>58.7</td>
            <td>81.4</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Qwen2.5-VL-7B-Instruct</td>
            <td>8.3B</td>
            <td>70.9</td>
            <td>888</td>
            <td>68.1</td>
            <td>51.9</td>
            <td>58.0</td>
            <td>69.7</td>
            <td>82.2</td>
            <td>64.1</td>
            <td>84.3</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">InternVL2.5-8B</td>
            <td>8.1B</td>
            <td>68.1</td>
            <td>821</td>
            <td>64.5</td>
            <td>49.0</td>
            <td>56.2</td>
            <td>62.8</td>
            <td>82.5</td>
            <td>63.2</td>
            <td>84.6</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">MiniCPM-V-2.6</td>
            <td>8.1B</td>
            <td>65.2</td>
            <td>852</td>
            <td>60.8</td>
            <td>48.1</td>
            <td>49.8</td>
            <td>60.0</td>
            <td>78.0</td>
            <td>57.5</td>
            <td>82.1</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">MiniCPM-o-2.6</td>
            <td>8.7B</td>
            <td>70.2</td>
            <td>889</td>
            <td>73.3</td>
            <td>51.1</td>
            <td>50.9</td>
            <td>67.2</td>
            <td>80.6</td>
            <td>63.3</td>
            <td>86.1</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">MiniCPM-V-4.0</td>
            <td>4.1B</td>
            <td>69.0</td>
            <td>894</td>
            <td>66.9</td>
            <td>50.8</td>
            <td>51.2</td>
            <td>68.0</td>
            <td>79.7</td>
            <td>62.8</td>
            <td>82.9</td>
        </tr>
    </tbody>
 </table>
 </div>
 </details>
 <details>
 <summary>Click to view single image results on ChartQA, MME, RealWorldQA, TextVQA, DocVQA, MathVision, DynaMath, WeMath, Object HalBench and MM Halbench. </summary>
 <div align="center">
 <table style="margin: 0px auto;">
    <thead>
        <tr>
            <th nowrap="nowrap" align="left">model</th>
            <th>Size</th>
            <th>ChartQA</th>
            <th>MME</th>
            <th>RealWorldQA</th>
            <th>TextVQA</th>
            <th>DocVQA</th>
            <th>MathVision</th>
            <th>DynaMath</th>
            <th>WeMath</th>
            <th colspan="2">Obj Hal</th>
            <th colspan="2">MM Hal</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td></td>
            <td></td>
            <td></td>
            <td></td>
            <td></td>
            <td></td>
            <td></td>
            <td></td>
            <td></td>
            <td></td>
            <td>CHAIRs↓</td>
            <td>CHAIRi↓</td>
            <td nowrap="nowrap">score avg@3↑</td>
            <td nowrap="nowrap">hall rate avg@3↓</td>
        </tr>
        <tbody align="center">
        <tr>
            <td colspan="14" align="left"><strong>Proprietary</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">GPT-4v-20240409</td>
            <td>-</td>
            <td>78.5</td>
            <td>1927</td>
            <td>61.4</td>
            <td>78.0</td>
            <td>88.4</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Gemini-1.5-Pro</td>
            <td>-</td>
            <td>87.2</td>
            <td>-</td>
            <td>67.5</td>
            <td>78.8</td>
            <td>93.1</td>
            <td>41.0</td>
            <td>31.5</td>
            <td>50.5</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">GPT-4.1-mini-20250414</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>45.3</td>
            <td>47.7</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Claude 3.5 Sonnet-20241022</td>
            <td>-</td>
            <td>90.8</td>
            <td>-</td>
            <td>60.1</td>
            <td>74.1</td>
            <td>95.2</td>
            <td>35.6</td>
            <td>35.7</td>
            <td>44.0</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
        </tr>
        <tr>
            <td colspan="14" align="left"><strong>Open-source</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Qwen2.5-VL-3B-Instruct</td>
            <td>3.8B</td>
            <td>84.0</td>
            <td>2157</td>
            <td>65.4</td>
            <td>79.3</td>
            <td>93.9</td>
            <td>21.9</td>
            <td>13.2</td>
            <td>22.9</td>
            <td>18.3</td>
            <td>10.8</td>
            <td>3.9 </td>
            <td>33.3 </td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">InternVL2.5-4B</td>
            <td>3.7B</td>
            <td>84.0</td>
            <td>2338</td>
            <td>64.3</td>
            <td>76.8</td>
            <td>91.6</td>
            <td>18.4</td>
            <td>15.2</td>
            <td>21.2</td>
            <td>13.7</td>
            <td>8.7</td>
            <td>3.2 </td>
            <td>46.5 </td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Qwen2.5-VL-7B-Instruct</td>
            <td>8.3B</td>
            <td>87.3</td>
            <td>2347</td>
            <td>68.5</td>
            <td>84.9</td>
            <td>95.7</td>
            <td>25.4</td>
            <td>21.8</td>
            <td>36.2</td>
            <td>13.3</td>
            <td>7.9</td>
            <td>4.1 </td>
            <td>31.6 </td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">InternVL2.5-8B</td>
            <td>8.1B</td>
            <td>84.8</td>
            <td>2344</td>
            <td>70.1</td>
            <td>79.1</td>
            <td>93.0</td>
            <td>17.0</td>
            <td>9.4</td>
            <td>23.5</td>
            <td>18.3</td>
            <td>11.6</td>
            <td>3.6 </td>
            <td>37.2</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">MiniCPM-V-2.6</td>
            <td>8.1B</td>
            <td>79.4</td>
            <td>2348</td>
            <td>65.0</td>
            <td>80.1</td>
            <td>90.8</td>
            <td>17.5</td>
            <td>9.0</td>
            <td>20.4</td>
            <td>7.3</td>
            <td>4.7</td>
            <td>4.0 </td>
            <td>29.9 </td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">MiniCPM-o-2.6</td>
            <td>8.7B</td>
            <td>86.9</td>
            <td>2372</td>
            <td>68.1</td>
            <td>82.0</td>
            <td>93.5</td>
            <td>21.7</td>
            <td>10.4</td>
            <td>25.2</td>
            <td>6.3</td>
            <td>3.4</td>
            <td>4.1 </td>
            <td>31.3 </td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">MiniCPM-V-4.0</td>
            <td>4.1B</td>
            <td>84.4</td>
            <td>2298</td>
            <td>68.5</td>
            <td>80.8</td>
            <td>92.9</td>
            <td>20.7</td>
            <td>14.2</td>
            <td>32.7</td>
            <td>6.3</td>
            <td>3.5</td>
            <td>4.1 </td>
            <td>29.2 </td>
        </tr>
    </tbody>
 </table>
 </div>
 </details>
 <details>
 <summary>Click to view multi-image and video understanding results on Mantis, Blink and Video-MME. </summary>
 <div align="center">
 <table style="margin: 0px auto;">
    <thead>
        <tr>
            <th nowrap="nowrap" align="left">model</th>
            <th>Size</th>
            <th>Mantis</th>
            <th>Blink</th>
            <th nowrap="nowrap" colspan="2" >Video-MME</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td></td>
            <td></td>
            <td></td>
            <td></td>
            <td>wo subs</td>
            <td>w subs</td>
        </tr>
        <tbody align="center">
        <tr>
            <td colspan="6" align="left"><strong>Proprietary</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">GPT-4v-20240409</td>
            <td>-</td>
            <td>62.7</td>
            <td>54.6</td>
            <td>59.9</td>
            <td>63.3</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Gemini-1.5-Pro</td>
            <td>-</td>
            <td>-</td>
            <td>59.1</td>
            <td>75.0</td>
            <td>81.3</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">GPT-4o-20240513</td>
            <td>-</td>
            <td>-</td>
            <td>68.0</td>
            <td>71.9</td>
            <td>77.2</td>
        </tr>
        <tr>
            <td colspan="6" align="left"><strong>Open-source</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Qwen2.5-VL-3B-Instruct</td>
            <td>3.8B</td>
            <td>-</td>
            <td>47.6</td>
            <td>61.5</td>
            <td>67.6</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">InternVL2.5-4B</td>
            <td>3.7B</td>
            <td>62.7</td>
            <td>50.8</td>
            <td>62.3</td>
            <td>63.6</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Qwen2.5-VL-7B-Instruct</td>
            <td>8.3B</td>
            <td>-</td>
            <td>56.4</td>
            <td>65.1</td>
            <td>71.6</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">InternVL2.5-8B</td>
            <td>8.1B</td>
            <td>67.7</td>
            <td>54.8</td>
            <td>64.2</td>
            <td>66.9</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">MiniCPM-V-2.6</td>
            <td>8.1B</td>
            <td>69.1</td>
            <td>53.0</td>
            <td>60.9</td>
            <td>63.6</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">MiniCPM-o-2.6</td>
            <td>8.7B</td>
            <td>71.9</td>
            <td>56.7</td>
            <td>63.9</td>
            <td>69.6</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">MiniCPM-V-4.0</td>
            <td>4.1B</td>
            <td>71.4</td>
            <td>54.0</td>
            <td>61.2</td>
            <td>65.8</td>
        </tr>
    </tbody>
 </table>
 </div>
 </details>
 ### Examples
 <div style="display: flex; flex-direction: column; align-items: center;">
  <img src="../assets/minicpmv4/minicpm-v-4-case.png" alt="math" style="margin-bottom: 5px;">
 </div>
 We deploy MiniCPM-V 4.0 on iPhone 16 Pro Max with [iOS demo](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/demo/ios_demo/ios.md). The demo video is the raw screen recording without edition.
 <table align="center"> 
    <p align="center">
      <img src="../assets/minicpmv4/iphone_en.gif" width=45%/>
      &nbsp;&nbsp;&nbsp;&nbsp;
      <img src="../assets/minicpmv4/iphone_en_information_extraction.gif" width=45%/>
    </p>
    <p align="center">
      <img src="../assets/minicpmv4/iphone_cn.gif" width=45%/>
      &nbsp;&nbsp;&nbsp;&nbsp;
      <img src="../assets/minicpmv4/iphone_cn_funny_points.gif" width=45%/>
    </p>
 </table> 
--- a/docs/minicpm_v4_zh.md
+++ b/docs/minicpm_v4_zh.md
@@ -0,0 +1,557 @@
 ## MiniCPM-V 4.0
 > Archieve at: 2025-08-25
 MiniCPM-V 4.0 是 MiniCPM-V 系列中的最新模型。该模型基于 SigLIP2-400M 和 MiniCPM4-3B 构建，参数总量为 4.1B。它延续了 MiniCPM-V 2.6 在单图、多图和视频理解方面的强大能力，同时大幅提升了推理效率。MiniCPM-V 4.0 的主要特点包括：
 - 🔥 **领先的视觉能力。**
 MiniCPM-V 4.0 在 OpenCompass 上获得了平均 69.0 的高分，超越了 MiniCPM-V 2.6（8.1B，得分 65.2）、 Qwen2.5-VL-3B-Instruct（3.8B，得分 64.5）和**广泛使用的闭源模型 GPT-4.1-mini-20250414**。在多图理解与视频理解任务上，MiniCPM-V 4.0 也表现出色。
 - 🚀 **卓越的效率。**
 MiniCPM-V 4.0 专为端侧设备优化，**可在 iPhone 16 Pro Max 上流畅运行，首 token 延迟低至 2 秒，解码速度达 17.9 tokens/s**，且无发热问题。MiniCPM-V 4.0 在并发请求场景下表现出领先的吞吐率指标。
 - 💫 **易于使用。**
 MiniCPM-V 4.0 支持多种推理方式，包括 **llama.cpp、Ollama、vLLM、SGLang、LLaMA-Factory 及本地 Web Demo 等**。我们还开源了可以在 iPhone 和 iPad 运行的 iOS App。欢迎参考我们开源的 **结构清晰的[使用手册](https://github.com/OpenSQZ/MiniCPM-V-CookBook)** 玩转 MiniCPM-V 4.0，其中涵盖了详细的部署指南和真实示例。
 ### 性能评估 <!-- omit in toc -->
 <details>
 <summary>点击查看在OpenCompass上的单图理解能力的评测结果。</summary>
 <div align="center">
 <table style="margin: 0px auto;">
    <thead>
        <tr>
            <th nowrap="nowrap" align="left">model</th>
            <th>Size</th>
            <th>Opencompass</th>
            <th>OCRBench</th>
            <th>MathVista</th>
            <th>HallusionBench</th>
            <th>MMMU</th>
            <th>MMVet</th>
            <th>MMBench V1.1</th>
            <th>MMStar</th>
            <th>AI2D</th>
        </tr>
    </thead>
    <tbody align="center">
        <tr>
            <td colspan="11" align="left"><strong>Proprietary</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">GPT-4v-20240409</td>
            <td>-</td>
            <td>63.5</td>
            <td>656</td>
            <td>55.2</td>
            <td>43.9</td>
            <td>61.7</td>
            <td>67.5</td>
            <td>79.8</td>
            <td>56.0</td>
            <td>78.6</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Gemini-1.5-Pro</td>
            <td>-</td>
            <td>64.5</td>
            <td>754</td>
            <td>58.3</td>
            <td>45.6</td>
            <td>60.6</td>
            <td>64.0</td>
            <td>73.9</td>
            <td>59.1</td>
            <td>79.1</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">GPT-4.1-mini-20250414</td>
            <td>-</td>
            <td>68.9</td>
            <td>840</td>
            <td>70.9</td>
            <td>49.3</td>
            <td>55.0</td>
            <td>74.3</td>
            <td>80.9</td>
            <td>60.9</td>
            <td>76.0</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Claude 3.5 Sonnet-20241022</td>
            <td>-</td>
            <td>70.6</td>
            <td>798</td>
            <td>65.3</td>
            <td>55.5</td>
            <td>66.4</td>
            <td>70.1</td>
            <td>81.7</td>
            <td>65.1</td>
            <td>81.2</td>
        </tr>
        <tr>
            <td colspan="11" align="left"><strong>Open-source</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Qwen2.5-VL-3B-Instruct</td>
            <td>3.8B</td>
            <td>64.5</td>
            <td>828</td>
            <td>61.2</td>
            <td>46.6</td>
            <td>51.2</td>
            <td>60.0</td>
            <td>76.8</td>
            <td>56.3</td>
            <td>81.4</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">InternVL2.5-4B</td>
            <td>3.7B</td>
            <td>65.1</td>
            <td>820</td>
            <td>60.8</td>
            <td>46.6</td>
            <td>51.8</td>
            <td>61.5</td>
            <td>78.2</td>
            <td>58.7</td>
            <td>81.4</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Qwen2.5-VL-7B-Instruct</td>
            <td>8.3B</td>
            <td>70.9</td>
            <td>888</td>
            <td>68.1</td>
            <td>51.9</td>
            <td>58.0</td>
            <td>69.7</td>
            <td>82.2</td>
            <td>64.1</td>
            <td>84.3</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">InternVL2.5-8B</td>
            <td>8.1B</td>
            <td>68.1</td>
            <td>821</td>
            <td>64.5</td>
            <td>49.0</td>
            <td>56.2</td>
            <td>62.8</td>
            <td>82.5</td>
            <td>63.2</td>
            <td>84.6</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">MiniCPM-V-2.6</td>
            <td>8.1B</td>
            <td>65.2</td>
            <td>852</td>
            <td>60.8</td>
            <td>48.1</td>
            <td>49.8</td>
            <td>60.0</td>
            <td>78.0</td>
            <td>57.5</td>
            <td>82.1</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">MiniCPM-o-2.6</td>
            <td>8.7B</td>
            <td>70.2</td>
            <td>889</td>
            <td>73.3</td>
            <td>51.1</td>
            <td>50.9</td>
            <td>67.2</td>
            <td>80.6</td>
            <td>63.3</td>
            <td>86.1</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">MiniCPM-V-4.0</td>
            <td>4.1B</td>
            <td>69.0</td>
            <td>894</td>
            <td>66.9</td>
            <td>50.8</td>
            <td>51.2</td>
            <td>68.0</td>
            <td>79.7</td>
            <td>62.8</td>
            <td>82.9</td>
        </tr>
    </tbody>
 </table>
 </div>
 </details>
 <details>
 <summary>点击查看在图表理解、文档理解、数学推理、幻觉等领域的评测结果。 </summary>
 <div align="center">
 <table style="margin: 0px auto;">
    <thead>
        <tr>
            <th nowrap="nowrap" align="left">model</th>
            <th>Size</th>
            <th>ChartQA</th>
            <th>MME</th>
            <th>RealWorldQA</th>
            <th>TextVQA</th>
            <th>DocVQA</th>
            <th>MathVision</th>
            <th>DynaMath</th>
            <th>WeMath</th>
            <th colspan="2">Obj Hal</th>
            <th colspan="2">MM Hal</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td></td>
            <td></td>
            <td></td>
            <td></td>
            <td></td>
            <td></td>
            <td></td>
            <td></td>
            <td></td>
            <td></td>
            <td>CHAIRs↓</td>
            <td>CHAIRi↓</td>
            <td nowrap="nowrap">score avg@3↑</td>
            <td nowrap="nowrap">hall rate avg@3↓</td>
        </tr>
        <tbody align="center">
        <tr>
            <td colspan="14" align="left"><strong>Proprietary</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">GPT-4v-20240409</td>
            <td>-</td>
            <td>78.5</td>
            <td>1927</td>
            <td>61.4</td>
            <td>78.0</td>
            <td>88.4</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Gemini-1.5-Pro</td>
            <td>-</td>
            <td>87.2</td>
            <td>-</td>
            <td>67.5</td>
            <td>78.8</td>
            <td>93.1</td>
            <td>41.0</td>
            <td>31.5</td>
            <td>50.5</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">GPT-4.1-mini-20250414</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>45.3</td>
            <td>47.7</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Claude 3.5 Sonnet-20241022</td>
            <td>-</td>
            <td>90.8</td>
            <td>-</td>
            <td>60.1</td>
            <td>74.1</td>
            <td>95.2</td>
            <td>35.6</td>
            <td>35.7</td>
            <td>44.0</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
            <td>-</td>
        </tr>
        <tr>
            <td colspan="14" align="left"><strong>Open-source</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Qwen2.5-VL-3B-Instruct</td>
            <td>3.8B</td>
            <td>84.0</td>
            <td>2157</td>
            <td>65.4</td>
            <td>79.3</td>
            <td>93.9</td>
            <td>21.9</td>
            <td>13.2</td>
            <td>22.9</td>
            <td>18.3</td>
            <td>10.8</td>
            <td>3.9 </td>
            <td>33.3 </td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">InternVL2.5-4B</td>
            <td>3.7B</td>
            <td>84.0</td>
            <td>2338</td>
            <td>64.3</td>
            <td>76.8</td>
            <td>91.6</td>
            <td>18.4</td>
            <td>15.2</td>
            <td>21.2</td>
            <td>13.7</td>
            <td>8.7</td>
            <td>3.2 </td>
            <td>46.5 </td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Qwen2.5-VL-7B-Instruct</td>
            <td>8.3B</td>
            <td>87.3</td>
            <td>2347</td>
            <td>68.5</td>
            <td>84.9</td>
            <td>95.7</td>
            <td>25.4</td>
            <td>21.8</td>
            <td>36.2</td>
            <td>13.3</td>
            <td>7.9</td>
            <td>4.1 </td>
            <td>31.6 </td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">InternVL2.5-8B</td>
            <td>8.1B</td>
            <td>84.8</td>
            <td>2344</td>
            <td>70.1</td>
            <td>79.1</td>
            <td>93.0</td>
            <td>17.0</td>
            <td>9.4</td>
            <td>23.5</td>
            <td>18.3</td>
            <td>11.6</td>
            <td>3.6 </td>
            <td>37.2</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">MiniCPM-V-2.6</td>
            <td>8.1B</td>
            <td>79.4</td>
            <td>2348</td>
            <td>65.0</td>
            <td>80.1</td>
            <td>90.8</td>
            <td>17.5</td>
            <td>9.0</td>
            <td>20.4</td>
            <td>7.3</td>
            <td>4.7</td>
            <td>4.0 </td>
            <td>29.9 </td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">MiniCPM-o-2.6</td>
            <td>8.7B</td>
            <td>86.9</td>
            <td>2372</td>
            <td>68.1</td>
            <td>82.0</td>
            <td>93.5</td>
            <td>21.7</td>
            <td>10.4</td>
            <td>25.2</td>
            <td>6.3</td>
            <td>3.4</td>
            <td>4.1 </td>
            <td>31.3 </td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">MiniCPM-V-4.0</td>
            <td>4.1B</td>
            <td>84.4</td>
            <td>2298</td>
            <td>68.5</td>
            <td>80.8</td>
            <td>92.9</td>
            <td>20.7</td>
            <td>14.2</td>
            <td>32.7</td>
            <td>6.3</td>
            <td>3.5</td>
            <td>4.1 </td>
            <td>29.2 </td>
        </tr>
    </tbody>
 </table>
 </div>
 </details>
 <details>
 <summary>点击查看多图和视频理解能力的评测结果。 </summary>
 <div align="center">
 <table style="margin: 0px auto;">
    <thead>
        <tr>
            <th nowrap="nowrap" align="left">model</th>
            <th>Size</th>
            <th>Mantis</th>
            <th>Blink</th>
            <th nowrap="nowrap" colspan="2" >Video-MME</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td></td>
            <td></td>
            <td></td>
            <td></td>
            <td>wo subs</td>
            <td>w subs</td>
        </tr>
        <tbody align="center">
        <tr>
            <td colspan="6" align="left"><strong>Proprietary</strong></td>
        </tr>
                <tr>
            <td nowrap="nowrap" align="left">GPT-4v-20240409</td>
            <td>-</td>
            <td>62.7</td>
            <td>54.6</td>
            <td>59.9</td>
            <td>63.3</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Gemini-1.5-Pro</td>
            <td>-</td>
            <td>-</td>
            <td>59.1</td>
            <td>75.0</td>
            <td>81.3</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">GPT-4o-20240513</td>
            <td>-</td>
            <td>-</td>
            <td>68.0</td>
            <td>71.9</td>
            <td>77.2</td>
        </tr>
        <tr>
            <td colspan="6" align="left"><strong>Open-source</strong></td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Qwen2.5-VL-3B-Instruct</td>
            <td>3.8B</td>
            <td>-</td>
            <td>47.6</td>
            <td>61.5</td>
            <td>67.6</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">InternVL2.5-4B</td>
            <td>3.7B</td>
            <td>62.7</td>
            <td>50.8</td>
            <td>62.3</td>
            <td>63.6</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">Qwen2.5-VL-7B-Instruct</td>
            <td>8.3B</td>
            <td>-</td>
            <td>56.4</td>
            <td>65.1</td>
            <td>71.6</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">InternVL2.5-8B</td>
            <td>8.1B</td>
            <td>67.7</td>
            <td>54.8</td>
            <td>64.2</td>
            <td>66.9</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">MiniCPM-V-2.6</td>
            <td>8.1B</td>
            <td>69.1</td>
            <td>53.0</td>
            <td>60.9</td>
            <td>63.6</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">MiniCPM-o-2.6</td>
            <td>8.7B</td>
            <td>71.9</td>
            <td>56.7</td>
            <td>63.9</td>
            <td>69.6</td>
        </tr>
        <tr>
            <td nowrap="nowrap" align="left">MiniCPM-V-4.0</td>
            <td>4.1B</td>
            <td>71.4</td>
            <td>54.0</td>
            <td>61.2</td>
            <td>65.8</td>
        </tr>
    </tbody>
 </table>
 </div>
 </details>
 ### 典型示例
 <div style="display: flex; flex-direction: column; align-items: center;">
  <img src="../assets/minicpmv4/minicpm-v-4-case.png" alt="math" style="margin-bottom: 5px;">
 </div>
 我们在 iPhone 16 Pro Max 上部署了 MiniCPM-V 4.0 [iOS demo](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/demo/ios_demo/ios.md)，并录制了以下演示录屏，视频未经加速等任何编辑：
 <table align="center"> 
    <p align="center">
      <img src="../assets/minicpmv4/iphone_en.gif" width=45%/>
      &nbsp;&nbsp;&nbsp;&nbsp;
      <img src="../assets/minicpmv4/iphone_en_information_extraction.gif" width=45%/>
    </p>
    <p align="center">
      <img src="../assets/minicpmv4/iphone_cn.gif" width=45%/>
      &nbsp;&nbsp;&nbsp;&nbsp;
      <img src="../assets/minicpmv4/iphone_cn_funny_points.gif" width=45%/>
    </p>
 </table> 
--- a/docs/omnilmm.md
+++ b/docs/omnilmm.md
--- a/docs/omnilmm_en.md
+++ b/docs/omnilmm_en.md
@@ -1,6 +1,6 @@
 ## OmniLMM-12B
-> OmniLMM-12B is released at early time of this project. We recommond you to use our [recently released models](./README_en.md), for better performance and efficiency.
+> OmniLMM-12B is released at early time of this project. We recommond you to use our [recently released models](./README.md), for better performance and efficiency.
 > Archieve at: 2024-05-19
--- a/docs/wechat.md
+++ b/docs/wechat.md
@@ -1,6 +1,6 @@
 <div align="center">
-<img src="../assets/minicpm-v25.png" width="60%"/>
+<img src="../assets/wechat-QR.jpeg" width="60%"/>
-<p> 扫码加入「MiniCPM-V 交流群」 </p>
+<p> 扫码加入「MiniCPM-o 交流群」 </p>
-<p> Scan the QR code to join the "MiniCPM-V Discussion Group" </p>
+<p> Scan the QR code to join the "MiniCPM-o Discussion Group" </p>
 </div>
--- a/eval_mm/README.md
+++ b/eval_mm/README.md
@@ -1,60 +1,56 @@
 # Evaluation
-## opencompass
+## MiniCPM-o 2.6
 ### opencompass
 First, enter the `vlmevalkit` directory and install all dependencies:
 ```bash
 cd vlmevalkit
-pip install -r requirements.txt
+pip install --upgrade pip
 pip install -e .
 wget https://download.pytorch.org/whl/cu118/torch-2.2.0%2Bcu118-cp310-cp310-linux_x86_64.whl#sha256=4377e0a7fe8ff8ffc4f7c9c6130c1dcd3874050ae4fc28b7ff1d35234fbca423
 wget https://download.pytorch.org/whl/cu118/torchvision-0.17.0%2Bcu118-cp310-cp310-linux_x86_64.whl#sha256=2e63d62e09d9b48b407d3e1b30eb8ae4e3abad6968e8d33093b60d0657542428
 wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu118torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
 pip install torch-2.2.0%2Bcu118-cp310-cp310-linux_x86_64.whl
 pip install torchvision-0.17.0%2Bcu118-cp310-cp310-linux_x86_64.whl
 pip install flash_attn-2.6.3+cu118torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
 ```
 <br />
-Then, run `script/run_inference.sh`, which receives three input parameters in sequence: `MODELNAME`, `DATALIST`, and `MODE`. `MODELNAME` represents the name of the model, `DATALIST` represents the datasets used for inference, and `MODE` represents evaluation mode:
+Then, run `scripts/run_inference.sh`, which receives two input parameters in sequence: `MODELNAME` and `DATALIST`. `MODELNAME` represents the name of the model, `DATALIST` represents the datasets used for inference:
 ```bash
-chmod +x ./script/run_inference.sh
+chmod +x ./scripts/run_inference.sh
-./script/run_inference.sh $MODELNAME $DATALIST $MODE
+./scripts/run_inference.sh $MODELNAME $DATALIST
 ```
 <br />
-The three available choices for `MODELNAME` are listed in `vlmeval/config.py`:
+The five available choices for `MODELNAME` are listed in `vlmeval/config.py`:
 ```bash
-ungrouped = {
+minicpm_series = {
-    'MiniCPM-V':partial(MiniCPM_V, model_path='openbmb/MiniCPM-V'),
+    'MiniCPM-V': partial(MiniCPM_V, model_path='openbmb/MiniCPM-V'),
-    'MiniCPM-V-2':partial(MiniCPM_V, model_path='openbmb/MiniCPM-V-2'),
+    'MiniCPM-V-2': partial(MiniCPM_V, model_path='openbmb/MiniCPM-V-2'),
-    'MiniCPM-Llama3-V-2_5':partial(MiniCPM_Llama3_V, model_path='openbmb/MiniCPM-Llama3-V-2_5'),
+    'MiniCPM-Llama3-V-2_5': partial(MiniCPM_Llama3_V, model_path='openbmb/MiniCPM-Llama3-V-2_5'),
    'MiniCPM-V-2_6': partial(MiniCPM_V_2_6, model_path='openbmb/MiniCPM-V-2_6'),
    'MiniCPM-o-2_6': partial(MiniCPM_o_2_6, model_path='openbmb/MiniCPM-o-2_6'),
 }
 ```
 <br />
-All available choices for `DATALIST` are listed in `vlmeval/utils/dataset_config.py`. While evaluating on a single dataset, call the dataset name directly without quotation marks; while evaluating on multiple datasets, separate the names of different datasets with spaces and add quotation marks at both ends:
+All available choices for `DATALIST` are listed in `vlmeval/utils/dataset_config.py`. While evaluating on multiple datasets at a time, separate the names of different datasets with spaces and add quotation marks at both ends:
 ```bash
-$DATALIST="POPE ScienceQA_TEST ChartQA_TEST"
+$DATALIST="MMMU_DEV_VAL MathVista_MINI MMVet MMBench_DEV_EN_V11 MMBench_DEV_CN_V11 MMStar HallusionBench AI2D_TEST"
 ```
 <br />
-While scoring on each benchmark directly, set `MODE=all`. If only inference results are required, set `MODE=infer`. In order to reproduce the results in the table displayed on the homepage (columns between MME and RealWorldQA), you need to run the script according to the following settings:
+When the benchmark requires GPT series model for scoring, please specify `OPENAI_API_BASE` and `OPENAI_API_KEY` in the `.env` file.
 In order to reproduce the results on OpenCompass benchmarks together with ChartQA and MME, which are displayed in the table on the homepage (columns between OCRBench and HallusionBench), you need to run the script according to the following settings:
 ```bash
-# run on all 7 datasets
+# Please note that we use different prompts for the perception and reasoning sets of MME. While evaluating on the reasoning subset, CoT is required, so you need to manually modify the judgment condition of the use_cot function in vlmeval/vlm/minicpm_v.py
-./script/run_inference.sh MiniCPM-Llama3-V-2_5 "MME MMBench_TEST_EN MMBench_TEST_CN MMMU_DEV_VAL MathVista_MINI LLaVABench RealWorldQA" all
+./scripts/run_inference.sh MiniCPM-o-2_6 "MMMU_DEV_VAL MathVista_MINI MMVet MMBench_TEST_EN_V11 MMBench_TEST_CN_V11 MMStar HallusionBench AI2D_TEST OCRBench ChartQA_TEST MME"
 # The following are instructions for running on a single dataset
 # MME
 ./script/run_inference.sh MiniCPM-Llama3-V-2_5 MME all
 # MMBench_TEST_EN
 ./script/run_inference.sh MiniCPM-Llama3-V-2_5 MMBench_TEST_EN all
 # MMBench_TEST_CN
 ./script/run_inference.sh MiniCPM-Llama3-V-2_5 MMBench_TEST_CN all
 # MMMU_DEV_VAL
 ./script/run_inference.sh MiniCPM-Llama3-V-2_5 MMMU_DEV_VAL all
 # MathVista_MINI
 ./script/run_inference.sh MiniCPM-Llama3-V-2_5 MathVista_MINI all
 # LLaVABench
 ./script/run_inference.sh MiniCPM-Llama3-V-2_5 LLaVABench all
 # RealWorldQA
 ./script/run_inference.sh MiniCPM-Llama3-V-2_5 RealWorldQA all
 ```
 <br />
-## vqadataset
+### vqadataset
 First, enter the `vqaeval` directory and install all dependencies. Then, create `downloads` subdirectory to store the downloaded dataset for all tasks:
 ```bash
 cd vqaeval
@@ -112,7 +108,8 @@ chmod +x ./shell/run_inference.sh
 ```
 <br />
-All optional parameters are listed in `eval_utils/getargs.py`. The meanings of some major parameters are listed as follows:
+All optional parameters are listed in `eval_utils/getargs.py`. The meanings of some major parameters are listed as follows.
 For `MiniCPM-o-2_6`, set `model_name` to `minicpmo26`:
 ```bash
 # path to images and their corresponding questions
 # TextVQA
@@ -175,3 +172,372 @@ For the DocVQATest task, in order to upload the inference results to the [offici
 chmod +x ./shell/run_transform.sh
 ./shell/run_transform.sh
 ```
 <br />
 ## MiniCPM-V 2.6
 <details>
 <summary>Expand</summary>
 ### opencompass
 First, enter the `vlmevalkit` directory and install all dependencies:
 ```bash
 cd vlmevalkit
 pip install --upgrade pip
 pip install -e .
 wget https://download.pytorch.org/whl/cu118/torch-2.2.0%2Bcu118-cp310-cp310-linux_x86_64.whl#sha256=4377e0a7fe8ff8ffc4f7c9c6130c1dcd3874050ae4fc28b7ff1d35234fbca423
 wget https://download.pytorch.org/whl/cu118/torchvision-0.17.0%2Bcu118-cp310-cp310-linux_x86_64.whl#sha256=2e63d62e09d9b48b407d3e1b30eb8ae4e3abad6968e8d33093b60d0657542428
 wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu118torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
 pip install torch-2.2.0%2Bcu118-cp310-cp310-linux_x86_64.whl
 pip install torchvision-0.17.0%2Bcu118-cp310-cp310-linux_x86_64.whl
 pip install flash_attn-2.6.3+cu118torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
 ```
 <br />
 Then, run `scripts/run_inference.sh`, which receives three input parameters in sequence: `MODELNAME`, `DATALIST`, and `MODE`. `MODELNAME` represents the name of the model, `DATALIST` represents the datasets used for inference, and `MODE` represents evaluation mode:
 ```bash
 chmod +x ./scripts/run_inference.sh
 ./scripts/run_inference.sh $MODELNAME $DATALIST $MODE
 ```
 <br />
 The four available choices for `MODELNAME` are listed in `vlmeval/config.py`:
 ```bash
 minicpm_series = {
    'MiniCPM-V': partial(MiniCPM_V, model_path='openbmb/MiniCPM-V'),
    'MiniCPM-V-2': partial(MiniCPM_V, model_path='openbmb/MiniCPM-V-2'),
    'MiniCPM-Llama3-V-2_5': partial(MiniCPM_Llama3_V, model_path='openbmb/MiniCPM-Llama3-V-2_5'),
    'MiniCPM-V-2_6': partial(MiniCPM_V_2_6, model_path='openbmb/MiniCPM-V-2_6'),
 }
 ```
 <br />
 All available choices for `DATALIST` are listed in `vlmeval/utils/dataset_config.py`. Separate the names of different datasets with spaces and add quotation marks at both ends:
 ```bash
 $DATALIST="MMMU_DEV_VAL MathVista_MINI MMVet MMBench_DEV_EN_V11 MMBench_DEV_CN_V11 MMStar HallusionBench AI2D_TEST"
 ```
 <br />
 While scoring on each benchmark directly, set `MODE=all`. If only inference results are required, set `MODE=infer`. In order to reproduce the results in the table displayed on the homepage (columns between MME and HallusionBench), you need to run the script according to the following settings:
 ```bash
 # without CoT
 ./scripts/run_inference.sh MiniCPM-V-2_6 "MMMU_DEV_VAL MathVista_MINI MMVet MMBench_DEV_EN_V11 MMBench_DEV_CN_V11 MMStar HallusionBench AI2D_TEST" all
 ./scripts/run_inference.sh MiniCPM-V-2_6 MME all
 # with CoT
 # While running the CoT version of MME, you need to modify the 'use_cot' function in vlmeval/vlm/minicpm_v.py and add MME to the branch that returns True.
 ./scripts/run_inference/sh MiniCPM-V-2_6 "MMMU_DEV_VAL MMVet MMStar HallusionBench OCRBench" all
 ./scripts/run_inference.sh MiniCPM-V-2_6 MME all
 ```
 <br />
 ### vqadataset
 First, enter the `vqaeval` directory and install all dependencies. Then, create `downloads` subdirectory to store the downloaded dataset for all tasks:
 ```bash
 cd vqaeval
 pip install -r requirements.txt
 mkdir downloads
 ```
 <br />
 Download the datasets from the following links and place it in the specified directories:
 ###### TextVQA
 ```bash
 cd downloads
 mkdir TextVQA && cd TextVQA
 wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip
 unzip train_val_images.zip && rm train_val_images.zip
 mv train_val_images/train_images . && rm -rf train_val_images
 wget https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_val.json
 cd ../..
 ```
 ###### DocVQA / DocVQATest
 ```bash
 cd downloads
 mkdir DocVQA && cd DocVQA && mkdir spdocvqa_images
 # Download Images and Annotations from Task 1 - Single Page Document Visual Question Answering at https://rrc.cvc.uab.es/?ch=17&com=downloads
 # Move the spdocvqa_images.tar.gz and spdocvqa_qas.zip to DocVQA directory
 tar -zxvf spdocvqa_images.tar.gz -C spdocvqa_images && rm spdocvqa_images.tar.gz
 unzip spdocvqa_qas.zip && rm spdocvqa_qas.zip
 cp spdocvqa_qas/val_v1.0_withQT.json . && cp spdocvqa_qas/test_v1.0.json .  && rm -rf spdocvqa_qas
 cd ../..
 ```
 <br />
 The `downloads` directory should be organized according to the following structure:
 ```bash
 downloads
 ├── TextVQA
 │   ├── train_images
 │   │   ├── ...
 │   ├── TextVQA_0.5.1_val.json
 ├── DocVQA
 │   ├── spdocvqa_images
 │   │   ├── ...
 │   ├── val_v1.0_withQT.json
 │   ├── test_v1.0.json
 ```
 <br />
 Modify the parameters in `shell/run_inference.sh` and run inference:
 ```bash
 chmod +x ./shell/run_inference.sh
 ./shell/run_inference.sh
 ```
 <br />
 All optional parameters are listed in `eval_utils/getargs.py`. The meanings of some major parameters are listed as follows.
 For `MiniCPM-V-2_6`, set `model_name` to `minicpmv26`:
 ```bash
 # path to images and their corresponding questions
 # TextVQA
 --textVQA_image_dir
 --textVQA_ann_path
 # DocVQA
 --docVQA_image_dir
 --docVQA_ann_path
 # DocVQATest
 --docVQATest_image_dir
 --docVQATest_ann_path
 # whether to eval on certain task
 --eval_textVQA
 --eval_docVQA
 --eval_docVQATest
 --eval_all
 # model name and model path
 --model_name
 --model_path
 # load model from ckpt
 --ckpt
 # the way the model processes input data, "interleave" represents interleaved image-text form, while "old" represents non-interleaved.
 --generate_method
 --batchsize
 # path to save the outputs
 --answer_path
 ```
 <br />
 While evaluating on different tasks, parameters need to be set as follows:
 ###### TextVQA
 ```bash
 --eval_textVQA
 --textVQA_image_dir ./downloads/TextVQA/train_images
 --textVQA_ann_path ./downloads/TextVQA/TextVQA_0.5.1_val.json
 ```
 ###### DocVQA
 ```bash
 --eval_docVQA
 --docVQA_image_dir ./downloads/DocVQA/spdocvqa_images
 --docVQA_ann_path ./downloads/DocVQA/val_v1.0_withQT.json
 ```
 ###### DocVQATest
 ```bash
 --eval_docVQATest
 --docVQATest_image_dir ./downloads/DocVQA/spdocvqa_images
 --docVQATest_ann_path ./downloads/DocVQA/test_v1.0.json
 ```
 <br />
 For the DocVQATest task, in order to upload the inference results to the [official website](https://rrc.cvc.uab.es/?ch=17) for evaluation, run `shell/run_transform.sh` for format transformation after inference. `input_file_path` represents the path to the original output json, `output_file_path` represents the path to the transformed json:
 ```bash
 chmod +x ./shell/run_transform.sh
 ./shell/run_transform.sh
 ```
 </details>
 <br />
 ## MiniCPM-Llama3-V-2_5
 <details>
 <summary>Expand</summary>
 ### opencompass
 First, enter the `vlmevalkit` directory and install all dependencies:
 ```bash
 cd vlmevalkit
 pip install -r requirements.txt
 ```
 <br />
 Then, run `scripts/run_inference.sh`, which receives three input parameters in sequence: `MODELNAME`, `DATALIST`, and `MODE`. `MODELNAME` represents the name of the model, `DATALIST` represents the datasets used for inference, and `MODE` represents evaluation mode:
 ```bash
 chmod +x ./scripts/run_inference.sh
 ./scripts/run_inference.sh $MODELNAME $DATALIST $MODE
 ```
 <br />
 The three available choices for `MODELNAME` are listed in `vlmeval/config.py`:
 ```bash
 ungrouped = {
    'MiniCPM-V':partial(MiniCPM_V, model_path='openbmb/MiniCPM-V'),
    'MiniCPM-V-2':partial(MiniCPM_V, model_path='openbmb/MiniCPM-V-2'),
    'MiniCPM-Llama3-V-2_5':partial(MiniCPM_Llama3_V, model_path='openbmb/MiniCPM-Llama3-V-2_5'),
 }
 ```
 <br />
 All available choices for `DATALIST` are listed in `vlmeval/utils/dataset_config.py`. While evaluating on a single dataset, call the dataset name directly without quotation marks; while evaluating on multiple datasets, separate the names of different datasets with spaces and add quotation marks at both ends:
 ```bash
 $DATALIST="POPE ScienceQA_TEST ChartQA_TEST"
 ```
 <br />
 While scoring on each benchmark directly, set `MODE=all`. If only inference results are required, set `MODE=infer`. In order to reproduce the results in the table displayed on the homepage (columns between MME and RealWorldQA), you need to run the script according to the following settings:
 ```bash
 # run on all 7 datasets
 ./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 "MME MMBench_TEST_EN MMBench_TEST_CN MMMU_DEV_VAL MathVista_MINI LLaVABench RealWorldQA" all
 # The following are instructions for running on a single dataset
 # MME
 ./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 MME all
 # MMBench_TEST_EN
 ./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 MMBench_TEST_EN all
 # MMBench_TEST_CN
 ./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 MMBench_TEST_CN all
 # MMMU_DEV_VAL
 ./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 MMMU_DEV_VAL all
 # MathVista_MINI
 ./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 MathVista_MINI all
 # LLaVABench
 ./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 LLaVABench all
 # RealWorldQA
 ./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 RealWorldQA all
 ```
 <br />
 ### vqadataset
 First, enter the `vqaeval` directory and install all dependencies. Then, create `downloads` subdirectory to store the downloaded dataset for all tasks:
 ```bash
 cd vqaeval
 pip install -r requirements.txt
 mkdir downloads
 ```
 <br />
 Download the datasets from the following links and place it in the specified directories:
 ###### TextVQA
 ```bash
 cd downloads
 mkdir TextVQA && cd TextVQA
 wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip
 unzip train_val_images.zip && rm train_val_images.zip
 mv train_val_images/train_images . && rm -rf train_val_images
 wget https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_val.json
 cd ../..
 ```
 ###### DocVQA / DocVQATest
 ```bash
 cd downloads
 mkdir DocVQA && cd DocVQA && mkdir spdocvqa_images
 # Download Images and Annotations from Task 1 - Single Page Document Visual Question Answering at https://rrc.cvc.uab.es/?ch=17&com=downloads
 # Move the spdocvqa_images.tar.gz and spdocvqa_qas.zip to DocVQA directory
 tar -zxvf spdocvqa_images.tar.gz -C spdocvqa_images && rm spdocvqa_images.tar.gz
 unzip spdocvqa_qas.zip && rm spdocvqa_qas.zip
 cp spdocvqa_qas/val_v1.0_withQT.json . && cp spdocvqa_qas/test_v1.0.json .  && rm -rf spdocvqa_qas
 cd ../..
 ```
 <br />
 The `downloads` directory should be organized according to the following structure:
 ```bash
 downloads
 ├── TextVQA
 │   ├── train_images
 │   │   ├── ...
 │   ├── TextVQA_0.5.1_val.json
 ├── DocVQA
 │   ├── spdocvqa_images
 │   │   ├── ...
 │   ├── val_v1.0_withQT.json
 │   ├── test_v1.0.json
 ```
 <br />
 Modify the parameters in `shell/run_inference.sh` and run inference:
 ```bash
 chmod +x ./shell/run_inference.sh
 ./shell/run_inference.sh
 ```
 <br />
 All optional parameters are listed in `eval_utils/getargs.py`. The meanings of some major parameters are listed as follows.
 For `MiniCPM-Llama3-V-2_5`, set `model_name` to `minicpmv`:
 ```bash
 # path to images and their corresponding questions
 # TextVQA
 --textVQA_image_dir
 --textVQA_ann_path
 # DocVQA
 --docVQA_image_dir
 --docVQA_ann_path
 # DocVQATest
 --docVQATest_image_dir
 --docVQATest_ann_path
 # whether to eval on certain task
 --eval_textVQA
 --eval_docVQA
 --eval_docVQATest
 --eval_all
 # model name and model path
 --model_name
 --model_path
 # load model from ckpt
 --ckpt
 # the way the model processes input data, "interleave" represents interleaved image-text form, while "old" represents non-interleaved.
 --generate_method
 --batchsize
 # path to save the outputs
 --answer_path
 ```
 <br />
 While evaluating on different tasks, parameters need to be set as follows:
 ###### TextVQA
 ```bash
 --eval_textVQA
 --textVQA_image_dir ./downloads/TextVQA/train_images
 --textVQA_ann_path ./downloads/TextVQA/TextVQA_0.5.1_val.json
 ```
 ###### DocVQA
 ```bash
 --eval_docVQA
 --docVQA_image_dir ./downloads/DocVQA/spdocvqa_images
 --docVQA_ann_path ./downloads/DocVQA/val_v1.0_withQT.json
 ```
 ###### DocVQATest
 ```bash
 --eval_docVQATest
 --docVQATest_image_dir ./downloads/DocVQA/spdocvqa_images
 --docVQATest_ann_path ./downloads/DocVQA/test_v1.0.json
 ```
 <br />
 For the DocVQATest task, in order to upload the inference results to the [official website](https://rrc.cvc.uab.es/?ch=17) for evaluation, run `shell/run_transform.sh` for format transformation after inference. `input_file_path` represents the path to the original output json, `output_file_path` represents the path to the transformed json:
 ```bash
 chmod +x ./shell/run_transform.sh
 ./shell/run_transform.sh
 ```
 </details>
--- a/eval_mm/README_zh.md
+++ b/eval_mm/README_zh.md
@@ -1,61 +1,57 @@
 # Evaluation
-## opencompass
+## MiniCPM-o 2.6
 ### opencompass
 首先，进入 `vlmevalkit` 目录下，安装必要的依赖：
 ```bash
 cd vlmevalkit
-pip install -r requirements.txt
+pip install --upgrade pip
 pip install -e .
 wget https://download.pytorch.org/whl/cu118/torch-2.2.0%2Bcu118-cp310-cp310-linux_x86_64.whl#sha256=4377e0a7fe8ff8ffc4f7c9c6130c1dcd3874050ae4fc28b7ff1d35234fbca423
 wget https://download.pytorch.org/whl/cu118/torchvision-0.17.0%2Bcu118-cp310-cp310-linux_x86_64.whl#sha256=2e63d62e09d9b48b407d3e1b30eb8ae4e3abad6968e8d33093b60d0657542428
 wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu118torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
 pip install torch-2.2.0%2Bcu118-cp310-cp310-linux_x86_64.whl
 pip install torchvision-0.17.0%2Bcu118-cp310-cp310-linux_x86_64.whl
 pip install flash_attn-2.6.3+cu118torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
 rm *.whl
 ```
 <br />
-然后，运行 `script/run_inference.sh`，该脚本依次接收三个输入参数：`MODELNAME`, `DATALIST`, `MODE`。`MODELNAME` 为模型名称，`DATALIST` 为目标数据集，`MODE` 为评测模式。
+然后，运行 `scripts/run_inference.sh`，该脚本依次接收两个输入参数：`MODELNAME`, `DATALIST`。其中，`MODELNAME` 为模型名称，`DATALIST` 为目标数据集。
 ```bash
-chmod +x ./script/run_inference.sh
+chmod +x ./scripts/run_inference.sh
-./script/run_inference.sh $MODELNAME $DATALIST $MODE
+./scripts/run_inference.sh $MODELNAME $DATALIST
 ```
 <br />
-`MODELNAME` 有三种选择，位于 `vlmeval/config.py` 中：
+`MODELNAME` 有五种选择，位于 `vlmeval/config.py` 中：
 ```bash
-ungrouped = {
+minicpm_series = {
-    'MiniCPM-V':partial(MiniCPM_V, model_path='openbmb/MiniCPM-V'),
+    'MiniCPM-V': partial(MiniCPM_V, model_path='openbmb/MiniCPM-V'),
-    'MiniCPM-V-2':partial(MiniCPM_V, model_path='openbmb/MiniCPM-V-2'),
+    'MiniCPM-V-2': partial(MiniCPM_V, model_path='openbmb/MiniCPM-V-2'),
-    'MiniCPM-Llama3-V-2_5':partial(MiniCPM_Llama3_V, model_path='openbmb/MiniCPM-Llama3-V-2_5'),
+    'MiniCPM-Llama3-V-2_5': partial(MiniCPM_Llama3_V, model_path='openbmb/MiniCPM-Llama3-V-2_5'),
    'MiniCPM-V-2_6': partial(MiniCPM_V_2_6, model_path='openbmb/MiniCPM-V-2_6'),
    'MiniCPM-o-2_6': partial(MiniCPM_o_2_6, model_path='openbmb/MiniCPM-o-2_6'),
 }
 ```
 <br />
-可选的所有 `DATALIST` 位于 `vlmeval/utils/dataset_config.py` 中，评测单个数据集时，直接调用数据集名称，不加引号；评测多个数据集时，将不同数据集名称以空格隔开，两端加引号：
+可选的所有 `DATALIST` 位于 `vlmeval/utils/dataset_config.py` 中。一次评测多个数据集时，将不同数据集名称以空格隔开，两端加引号：
 ```bash
-$DATALIST="POPE ScienceQA_TEST ChartQA_TEST"
+$DATALIST="MMMU_DEV_VAL MathVista_MINI MMVet MMBench_TEST_EN_V11 MMBench_TEST_CN_V11 MMStar HallusionBench AI2D_TEST"
 ```
 <br />
-直接对各 benchmark 进行评分时，设置 `MODE=all`。如果仅需要推理结果，则设置 `MODE=infer`
+当评测的 benchmark 需要 GPT 系列模型进行评分时，请在 `.env` 文件中预先指定 `OPENAI_API_BASE` 和 `OPENAI_API_KEY`。
-为了复现出首页展示的表格中的各项结果（MME 到 RealWorldQA 之间的列），需要按照如下设置运行：
+为了复现出首页展示的表格中 OpenCompass 对应的各项数据集以及 ChartQA 和 MME 上的结果（OCRBench 到 HallusionBench 之间的列），需要按照如下设置运行：
 ```bash
-# 一次性运行 7 个数据集
+# 请注意，对于 MME 的 perception 和 reasoning 集，我们采取了不同的 prompt 方式。评测 reasoning 子集时，需要使用 CoT，因此需要手动到 vlmeval/vlm/minicpm_v.py 中修改 use_cot 函数的判断条件
-./script/run_inference.sh MiniCPM-Llama3-V-2_5 "MME MMBench_TEST_EN MMBench_TEST_CN MMMU_DEV_VAL MathVista_MINI LLaVABench RealWorldQA" all
+./scripts/run_inference.sh MiniCPM-o-2_6 "MMMU_DEV_VAL MathVista_MINI MMVet MMBench_TEST_EN_V11 MMBench_TEST_CN_V11 MMStar HallusionBench AI2D_TEST OCRBench ChartQA_TEST MME"
 # 以下是单独运行 1 个数据集的指令
 # MME
 ./script/run_inference.sh MiniCPM-Llama3-V-2_5 MME all
 # MMBench_TEST_EN
 ./script/run_inference.sh MiniCPM-Llama3-V-2_5 MMBench_TEST_EN all
 # MMBench_TEST_CN
 ./script/run_inference.sh MiniCPM-Llama3-V-2_5 MMBench_TEST_CN all
 # MMMU_DEV_VAL
 ./script/run_inference.sh MiniCPM-Llama3-V-2_5 MMMU_DEV_VAL all
 # MathVista_MINI
 ./script/run_inference.sh MiniCPM-Llama3-V-2_5 MathVista_MINI all
 # LLaVABench
 ./script/run_inference.sh MiniCPM-Llama3-V-2_5 LLaVABench all
 # RealWorldQA
 ./script/run_inference.sh MiniCPM-Llama3-V-2_5 RealWorldQA all
 ```
 <br />
-## vqadataset
+### vqadataset
 首先，进入 `vqaeval` 目录下，安装必要的依赖，并创建 `downloads` 子目录，用于存储下载的数据集：
 ```bash
 cd vqaeval
@@ -112,7 +108,8 @@ chmod +x ./shell/run_inference.sh
 ```
 <br />
-可以传入的参数位于 `eval_utils/getargs.py` 中，各主要参数的含义如下：
+可以传入的参数位于 `eval_utils/getargs.py` 中，各主要参数的含义如下。
 对于 `MiniCPM-o-2_6`，需要将 `model_name`设置为 `minicpmo26`：
 ```bash
 # 指定 TextVQA 评测所有图片和问题的路径
 --textVQA_image_dir
@@ -173,3 +170,368 @@ chmod +x ./shell/run_inference.sh
 chmod +x ./shell/run_transform.sh
 ./shell/run_transform.sh
 ```
 <br />
 ## MiniCPM-V 2.6
 <details>
 <summary>展开</summary>
 ### opencompass
 首先，进入 `vlmevalkit` 目录下，安装必要的依赖：
 ```bash
 cd vlmevalkit
 pip install --upgrade pip
 pip install -e .
 wget https://download.pytorch.org/whl/cu118/torch-2.2.0%2Bcu118-cp310-cp310-linux_x86_64.whl#sha256=4377e0a7fe8ff8ffc4f7c9c6130c1dcd3874050ae4fc28b7ff1d35234fbca423
 wget https://download.pytorch.org/whl/cu118/torchvision-0.17.0%2Bcu118-cp310-cp310-linux_x86_64.whl#sha256=2e63d62e09d9b48b407d3e1b30eb8ae4e3abad6968e8d33093b60d0657542428
 wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu118torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
 pip install torch-2.2.0%2Bcu118-cp310-cp310-linux_x86_64.whl
 pip install torchvision-0.17.0%2Bcu118-cp310-cp310-linux_x86_64.whl
 pip install flash_attn-2.6.3+cu118torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
 rm *.whl
 ```
 <br />
 然后，运行 `scripts/run_inference.sh`，该脚本依次接收三个输入参数：`MODELNAME`, `DATALIST`, `MODE`。`MODELNAME` 为模型名称，`DATALIST` 为目标数据集，`MODE` 为评测模式。
 ```bash
 chmod +x ./scripts/run_inference.sh
 ./scripts/run_inference.sh $MODELNAME $DATALIST $MODE
 ```
 <br />
 `MODELNAME` 有四种选择，位于 `vlmeval/config.py` 中：
 ```bash
 minicpm_series = {
    'MiniCPM-V': partial(MiniCPM_V, model_path='openbmb/MiniCPM-V'),
    'MiniCPM-V-2': partial(MiniCPM_V, model_path='openbmb/MiniCPM-V-2'),
    'MiniCPM-Llama3-V-2_5': partial(MiniCPM_Llama3_V, model_path='openbmb/MiniCPM-Llama3-V-2_5'),
    'MiniCPM-V-2_6': partial(MiniCPM_V_2_6, model_path='openbmb/MiniCPM-V-2_6'),
 }
 ```
 <br />
 可选的所有 `DATALIST` 位于 `vlmeval/utils/dataset_config.py` 中。将不同数据集名称以空格隔开，两端加引号：
 ```bash
 $DATALIST="MMMU_DEV_VAL MathVista_MINI MMVet MMBench_DEV_EN_V11 MMBench_DEV_CN_V11 MMStar HallusionBench AI2D_TEST"
 ```
 <br />
 直接对各 benchmark 进行评分时，设置 `MODE=all`。如果仅需要推理结果，则设置 `MODE=infer`。
 为了复现出首页展示的表格中的各项结果（MME 到 HallusionBench 之间的列），需要按照如下设置运行：
 ```bash
 # without CoT
 ./scripts/run_inference.sh MiniCPM-V-2_6 "MMMU_DEV_VAL MathVista_MINI MMVet MMBench_DEV_EN_V11 MMBench_DEV_CN_V11 MMStar HallusionBench AI2D_TEST" all
 ./scripts/run_inference.sh MiniCPM-V-2_6 MME all
 # with CoT，运行 CoT 版本的 MME 时，需要改写 vlmeval/vlm/minicpm_v.py 中的 'use_cot' 函数，将 MME 添加到 return True 的分支中
 ./scripts/run_inference/sh MiniCPM-V-2_6 "MMMU_DEV_VAL MMVet MMStar HallusionBench OCRBench" all
 ./scripts/run_inference.sh MiniCPM-V-2_6 MME all
 ```
 <br />
 ### vqadataset
 首先，进入 `vqaeval` 目录下，安装必要的依赖，并创建 `downloads` 子目录，用于存储下载的数据集：
 ```bash
 cd vqaeval
 pip install -r requirements.txt
 mkdir downloads
 ```
 <br />
 然后，从下列各地址下载数据集并置于指定目录下：
 ###### TextVQA
 ```bash
 cd downloads
 mkdir TextVQA && cd TextVQA
 wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip
 unzip train_val_images.zip && rm train_val_images.zip
 mv train_val_images/train_images . && rm -rf train_val_images
 wget https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_val.json
 cd ../..
 ```
 ###### DocVQA / DocVQATest
 ```bash
 cd downloads
 mkdir DocVQA && cd DocVQA && mkdir spdocvqa_images
 # 在 https://rrc.cvc.uab.es/?ch=17&com=downloads 下载 Task 1 - Single Page Document Visual Question Answering 下的 Images 和 Annotations
 # 将下载得到的 spdocvqa_images.tar.gz 以及 spdocvqa_qas.zip 置于 DocVQA 目录下
 tar -zxvf spdocvqa_images.tar.gz -C spdocvqa_images && rm spdocvqa_images.tar.gz
 unzip spdocvqa_qas.zip && rm spdocvqa_qas.zip
 cp spdocvqa_qas/val_v1.0_withQT.json . && cp spdocvqa_qas/test_v1.0.json .  && rm -rf spdocvqa_qas
 cd ../..
 ```
 <br />
 `downloads` 目录应当按照下列结构组织：
 ```bash
 downloads
 ├── TextVQA
 │   ├── train_images
 │   │   ├── ...
 │   ├── TextVQA_0.5.1_val.json
 ├── DocVQA
 │   ├── spdocvqa_images
 │   │   ├── ...
 │   ├── val_v1.0_withQT.json
 │   ├── test_v1.0.json
 ```
 <br />
 准备好相应的数据集之后，修改 `shell/run_inference.sh` 的参数，运行推理：
 ```bash
 chmod +x ./shell/run_inference.sh
 ./shell/run_inference.sh
 ```
 <br />
 可以传入的参数位于 `eval_utils/getargs.py` 中，各主要参数的含义如下。
 对于 `MiniCPM-V-2_6`，需要将 `model_name`设置为 `minicpmv26`：
 ```bash
 # 指定 TextVQA 评测所有图片和问题的路径
 --textVQA_image_dir
 --textVQA_ann_path
 # 指定 DocVQA 评测所有图片和问题的路径
 --docVQA_image_dir
 --docVQA_ann_path
 # 指定 DocVQATest 评测所有图片和问题的路径
 --docVQATest_image_dir
 --docVQATest_ann_path
 # 决定是否评测某个任务，eval_all 设置为 True 表示所有任务都评测
 --eval_textVQA
 --eval_docVQA
 --eval_docVQATest
 --eval_all
 # 模型名称、模型路径（从指定路径加载模型）
 --model_name
 --model_path
 # 从 checkpoint 加载模型
 --ckpt
 # 模型处理输入数据的方式，interleave 表示图文交错式，old 表示非交错式
 --generate_method
 # 推理时的批处理规模，建议推理时设置为 1
 --batchsize
 # 输出内容保存的路径
 --answer_path
 ```
 <br />
 评测三个任务需要设置的参数如下：
 ###### TextVQA
 ```bash
 --eval_textVQA
 --textVQA_image_dir ./downloads/TextVQA/train_images
 --textVQA_ann_path ./downloads/TextVQA/TextVQA_0.5.1_val.json
 ```
 ###### DocVQA
 ```bash
 --eval_docVQA
 --docVQA_image_dir ./downloads/DocVQA/spdocvqa_images
 --docVQA_ann_path ./downloads/DocVQA/val_v1.0_withQT.json
 ```
 ###### DocVQATest
 ```bash
 --eval_docVQATest
 --docVQATest_image_dir ./downloads/DocVQA/spdocvqa_images
 --docVQATest_ann_path ./downloads/DocVQA/test_v1.0.json
 ```
 <br />
 对于 DocVQATest 任务，为了将推理结果上传到[官方网站](https://rrc.cvc.uab.es/?ch=17)进行评测，还需要运行 `shell/run_transform.sh` 进行格式转换。其中，`input_file_path` 对应原始输出的 json 的路径，`output_file_path` 为自定义的转换后的 json 的路径：
 ```bash
 chmod +x ./shell/run_transform.sh
 ./shell/run_transform.sh
 ```
 </details>
 <br />
 ## MiniCPM-Llama3-V-2_5
 <details>
 <summary>展开</summary>
 ### opencompass
 首先，进入 `vlmevalkit` 目录下，安装必要的依赖：
 ```bash
 cd vlmevalkit
 pip install -r requirements.txt
 ```
 <br />
 然后，运行 `scripts/run_inference.sh`，该脚本依次接收三个输入参数：`MODELNAME`, `DATALIST`, `MODE`。`MODELNAME` 为模型名称，`DATALIST` 为目标数据集，`MODE` 为评测模式。
 ```bash
 chmod +x ./scripts/run_inference.sh
 ./scripts/run_inference.sh $MODELNAME $DATALIST $MODE
 ```
 <br />
 `MODELNAME` 有三种选择，位于 `vlmeval/config.py` 中：
 ```bash
 ungrouped = {
    'MiniCPM-V':partial(MiniCPM_V, model_path='openbmb/MiniCPM-V'),
    'MiniCPM-V-2':partial(MiniCPM_V, model_path='openbmb/MiniCPM-V-2'),
    'MiniCPM-Llama3-V-2_5':partial(MiniCPM_Llama3_V, model_path='openbmb/MiniCPM-Llama3-V-2_5'),
 }
 ```
 <br />
 可选的所有 `DATALIST` 位于 `vlmeval/utils/dataset_config.py` 中，评测单个数据集时，直接调用数据集名称，不加引号；评测多个数据集时，将不同数据集名称以空格隔开，两端加引号：
 ```bash
 $DATALIST="POPE ScienceQA_TEST ChartQA_TEST"
 ```
 <br />
 直接对各 benchmark 进行评分时，设置 `MODE=all`。如果仅需要推理结果，则设置 `MODE=infer`
 为了复现出首页展示的表格中的各项结果（MME 到 RealWorldQA 之间的列），需要按照如下设置运行：
 ```bash
 # 一次性运行 7 个数据集
 ./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 "MME MMBench_TEST_EN MMBench_TEST_CN MMMU_DEV_VAL MathVista_MINI LLaVABench RealWorldQA" all
 # 以下是单独运行 1 个数据集的指令
 # MME
 ./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 MME all
 # MMBench_TEST_EN
 ./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 MMBench_TEST_EN all
 # MMBench_TEST_CN
 ./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 MMBench_TEST_CN all
 # MMMU_DEV_VAL
 ./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 MMMU_DEV_VAL all
 # MathVista_MINI
 ./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 MathVista_MINI all
 # LLaVABench
 ./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 LLaVABench all
 # RealWorldQA
 ./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 RealWorldQA all
 ```
 <br />
 ### vqadataset
 首先，进入 `vqaeval` 目录下，安装必要的依赖，并创建 `downloads` 子目录，用于存储下载的数据集：
 ```bash
 cd vqaeval
 pip install -r requirements.txt
 mkdir downloads
 ```
 <br />
 然后，从下列各地址下载数据集并置于指定目录下：
 ###### TextVQA
 ```bash
 cd downloads
 mkdir TextVQA && cd TextVQA
 wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip
 unzip train_val_images.zip && rm train_val_images.zip
 mv train_val_images/train_images . && rm -rf train_val_images
 wget https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_val.json
 cd ../..
 ```
 ###### DocVQA / DocVQATest
 ```bash
 cd downloads
 mkdir DocVQA && cd DocVQA && mkdir spdocvqa_images
 # 在 https://rrc.cvc.uab.es/?ch=17&com=downloads 下载 Task 1 - Single Page Document Visual Question Answering 下的 Images 和 Annotations
 # 将下载得到的 spdocvqa_images.tar.gz 以及 spdocvqa_qas.zip 置于 DocVQA 目录下
 tar -zxvf spdocvqa_images.tar.gz -C spdocvqa_images && rm spdocvqa_images.tar.gz
 unzip spdocvqa_qas.zip && rm spdocvqa_qas.zip
 cp spdocvqa_qas/val_v1.0_withQT.json . && cp spdocvqa_qas/test_v1.0.json .  && rm -rf spdocvqa_qas
 cd ../..
 ```
 <br />
 `downloads` 目录应当按照下列结构组织：
 ```bash
 downloads
 ├── TextVQA
 │   ├── train_images
 │   │   ├── ...
 │   ├── TextVQA_0.5.1_val.json
 ├── DocVQA
 │   ├── spdocvqa_images
 │   │   ├── ...
 │   ├── val_v1.0_withQT.json
 │   ├── test_v1.0.json
 ```
 <br />
 准备好相应的数据集之后，修改 `shell/run_inference.sh` 的参数，运行推理：
 ```bash
 chmod +x ./shell/run_inference.sh
 ./shell/run_inference.sh
 ```
 <br />
 可以传入的参数位于 `eval_utils/getargs.py` 中，各主要参数的含义如下。
 对于 `MiniCPM-Llama3-V-2_5`，需要将 `model_name` 设置为 `minicpmv`：
 ```bash
 # 指定 TextVQA 评测所有图片和问题的路径
 --textVQA_image_dir
 --textVQA_ann_path
 # 指定 DocVQA 评测所有图片和问题的路径
 --docVQA_image_dir
 --docVQA_ann_path
 # 指定 DocVQATest 评测所有图片和问题的路径
 --docVQATest_image_dir
 --docVQATest_ann_path
 # 决定是否评测某个任务，eval_all 设置为 True 表示所有任务都评测
 --eval_textVQA
 --eval_docVQA
 --eval_docVQATest
 --eval_all
 # 模型名称、模型路径（从指定路径加载模型）
 --model_name
 --model_path
 # 从 checkpoint 加载模型
 --ckpt
 # 模型处理输入数据的方式，interleave 表示图文交错式，old 表示非交错式
 --generate_method
 # 推理时的批处理规模，建议推理时设置为 1
 --batchsize
 # 输出内容保存的路径
 --answer_path
 ```
 <br />
 评测三个任务需要设置的参数如下：
 ###### TextVQA
 ```bash
 --eval_textVQA
 --textVQA_image_dir ./downloads/TextVQA/train_images
 --textVQA_ann_path ./downloads/TextVQA/TextVQA_0.5.1_val.json
 ```
 ###### DocVQA
 ```bash
 --eval_docVQA
 --docVQA_image_dir ./downloads/DocVQA/spdocvqa_images
 --docVQA_ann_path ./downloads/DocVQA/val_v1.0_withQT.json
 ```
 ###### DocVQATest
 ```bash
 --eval_docVQATest
 --docVQATest_image_dir ./downloads/DocVQA/spdocvqa_images
 --docVQATest_ann_path ./downloads/DocVQA/test_v1.0.json
 ```
 <br />
 对于 DocVQATest 任务，为了将推理结果上传到[官方网站](https://rrc.cvc.uab.es/?ch=17)进行评测，还需要运行 `shell/run_transform.sh` 进行格式转换。其中，`input_file_path` 对应原始输出的 json 的路径，`output_file_path` 为自定义的转换后的 json 的路径：
 ```bash
 chmod +x ./shell/run_transform.sh
 ./shell/run_transform.sh
 ```
 </details>
--- a/eval_mm/vlmevalkit/.env
+++ b/eval_mm/vlmevalkit/.env
@@ -0,0 +1,28 @@
 # .env 文件，将其放置在 $VLMEvalKit 下
 # 专有 VLMs 的 API 密钥
 # QwenVL APIs
 DASHSCOPE_API_KEY=
 # Gemini w. Google Cloud Backends
 GOOGLE_API_KEY=
 # OpenAI API
 OPENAI_API_KEY=
 OPENAI_API_BASE=
 # StepAI API
 STEPAI_API_KEY=
 # REKA API
 REKA_API_KEY=
 # GLMV API
 GLMV_API_KEY=
 # CongRong API
 CW_API_BASE=
 CW_API_KEY=
 # SenseChat-V API
 SENSECHAT_AK=
 SENSECHAT_SK=
 # Hunyuan-Vision API
 HUNYUAN_SECRET_KEY=
 HUNYUAN_SECRET_ID=
 # LMDeploy API
 LMDEPLOY_API_BASE=
 # 你可以设置一个评估时代理，评估阶段产生的 API 调用将通过这个代理进行
 EVAL_PROXY=
--- a/eval_mm/vlmevalkit/requirements.txt
+++ b/eval_mm/vlmevalkit/requirements.txt
@@ -1,33 +1,30 @@
-einops
+decord; platform_machine != 'arm64'
-gradio==4.15.0
+eva-decord; platform_machine == 'arm64'
 gradio
 huggingface_hub
 imageio
 matplotlib
-numpy>=1.23.4
+numpy
 omegaconf
-openai==1.3.5
+openai
 opencv-python>=4.4.0.46
 openpyxl
-pandas>=1.5.3
+pandas
 pillow
 portalocker
 protobuf
 pycocoevalcap
 python-dotenv
 requests
 rich
 seaborn
 sentencepiece
 setuptools
 sty
 tabulate
 tiktoken
 timeout-decorator
 torch
 tqdm
-typing_extensions==4.7.1
+transformers
 typing_extensions
 validators
 visual_genome
 xlsxwriter
 Pillow==10.1.0
 sentencepiece==0.1.99
 transformers==4.40.0
 torch==1.13.1
 torchvision
--- a/eval_mm/vlmevalkit/requirements/docs.txt
+++ b/eval_mm/vlmevalkit/requirements/docs.txt
@@ -0,0 +1,11 @@
 docutils==0.18.1
 modelindex
 myst-parser
 -e git+https://github.com/open-compass/pytorch_sphinx_theme.git#egg=pytorch_sphinx_theme
 sphinx==6.1.3
 sphinx-copybutton
 sphinx-design
 sphinx-notfound-page
 sphinx-tabs
 sphinxcontrib-jquery
 tabulate
--- a/eval_mm/vlmevalkit/run.py
+++ b/eval_mm/vlmevalkit/run.py
@@ -1,147 +1,422 @@
 import torch
 import torch.distributed as dist
-from vlmeval.smp import *
+
 from vlmeval.evaluate import *
 from vlmeval.inference import infer_data_job
 from vlmeval.config import supported_VLM
-from vlmeval.utils import dataset_URLs, DATASET_TYPE, abbr2full, MMMU_result_transfer
+from vlmeval.dataset.video_dataset_config import supported_video_datasets
 from vlmeval.dataset import build_dataset
 from vlmeval.inference import infer_data_job
 from vlmeval.inference_video import infer_data_job_video
 from vlmeval.inference_mt import infer_data_job_mt
 from vlmeval.smp import *
 from vlmeval.utils.result_transfer import MMMU_result_transfer, MMTBench_result_transfer
 def build_model_from_config(cfg, model_name):
    import vlmeval.api
    import vlmeval.vlm
    config = cp.deepcopy(cfg[model_name])
    if config == {}:
        return supported_VLM[model_name]()
    assert 'class' in config
    cls_name = config.pop('class')
    if hasattr(vlmeval.api, cls_name):
        return getattr(vlmeval.api, cls_name)(**config)
    elif hasattr(vlmeval.vlm, cls_name):
        return getattr(vlmeval.vlm, cls_name)(**config)
    else:
        raise ValueError(f'Class {cls_name} is not supported in `vlmeval.api` or `vlmeval.vlm`')
 def build_dataset_from_config(cfg, dataset_name):
    import vlmeval.dataset
    import inspect
    config = cp.deepcopy(cfg[dataset_name])
    if config == {}:
        return supported_video_datasets[dataset_name]()
    assert 'class' in config
    cls_name = config.pop('class')
    if hasattr(vlmeval.dataset, cls_name):
        cls = getattr(vlmeval.dataset, cls_name)
        sig = inspect.signature(cls.__init__)
        valid_params = {k: v for k, v in config.items() if k in sig.parameters}
        if valid_params.get('fps', 0) > 0 and valid_params.get('nframe', 0) > 0:
            raise ValueError('fps and nframe should not be set at the same time')
        if valid_params.get('fps', 0) <= 0 and valid_params.get('nframe', 0) <= 0:
            raise ValueError('fps and nframe should be set at least one valid value')
        return cls(**valid_params)
    else:
        raise ValueError(f'Class {cls_name} is not supported in `vlmeval.dataset`')
 def parse_args():
-    parser = argparse.ArgumentParser()
+    help_msg = """\
-    parser.add_argument('--data', type=str, nargs='+', required=True)
+You can launch the evaluation by setting either --data and --model or --config.
-    parser.add_argument('--model', type=str, nargs='+', required=True)
+
-    parser.add_argument('--work-dir', type=str, default='.', help='select the output directory')
+--data and --model:
    Each Arg should be a list of strings, specifying the names of datasets and models.
    To find all supported model names, please refer to the `vlmeval/config.py` of check the output of the command \
        `vlmutil mlist all` in the terminal (you should first have vlmeval installed).
    To find all supported dataset names, please refer to the `vlmeval/dataset/__init__.py` file. The python script \
        to print all supported dataset names is as follows:
        ```python
        from vlmeval.dataset import SUPPORTED_DATASETS
        print(SUPPORTED_DATASETS)
        ```
        or you can check the output of the command `vlmutil dlist all` in the terminal.
    To find all supported video dataset default settings, please refer to the \
        `vlmeval/dataset/video_dataset_config.py` file.
 --config:
    Launch the evaluation by specifying the path to the config json file. Sample Json Content:
    ```json
    {
        "model": {
            "GPT4o_20240806_T00_HIGH": {
                "class": "GPT4V",
                "model": "gpt-4o-2024-08-06",
                "temperature": 0,
                "img_detail": "high"
            },
            "GPT4o_20240806_T10_Low": {
                "class": "GPT4V",
                "model": "gpt-4o-2024-08-06",
                "temperature": 1.0,
                "img_detail": "low"
            },
            "GPT4o_20241120": {}
        },
        "data": {
            "MME-RealWorld-Lite": {
                "class": "MMERealWorld",
                "dataset": "MME-RealWorld-Lite"
            },
            "MMBench_DEV_EN_V11": {
                "class": "ImageMCQDataset",
                "dataset": "MMBench_DEV_EN_V11"
            },
            "MMBench_Video_8frame_nopack": {},
            "Video-MME_16frame_subs": {
                "class": "VideoMME",
                "dataset": "Video-MME",
                "nframe": 16,
                "use_subtitle": true,
            }
        }
    }
    ```
    Currently, only `model` and `data` are supported fields. The content of each field is a dictionary.
    For `model`, the key is the name of the model, and the value is a dictionary containing the following keys:
    - `class`: The class name of the model, which should be a class in `vlmeval.vlm` or `vlmeval.api`.
    - Other keys are specific to the model, please refer to the corresponding class.
    - Tip: The defined model in the `supported_VLM` of `vlmeval/config.py` can be used as a shortcut.
    For `data`, the key is the name of the dataset (should be the same as the `dataset` field in most cases, \
        except for video datasets), and the value is a dictionary containing the following keys:
    - `class`: The class name of the dataset, which should be a class in `vlmeval.dataset`.
    - `dataset`: The name of the dataset, which should be a string that is accepted by the `dataset` argument of the \
        corresponding class.
    - Other keys are specific to the dataset, please refer to the corresponding class.
    - Tip: The defined dataset in the `supported_video_datasets` of `vlmeval/dataset/video_dataset_config.py` \
        can be used as a shortcut.
    The keys in the `model` and `data` fields will be used for naming the prediction files and evaluation results.
    When launching with `--config`, args for API VLMs, such as `--retry`, `--verbose`, will be ignored.
 """
    parser = argparse.ArgumentParser(description=help_msg, formatter_class=argparse.RawTextHelpFormatter)
    # Essential Args, Setting the Names of Datasets and Models
    parser.add_argument('--data', type=str, nargs='+', help='Names of Datasets')
    parser.add_argument('--model', type=str, nargs='+', help='Names of Models')
    parser.add_argument('--config', type=str, help='Path to the Config Json File')
    # Work Dir
    parser.add_argument('--work-dir', type=str, default='./outputs', help='select the output directory')
    # Infer + Eval or Infer Only
    parser.add_argument('--mode', type=str, default='all', choices=['all', 'infer'])
-    parser.add_argument('--nproc', type=int, default=4, help='Parallel API calling')
+    # API Kwargs, Apply to API VLMs and Judge API LLMs
    parser.add_argument('--api_nproc', type=int, default=4, help='Parallel API calling')
    parser.add_argument('--retry', type=int, default=None, help='retry numbers for API VLMs')
    # Explicitly Set the Judge Model
    parser.add_argument('--judge', type=str, default=None)
-    parser.add_argument('--ignore', action='store_true', help='Ignore failed indices. ')
+    # Logging Utils
    parser.add_argument('--verbose', action='store_true')
-    parser.add_argument('--rerun', action='store_true')
+    # Configuration for Resume
    # Ignore: will not rerun failed VLM inference
    parser.add_argument('--ignore', action='store_true', help='Ignore failed indices. ')
    # Reuse: will reuse the existing prediction files
    parser.add_argument('--reuse', action='store_true')
    args = parser.parse_args()
    return args
 def main():
    logger = get_logger('RUN')
-
+    rank, world_size = get_rank_and_world_size()
    args = parse_args()
-    assert len(args.data), '--data should be a list of data files'
+    use_config, cfg = False, None
    if args.config is not None:
        assert args.data is None and args.model is None, '--data and --model should not be set when using --config'
        use_config, cfg = True, load(args.config)
        args.model = list(cfg['model'].keys())
        args.data = list(cfg['data'].keys())
    else:
        assert len(args.data), '--data should be a list of data files'
-    if args.retry is not None:
+    if rank == 0:
        if not args.reuse:
            logger.warning('--reuse is not set, will not reuse previous (before one day) temporary files')
        else:
            logger.warning('--reuse is set, will reuse the latest prediction & temporary pickle files')
    if 'MMEVAL_ROOT' in os.environ:
        args.work_dir = os.environ['MMEVAL_ROOT']
    if not use_config:
        for k, v in supported_VLM.items():
-            if hasattr(v, 'keywords') and 'retry' in v.keywords:
+            if hasattr(v, 'keywords') and 'retry' in v.keywords and args.retry is not None:
                v.keywords['retry'] = args.retry
                supported_VLM[k] = v
-            if hasattr(v, 'keywords') and 'verbose' in v.keywords:
+            if hasattr(v, 'keywords') and 'verbose' in v.keywords and args.verbose is not None:
                v.keywords['verbose'] = args.verbose
                supported_VLM[k] = v
    rank, world_size = get_rank_and_world_size()
    if world_size > 1:
        local_rank = os.environ.get('LOCAL_RANK', 0)
        torch.cuda.set_device(int(local_rank))
-        dist.init_process_group(backend='nccl', timeout=datetime.timedelta(seconds=10800))
+        dist.init_process_group(
            backend='nccl',
            timeout=datetime.timedelta(seconds=int(os.environ.get('DIST_TIMEOUT', 3600)))
        )
    for _, model_name in enumerate(args.model):
        model = None
        date, commit_id = timestr('day'), githash(digits=8)
        eval_id = f"T{date}_G{commit_id}"
-        pred_root = osp.join(args.work_dir, model_name)
+        pred_root = osp.join(args.work_dir, model_name, eval_id)
-        os.makedirs(pred_root, exist_ok=True)
+        pred_root_meta = osp.join(args.work_dir, model_name)
        os.makedirs(pred_root_meta, exist_ok=True)
        prev_pred_roots = ls(osp.join(args.work_dir, model_name), mode='dir')
        if len(prev_pred_roots) and args.reuse:
            prev_pred_roots.sort()
        if not osp.exists(pred_root):
            os.makedirs(pred_root, exist_ok=True)
        if use_config:
            model = build_model_from_config(cfg['model'], model_name)
        for _, dataset_name in enumerate(args.data):
-            custom_flag = False
+            try:
                result_file_base = f'{model_name}_{dataset_name}.xlsx'
-            if dataset_name not in dataset_URLs:
+                if use_config:
-                dataset_name = abbr2full(dataset_name)
+                    if world_size > 1:
-
+                        if rank == 0:
-            if dataset_name not in dataset_URLs:
+                            dataset = build_dataset_from_config(cfg['data'], dataset_name)
-                logger.warning(f'Dataset {dataset_name} is not officially supported. ')
+                        dist.barrier()
-                file_path = osp.join(LMUDataRoot(), f'{dataset_name}.tsv')
+                    dataset = build_dataset_from_config(cfg['data'], dataset_name)
-                if not osp.exists(file_path):
+                    if dataset is None:
-                    logger.error(f'Cannot find the local dataset {dataset_name}. ')
+                        logger.error(f'Dataset {dataset_name} is not valid, will be skipped. ')
-                    continue
+                        continue
                else:
-                    custom_flag = True
+                    dataset_kwargs = {}
                    if dataset_name in ['MMLongBench_DOC', 'DUDE', 'DUDE_MINI', 'SLIDEVQA', 'SLIDEVQA_MINI']:
                        dataset_kwargs['model'] = model_name
-            result_file = f'{pred_root}/{model_name}_{dataset_name}.xlsx'
+                    # If distributed, first build the dataset on the main process for doing preparation works
-            if osp.exists(result_file) and args.rerun:
+                    if world_size > 1:
-                os.system(f'rm {pred_root}/{model_name}_{dataset_name}_*')
+                        if rank == 0:
                            dataset = build_dataset(dataset_name, **dataset_kwargs)
                        dist.barrier()
-            if model is None:
+                    dataset = build_dataset(dataset_name, **dataset_kwargs)
-                model = model_name  # which is only a name
+                    if dataset is None:
                        logger.error(f'Dataset {dataset_name} is not valid, will be skipped. ')
                        continue
-            model = infer_data_job(
+                # Handling Multi-Turn Dataset
-                model,
+                if dataset.TYPE == 'MT':
-                work_dir=pred_root,
+                    result_file_base = result_file_base.replace('.xlsx', '.tsv')
                model_name=model_name,
                dataset_name=dataset_name,
                verbose=args.verbose,
                api_nproc=args.nproc,
                ignore_failed=args.ignore)
-            if rank == 0:
+                result_file = osp.join(pred_root, result_file_base)
                if dataset_name in ['MMMU_TEST']:
                    result_json = MMMU_result_transfer(result_file)
                    logger.info(f'Transfer MMMU_TEST result to json for official evaluation, json file saved in {result_json}')  # noqa: E501
                    continue
-            if dataset_name in [
+                # Reuse the previous prediction file if exists
-                'MMBench_TEST_CN', 'MMBench_TEST_EN', 'MMBench', 'MMBench_CN'
+                if rank == 0 and len(prev_pred_roots):
-                'MMBench_TEST_CN_V11', 'MMBench_TEST_EN_V11', 'MMBench_V11', 'MMBench_CN_V11'
+                    prev_result_file = None
-            ]:
+                    prev_pkl_file_list = []
-                if not MMBenchOfficialServer(dataset_name):
+                    for root in prev_pred_roots[::-1]:
-                    logger.error(
+                        if osp.exists(osp.join(root, result_file_base)):
-                        f'Can not evaluate {dataset_name} on non-official servers, '
+                            prev_result_file = osp.join(root, result_file_base)
-                        'will skip the evaluation. '
+                            break
-                    )
+                        elif commit_id in root and len(ls(root)) and root != pred_root:
-                    continue
+                            temp_files = ls(root, match=[dataset_name, '.pkl'])
                            if len(temp_files):
                                prev_pkl_file_list.extend(temp_files)
                                break
                    if not args.reuse:
                        prev_result_file = None
                        prev_pkl_file_list = []
                    if prev_result_file is not None:
                        logger.warning(
                            f'--reuse is set, will reuse the prediction file {prev_result_file}.')
                        if prev_result_file != result_file:
                            shutil.copy(prev_result_file, result_file)
                    elif len(prev_pkl_file_list):
                        for fname in prev_pkl_file_list:
                            target_path = osp.join(pred_root, osp.basename(fname))
                            if not osp.exists(target_path):
                                shutil.copy(fname, target_path)
                                logger.info(f'--reuse is set, will reuse the prediction pickle file {fname}.')
                            else:
                                logger.warning(f'File already exists: {target_path}')
-            judge_kwargs = {
+                if world_size > 1:
-                'nproc': args.nproc,
+                    dist.barrier()
                'verbose': args.verbose,
            }
            if args.retry is not None:
                judge_kwargs['retry'] = args.retry
            if args.judge is not None:
                judge_kwargs['model'] = args.judge
            else:
                if DATASET_TYPE(dataset_name) in ['multi-choice', 'Y/N']:
                    judge_kwargs['model'] = 'chatgpt-0613'
                elif listinstr(['MMVet', 'MathVista', 'LLaVABench'], dataset_name):
                    judge_kwargs['model'] = 'gpt-4-turbo'
            if 'OPENAI_API_KEY_JUDGE' in os.environ and len(os.environ['OPENAI_API_KEY_JUDGE']):
                judge_kwargs['key'] = os.environ['OPENAI_API_KEY_JUDGE']
            if 'OPENAI_API_BASE_JUDGE' in os.environ and len(os.environ['OPENAI_API_BASE_JUDGE']):
                judge_kwargs['api_base'] = os.environ['OPENAI_API_BASE_JUDGE']
-            if rank == 0 and args.mode == 'all':
+                if model is None:
-                if DATASET_TYPE(dataset_name) == 'multi-choice':
+                    model = model_name  # which is only a name
-                    dataset_name = 'default' if custom_flag else dataset_name
+
-                    multiple_choice_eval(
+                # Perform the Inference
-                        result_file,
+                if dataset.MODALITY == 'VIDEO':
-                        dataset=dataset_name,
+                    model = infer_data_job_video(
-                        **judge_kwargs)
+                        model,
-                elif DATASET_TYPE(dataset_name) == 'Y/N':
+                        work_dir=pred_root,
-                    YOrN_eval(
+                        model_name=model_name,
-                        result_file,
+                        dataset=dataset,
-                        dataset=dataset_name,
+                        result_file_name=result_file_base,
-                        **judge_kwargs)
+                        verbose=args.verbose,
-                elif DATASET_TYPE(dataset_name) == 'Caption':
+                        api_nproc=args.api_nproc)
-                    COCO_eval(result_file)
+                elif dataset.TYPE == 'MT':
-                elif dataset_name == 'MMVet':
+                    model = infer_data_job_mt(
-                    MMVet_eval(result_file, **judge_kwargs)
+                        model,
-                elif dataset_name == 'OCRBench':
+                        work_dir=pred_root,
-                    OCRBench_eval(result_file)
+                        model_name=model_name,
-                elif listinstr(['OCRVQA', 'TextVQA', 'ChartQA', 'DocVQA', 'InfoVQA'], dataset_name):
+                        dataset=dataset,
-                    VQAEval(result_file, dataset_name)
+                        verbose=args.verbose,
-                elif listinstr(['MathVista'], dataset_name):
+                        api_nproc=args.api_nproc,
-                    MathVista_eval(result_file, **judge_kwargs)
+                        ignore_failed=args.ignore)
                elif listinstr(['LLaVABench'], dataset_name):
                    LLaVABench_eval(result_file, **judge_kwargs)
                else:
-                    logger.error(f'Dataset {dataset_name} is not handled by evaluator, will be skipped. ')
+                    model = infer_data_job(
                        model,
                        work_dir=pred_root,
                        model_name=model_name,
                        dataset=dataset,
                        verbose=args.verbose,
                        api_nproc=args.api_nproc,
                        ignore_failed=args.ignore)
                # Set the judge kwargs first before evaluation or dumping
                judge_kwargs = {
                    'nproc': args.api_nproc,
                    'verbose': args.verbose,
                    'retry': args.retry if args.retry is not None else 3
                }
                if args.retry is not None:
                    judge_kwargs['retry'] = args.retry
                if args.judge is not None:
                    judge_kwargs['model'] = args.judge
                else:
                    if dataset.TYPE in ['MCQ', 'Y/N']:
                        judge_kwargs['model'] = 'chatgpt-0125'
                    elif listinstr(['MMVet', 'LLaVABench', 'MMBench-Video'], dataset_name):
                        judge_kwargs['model'] = 'gpt-4-turbo'
                    elif listinstr(['MathVista', 'MathVerse', 'MathVision', 'DynaMath', 'VL-RewardBench', 'WeMath', 'LogicVista'], dataset_name):  # noqa: E501
                        judge_kwargs['model'] = 'gpt-4o-mini'
                    elif listinstr(['MMLongBench', 'MMDU', 'DUDE', 'SLIDEVQA', 'MIA-Bench', 'WildVision'], dataset_name):  # noqa: E501
                        judge_kwargs['model'] = 'gpt-4o'
                if rank == 0:
                    logger.info(judge_kwargs)
                if world_size > 1:
                    dist.barrier()
                # Only Rank 0 handles the evaluation part
                if rank == 0:
                    # Prepare Submission Files for MMMU_TEST AND MMT-Bench_ALL
                    if dataset_name in ['MMMU_TEST']:
                        result_json = MMMU_result_transfer(result_file)
                        logger.info(f'Transfer MMMU_TEST result to json for official evaluation, '
                                    f'json file saved in {result_json}')
                        continue
                    elif 'MMT-Bench_ALL' in dataset_name:
                        submission_file = MMTBench_result_transfer(result_file, **judge_kwargs)
                        logger.info(f'Extract options from prediction of MMT-Bench FULL split for official evaluation '
                                    f'(https://eval.ai/web/challenges/challenge-page/2328/overview), '
                                    f'submission file saved in {submission_file}')
                        continue
                    # Skip the evaluation part if only infer
                    if args.mode == 'infer':
                        continue
                    # Skip the evaluation part if the dataset evaluation is not supported or annotations are missing
                    if 'MLLMGuard_DS' in dataset_name:
                        logger.info('The evaluation of MLLMGuard_DS is not supported yet. ')
                        continue
                    elif 'AesBench_TEST' == dataset_name:
                        logger.info(f'The results are saved in {result_file}. '
                                    f'Please send it to the AesBench Team via huangyipo@hotmail.com.')
                        continue
                    elif dataset_name in ['DocVQA_TEST', 'InfoVQA_TEST', 'Q-Bench1_TEST', 'A-Bench_TEST']:
                        logger.info(f'{dataset_name} is a test split without ground-truth. '
                                    'Thus only the inference part is supported for those datasets. ')
                        continue
                    elif dataset_name in [
                        'MMBench_TEST_CN', 'MMBench_TEST_EN', 'MMBench', 'MMBench_CN',
                        'MMBench_TEST_CN_V11', 'MMBench_TEST_EN_V11', 'MMBench_V11', 'MMBench_CN_V11'
                    ] and not MMBenchOfficialServer(dataset_name):
                        logger.error(
                            f'Can not evaluate {dataset_name} on non-official servers, will skip the evaluation.')
                        continue
                    # Setup the proxy for the evaluation
                    eval_proxy = os.environ.get('EVAL_PROXY', None)
                    old_proxy = os.environ.get('HTTP_PROXY', '')
                    if eval_proxy is not None:
                        proxy_set(eval_proxy)
                    # Perform the Evaluation
                    eval_results = dataset.evaluate(result_file, **judge_kwargs)
                    # Display Evaluation Results in Terminal
                    if eval_results is not None:
                        assert isinstance(eval_results, dict) or isinstance(eval_results, pd.DataFrame)
                        logger.info(f'The evaluation of model {model_name} x dataset {dataset_name} has finished! ')
                        logger.info('Evaluation Results:')
                        if isinstance(eval_results, dict):
                            logger.info('\n' + json.dumps(eval_results, indent=4))
                        elif isinstance(eval_results, pd.DataFrame):
                            if len(eval_results) < len(eval_results.columns):
                                eval_results = eval_results.T
                            logger.info('\n' + tabulate(eval_results))
                    # Restore the proxy
                    if eval_proxy is not None:
                        proxy_set(old_proxy)
                    # Create the symbolic links for the prediction files
                    files = os.listdir(pred_root)
                    files = [x for x in files if (f'{model_name}_{dataset_name}' in x or "status.json" in x)]
                    for f in files:
                        cwd = os.getcwd()
                        file_addr = osp.join(cwd, pred_root, f)
                        link_addr = osp.join(cwd, pred_root_meta, f)
                        if osp.exists(link_addr) or osp.islink(link_addr):
                            os.remove(link_addr)
                        os.symlink(file_addr, link_addr)
            except Exception as e:
                logger.exception(f'Model {model_name} x Dataset {dataset_name} combination failed: {e}, '
                                 'skipping this combination.')
                continue
            if world_size > 1:
                dist.barrier()
    if world_size > 1:
        dist.destroy_process_group()
 if __name__ == '__main__':
--- a/eval_mm/vlmevalkit/script/run_inference.sh
+++ b/eval_mm/vlmevalkit/script/run_inference.sh
@@ -1,31 +0,0 @@
 export PATH=/usr/local/cuda/bin:$PATH
 export HF_ENDPOINT=https://hf-mirror.com
 export OMP_NUM_THREADS=1
 export timestamp=`date +"%Y%m%d%H%M%S"`
 export OLD_VERSION='False'
 export PYTHONPATH=$(dirname $SELF_DIR):$PYTHONPATH
 # gpu consumed
 # fp16 17-18G
 # int4 7-8G
 # model to be used
 # Example: MODELNAME=MiniCPM-Llama3-V-2_5
 MODELNAME=$1
 # datasets to be tested
 # Example: DATALIST="POPE ScienceQA_TEST ChartQA_TEST"
 DATALIST=$2
 # test mode, all or infer
 MODE=$3
 echo "Starting inference with model $MODELNAME on datasets $DATALIST"
 # run on multi gpus with torchrun command
 # remember to run twice, the first run may fail
 torchrun --nproc_per_node=8 run.py --data $DATALIST --model $MODELNAME --mode $MODE
 torchrun --nproc_per_node=8 run.py --data $DATALIST --model $MODELNAME --mode $MODE
 # run on single gpu with python command
 # python run.py --data $DATALIST --model $MODELNAME --verbose --mode $MODE
 # python run.py --data $DATALIST --model $MODELNAME --verbose --mode $MODE
 ls
--- a/eval_mm/vlmevalkit/scripts/run_inference.sh
+++ b/eval_mm/vlmevalkit/scripts/run_inference.sh
@@ -0,0 +1,41 @@
 export PATH=/usr/local/cuda/bin:$PATH
 export HF_ENDPOINT=https://hf-mirror.com
 export OMP_NUM_THREADS=1
 export timestamp=`date +"%Y%m%d%H%M%S"`
 export OLD_VERSION='False'
 export PYTHONPATH=$(dirname $SELF_DIR):$PYTHONPATH
 export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
 # gpu consumed
 # fp16 17-18G
 # int4 7-8G
 # model to be used
 # Example: MODELNAME=MiniCPM-o-2_6
 MODELNAME=$1
 # datasets to be tested
 # Example: DATALIST=MMMU_DEV_VAL
 DATALIST=$2
 # run on multi gpus with torchrun command
 # remember to run twice, the first run may fail
 for DATASET in $DATALIST; do
    echo "Starting inference with model $MODELNAME on dataset $DATASET"
    torchrun --master_port 29500 --nproc_per_node=8 run.py --data $DATASET --model $MODELNAME --mode infer --reuse
    torchrun --master_port 29501 --nproc_per_node=8 run.py --data $DATASET --model $MODELNAME --mode infer --reuse
    # for benchmarks which require gpt for scoring, you need to specify OPENAI_API_BASE and OPENAI_API_KEY in .env file
    if [[ "$DATASET" == *"MMBench_TEST"*]]; then
        echo "Skipping evaluation for dataset $DATASET"
    else
        echo "Starting evaluation with model $MODELNAME on datasets $DATASET"
        python run.py --data $DATASET --model $MODELNAME --nproc 16 --verbose
    fi
 done
 # run on single gpu with python command
 # python run.py --data $DATALIST --model $MODELNAME --verbose --mode infer
 # python run.py --data $DATALIST --model $MODELNAME --verbose --mode infer
 # echo "Starting evaluation with model $MODELNAME on datasets $DATASET"
 # python run.py --data $DATASET --model $MODELNAME --nproc 16 --verbose
--- a/eval_mm/vlmevalkit/setup.py
+++ b/eval_mm/vlmevalkit/setup.py
@@ -0,0 +1,122 @@
 import re
 import sys
 from os.path import exists
 from setuptools import find_packages, setup
 def parse_requirements(fname='requirements.txt', with_version=True):
    """Parse the package dependencies listed in a requirements file but strips
    specific versioning information.
    Args:
        fname (str): path to requirements file
        with_version (bool, default=False): if True include version specs
    Returns:
        List[str]: list of requirements items
    CommandLine:
        python -c "import setup; print(setup.parse_requirements())"
    """
    require_fpath = fname
    def parse_line(line):
        """Parse information from a line in a requirements text file."""
        if line.startswith('-r '):
            # Allow specifying requirements in other files
            target = line.split(' ')[1]
            for info in parse_require_file(target):
                yield info
        else:
            info = {'line': line}
            if line.startswith('-e '):
                info['package'] = line.split('#egg=')[1]
            elif '@git+' in line:
                info['package'] = line
            else:
                # Remove versioning from the package
                pat = '(' + '|'.join(['>=', '==', '>']) + ')'
                parts = re.split(pat, line, maxsplit=1)
                parts = [p.strip() for p in parts]
                info['package'] = parts[0]
                if len(parts) > 1:
                    op, rest = parts[1:]
                    if ';' in rest:
                        # Handle platform specific dependencies
                        # http://setuptools.readthedocs.io/en/latest/setuptools.html#declaring-platform-specific-dependencies
                        version, platform_deps = map(str.strip,
                                                     rest.split(';'))
                        info['platform_deps'] = platform_deps
                    else:
                        version = rest  # NOQA
                    info['version'] = (op, version)
            yield info
    def parse_require_file(fpath):
        with open(fpath, 'r') as f:
            for line in f.readlines():
                line = line.strip()
                if line and not line.startswith('#'):
                    for info in parse_line(line):
                        yield info
    def gen_packages_items():
        if exists(require_fpath):
            for info in parse_require_file(require_fpath):
                parts = [info['package']]
                if with_version and 'version' in info:
                    parts.extend(info['version'])
                if not sys.version.startswith('3.4'):
                    # apparently package_deps are broken in 3.4
                    platform_deps = info.get('platform_deps')
                    if platform_deps is not None:
                        parts.append(';' + platform_deps)
                item = ''.join(parts)
                yield item
    packages = list(gen_packages_items())
    return packages
 with open('README.md') as f:
    readme = f.read()
 def do_setup():
    setup(
        name='vlmeval',
        version='0.1.0',
        description='OpenCompass VLM Evaluation Kit',
        author='Haodong Duan',
        author_email='dhd.efz@gmail.com',
        maintainer='Haodong Duan',
        maintainer_email='dhd.efz@gmail.com',
        long_description=readme,
        long_description_content_type='text/markdown',
        cmdclass={},
        install_requires=parse_requirements('requirements.txt'),
        setup_requires=[],
        python_requires='>=3.7.0',
        packages=find_packages(exclude=[
            'test*',
            'paper_test*',
        ]),
        keywords=['AI', 'NLP', 'in-context learning'],
        entry_points={
            'console_scripts': ['vlmutil = vlmeval:cli']
        },
        classifiers=[
            'Programming Language :: Python :: 3.7',
            'Programming Language :: Python :: 3.8',
            'Programming Language :: Python :: 3.9',
            'Programming Language :: Python :: 3.10',
            'Intended Audience :: Developers',
            'Intended Audience :: Education',
            'Intended Audience :: Science/Research',
        ])
 if __name__ == '__main__':
    do_setup()
--- a/eval_mm/vlmevalkit/vlmeval/init.py
+++ b/eval_mm/vlmevalkit/vlmeval/init.py
@@ -5,9 +5,12 @@ except ImportError:
 from .smp import *
 from .api import *
-from .evaluate import *
+from .dataset import *
 from .utils import *
 from .vlm import *
 from .config import *
 from .tools import cli
 load_env()
 __version__ = '0.2rc1'
--- a/eval_mm/vlmevalkit/vlmeval/api/init.py
+++ b/eval_mm/vlmevalkit/vlmeval/api/init.py
@@ -1,6 +1,5 @@
 from .gpt import OpenAIWrapper, GPT4V
 from .gpt_int import OpenAIWrapperInternal, GPT4V_Internal
 __all__ = [
-    'OpenAIWrapper', 'OpenAIWrapperInternal', 'GPT4V', 'GPT4V_Internal'
+    'OpenAIWrapper', 'GPT4V',
 ]
--- a/eval_mm/vlmevalkit/vlmeval/api/base.py
+++ b/eval_mm/vlmevalkit/vlmeval/api/base.py
@@ -3,7 +3,7 @@ import random as rd
 from abc import abstractmethod
 import os.path as osp
 import copy as cp
-from ..smp import get_logger, parse_file
+from ..smp import get_logger, parse_file, concat_images_vlmeval, LMUDataRoot, md5, decode_base64_to_image_file
 class BaseAPI:
@@ -62,12 +62,22 @@ class BaseAPI:
        Returns:
            bool: If the API model is working, return True, else return False.
        """
-        retry = 3
+        self.old_timeout = None
        if hasattr(self, 'timeout'):
            self.old_timeout = self.timeout
            self.timeout = 120
        retry = 5
        while retry > 0:
            ret = self.generate('hello')
            if ret is not None and ret != '' and self.fail_msg not in ret:
                if self.old_timeout is not None:
                    self.timeout = self.old_timeout
                return True
            retry -= 1
        if self.old_timeout is not None:
            self.timeout = self.old_timeout
        return False
    def check_content(self, msgs):
@@ -127,6 +137,82 @@ class BaseAPI:
        else:
            return None
    # May exceed the context windows size, so try with different turn numbers.
    def chat_inner(self, inputs, **kwargs):
        _ = kwargs.pop('dataset', None)
        while len(inputs):
            try:
                return self.generate_inner(inputs, **kwargs)
            except Exception as e:
                if self.verbose:
                    self.logger.info(f'{type(e)}: {e}')
                inputs = inputs[1:]
                while len(inputs) and inputs[0]['role'] != 'user':
                    inputs = inputs[1:]
                continue
        return -1, self.fail_msg + ': ' + 'Failed with all possible conversation turns.', None
    def chat(self, messages, **kwargs1):
        """The main function for multi-turn chatting. Will call `chat_inner` with the preprocessed input messages."""
        assert hasattr(self, 'chat_inner'), 'The API model should has the `chat_inner` method. '
        for msg in messages:
            assert isinstance(msg, dict) and 'role' in msg and 'content' in msg, msg
            assert self.check_content(msg['content']) in ['str', 'dict', 'liststr', 'listdict'], msg
            msg['content'] = self.preproc_content(msg['content'])
        # merge kwargs
        kwargs = cp.deepcopy(self.default_kwargs)
        kwargs.update(kwargs1)
        answer = None
        # a very small random delay [0s - 0.5s]
        T = rd.random() * 0.5
        time.sleep(T)
        assert messages[-1]['role'] == 'user'
        for i in range(self.retry):
            try:
                ret_code, answer, log = self.chat_inner(messages, **kwargs)
                if ret_code == 0 and self.fail_msg not in answer and answer != '':
                    if self.verbose:
                        print(answer)
                    return answer
                elif self.verbose:
                    if not isinstance(log, str):
                        try:
                            log = log.text
                        except Exception as e:
                            self.logger.warning(f'Failed to parse {log} as an http response: {str(e)}. ')
                    self.logger.info(f'RetCode: {ret_code}\nAnswer: {answer}\nLog: {log}')
            except Exception as err:
                if self.verbose:
                    self.logger.error(f'An error occured during try {i}: ')
                    self.logger.error(f'{type(err)}: {err}')
            # delay before each retry
            T = rd.random() * self.wait * 2
            time.sleep(T)
        return self.fail_msg if answer in ['', None] else answer
    def preprocess_message_with_role(self, message):
        system_prompt = ''
        new_message = []
        for data in message:
            assert isinstance(data, dict)
            role = data.pop('role', 'user')
            if role == 'system':
                system_prompt += data['value'] + '\n'
            else:
                new_message.append(data)
        if system_prompt != '':
            if self.system_prompt is None:
                self.system_prompt = system_prompt
            else:
                self.system_prompt += '\n' + system_prompt
        return new_message
    def generate(self, message, **kwargs1):
        """The main function to generate the answer. Will call `generate_inner` with the preprocessed input messages.
@@ -136,6 +222,9 @@ class BaseAPI:
        Returns:
            str: The generated answer of the Failed Message if failed to obtain answer.
        """
        if self.check_content(message) == 'listdict':
            message = self.preprocess_message_with_role(message)
        assert self.check_content(message) in ['str', 'dict', 'liststr', 'listdict'], f'Invalid input type: {message}'
        message = self.preproc_content(message)
        assert message is not None and self.check_content(message) == 'listdict'
@@ -162,20 +251,20 @@ class BaseAPI:
                    if not isinstance(log, str):
                        try:
                            log = log.text
-                        except:
+                        except Exception as e:
-                            self.logger.warning(f'Failed to parse {log} as an http response. ')
+                            self.logger.warning(f'Failed to parse {log} as an http response: {str(e)}. ')
                    self.logger.info(f'RetCode: {ret_code}\nAnswer: {answer}\nLog: {log}')
            except Exception as err:
                if self.verbose:
-                    self.logger.error(f'An error occured during try {i}:')
+                    self.logger.error(f'An error occured during try {i}: ')
-                    self.logger.error(err)
+                    self.logger.error(f'{type(err)}: {err}')
            # delay before each retry
            T = rd.random() * self.wait * 2
            time.sleep(T)
        return self.fail_msg if answer in ['', None] else answer
-    def message_to_promptimg(self, message):
+    def message_to_promptimg(self, message, dataset=None):
        assert not self.INTERLEAVE
        model_name = self.__class__.__name__
        import warnings
@@ -191,5 +280,10 @@ class BaseAPI:
            image = [x['value'] for x in message if x['type'] == 'image'][0]
        else:
            prompt = '\n'.join([x['value'] if x['type'] == 'text' else '<image>' for x in message])
-            image = [x['value'] for x in message if x['type'] == 'image'][0]
+            if dataset == 'BLINK':
                image = concat_images_vlmeval(
                    [x['value'] for x in message if x['type'] == 'image'],
                    target_size=512)
            else:
                image = [x['value'] for x in message if x['type'] == 'image'][0]
        return prompt, image
--- a/eval_mm/vlmevalkit/vlmeval/api/gpt.py
+++ b/eval_mm/vlmevalkit/vlmeval/api/gpt.py
@@ -10,18 +10,18 @@ APIBASES = {
 def GPT_context_window(model):
    length_map = {
        'gpt-4-1106-preview': 128000,
        'gpt-4-vision-preview': 128000,
        'gpt-4': 8192,
        'gpt-4-32k': 32768,
        'gpt-4-0613': 8192,
-        'gpt-4-32k-0613': 32768,
+        'gpt-4-turbo-preview': 128000,
        'gpt-4-1106-preview': 128000,
        'gpt-4-0125-preview': 128000,
        'gpt-4-vision-preview': 128000,
        'gpt-4-turbo': 128000,
        'gpt-4-turbo-2024-04-09': 128000,
        'gpt-3.5-turbo': 16385,
        'gpt-3.5-turbo-0125': 16385,
        'gpt-3.5-turbo-1106': 16385,
        'gpt-3.5-turbo': 4096,
        'gpt-3.5-turbo-16k': 16385,
        'gpt-3.5-turbo-instruct': 4096,
        'gpt-3.5-turbo-0613': 4096,
        'gpt-3.5-turbo-16k-0613': 16385,
    }
    if model in length_map:
        return length_map[model]
@@ -38,7 +38,7 @@ class OpenAIWrapper(BaseAPI):
                 retry: int = 5,
                 wait: int = 5,
                 key: str = None,
-                 verbose: bool = True,
+                 verbose: bool = False,
                 system_prompt: str = None,
                 temperature: float = 0,
                 timeout: int = 60,
@@ -46,6 +46,7 @@ class OpenAIWrapper(BaseAPI):
                 max_tokens: int = 1024,
                 img_size: int = 512,
                 img_detail: str = 'low',
                 use_azure: bool = False,
                 **kwargs):
        self.model = model
@@ -53,19 +54,43 @@ class OpenAIWrapper(BaseAPI):
        self.fail_msg = 'Failed to obtain answer via API. '
        self.max_tokens = max_tokens
        self.temperature = temperature
        self.use_azure = use_azure
-        if 'step-1v' in model:
+        if 'step' in model:
            env_key = os.environ.get('STEPAI_API_KEY', '')
            if key is None:
                key = env_key
-        else:
+        elif 'yi-vision' in model:
-            env_key = os.environ.get('OPENAI_API_KEY', '')
+            env_key = os.environ.get('YI_API_KEY', '')
            if key is None:
                key = env_key
-            assert isinstance(key, str) and key.startswith('sk-'), (
+        elif 'internvl2-pro' in model:
-                f'Illegal openai_key {key}. '
+            env_key = os.environ.get('InternVL2_PRO_KEY', '')
-                'Please set the environment variable OPENAI_API_KEY to your openai key. '
+            if key is None:
-            )
+                key = env_key
        elif 'abab' in model:
            env_key = os.environ.get('MiniMax_API_KEY', '')
            if key is None:
                key = env_key
        else:
            if use_azure:
                env_key = os.environ.get('AZURE_OPENAI_API_KEY', None)
                assert env_key is not None, 'Please set the environment variable AZURE_OPENAI_API_KEY. '
                if key is None:
                    key = env_key
                assert isinstance(key, str), (
                    'Please set the environment variable AZURE_OPENAI_API_KEY to your openai key. '
                )
            else:
                env_key = os.environ.get('OPENAI_API_KEY', '')
                if key is None:
                    key = env_key
                assert isinstance(key, str) and key.startswith('sk-'), (
                    f'Illegal openai_key {key}. '
                    'Please set the environment variable OPENAI_API_KEY to your openai key. '
                )
        self.key = key
        assert img_size > 0 or img_size == -1
        self.img_size = img_size
@@ -75,30 +100,46 @@ class OpenAIWrapper(BaseAPI):
        super().__init__(wait=wait, retry=retry, system_prompt=system_prompt, verbose=verbose, **kwargs)
-        if api_base is None:
+        if use_azure:
-            if 'OPENAI_API_BASE' in os.environ and os.environ['OPENAI_API_BASE'] != '':
+            api_base_template = (
-                self.logger.error('Environment variable OPENAI_API_BASE is set. Will use it as api_base. ')
+                '{endpoint}openai/deployments/{deployment_name}/chat/completions?api-version={api_version}'
-                api_base = os.environ['OPENAI_API_BASE']
+            )
-            else:
+            endpoint = os.getenv('AZURE_OPENAI_ENDPOINT', None)
-                api_base = 'OFFICIAL'
+            assert endpoint is not None, 'Please set the environment variable AZURE_OPENAI_ENDPOINT. '
            deployment_name = os.getenv('AZURE_OPENAI_DEPLOYMENT_NAME', None)
            assert deployment_name is not None, 'Please set the environment variable AZURE_OPENAI_DEPLOYMENT_NAME. '
            api_version = os.getenv('OPENAI_API_VERSION', None)
            assert api_version is not None, 'Please set the environment variable OPENAI_API_VERSION. '
-        assert api_base is not None
+            self.api_base = api_base_template.format(
-
+                endpoint=os.getenv('AZURE_OPENAI_ENDPOINT'),
-        if api_base in APIBASES:
+                deployment_name=os.getenv('AZURE_OPENAI_DEPLOYMENT_NAME'),
-            self.api_base = APIBASES[api_base]
+                api_version=os.getenv('OPENAI_API_VERSION')
-        elif api_base.startswith('http'):
+            )
            self.api_base = api_base
        else:
-            self.logger.error('Unknown API Base. ')
+            if api_base is None:
-            sys.exit(-1)
+                if 'OPENAI_API_BASE' in os.environ and os.environ['OPENAI_API_BASE'] != '':
                    self.logger.info('Environment variable OPENAI_API_BASE is set. Will use it as api_base. ')
                    api_base = os.environ['OPENAI_API_BASE']
                else:
                    api_base = 'OFFICIAL'
            assert api_base is not None
            if api_base in APIBASES:
                self.api_base = APIBASES[api_base]
            elif api_base.startswith('http'):
                self.api_base = api_base
            else:
                self.logger.error('Unknown API Base. ')
                raise NotImplementedError
        self.logger.info(f'Using API Base: {self.api_base}; API Key: {self.key}')
    # inputs can be a lvl-2 nested list: [content1, content2, content3, ...]
    # content can be a string or a list of image & text
-    def prepare_inputs(self, inputs):
+    def prepare_itlist(self, inputs):
-        input_msgs = []
+        assert np.all([isinstance(x, dict) for x in inputs])
        if self.system_prompt is not None:
            input_msgs.append(dict(role='system', content=self.system_prompt))
        has_images = np.sum([x['type'] == 'image' for x in inputs])
        if has_images:
            content_list = []
@@ -111,11 +152,24 @@ class OpenAIWrapper(BaseAPI):
                    b64 = encode_image_to_base64(img, target_size=self.img_size)
                    img_struct = dict(url=f'data:image/jpeg;base64,{b64}', detail=self.img_detail)
                    content_list.append(dict(type='image_url', image_url=img_struct))
            input_msgs.append(dict(role='user', content=content_list))
        else:
            assert all([x['type'] == 'text' for x in inputs])
            text = '\n'.join([x['value'] for x in inputs])
-            input_msgs.append(dict(role='user', content=text))
+            content_list = [dict(type='text', text=text)]
        return content_list
    def prepare_inputs(self, inputs):
        input_msgs = []
        if self.system_prompt is not None:
            input_msgs.append(dict(role='system', content=self.system_prompt))
        assert isinstance(inputs, list) and isinstance(inputs[0], dict)
        assert np.all(['type' in x for x in inputs]) or np.all(['role' in x for x in inputs]), inputs
        if 'role' in inputs[0]:
            assert inputs[-1]['role'] == 'user', inputs[-1]
            for item in inputs:
                input_msgs.append(dict(role=item['role'], content=self.prepare_itlist(item['content'])))
        else:
            input_msgs.append(dict(role='user', content=self.prepare_itlist(inputs)))
        return input_msgs
    def generate_inner(self, inputs, **kwargs) -> str:
@@ -123,17 +177,24 @@ class OpenAIWrapper(BaseAPI):
        temperature = kwargs.pop('temperature', self.temperature)
        max_tokens = kwargs.pop('max_tokens', self.max_tokens)
-        context_window = GPT_context_window(self.model)
+        # context_window = GPT_context_window(self.model)
-        max_tokens = min(max_tokens, context_window - self.get_token_len(inputs))
+        # new_max_tokens = min(max_tokens, context_window - self.get_token_len(inputs))
-        if 0 < max_tokens <= 100:
+        # if 0 < new_max_tokens <= 100 and new_max_tokens < max_tokens:
-            self.logger.warning(
+        #     self.logger.warning(
-                'Less than 100 tokens left, '
+        #         'Less than 100 tokens left, '
-                'may exceed the context window with some additional meta symbols. '
+        #         'may exceed the context window with some additional meta symbols. '
-            )
+        #     )
-        if max_tokens <= 0:
+        # if new_max_tokens <= 0:
-            return 0, self.fail_msg + 'Input string longer than context window. ', 'Length Exceeded. '
+        #     return 0, self.fail_msg + 'Input string longer than context window. ', 'Length Exceeded. '
        # max_tokens = new_max_tokens
-        headers = {'Content-Type': 'application/json', 'Authorization': f'Bearer {self.key}'}
+        # Will send request if use Azure, dk how to use openai client for it
        if self.use_azure:
            headers = {'Content-Type': 'application/json', 'api-key': self.key}
        elif 'internvl2-pro' in self.model:
            headers = {'Content-Type': 'application/json', 'Authorization': self.key}
        else:
            headers = {'Content-Type': 'application/json', 'Authorization': f'Bearer {self.key}'}
        payload = dict(
            model=self.model,
            messages=input_msgs,
@@ -141,34 +202,62 @@ class OpenAIWrapper(BaseAPI):
            n=1,
            temperature=temperature,
            **kwargs)
-        response = requests.post(self.api_base, headers=headers, data=json.dumps(payload), timeout=self.timeout * 1.1)
+        response = requests.post(
            self.api_base,
            headers=headers, data=json.dumps(payload), timeout=self.timeout * 1.1)
        ret_code = response.status_code
        ret_code = 0 if (200 <= int(ret_code) < 300) else ret_code
        answer = self.fail_msg
        try:
            resp_struct = json.loads(response.text)
            answer = resp_struct['choices'][0]['message']['content'].strip()
-        except:
+        except Exception as err:
-            pass
+            if self.verbose:
                self.logger.error(f'{type(err)}: {err}')
                self.logger.error(response.text if hasattr(response, 'text') else response)
        return ret_code, answer, response
    def get_image_token_len(self, img_path, detail='low'):
        import math
        if detail == 'low':
            return 85
        im = Image.open(img_path)
        height, width = im.size
        if width > 1024 or height > 1024:
            if width > height:
                height = int(height * 1024 / width)
                width = 1024
            else:
                width = int(width * 1024 / height)
                height = 1024
        h = math.ceil(height / 512)
        w = math.ceil(width / 512)
        total = 85 + 170 * h * w
        return total
    def get_token_len(self, inputs) -> int:
        import tiktoken
        try:
            enc = tiktoken.encoding_for_model(self.model)
-        except:
+        except Exception as err:
-            enc = tiktoken.encoding_for_model('gpt-4')
+            if 'gpt' in self.model.lower():
                if self.verbose:
                    self.logger.warning(f'{type(err)}: {err}')
                enc = tiktoken.encoding_for_model('gpt-4')
            else:
                return 0
        assert isinstance(inputs, list)
        tot = 0
        for item in inputs:
-            if item['type'] == 'text':
+            if 'role' in item:
                tot += self.get_token_len(item['content'])
            elif item['type'] == 'text':
                tot += len(enc.encode(item['value']))
            elif item['type'] == 'image':
-                tot += 85
+                tot += self.get_image_token_len(item['value'], detail=self.img_detail)
                if self.img_detail == 'high':
                    img = Image.open(item['value'])
                    npatch = np.ceil(img.size[0] / 512) * np.ceil(img.size[1] / 512)
                    tot += npatch * 170
        return tot
--- a/eval_mm/vlmevalkit/vlmeval/api/gpt_int.py
+++ b/eval_mm/vlmevalkit/vlmeval/api/gpt_int.py
@@ -1,90 +0,0 @@
 import json
 import warnings
 import requests
 from ..smp import *
 from .gpt import GPT_context_window, OpenAIWrapper
 url = 'http://ecs.sv.us.alles-apin.openxlab.org.cn/v1/openai/v2/text/chat'
 headers = {
    'Content-Type': 'application/json'
 }
 class OpenAIWrapperInternal(OpenAIWrapper):
    is_api: bool = True
    def __init__(self,
                 model: str = 'gpt-3.5-turbo-0613',
                 retry: int = 5,
                 wait: int = 3,
                 verbose: bool = True,
                 system_prompt: str = None,
                 temperature: float = 0,
                 timeout: int = 60,
                 max_tokens: int = 1024,
                 img_size: int = 512,
                 img_detail: str = 'low',
                 **kwargs):
        self.model = model
        if 'KEYS' in os.environ and osp.exists(os.environ['KEYS']):
            keys = load(os.environ['KEYS'])
            headers['alles-apin-token'] = keys.get('alles-apin-token', '')
        elif 'ALLES' in os.environ:
            headers['alles-apin-token'] = os.environ['ALLES']
        self.headers = headers
        self.temperature = temperature
        self.timeout = timeout
        self.max_tokens = max_tokens
        assert img_size > 0 or img_size == -1
        self.img_size = img_size
        assert img_detail in ['high', 'low']
        self.img_detail = img_detail
        super(OpenAIWrapper, self).__init__(
            wait=wait, retry=retry, system_prompt=system_prompt, verbose=verbose, **kwargs)
    def generate_inner(self, inputs, **kwargs) -> str:
        input_msgs = self.prepare_inputs(inputs)
        temperature = kwargs.pop('temperature', self.temperature)
        max_tokens = kwargs.pop('max_tokens', self.max_tokens)
        # Held out 100 tokens as buffer
        context_window = GPT_context_window(self.model)
        max_tokens = min(max_tokens, context_window - self.get_token_len(inputs))
        if 0 < max_tokens <= 100:
            print('Less than 100 tokens left, may exceed the context window with some additional meta symbols. ')
        if max_tokens <= 0:
            return 0, self.fail_msg + 'Input string longer than context window. ', 'Length Exceeded. '
        payload = dict(
            model=self.model,
            messages=input_msgs,
            max_tokens=max_tokens,
            n=1,
            stop=None,
            timeout=self.timeout,
            temperature=temperature,
            **kwargs)
        response = requests.post(url, headers=headers, data=json.dumps(payload), timeout=self.timeout * 1.1)
        ret_code = response.status_code
        ret_code = 0 if (200 <= int(ret_code) < 300) else ret_code
        answer = self.fail_msg
        try:
            resp_struct = json.loads(response.text)
            assert resp_struct['msg'] == 'ok' and resp_struct['msgCode'] == '10000', resp_struct
            answer = resp_struct['data']['choices'][0]['message']['content'].strip()
        except:
            pass
        return ret_code, answer, response
 class GPT4V_Internal(OpenAIWrapperInternal):
    def generate(self, message, dataset=None):
        return super(GPT4V_Internal, self).generate(message)
--- a/eval_mm/vlmevalkit/vlmeval/config.py
+++ b/eval_mm/vlmevalkit/vlmeval/config.py
@@ -2,18 +2,19 @@ from vlmeval.vlm import *
 from vlmeval.api import *
 from functools import partial
-ungrouped = {
+minicpm_series = {
-    'MiniCPM-V':partial(MiniCPM_V, model_path='openbmb/MiniCPM-V'),
+    'MiniCPM-V': partial(MiniCPM_V, model_path='openbmb/MiniCPM-V'),
-    'MiniCPM-V-2':partial(MiniCPM_V, model_path='openbmb/MiniCPM-V-2'),
+    'MiniCPM-V-2': partial(MiniCPM_V, model_path='openbmb/MiniCPM-V-2'),
-    'MiniCPM-Llama3-V-2_5':partial(MiniCPM_Llama3_V, model_path='openbmb/MiniCPM-Llama3-V-2_5'),
+    'MiniCPM-Llama3-V-2_5': partial(MiniCPM_Llama3_V, model_path='openbmb/MiniCPM-Llama3-V-2_5'),
    'MiniCPM-V-2_6': partial(MiniCPM_V_2_6, model_path='openbmb/MiniCPM-V-2_6'),
    'MiniCPM-o-2_6': partial(MiniCPM_o_2_6, model_path='openbmb/MiniCPM-o-2_6'),
 }
 supported_VLM = {}
 model_groups = [
-    ungrouped
+    minicpm_series
 ]
 for grp in model_groups:
    supported_VLM.update(grp)
--- a/eval_mm/vlmevalkit/vlmeval/dataset/init.py
+++ b/eval_mm/vlmevalkit/vlmeval/dataset/init.py
@@ -0,0 +1,237 @@
 import warnings
 from .image_base import img_root_map, ImageBaseDataset
 from .image_caption import ImageCaptionDataset
 from .image_yorn import ImageYORNDataset
 from .image_mcq import (
    ImageMCQDataset, MMMUDataset, CustomMCQDataset, MUIRDataset, GMAIMMBenchDataset, MMERealWorld, HRBenchDataset,
    NaturalBenchDataset
 )
 from .image_mt import MMDUDataset
 from .image_vqa import (
    ImageVQADataset, MathVision, OCRBench, MathVista, LLaVABench, MMVet, MTVQADataset, TableVQABench,
    CustomVQADataset, CRPE, MathVerse, OlympiadBench, QSpatial, VizWiz, MMNIAH, WeMath, LogicVista
 )
 from .image_ccocr import CCOCRDataset
 from .text_mcq import CustomTextMCQDataset, TextMCQDataset
 from .vcr import VCRDataset
 from .mmlongbench import MMLongBench
 from .dude import DUDE
 from .slidevqa import SlideVQA
 from .vl_rewardbench import VLRewardBench
 from .mmbench_video import MMBenchVideo
 from .videomme import VideoMME
 from .mvbench import MVBench, MVBench_MP4
 from .mlvu import MLVU, MLVU_MCQ, MLVU_OpenEnded
 from .tempcompass import TempCompass, TempCompass_Captioning, TempCompass_MCQ, TempCompass_YorN
 from .longvideobench import LongVideoBench
 from .video_concat_dataset import ConcatVideoDataset
 from .mmgenbench import MMGenBench
 from .cgbench import CGBench_MCQ_Grounding_Mini, CGBench_OpenEnded_Mini, CGBench_MCQ_Grounding, CGBench_OpenEnded
 from .miabench import MIABench
 from .cmmmu import CMMMU
 from .wildvision import WildVision
 from .mmmath import MMMath
 from .dynamath import Dynamath
 from .utils import *
 from .video_dataset_config import *
 from ..smp import *
 class ConcatDataset(ImageBaseDataset):
    # This dataset takes multiple dataset names as input and aggregate them into a single dataset.
    # Each single dataset should not have a field named `SUB_DATASET`
    DATASET_SETS = {
        'MMMB': ['MMMB_ar', 'MMMB_cn', 'MMMB_en', 'MMMB_pt', 'MMMB_ru', 'MMMB_tr'],
        'MTL_MMBench_DEV': [
            'MMBench_dev_ar', 'MMBench_dev_cn', 'MMBench_dev_en',
            'MMBench_dev_pt', 'MMBench_dev_ru', 'MMBench_dev_tr'
        ]
    }
    def __init__(self, dataset):
        datasets = self.DATASET_SETS[dataset]
        self.dataset_map = {}
        # The name of the compliation
        self.dataset_name = dataset
        self.datasets = datasets
        for dname in datasets:
            dataset = build_dataset(dname)
            assert dataset is not None, dataset
            self.dataset_map[dname] = dataset
        TYPES = [x.TYPE for x in self.dataset_map.values()]
        MODALITIES = [x.MODALITY for x in self.dataset_map.values()]
        assert np.all([x == TYPES[0] for x in TYPES]), (datasets, TYPES)
        assert np.all([x == MODALITIES[0] for x in MODALITIES]), (datasets, MODALITIES)
        self.TYPE = TYPES[0]
        self.MODALITY = MODALITIES[0]
        data_all = []
        for dname in datasets:
            data = self.dataset_map[dname].data
            data['SUB_DATASET'] = [dname] * len(data)
            data_new = localize_df(data, dname, nproc=16)
            data_all.append(data_new)
        data = pd.concat(data_all)
        data['original_index'] = data.pop('index')
        data['index'] = np.arange(len(data))
        self.data = data
    def build_prompt(self, line):
        if isinstance(line, int):
            line = self.data.iloc[line]
        idx = line['original_index']
        dname = line['SUB_DATASET']
        org_data = self.dataset_map[dname].data
        org_line = cp.deepcopy(org_data[org_data['index'] == idx]).iloc[0]
        return self.dataset_map[dname].build_prompt(org_line)
    def dump_image(self, line):
        # Assert all images are pre-dumped
        assert 'image' not in line
        assert 'image_path' in line
        tgt_path = toliststr(line['image_path'])
        return tgt_path
    @classmethod
    def supported_datasets(cls):
        return list(cls.DATASET_SETS)
    def evaluate(self, eval_file, **judge_kwargs):
        suffix = eval_file.split('.')[-1]
        # First, split the eval_file by dataset
        data_all = load(eval_file)
        for dname in self.datasets:
            tgt = eval_file.replace(self.dataset_name, dname)
            data_sub = data_all[data_all['SUB_DATASET'] == dname]
            data_sub.pop('index')
            data_sub['index'] = data_sub.pop('original_index')
            data_sub.pop('SUB_DATASET')
            dump(data_sub, tgt)
        # Then, evaluate each dataset separately
        results_all = []
        for dname in self.datasets:
            tgt = eval_file.replace(self.dataset_name, dname)
            res = self.dataset_map[dname].evaluate(tgt, **judge_kwargs)
            assert isinstance(res, pd.DataFrame)
            res['DATASET'] = [dname] * len(res)
            results_all.append(res)
        result = pd.concat(results_all)
        score_file = eval_file.replace(f'.{suffix}', '_acc.csv')
        dump(result, score_file)
        return result
 # Add new supported dataset class here
 IMAGE_DATASET = [
    ImageCaptionDataset, ImageYORNDataset, ImageMCQDataset, ImageVQADataset, MathVision,
    MMMUDataset, OCRBench, MathVista, LLaVABench, MMVet, MTVQADataset, TableVQABench,
    MMLongBench, VCRDataset, MMDUDataset, DUDE, SlideVQA, MUIRDataset, CCOCRDataset,
    GMAIMMBenchDataset, MMERealWorld, HRBenchDataset, CRPE, MathVerse, NaturalBenchDataset,
    MIABench, OlympiadBench, WildVision, MMMath, QSpatial, Dynamath, MMGenBench, VizWiz, MMNIAH,
    CMMMU, VLRewardBench, WeMath, LogicVista
 ]
 VIDEO_DATASET = [
    MMBenchVideo, VideoMME, MVBench, MVBench_MP4, LongVideoBench,
    MLVU, MLVU_MCQ, MLVU_OpenEnded,
    TempCompass, TempCompass_MCQ, TempCompass_Captioning, TempCompass_YorN,
    CGBench_MCQ_Grounding_Mini, CGBench_OpenEnded_Mini, CGBench_MCQ_Grounding, CGBench_OpenEnded
 ]
 TEXT_DATASET = [
    TextMCQDataset
 ]
 CUSTOM_DATASET = [
    CustomMCQDataset, CustomVQADataset, CustomTextMCQDataset
 ]
 DATASET_COLLECTION = [ConcatDataset, ConcatVideoDataset]
 DATASET_CLASSES = IMAGE_DATASET + VIDEO_DATASET + TEXT_DATASET + CUSTOM_DATASET + DATASET_COLLECTION
 SUPPORTED_DATASETS = []
 for DATASET_CLS in DATASET_CLASSES:
    SUPPORTED_DATASETS.extend(DATASET_CLS.supported_datasets())
 def DATASET_TYPE(dataset, *, default: str = 'MCQ') -> str:
    for cls in DATASET_CLASSES:
        if dataset in cls.supported_datasets():
            if hasattr(cls, 'TYPE'):
                return cls.TYPE
    # Have to add specific routine to handle ConcatDataset
    if dataset in ConcatDataset.DATASET_SETS:
        dataset_list = ConcatDataset.DATASET_SETS[dataset]
        TYPES = [DATASET_TYPE(dname) for dname in dataset_list]
        assert np.all([x == TYPES[0] for x in TYPES]), (dataset_list, TYPES)
        return TYPES[0]
    if 'openended' in dataset.lower():
        return 'VQA'
    warnings.warn(f'Dataset {dataset} is a custom one and not annotated as `openended`, will treat as {default}. ')
    return default
 def DATASET_MODALITY(dataset, *, default: str = 'IMAGE') -> str:
    if dataset is None:
        warnings.warn(f'Dataset is not specified, will treat modality as {default}. ')
        return default
    for cls in DATASET_CLASSES:
        if dataset in cls.supported_datasets():
            if hasattr(cls, 'MODALITY'):
                return cls.MODALITY
    # Have to add specific routine to handle ConcatDataset
    if dataset in ConcatDataset.DATASET_SETS:
        dataset_list = ConcatDataset.DATASET_SETS[dataset]
        MODALITIES = [DATASET_MODALITY(dname) for dname in dataset_list]
        assert np.all([x == MODALITIES[0] for x in MODALITIES]), (dataset_list, MODALITIES)
        return MODALITIES[0]
    if 'VIDEO' in dataset.lower():
        return 'VIDEO'
    elif 'IMAGE' in dataset.lower():
        return 'IMAGE'
    warnings.warn(f'Dataset {dataset} is a custom one, will treat modality as {default}. ')
    return default
 def build_dataset(dataset_name, **kwargs):
    for cls in DATASET_CLASSES:
        if dataset_name in supported_video_datasets:
            return supported_video_datasets[dataset_name](**kwargs)
        elif dataset_name in cls.supported_datasets():
            return cls(dataset=dataset_name, **kwargs)
    warnings.warn(f'Dataset {dataset_name} is not officially supported. ')
    data_file = osp.join(LMUDataRoot(), f'{dataset_name}.tsv')
    if not osp.exists(data_file):
        warnings.warn(f'Data file {data_file} does not exist. Dataset building failed. ')
        return None
    data = load(data_file)
    if 'question' not in [x.lower() for x in data.columns]:
        warnings.warn(f'Data file {data_file} does not have a `question` column. Dataset building failed. ')
        return None
    if 'A' in data and 'B' in data:
        if 'image' in data or 'image_path' in data:
            warnings.warn(f'Will assume unsupported dataset {dataset_name} as a Custom MCQ dataset. ')
            return CustomMCQDataset(dataset=dataset_name, **kwargs)
        else:
            warnings.warn(f'Will assume unsupported dataset {dataset_name} as a Custom Text MCQ dataset. ')
            return CustomTextMCQDataset(dataset=dataset_name, **kwargs)
    else:
        warnings.warn(f'Will assume unsupported dataset {dataset_name} as a Custom VQA dataset. ')
        return CustomVQADataset(dataset=dataset_name, **kwargs)
 __all__ = [
    'build_dataset', 'img_root_map', 'build_judge', 'extract_answer_from_item', 'prefetch_answer', 'DEBUG_MESSAGE'
 ] + [cls.__name__ for cls in DATASET_CLASSES]
--- a/eval_mm/vlmevalkit/vlmeval/dataset/cgbench.py
+++ b/eval_mm/vlmevalkit/vlmeval/dataset/cgbench.py
--- a/eval_mm/vlmevalkit/vlmeval/dataset/cmmmu.py
+++ b/eval_mm/vlmevalkit/vlmeval/dataset/cmmmu.py
@@ -0,0 +1,354 @@
 from .image_base import ImageBaseDataset
 import random
 from collections import Counter
 import os
 import re
 import tempfile
 from ..smp import *
 def get_multi_choice_prediction(response, all_choices, index2ans):
    for char in [',', '.', '!', '?', ';', ':', "'"]:
        response = response.strip(char)
    response = " " + response + " "  # add space to avoid partial match
    candidates = []
    for choice in all_choices:  # (A) (B) (C) (D)
        # Add the choice to candidates each time it appears in the response
        candidates.extend([choice for _ in range(response.count(f'({choice})'))])
    if len(candidates) == 0:
        for choice in all_choices:  # A B C D
            # Similarly, add the choice for each occurrence
            candidates.extend([choice for _ in range(response.count(f'{choice}'))])
    if len(candidates) == 0 and len(response.split()) >= 1:
        for index, ans in index2ans.items():
            # Add index for each occurrence of ans in response
            candidates.extend([index for _ in range(response.count(ans))])
    # if all above doesn't get candidates, check if the content is larger than 5 tokens and try to parse the example
    if len(candidates) == 0 and len(response.split()) >= 1:
        for index, ans in index2ans.items():
            if ans in response:
                candidates.append(index)
                # index_ans = False  # it's content ans.
    if len(candidates) == 0:  # still not get answer, randomly choose one.
        return random.choice(all_choices)
        # return ''
    else:
        # Count the occurrence of each candidate
        candidate_counts = Counter(candidates)
        # Select the most frequent candidates
        max_count = max(candidate_counts.values())
        most_frequent_candidates = [c for c in all_choices if candidate_counts.get(c, 0) == max_count]
        # Combine the most frequent candidates in ABCD order
        return ''.join(most_frequent_candidates)
 def extract_numbers(string):
    # Pattern for numbers with Chinese commas
    pattern_commas = r'-?\d{1,3}(?:，\d{3})+'
    # Pattern for scientific notation
    pattern_scientific = r'-?\d+(?:\.\d+)?[eE][+-]?\d+'
    # Pattern for simple numbers without Chinese commas
    pattern_simple = r'-?(?:\d+\.\d+|\.\d+|\d+)(?![eE][+-]?\d+)(?!，\d)'
    # Extract numbers with Chinese commas
    numbers_with_commas = re.findall(pattern_commas, string)
    # Extract numbers in scientific notation
    numbers_scientific = re.findall(pattern_scientific, string)
    # Extract simple numbers without Chinese commas
    numbers_simple = re.findall(pattern_simple, string)
    # Combine all extracted numbers
    all_numbers = numbers_with_commas + numbers_scientific + numbers_simple
    return all_numbers
 def check_is_number(string):
    try:
        float(string.replace(',', ''))
        return True
    except ValueError:
        # check if there's comma inside
        return False
 def count_letters(string):
    return sum(c.isalpha() and 'a' <= c <= 'z' or 'A' <= c <= 'Z' for c in string)
 def normalize_str(string, answer):
    # check if characters in the string
    # if number, numerize it.
    if string is None:
        return [string]
    string = string.strip()
    is_number = check_is_number(string)
    if is_number:
        string = string.replace(',', '')
        string = float(string)
        # leave 2 decimal
        string = round(string, 2)
        return [string]
    else:  # it's likely to be a string
        if len(string) > len(answer) + 20 or count_letters(string) > count_letters(answer) + 2:
            return []
        return [string]
 def get_fill_blank_prediction(response, answer):
    """get the prediction from the generated response,
    return a list of predicted strings or numbers"""
    def get_key_subresponses(response):
        response = response.strip("。").strip()
        sub_responses = re.split(r'。|\n', response)
        indicators_of_keys = ['是', '为', '所以', '等于', '方案', '选择',
                              '正确答案', '因此', '最后', '答案', '结果']
        key_responses = []
        for index, resp in enumerate(sub_responses):
            # if last one, accept it's an equation (the entire response can be just one sentence with equation)
            if index == len(sub_responses) - 1:
                indicators_of_keys.extend(['='])
            shortest_key_response = None
            # the shortest response that may contain the answer (tail part of the response)
            for indicator in indicators_of_keys:
                if indicator in resp:
                    if not shortest_key_response:
                        shortest_key_response = resp.split(indicator)[-1].strip()
                    else:
                        if len(resp.split(indicator)[-1].strip()) < len(shortest_key_response):
                            shortest_key_response = resp.split(indicator)[-1].strip()
            if shortest_key_response:
                # and it's not trivial
                if shortest_key_response.strip() not in [":", ",", ".", "!", "?", ";", ":", "'"]:
                    key_responses.append(shortest_key_response)
        if len(key_responses) == 0:  # did not found any
            return [response]
        return key_responses
    key_responses = get_key_subresponses(response)
    pred_list = key_responses.copy()  # keep the original string response
    for resp in key_responses:
        pred_list.extend(extract_numbers(resp))
    tmp_pred_list = []
    for i in range(len(pred_list)):
        tmp_pred_list.extend(normalize_str(pred_list[i], answer))
    pred_list = tmp_pred_list
    # remove duplicates
    pred_list = list(set(pred_list))
    return pred_list
 def get_TF_prediction(response):
    """get the prediction from the generated response,
    return a list of predicted strings or numbers"""
    def get_key_subresponses(response):
        response = response.strip("。").strip()
        sub_responses = re.split(r'。|\n', response)
        indicators_of_keys = ['是', '为', '所以', '判断',
                              '陈述', '说法', '表达', '答案', '结果']
        key_responses = []
        for index, resp in enumerate(sub_responses):
            shortest_key_response = None
            # the shortest response that may contain the answer (tail part of the response)
            for indicator in indicators_of_keys:
                if indicator in resp:
                    if not shortest_key_response:
                        shortest_key_response = resp.split(indicator)[-1].strip()
                    else:
                        if len(resp.split(indicator)[-1].strip()) < len(shortest_key_response):
                            shortest_key_response = resp.split(indicator)[-1].strip()
            if shortest_key_response:
                # and it's not trivial
                if shortest_key_response.strip() not in [":", ",", ".", "!", "?", ";", ":", "'"]:
                    key_responses.append(shortest_key_response)
        if len(key_responses) == 0:  # did not found any
            return [response]
        return key_responses
    key_responses = get_key_subresponses(response)
    pred_list = key_responses.copy()  # keep the original string response
    # remove duplicates
    pred_list = list(set(pred_list))
    return pred_list
 class CMMMU(ImageBaseDataset):
    TYPE = 'VQA'
    DATASET_URL = {
        'CMMMU_VAL': 'https://opencompass.openxlab.space/utils/VLMEval/CMMMU_VAL.tsv'
    }
    DATASET_MD5 = {
        'CMMMU_VAL': 'b4727e2fce2415bf646379e60c11a726'
    }
    def dump_image(self, line):
        os.makedirs(self.img_root, exist_ok=True)
        tgt_path_z = []
        if isinstance(line['image'], list):
            for i in range(len(line['image'])):
                tgt_path = osp.join(self.img_root, f"{line['index']}--{i + 1}.jpg")
                if not read_ok(tgt_path):
                    decode_base64_to_image_file(line['image'][i], tgt_path)
                tgt_path_z.append(tgt_path)
        else:
            tgt_path = osp.join(self.img_root, f"{line['index']}.jpg")
            if not read_ok(tgt_path):
                decode_base64_to_image_file(line['image'], tgt_path)
            tgt_path_z.append(tgt_path)
        return tgt_path_z
    @classmethod
    def evaluate(self, eval_file, **judge_kwargs):
        suffix = eval_file.split('.')[-1]
        result_file = eval_file.replace(f'.{suffix}', '_acc.csv')
        if not osp.exists(result_file):
            data = load(eval_file)
            assert 'answer' in data and 'prediction' in data
            data['prediction'] = [str(x) for x in data['prediction']]
            data['answer'] = [str(x) for x in data['answer']]
            correct_count = 0
            correct_category = {
                '技术与工程': [0, 0],
                '科学': [0, 0],
                '健康与医学': [0, 0],
                '商业': [0, 0],
                '艺术与设计': [0, 0],
                '人文社会科学': [0, 0],
            }
            for i in tqdm(data.iterrows()):
                line = i[1]
                correct_category[line['category']][0] += 1
                # Options
                if line['type'] == '选择':
                    index2ans = {
                        'A': line['option1'],
                        'B': line['option2'],
                        'C': line['option3'],
                        'D': line['option4']
                    }
                    fact_option = get_multi_choice_prediction(line['prediction'], ['A', 'B', 'C', 'D'], index2ans)
                    if fact_option == line['answer']:
                        correct_count += 1
                        correct_category[line['category']][1] += 1
                # Binary
                elif line['type'] == '判断':
                    positive_keywords = ['正确', '对', '准确', '肯定', '对的']
                    negative_keywords = ['不对', '错误', '不正确', '不准确', '不合适', '否定', '错的', '错']
                    ambiguous_keywords = ['对错', '是否正确', '否正确', '或者', '是否', '正确性', '对不']
                    def judge_similarity(pred_list, positive_keywords, negative_keywords):
                        positive_count = 0
                        negative_count = 0
                        for pred in pred_list:
                            if any(pos_word in pred for pos_word in positive_keywords):
                                positive_count += 1
                            elif any(neg_word in pred for neg_word in negative_keywords):
                                negative_count += 1
                        if positive_count > negative_count:
                            return "对"
                        elif negative_count > positive_count:
                            return "错"
                        else:
                            return random.choice(['对', '错'])
                    answer = get_TF_prediction(line['prediction'])
                    answer = [word for word in answer if not any(ambiguous in word for ambiguous in ambiguous_keywords)]
                    fact_answer = judge_similarity(answer, positive_keywords, negative_keywords)
                    if fact_answer == line['answer']:
                        correct_count += 1
                        correct_category[line['category']][1] += 1
                # Fill the Blank
                else:
                    norm_answers = normalize_str(line['answer'], line['answer'])
                    predicted_answer = get_fill_blank_prediction(line['prediction'], line['answer'])
                    for pred in predicted_answer:
                        # already normalized
                        if isinstance(pred, str):  # if it's a string, then find if ans in the pred_i
                            for norm_ans in norm_answers:
                                # only see if the string answer in the string pred
                                # print(norm_ans, pred)
                                if isinstance(norm_ans, str) and norm_ans in pred:
                                    correct_count += 1
                                    correct_category[line['category']][1] += 1
                        else:  # it's a number
                            if pred in norm_answers:
                                correct_count += 1
                                correct_category[line['category']][1] += 1
            accuracyz = {}
            accuracyz['总准确率'] = correct_count / len(data)
            for i in correct_category.keys():
                accuracyz[i] = correct_category[i][1] / correct_category[i][0]
            accuracyz = d2df(accuracyz)
            accuracyz.round(10)
            dump(accuracyz, result_file)
        result = pd.read_csv(result_file)
        return result
    def build_prompt(self, line):
        if line['type'] == '选择':
            tgt_path = self.dump_image(line)
            question = line['question']
            options_prompt = 'Options:\n'
            for i in [['A', '1'], ['B', '2'], ['C', '3'], ['D', '4']]:
                options_prompt += i[0] + '. ' + line['option' + i[1]] + '\n'
            prompt = (f'问题: {question}\n' + options_prompt
                      + '请回答上述多项选择题，并选出正确选项。这些题目可能包括单选和多选题型。如果所提供的信息不足以确定一个明确的答案，那么请根据可用的数据和你的判断来选择最可能正确的选项。')
            msgs = []
            if isinstance(tgt_path, list):
                msgs.extend([dict(type='image', value=p) for p in tgt_path])
            else:
                msgs = [dict(type='image', value=tgt_path)]
            msgs.append(dict(type='text', value=prompt))
            return msgs
        elif line['type'] == '判断':
            msgs = super().build_prompt(line)
            assert msgs[-1]['type'] == 'text'
            msgs[-1]['value'] += '\n请回答上述判断题，并根据题目描述和所给的信息来判断问题中陈述的对错。如果信息不完整或不足以作出绝对判断，请运用你的逻辑推理和现有信息来做出最可能的判断。'
            return msgs
        else:
            msgs = super().build_prompt(line)
            assert msgs[-1]['type'] == 'text'
            msgs[-1]['value'] += '\n请回答上述填空题，并根据题目的要求和所提供的信息来给出最恰当的答案。如果信息不足以确切回答，那么请依据现有的数据和你的推理能力来填写最合理的答案。'
            return msgs
--- a/eval_mm/vlmevalkit/vlmeval/dataset/dude.py
+++ b/eval_mm/vlmevalkit/vlmeval/dataset/dude.py
@@ -0,0 +1,211 @@
 import math
 from typing import List
 from .utils.judge_util import build_judge
 from .image_base import ImageBaseDataset
 from .mmlongbench import concat_images, MMLongBench_auxeval, anls_compute
 from ..smp import *
 FAIL_MSG = 'Failed to obtain answer via API.'
 def DUDE_acc(result_file):
    data = load(result_file)
    overall_score = 0.0
    score_list = list()
    for i in range(len(data)):
        item = data.iloc[i]
        if isinstance(item['answer'], float) and math.isnan(item['answer']):
            item['answer'] = 'Not answerable'
        item['answer'] = item['answer'].lower()
        item['pred'] = item['pred'].lower()
        score = anls_compute(item['answer'], item['pred'])
        score_list.append(score)
        overall_score += score
    data['score'] = score_list
    dump(data, result_file)
    res = dict()
    res['category'], res['num'], res['avg_score'] = ['anls'], [len(data)], [overall_score / len(data)]
    res = pd.DataFrame(res)
    return res
 class DUDE(ImageBaseDataset):
    TYPE = 'VQA'
    DATASET_URL = {
        'DUDE': 'https://opencompass.openxlab.space/utils/VLMEval/DUDE.tsv',
        'DUDE_MINI': 'https://opencompass.openxlab.space/utils/VLMEval/DUDE_MINI.tsv',
    }
    DATASET_MD5 = {
        'DUDE': '130d860d08206e1e407cd77150c10d88',
        'DUDE_MINI': 'e0c0d998114f0cca7516d12039d2b538',
    }
    SUPPORTED_MODELS = {
        'GPT4': (1, 1),
        'GPT4V': (1, 1),
        'GPT4V_HIGH': (1, 1),
        'GPT4o': (1, 1),
        'GPT4o_HIGH': (1, 1),
        'GPT4o_MINI': (1, 1),
        'XComposer2d5': (1, -1),
        'XComposer2_4KHD': (1, -1),
        'MiniCPM-Llama3-V-2_5': (1, 5),
        'InternVL-Chat-V1-5': (5, 2),
    }
    def __init__(self, dataset, **kwargs):
        self.model_list = list(self.SUPPORTED_MODELS.keys())
        model_name = kwargs['model']
        if not listinstr(self.model_list, model_name):
            raise AssertionError("{} doesn't support the evaluation on DUDE.".format(model_name))
        super(DUDE, self).__init__(dataset)
        self.is_api = True if listinstr(['GPT4'], model_name) else False
        self.max_pages = 120
        concat_num, column_num = self.SUPPORTED_MODELS.get(model_name)
        self.concat_num = concat_num
        self.column_num = column_num
    def prepare_tsv(self, url, file_md5=None):
        data_root = LMUDataRoot()
        os.makedirs(data_root, exist_ok=True)
        file_name = url.split('/')[-1]
        data_path = osp.join(data_root, file_name)
        if osp.exists(data_path) and (file_md5 is None or md5(data_path) == file_md5):
            pass
        else:
            warnings.warn('The dataset tsv is not downloaded')
            download_file(url, data_path)
        return load(data_path)
    def dump_image(self, origin_line):
        os.makedirs(self.img_root, exist_ok=True)
        try:
            import fitz
        except Exception as e:
            logging.critical(f'{type(e)}: {e}')
            logging.critical('Please use `pip install pymupdf` to parse PDF files.')
        line = origin_line.copy()
        if not isinstance(line['image_path'], List):
            line['image_path'] = [line['image_path']]
        line['image_path'] = line['image_path'][:self.max_pages]
        skip_pdf_parse = True
        for im_name in line['image_path']:
            path = osp.join(self.img_root, im_name)
            if not read_ok(path):
                skip_pdf_parse = False
                break
        # Just for being compatible with the zooped loop: zip(line['image'], line['image_path'])
        if skip_pdf_parse:
            line['image'] = line['image_path']
        else:
            pdf_data = base64.b64decode(line['image'])
            pdf_file = io.BytesIO(pdf_data)
            encoded_images = []
            with fitz.open(stream=pdf_file, filetype='pdf') as doc:
                doc = doc[:self.max_pages]
                for page in doc:
                    image = page.get_pixmap(dpi=144)
                    image_file = io.BytesIO(image.tobytes(output='png'))
                    image = Image.open(image_file)
                    encoded_image = encode_image_to_base64(image)
                    encoded_images.append(encoded_image)
            line['image'] = encoded_images
            print('process {}'.format(line['doc_id']))
        if 'image' in line:
            if isinstance(line['image'], list):
                tgt_path = []
                assert 'image_path' in line
                for img, im_name in zip(line['image'], line['image_path']):
                    path = osp.join(self.img_root, im_name)
                    if not read_ok(path):
                        decode_base64_to_image_file(img, path)
                    tgt_path.append(path)
            else:
                tgt_path = osp.join(self.img_root, f"{line['index']}.jpg")
                if not read_ok(tgt_path):
                    decode_base64_to_image_file(line['image'], tgt_path)
                tgt_path = [tgt_path]
        else:
            assert 'image_path' in line
            tgt_path = toliststr(line['image_path'])
        if self.concat_num > 0 and not self.is_api:
            concatenated_images = concat_images(tgt_path, max_concat=self.concat_num, column_num=self.column_num)
            old_tgt_path = tgt_path
            assert isinstance(old_tgt_path, list)
            if self.column_num != -1:
                tgt_path = [
                    '_'.join(old_tgt_path[0].split('_')[:-1]) + '_concat{}_{}.jpg'.format(self.concat_num, i)
                    for i in range(len(concatenated_images))
                ]
            else:
                tgt_path = ['_'.join(old_tgt_path[0].split('_')[:-1]) + '_concat_all.jpg']
            for path, concatenated_image in zip(tgt_path, concatenated_images):
                if not read_ok(path):
                    decode_base64_to_image_file(encode_image_to_base64(concatenated_image), path)
                    num_images, image_size = len(old_tgt_path), concatenated_image.size
                    print('concat {} images to a new one with size {}. save at {}'.format(num_images, image_size, path))
        return tgt_path
    @classmethod
    def evaluate(self, eval_file, **judge_kwargs):
        logger = get_logger('Evaluation')
        model = judge_kwargs['model']
        suffix = eval_file.split('.')[-1]
        storage = eval_file.replace(f'.{suffix}', f'_{model}.xlsx')
        tmp_file = eval_file.replace(f'.{suffix}', f'_{model}.pkl')
        if osp.exists(storage):
            logger.warning(f'GPT scoring file {storage} already exists, will reuse it in DUDE_eval. ')
        else:
            data = load(eval_file)
            model = build_judge(max_tokens=128, **judge_kwargs)
            lt = len(data)
            lines = [data.iloc[i] for i in range(lt)]
            tups = [(model, line) for line in lines]
            indices = [line['index'] for line in lines]
            ans = {}
            if osp.exists(tmp_file):
                ans = load(tmp_file)
            tups = [x for x, i in zip(tups, indices) if i not in ans]
            indices = [i for i in indices if i not in ans]
            if len(indices):
                new_results = list()
                for model, line in tqdm(tups):
                    res = MMLongBench_auxeval(model, line)
                    new_results.append(res)
            log_map, res_map, pred_map = {}, {}, {}
            all_inds = [line['index'] for line in lines]
            for k, v in zip(all_inds, new_results):
                log_map[k] = v['log']
                res_map[k] = v['res']
                pred_map[k] = v['pred']
            data['res'] = [res_map[idx] for idx in data['index']]
            data['log'] = [log_map[idx] for idx in data['index']]
            data['pred'] = [pred_map[idx] for idx in data['index']]
            dump(data, storage)
        score = DUDE_acc(storage)
        score_pth = storage.replace('.xlsx', '_score.csv')
        dump(score, score_pth)
        logger.info(f'DUDE successfully finished evaluating {eval_file}, results saved in {score_pth}')
        logger.info('Score: ')
        logger.info(score)
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
qianyu chen	e41152f89c	Update trainer.py	2025-09-12 15:53:48 +08:00
tc-mb	c821cbd7c8	rm ide file Signed-off-by: tc-mb <caitianchi@modelbest.cn>	2025-09-02 16:14:14 +08:00
yiranyyu	a846468195	update readme	2025-09-02 12:16:14 +08:00
yiranyyu	f8da52c35c	update readme	2025-09-02 12:15:25 +08:00
yiranyyu	67afdeb934	update readme	2025-09-01 17:51:49 +08:00
yiranyyu	3cde81287d	Merge branch 'main' of https://github.com/OpenBMB/MiniCPM-o	2025-08-31 22:33:56 +08:00
yiranyyu	e45524cbf7	update readme	2025-08-31 22:33:14 +08:00
Yuan Yao	0d8b90df97	Update README.md	2025-08-31 11:24:58 +08:00
Yuan Yao	d16875b120	Update README.md	2025-08-30 10:52:45 +08:00
Yuan Yao	1c89161d65	Update README.md	2025-08-30 10:26:52 +08:00
YuzaChongyi	da79d55ad4	update readme (#986 ) Co-authored-by: wangchongyi <>	2025-08-30 00:02:35 +08:00
YuzaChongyi	b9a95ee0ea	update readme (#985 ) Co-authored-by: wangchongyi <>	2025-08-29 23:58:10 +08:00
YuzaChongyi	02c68764d4	update readme (#984 ) Co-authored-by: wangchongyi <>	2025-08-29 23:52:31 +08:00
tc-mb	509e934a59	update video link Signed-off-by: tc-mb <caitianchi@modelbest.cn>	2025-08-29 01:00:45 +08:00
Yuan Yao	3d050a5dd4	Update README.md	2025-08-27 11:47:38 +08:00
Yuan Yao	d01532f89c	Update README.md	2025-08-27 11:47:00 +08:00
tc-mb	bffc715128	update awq	2025-08-26 22:38:54 +08:00
YuzaChongyi	af96e66e01	update readme (#966 ) Co-authored-by: wangchongyi <>	2025-08-26 18:58:47 +08:00
yiranyyu	eb072b30a0	update readme	2025-08-26 18:34:29 +08:00
yiranyyu	16a79219cb	update readme	2025-08-26 18:33:00 +08:00
yiranyyu	663d96c887	update readme	2025-08-26 18:31:16 +08:00
yiranyyu	1dcb4e2fee	update readme	2025-08-26 17:39:41 +08:00
yiranyyu	fe7b3d27de	update readme	2025-08-26 17:35:17 +08:00
yiranyyu	9d0531b236	update readme	2025-08-26 17:29:42 +08:00
yiranyyu	5443a7c4d7	update readme	2025-08-26 17:27:04 +08:00
yiranyyu	fcecab8045	update readme	2025-08-26 17:21:27 +08:00
YuzaChongyi	06e220c8f4	add minicpm-v-4.5 (#963 ) Co-authored-by: wangchongyi <>	2025-08-26 05:20:58 +08:00
tc-mb	2ef22c138e	update qr png	2025-08-20 17:41:33 +08:00
tc-mb	51f3f36614	add modelbest license to minicpm-o repo	2025-08-12 14:14:46 +08:00
tc-mb	03111d5c5b	change quan doc to cookbook	2025-08-06 16:25:53 +08:00
tc-mb	4f7eba0c29	Merge pull request #947 from ZMXJJ/minicpmv-4 Update README	2025-08-06 14:55:20 +08:00
Dennis Huang	3acd3f9891	Update README	2025-08-06 14:31:54 +08:00
tc-mb	d828902a98	update news	2025-08-06 00:21:09 +08:00
tc-mb	8438ec2147	fix png in readme	2025-08-06 00:18:31 +08:00
tc-mb	b91fff3ea8	update readme	2025-08-05 22:26:23 +08:00
tc-mb	e2559a5ca2	public MiniCPM-V 4.0	2025-08-05 22:19:54 +08:00
tc-mb	8185ac321d	add gif	2025-08-05 21:40:15 +08:00
yiranyyu	539e70177c	Add Cookbook	2025-08-01 01:18:47 +08:00
yiranyyu	6e8f1d7a66	Add Cookbook	2025-08-01 01:18:14 +08:00
yiranyyu	50214bfa52	Add Cookbook	2025-08-01 01:16:19 +08:00
tc-mb	2d9919ac69	Update README_zh.md	2025-06-30 11:08:15 +08:00
tc-mb	48c0611a3f	Update README.md	2025-06-30 11:08:00 +08:00
tc-mb	afc3b105bd	Update README.md	2025-06-25 21:17:24 +08:00
tc-mb	732f5e62e4	Update README_zh.md	2025-06-25 20:23:57 +08:00
tc-mb	949fc4e843	Update README.md	2025-06-25 20:22:42 +08:00
tc-mb	ebb1a5e0a7	Update README_zh.md	2025-06-25 17:22:07 +08:00
tc-mb	7084bbfa9f	Update README.md	2025-06-25 17:21:22 +08:00
yiranyyu	523fb11263	Update README	2025-06-25 11:40:07 +08:00
yiranyyu	b2b2b7bd70	Update README	2025-06-20 14:32:01 +08:00
YuzaChongyi	0234793a3b	add join us link (#929 ) Co-authored-by: wangchongyi <>	2025-06-17 23:13:16 +08:00
tc-mb	4b5828acb1	Update README.md	2025-06-12 16:24:45 +08:00
tc-mb	11ca385133	Add files via upload	2025-06-12 10:52:27 +08:00
tc-mb	fdfa37e768	Merge pull request #920 from ZMXJJ/patch-1 Update requirements_o2.6.txt to resolve TypeError during chatbot startup.	2025-05-15 15:35:13 +08:00
Dennis Huang	587306e3a6	Update requirements_o2.6.txt to resolve TypeError during chatbot demo startup ### Summary - Resolved a `TypeError` encountered during the startup of the ChatBot Demo. - Added version constraints for `pydantic` in the `requirements.txt` file to ensure dependency stability. ### Details 1. Background: - During the startup of the ChatBot Demo, a `TypeError` occurred due to incompatible versions of certain dependencies. - Investigation revealed that the issue was related to the `pydantic` library, where specific features were not functioning correctly in certain versions. 2. Solution: - Added version constraints for `pydantic` in the `requirements.txt` file to ensure compatibility. - Update: ``` pydantic==2.10.6 ``` - By specifying the version, we avoid runtime errors caused by inconsistent dependency versions. 3. Testing Verification: - Reinstalled dependencies locally and restarted the ChatBot Demo to confirm the issue was resolved. - Verified that all functionalities work as expected without any additional errors.	2025-05-14 20:13:23 +08:00
yiranyyu	62327ab934	Update README	2025-03-03 19:54:53 +08:00
yiranyyu	8bfe84fa7e	Update README	2025-03-03 19:50:09 +08:00
Boke Syo	f27b88951c	Update README.md	2025-02-21 16:49:14 +08:00
Bokai Xu	784c46ba45	update default voice	2025-02-20 18:44:51 +08:00
Cui Junbo	8b1edaeffd	Merge pull request #771 from lihytotoro/main Modify eval_mm for MiniCPM-o 2.6	2025-02-17 12:26:07 +08:00
Zhangchi Feng	d364570bff	update docs of llamafactory (#825 )	2025-02-11 10:11:42 +08:00
Alphi	47283856a3	Update vllm example in ReadMe (#819 ) * Update README.md * Update README_zh.md	2025-02-08 17:52:37 +08:00
yiranyyu	06be4aa3d2	Update README	2025-01-27 23:32:33 +08:00
yiranyyu	166d987e48	Update README	2025-01-26 14:37:27 +08:00
yiranyyu	65932999c4	Update README	2025-01-24 21:23:23 +08:00
yiranyyu	94aad96afd	Update README	2025-01-24 21:18:59 +08:00
Boke Syo	7770da6253	Update model_server.py	2025-01-24 16:32:26 +08:00
Boke Syo	e3415db52b	Update README.md	2025-01-24 11:10:00 +08:00
yiranyyu	4d2de6f90f	update README	2025-01-23 23:38:19 +08:00
YuzaChongyi	ae0df6cbb5	update readme (#790 )	2025-01-23 16:07:25 +08:00
Hongji Zhu	56517f7a27	update readme web demo	2025-01-22 12:56:39 +08:00
Zhangchi Feng	1850a6b825	Update audio part of Llamafactory docs (#756 ) * add llamafactory examples * tiny fix * update doc about inference * update audio part of llamafactory	2025-01-22 12:45:09 +08:00
Cui Junbo	9cf46a3a2a	Merge pull request #777 from linhaojia13/main Update VITA-1.5 Results in README.md	2025-01-22 12:10:21 +08:00
Cui Junbo	3ed9416cd9	Merge pull request #779 from wangxiongts/main Add VITA-1.5 asr results	2025-01-22 12:10:04 +08:00
Xiong Wang	1a3f00479f	Update README.md	2025-01-22 11:55:38 +08:00
林豪佳	8df2c5e786	Update vita1.5 in README.md	2025-01-22 11:44:34 +08:00
flyingmrwang	a946b052c0	Update README	2025-01-22 11:15:50 +08:00
Poppy Xu	d8f382e157	Modify eval_mm for MiniCPM-o 2.6	2025-01-21 15:34:54 +08:00
Hongji Zhu	ec68cefc17	update req for chatbot demo	2025-01-21 00:15:21 +08:00
Hongji Zhu	3ad07c583a	update local chatbot demo	2025-01-20 16:02:56 +08:00
tc-mb	0b9fa42dd6	Merge pull request #763 from bokesyo/patch-1 Add detailed description in speech mode (en version)	2025-01-20 12:52:04 +08:00
Boke Syo	f8f4998a49	Update README.md	2025-01-20 12:21:20 +08:00
yiranyyu	332ed62161	Update README	2025-01-19 21:42:35 +08:00
yiranyyu	e29c918333	Update README	2025-01-19 16:57:28 +08:00
yiranyyu	639cfc9412	Update README	2025-01-19 12:40:12 +08:00
yiranyyu	03d2412085	Update README	2025-01-19 12:29:01 +08:00
YuzaChongyi	4a011f46d1	Update README_zh.md	2025-01-18 11:16:03 +08:00
YuzaChongyi	2421a033fb	Update README_zh.md	2025-01-17 17:42:04 +08:00
YuzaChongyi	2c093c2ab3	Update README.md	2025-01-17 17:40:37 +08:00
YuzaChongyi	ea1b4acc86	update news (#741 )	2025-01-17 15:27:59 +08:00
Ikko Eltociear Ashimine	c4117b72a6	docs: update README.md (#735 ) streaminig -> streaming	2025-01-17 00:30:32 +08:00
Hongji Zhu	4814f78a98	fix local web demo no permission for camera/mic	2025-01-16 22:48:50 +08:00
yiranyyu	62c5e88a17	update README	2025-01-16 22:33:54 +08:00
yiranyyu	a84fd37adc	Merge branch 'main' of https://github.com/openbmb/minicpm-o	2025-01-16 22:29:17 +08:00
yiranyyu	10bee02ce5	update README	2025-01-16 22:29:14 +08:00
YuzaChongyi	5520d436ed	add model init in Multimodal Live Streaming code (#733 )	2025-01-16 20:20:58 +08:00
YuzaChongyi	2c0e7ff051	Update requirements.txt	2025-01-16 17:44:45 +08:00
YuzaChongyi	880095e28c	add requirements for webdemo (#731 )	2025-01-16 15:47:45 +08:00
yiranyyu	cddeb03396	Update README	2025-01-16 15:00:33 +08:00
YuzaChongyi	58f7328e7a	Update requirements_o2.6.txt	2025-01-16 14:52:48 +08:00
YuzaChongyi	fadf8c398a	remove unused doc (#729 )	2025-01-16 13:44:41 +08:00
YuzaChongyi	25e0d84d2f	Update README_zh.md	2025-01-16 11:52:40 +08:00
YuzaChongyi	fd8674cc72	Update README.md	2025-01-16 11:50:44 +08:00
Zhangchi Feng	cd0972c7a1	Best Practice with LLaMA-Factory (#711 ) * add llamafactory examples * tiny fix * update doc about inference	2025-01-16 09:50:18 +08:00
yiranyyu	f2d4d4b130	Update README	2025-01-16 09:37:56 +08:00
yiranyyu	434e3874d3	Merge branch 'main' of https://github.com/openbmb/minicpm-o	2025-01-15 23:11:47 +08:00
yiranyyu	8c19a59cf8	Update README	2025-01-15 23:11:43 +08:00
Cui Junbo	28f1ed2925	Update README_zh.md	2025-01-15 22:46:35 +08:00
Cui Junbo	e81bf19555	Update README.md	2025-01-15 22:45:53 +08:00
yiranyyu	613803f6f1	Merge branch 'main' of https://github.com/openbmb/minicpm-o	2025-01-15 18:02:23 +08:00
yiranyyu	e1e04af112	Update README	2025-01-15 18:02:17 +08:00
Hongji Zhu	40a54bb0e3	update readme; demo model server add args for model path	2025-01-15 17:07:49 +08:00
yiranyyu	b178622f73	Update README	2025-01-15 16:57:47 +08:00
yiranyyu	7d160d7aeb	Update README	2025-01-15 16:55:22 +08:00
Alexandra Hotti	4977fec2ff	Update requirements.txt (#713 ) Fixed typo in requirements	2025-01-15 14:40:46 +08:00
Hongji Zhu	f09ffe355a	Update README_zh.md	2025-01-15 00:02:48 +08:00
Hongji Zhu	3c2f729530	update demo port	2025-01-14 23:57:57 +08:00
Hongji Zhu	f0d641b578	Update README.md	2025-01-14 22:52:59 +08:00
YuzaChongyi	ce0955c0f4	Update README.md	2025-01-14 21:59:41 +08:00
YuzaChongyi	5cfd89090e	Update README_zh.md	2025-01-14 21:59:10 +08:00
YuzaChongyi	cbe7ade404	Update README_zh.md	2025-01-14 21:57:17 +08:00
YuzaChongyi	62034f183f	Update README.md	2025-01-14 21:55:37 +08:00
yiranyyu	2cece543fa	Merge branch 'main' of https://github.com/openbmb/minicpm-o	2025-01-14 21:20:23 +08:00
yiranyyu	1c51a220f0	Update Demo	2025-01-14 21:20:10 +08:00
YuzaChongyi	516777e462	Update README_zh.md	2025-01-14 21:08:14 +08:00
YuzaChongyi	5e719efab0	Update README.md	2025-01-14 21:07:44 +08:00
yiranyyu	495a810f87	Update README format	2025-01-14 21:06:19 +08:00
yiranyyu	806a1015d8	Update README format	2025-01-14 21:04:58 +08:00
yiranyyu	6ce232a06c	Update README format	2025-01-14 16:50:43 +08:00
yiranyyu	b78137435a	Update README format	2025-01-14 15:51:00 +08:00
yiranyyu	85a21c8dc7	Update README format	2025-01-14 15:50:02 +08:00
yiranyyu	088622f7be	Update README format	2025-01-14 15:49:27 +08:00
yiranyyu	07afc8e39a	Update README format	2025-01-14 15:48:19 +08:00
yiranyyu	53c0174797	Update to MiniCPM-o 2.6	2025-01-14 15:33:44 +08:00
qianyu chen	b75a362dd6	Merge pull request #708 from BUAADreamer/main [Feature] Support video sft and fix some training bugs	2025-01-14 15:16:08 +08:00
fzc8578	4a087a8aec	tiny fix	2025-01-11 13:58:19 +08:00
fzc8578	c5e82b1bc7	support video sft and auto save and load all files	2025-01-11 13:50:36 +08:00
Tianyu Yu	8464c94a7b	Update dataset.py Fix typo in a parameter name.	2024-10-22 16:08:52 +08:00
阿丹(adan)	2ab9fa7913	Update wechat.md	2024-10-18 17:37:01 +08:00
阿丹(adan)	96c9e25287	Add files via upload	2024-10-18 17:36:26 +08:00
Hongji Zhu	8ff6cc0ed0	update readme demo url	2024-10-16 16:36:50 +08:00
Cui Junbo	a209258d85	Merge pull request #484 from JamePeng/main Update streamlit implementation for MiniCPM-V 2.6	2024-09-14 01:36:37 +08:00
LDLINGLINGLING	7bcca75e29	Update wechat.md	2024-09-12 13:58:54 +08:00
LDLINGLINGLING	fd938af276	Update wechat.md	2024-09-12 13:58:24 +08:00
LDLINGLINGLING	7ec8b3eca4	跟新微信入口	2024-09-12 13:57:47 +08:00
LDLINGLINGLING	0cda63b309	Update wechat.md	2024-09-12 13:53:48 +08:00
LDLINGLINGLING	f48f790d69	Add files via upload	2024-09-12 13:53:25 +08:00
Cui Junbo	c541f1044e	Merge pull request #543 from lihytotoro/main Modify eval_mm for MiniCPM-V 2.6	2024-09-01 00:51:28 +08:00
LDLINGLINGLING	e56f2373f2	Update wechat.md	2024-08-31 16:18:24 +08:00
LDLINGLINGLING	38c5495e1e	Update wechat.md	2024-08-31 16:17:49 +08:00
LDLINGLINGLING	fa25b3f20f	跟新微信入口	2024-08-31 16:16:55 +08:00
Haoyu Li	59224808a1	Modify eval_mm for MiniCPM-V 2.6	2024-08-30 18:18:22 +00:00
JamePeng	b109c67478	fixed missing import	2024-08-27 20:52:52 +08:00
JamePeng	344ddc2cb1	Optimize video frame sampling logic - Replaced manual index calculation with `np.linspace` for improved efficiency and readability. - Reduced computation overhead by utilizing NumPy's vectorized operations for generating evenly spaced frame indices.	2024-08-27 04:40:48 +08:00
JamePeng	3745c3316a	Update web_demo_streamlit-minicpmv2_6.py	2024-08-27 04:20:48 +08:00
JamePeng	dc5f809253	Update web_demo_streamlit-minicpmv2_6.py 1. Avoid using 'None' string when `user_text` is empty. 2. Added `st.spinner` to display a loading message during AI content generation.	2024-08-27 04:03:35 +08:00
JamePeng	48ed792ab8	Merge branch 'OpenBMB:main' into main	2024-08-27 03:39:30 +08:00
yiranyyu	ab1141ee45	update readme	2024-08-25 18:08:59 +08:00
yiranyyu	421c6d7838	update readme	2024-08-25 18:06:15 +08:00
yiranyyu	512d5a8bb0	update readme	2024-08-25 17:11:43 +08:00
LDLINGLINGLING	1c31c6aa78	Update wechat.md	2024-08-22 13:44:10 +08:00
LDLINGLINGLING	93aad9f29f	Add files via upload	2024-08-22 13:43:45 +08:00
yiranyyu	c19087cd13	update readme	2024-08-21 11:02:58 +08:00
yiranyyu	0fadd70c9e	Merge branch 'main' of https://github.com/OpenBMB/MiniCPM-V	2024-08-19 22:35:57 +08:00
yiranyyu	4b13c46dbb	update readme	2024-08-19 22:35:54 +08:00
JamePeng	c9f5cd4b00	Update streamlit implementation for MiniCPM-V 2.6 Compared with the streamlit implementation of 2.5, this code implementation can better play the new multi-modal capabilities of 2.6: 1. The application supports the upload and processing of text, single image, multiple images and videos, and can process different types of input according to the mode selected by the user. 2. Video frame extraction and encoding: In video mode, frames are extracted from the uploaded video through the decord library and uniformly sampled so that the model can process and generate responses. More detailed and clear variables and annotations. Convenient for learning and use 3. File upload and processing: Support users to upload pictures and videos, and perform corresponding processing according to different modes, such as displaying pictures in single picture mode, displaying multiple pictures in multi-picture mode, and processing video frames in video mode. You can switch back and forth between different media. 4. Tip: You can use the command `streamlit run ./web_demo_streamlit-minicpmv2_6.py --server.maxUploadSize 1024` to adjust the maximum upload size to 1024MB or larger files. The default 200MB limit of Streamlit's file_uploader component might be insufficient for video-based interactions. Adjust the size based on your GPU memory usage.	2024-08-19 00:45:33 +08:00
JamePeng	259d54ed0a	Update web_demo_streamlit-2_5.py Fixed a bug where imagefile was not initialized when no image was passed in	2024-08-18 23:53:07 +08:00
LDLINGLINGLING	0e4ec319cf	Update wechat.md	2024-08-16 16:16:35 +08:00
LDLINGLINGLING	b01d8e4adb	增加了微信群26的入口图片	2024-08-16 16:16:03 +08:00
LDLINGLINGLING	970cea7d60	增加了25群的入口图片	2024-08-16 16:14:35 +08:00
yiranyyu	338892394f	update readme	2024-08-15 12:08:47 +08:00
yiranyyu	5553046db7	update readme	2024-08-15 12:06:40 +08:00
yiranyyu	30b2446b0f	update readme	2024-08-15 12:05:31 +08:00
qianyu chen	cd64150b51	update finetuen for multi images sft (#462 )	2024-08-15 11:24:50 +08:00
tc-mb	825abf10e2	Update README_zh.md	2024-08-15 11:04:58 +08:00
tc-mb	ee458ad848	Update README.md	2024-08-15 10:59:28 +08:00