Compare commits
28 Commits
qyc-98-4.5
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
82ae1419eb | ||
|
|
17827f0c57 | ||
|
|
520c2c476b | ||
|
|
0cd0132b43 | ||
|
|
37b4f05f9c | ||
|
|
48d65128fc | ||
|
|
d2086b19da | ||
|
|
b2d728908b | ||
|
|
6880a27c5f | ||
|
|
4a333deb8c | ||
|
|
90dd7e88a6 | ||
|
|
8c2f41fef5 | ||
|
|
076466dd5a | ||
|
|
592ba7519e | ||
|
|
53bcece5c8 | ||
|
|
168ae8fe46 | ||
|
|
28632248d5 | ||
|
|
74aa48ebeb | ||
|
|
22431c9436 | ||
|
|
6d4da2ee5a | ||
|
|
bb0e3c2a92 | ||
|
|
1d3c5f455e | ||
|
|
9d37b1c2a0 | ||
|
|
c130da1b4d | ||
|
|
91cf50f813 | ||
|
|
6cb4a3bf82 | ||
|
|
af22b8f2ed | ||
|
|
7233ef5473 |
2
LICENSE
@@ -186,7 +186,7 @@
|
||||
same "printed page" as the copyright notice for easier
|
||||
identification within third-party archives.
|
||||
|
||||
Copyright 2024 OpenBMB
|
||||
Copyright OpenBMB
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
|
||||
@@ -1,41 +0,0 @@
|
||||
Version 1.0, June 5, 2024
|
||||
© 2024 OpenBMB. All rights reserved.
|
||||
|
||||
## Part One: Preamble
|
||||
|
||||
We are opening the entire series of the globally leading MiniCPM edge-side large language models, including the flagship edge-side models MiniCPM-2.4B and MiniCPM-1.2B, as well as the world's most powerful edge multimodal models MiniCPM-V series. The aforementioned weights are completely open for all academic research. Commercial use is also allowed after filling out a registration questionnaire. Community use of the MiniCPM series models must comply with Apache 2.0 and the "MiniCPM Model Community License Agreement."
|
||||
Therefore, you and the MiniCPM development team agree to the following "MiniCPM Model Community License Agreement":
|
||||
|
||||
## Part Two: Licensing and Redistributio
|
||||
|
||||
#### 1. Grant of Rights
|
||||
You are granted a non-exclusive, worldwide, non-transferable, royalty-free, limited license to use, copy, distribute, reproduce, create derivative works from, and modify MiniCPM materials in accordance with OpenBMB's intellectual property rights or other rights in the MiniCPM materials.
|
||||
#### 2. Distribution and Redistribution
|
||||
- If you distribute or provide MiniCPM series model materials (or any derivative works thereof), or use any product or service of them, you must (A) provide a copy of this agreement; and (B) prominently display "Built with 面壁MiniCPM" on the relevant website, user interface, blog post, about page, or product documentation. If you create, train, fine-tune, or improve an AI model using the MiniCPM series models, the model must include "MiniCPM" in its name.
|
||||
- You must retain the following attribution statement in all distributed MiniCPM-related materials: "MiniCPM is licensed under the MiniCPM Model Community License, © OpenBMB Platforms, Inc. All rights reserved."
|
||||
- Your use of MiniCPM materials must comply with applicable laws and regulations and the "MiniCPM Model Community License Agreement," which is incorporated into this agreement by reference.
|
||||
- You may not use MiniCPM series models or their outputs and results to improve any other large language models (other than MiniCPM or its derivatives).
|
||||
#### 3. Additional Commercial Terms
|
||||
If you or your affiliates' services or products deploy the model on edge-side devices not exceeding 5,000 units, or provide applications with a daily active user count (DAU) of less than 1 million, you can apply to OpenBMB for permission and, after filling out the registration questionnaire, may be allowed to use it commercially for free. Otherwise, please email (cpm@modelbest.cn) to apply for authorization from OpenBMB, which may, at its discretion, grant permission, and you will not have the right to exercise any rights under this agreement.
|
||||
#### 4. Usage-based Restrictions
|
||||
The restrictions set forth in Appendix A are considered usage-based restrictions. Therefore, you may not use the model or its derivatives for designated restricted uses. You may use the model under this license, including only for lawful purposes and in compliance with the terms of the license. Usage includes creating any content, fine-tuning, updating, running, training, evaluating, and/or re-parameterizing the model. You should require all users of the model or its derivatives to comply with the terms of this section.
|
||||
|
||||
## Part Three: Other Terms
|
||||
#### 5. Trademarks and Related
|
||||
This license does not grant you the right to use OpenBMB, OpenBMB Intelligence, MiniCPM trademarks, trade names, logos, or otherwise imply a relationship between the parties; any rights not expressly granted herein are reserved by OpenBMB.
|
||||
#### 6. Disclaimer
|
||||
Unless required by applicable law or agreed to in writing, OpenBMB provides the model and supplemental materials "as is," without any warranty or condition, express or implied, including but not limited to all express and implied warranties or conditions of title, non-infringement, merchantability, or fitness for a particular purpose. You are solely responsible for determining the appropriateness of using or redistributing the model, its derivatives, and supplemental materials, and assume any risks associated with exercising the permissions under this license.
|
||||
|
||||
## Appendix A: Usage Restrictions
|
||||
You agree not to use the model or its derivatives for:
|
||||
- Any use that violates applicable national or international laws or regulations or infringes upon the legal rights and interests of any third party;
|
||||
- Any military purposes;
|
||||
- Exploiting, harming, or attempting to exploit or harm minors in any way;
|
||||
- Generating or disseminating verifiable false information and/or content with the intent to harm others;
|
||||
- Generating or disseminating inappropriate content that must comply with applicable regulatory requirements;
|
||||
- Unauthorized generation or dissemination of personally identifiable information, or unreasonable use thereof;
|
||||
- Defamation, demeaning, or otherwise harassing others;
|
||||
- Fully automated decision-making that adversely affects individuals' legal rights or creates or modifies binding, enforceable obligations;
|
||||
- Any use intended to or having the effect of discriminating or harming individuals or groups based on online or offline social behaviors or known or predicted personal characteristics;
|
||||
- Exploiting the vulnerabilities of specific groups due to their age, social, physical, or psychological characteristics, in a manner that materially distorts the behavior of group members, leading to or likely leading to physical or psychological harm to the person or others;
|
||||
- Any use intended to or having the effect of discriminating against individuals or groups based on legally protected characteristics or categories.
|
||||
@@ -1,43 +0,0 @@
|
||||
版本 1.0,2024年6月5日
|
||||
版权所有 © 2024 OpenBMB
|
||||
|
||||
## 第一部分:序言
|
||||
|
||||
我们将全球领先的MiniCPM端侧模型全系开源,包括旗舰端侧模型MiniCPM-2.4B和MiniCPM-1.2B,以及全球领先的端侧多模态模型MiniCPM-V系列。以上权重对所有学术研究完全开放。在填写问卷登记后亦允许商业使用,社区使用 MiniCPM系列模型需要遵循 Apache 2.0 和《MiniCPM 模型社区许可协议》。
|
||||
因此,您与MiniCPM 开发团队达成如下《MiniCPM模型商用许可协议》:
|
||||
|
||||
## 第二部分:许可权和再分发
|
||||
|
||||
#### 1. 权利授予
|
||||
您被授予非排他性的、全球性的、不可转让的和免版税的有限许可,依据OpenBMB对MiniCPM材料所拥有的知识产权或其他权利来使用、复制、分发、复制、创建衍生作品和修改MiniCPM材料。
|
||||
#### 2. 分发和再分发
|
||||
- 如果您分发或提供MiniCPM系列模型材料(或其任何衍生作品),或使用其中任何一个的产品或服务,您必须(A)提供本协议的副本;并(B)在相关网站、用户界面、博客文章、关于页面或产品文档中显著显示“Built with 面壁MiniCPM”。如果您使用MiniCPM系列模型创建、训练、微调或改进AI模型,该模型必须包含“MiniCPM”命名。
|
||||
- 您必须在分发的所有MiniCPM相关材料中保留以下归属声明:“面壁MiniCPM 根据MiniCPM模型社区许可证许可,版权所有©面壁智能 Platforms, Inc. 保留所有权利。”
|
||||
- 您对MiniCPM材料的使用必须遵守适用的法律法规,并遵守《MiniCPM 模型社区许可协议》,该政策通过引用并入本协议。
|
||||
- 您不得使用MiniCPM系列模型或其输出和结果来改进任何其他大型语言模型(除 MiniCPM 或其衍生品外)。
|
||||
#### 3. 附加商业条款
|
||||
若您或您的关联方的服务或产品是将模型部署在端侧设备,且部署设备不超5000台,或提供应用的日均用户活跃量(DAU)低于100万,可直接向面壁智能申请许可,在填写问卷登记后可允许免费商业使用。否则请发邮件(cpm@modelbest.cn)向面壁智能申请授权,我们可自行决定是否授权,并自行决定授权的期限和范围。在我们给予书面授权前,您无权行使任何商业性权利,亦不得将模型用于任何商业用途。
|
||||
|
||||
#### 4. 基于使用的限制
|
||||
附录A中规定的限制被视为基于使用的限制。因此,您不得将模型及其衍生作品用于指定的受限用途。您可以根据本许可证使用模型,包括仅用于合法目的并符合许可证的规定。使用包括创建任何内容、微调、更新、运行、训练、评估和/或重新参数化模型。您应要求所有使用模型或其衍生作品的用户遵守本段的条款。
|
||||
|
||||
## 第三部分:其他条款
|
||||
#### 5. 商标和相关
|
||||
本许可证不授予您使用OpenBMB、面壁智能、MiniCPM商标、商号、标志或以其他方式暗示双方之间关系的权利;未在此明确授予的任何权利均由OpenBMB保留。
|
||||
|
||||
#### 6. 免责声明
|
||||
除非适用法律要求或书面同意,OpenBMB 按“现状”提供模型和补充材料,不提供任何形式的保证或条件,包括但不限于所有明示和暗示的保证或条件,包括所有权、非侵权、适销性或适用于特定目的的保证或条件。您自行负责确定使用或再分发模型、模型的衍生作品和补充材料的适当性,并承担在本许可证下行使权利所引发的任何风险。
|
||||
|
||||
## 附录A:使用限制
|
||||
您同意不将模型或其衍生作品用于:
|
||||
- 任何违反适用国家或国际法律法规或侵犯任何第三方合法权利和利益的方式;
|
||||
- 任何军事用途;
|
||||
- 以任何方式利用、伤害或试图利用或伤害未成年人;
|
||||
- 生成或传播可验证的虚假信息和/或内容,以损害他人为目的;
|
||||
- 生成或传播不适当内容,需符合适用的监管要求;
|
||||
- 未经授权生成或传播个人可识别信息,或进行不合理使用;
|
||||
- 诽谤、贬低或以其他方式骚扰他人;
|
||||
- 完全自动化的决策,导致个人的法律权利受到不利影响或创建或修改具有约束力、可执行的义务;
|
||||
- 任何意图或具有歧视或伤害个人或群体的效果,基于在线或离线的社会行为或已知或预测的个人特征;
|
||||
- 利用特定群体的年龄、社会、身体或心理特征的弱点,以实质性扭曲该群体成员的行为,导致或可能导致该人或其他人身体或心理伤害的方式;
|
||||
- 任何意图或具有歧视个人或群体效果的用途,基于法律保护的特征或类别。
|
||||
4467
README_zh.md
BIN
assets/minicpm-o-45-framework.pdf
Normal file
BIN
assets/minicpm-o-45-framework.png
Normal file
|
After Width: | Height: | Size: 359 KiB |
BIN
assets/minicpm-o-45-radar.png
Normal file
|
After Width: | Height: | Size: 1.1 MiB |
BIN
assets/minicpm_o_45_main_exp_table.png
Normal file
|
After Width: | Height: | Size: 304 KiB |
BIN
assets/minicpmo4_5/assistant_ref.mp4
Normal file
BIN
assets/minicpmo4_5/assistant_response.mp4
Normal file
BIN
assets/minicpmo4_5/elon_musk_ref.mp4
Normal file
BIN
assets/minicpmo4_5/elon_musk_response.mp4
Normal file
BIN
assets/minicpmo4_5/en_cot.png
Normal file
|
After Width: | Height: | Size: 3.0 MiB |
BIN
assets/minicpmo4_5/en_doc.png
Normal file
|
After Width: | Height: | Size: 4.4 MiB |
BIN
assets/minicpmo4_5/video_play.png
Normal file
|
After Width: | Height: | Size: 7.5 MiB |
BIN
assets/minicpmo4_5/zh_doc.png
Normal file
|
After Width: | Height: | Size: 2.9 MiB |
BIN
assets/radar_minicpmo4.5.png
Normal file
|
After Width: | Height: | Size: 1.2 MiB |
BIN
docs/MiniCPM_V_4_5_Technical_Report.pdf
Normal file
176
docs/minicpm-llama-v-2-5_languages.md
Normal file
@@ -0,0 +1,176 @@
|
||||
- English
|
||||
- 中文
|
||||
- 한국어
|
||||
- 日本語
|
||||
- Deutsch
|
||||
- Français
|
||||
- Português
|
||||
- Español
|
||||
- မြန်မာဘာသာ
|
||||
- ไทย
|
||||
- Tiếng Việt
|
||||
- Türkçe
|
||||
- ܣܘܪܝܝܐ
|
||||
- العربية
|
||||
- हिन्दी
|
||||
- বাংলা
|
||||
- नेपाली
|
||||
- Türkmençe
|
||||
- Тоҷикӣ
|
||||
- Кыргызча
|
||||
- Русский
|
||||
- Українська
|
||||
- Беларуская
|
||||
- ქართული
|
||||
- Azərbaycanca
|
||||
- Հայերեն
|
||||
- Polski
|
||||
- Lietuvių
|
||||
- Eesti
|
||||
- Latviešu
|
||||
- Čeština
|
||||
- Slovenčina
|
||||
- Magyar
|
||||
- Slovenščina
|
||||
- Hrvatski
|
||||
- Bosanski
|
||||
- Crnogorski
|
||||
- Српски
|
||||
- Shqip
|
||||
- Română
|
||||
- Български
|
||||
- Македонски
|
||||
|
||||
|
||||
## 支持语言
|
||||
|
||||
英语
|
||||
|
||||
中文
|
||||
|
||||
韩语
|
||||
|
||||
日语
|
||||
|
||||
德语
|
||||
|
||||
法语
|
||||
|
||||
葡萄牙语
|
||||
|
||||
西班牙语
|
||||
|
||||
缅甸语
|
||||
|
||||
泰语
|
||||
|
||||
越南语
|
||||
|
||||
土耳其语
|
||||
|
||||
叙利亚语
|
||||
|
||||
阿拉伯语
|
||||
|
||||
印地语
|
||||
|
||||
孟加拉语
|
||||
|
||||
尼泊尔语
|
||||
|
||||
土库曼语
|
||||
|
||||
塔吉克语
|
||||
|
||||
吉尔吉斯语
|
||||
|
||||
俄语
|
||||
|
||||
乌克兰语
|
||||
|
||||
白俄罗斯语
|
||||
|
||||
格鲁吉亚语
|
||||
|
||||
阿塞拜疆语
|
||||
|
||||
亚美尼亚语
|
||||
|
||||
波兰语
|
||||
|
||||
立陶宛语
|
||||
|
||||
爱沙尼亚语
|
||||
|
||||
拉脱维亚语
|
||||
|
||||
捷克语
|
||||
|
||||
斯洛伐克语
|
||||
|
||||
匈牙利语
|
||||
|
||||
斯洛文尼亚语
|
||||
|
||||
克罗地亚语
|
||||
|
||||
波斯尼亚语
|
||||
|
||||
黑山语
|
||||
|
||||
塞尔维亚语
|
||||
|
||||
阿尔巴尼亚语
|
||||
|
||||
罗马尼亚语
|
||||
|
||||
保加利亚
|
||||
|
||||
马其顿语
|
||||
|
||||
|
||||
|
||||
## Supported Languages
|
||||
|
||||
English
|
||||
Chinese
|
||||
Korean
|
||||
Japanese
|
||||
German
|
||||
French
|
||||
Portuguese
|
||||
Spanish
|
||||
Burmese
|
||||
Thai
|
||||
Vietnamese
|
||||
Turkish
|
||||
Syriac
|
||||
Arabic
|
||||
Hindi
|
||||
Bengali
|
||||
Nepali
|
||||
Turkmen
|
||||
Tajik
|
||||
Kyrgyz
|
||||
Russian
|
||||
Ukrainian
|
||||
Belarusian
|
||||
Georgian
|
||||
Azerbaijani
|
||||
Armenian
|
||||
Polish
|
||||
Lithuanian
|
||||
Estonian
|
||||
Latvian
|
||||
Czech
|
||||
Slovak
|
||||
Hungarian
|
||||
Slovenian
|
||||
Croatian
|
||||
Bosnian
|
||||
Montenegrin
|
||||
Serbian
|
||||
Albanian
|
||||
Romanian
|
||||
Bulgarian
|
||||
Macedonian
|
||||
@@ -15,7 +15,7 @@
|
||||
Leveraging the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) method (the newest technique in the [RLHF-V](https://github.com/RLHF-V) [CVPR'24] series), MiniCPM-Llama3-V 2.5 exhibits more trustworthy behavior. It achieves a **10.3%** hallucination rate on Object HalBench, lower than GPT-4V-1106 (13.6%), achieving the best-level performance within the open-source community. [Data released](https://huggingface.co/datasets/openbmb/RLAIF-V-Dataset).
|
||||
|
||||
- 🌏 **Multilingual Support.**
|
||||
Thanks to the strong multilingual capabilities of Llama 3 and the cross-lingual generalization technique from [VisCPM](https://github.com/OpenBMB/VisCPM), MiniCPM-Llama3-V 2.5 extends its bilingual (Chinese-English) multimodal capabilities to **over 30 languages including German, French, Spanish, Italian, Korean etc.** [All Supported Languages](./assets/minicpm-llama-v-2-5_languages.md).
|
||||
Thanks to the strong multilingual capabilities of Llama 3 and the cross-lingual generalization technique from [VisCPM](https://github.com/OpenBMB/VisCPM), MiniCPM-Llama3-V 2.5 extends its bilingual (Chinese-English) multimodal capabilities to **over 30 languages including German, French, Spanish, Italian, Korean etc.** [All Supported Languages](../docs/minicpm-llama-v-2-5_languages.md).
|
||||
|
||||
- 🚀 **Efficient Deployment.**
|
||||
MiniCPM-Llama3-V 2.5 systematically employs **model quantization, CPU optimizations, NPU optimizations and compilation optimizations**, achieving high-efficiency deployment on end-side devices. For mobile phones with Qualcomm chips, we have integrated the NPU acceleration framework QNN into llama.cpp for the first time. After systematic optimization, MiniCPM-Llama3-V 2.5 has realized a **150x acceleration in end-side MLLM image encoding** and a **3x speedup in language decoding**.
|
||||
|
||||
964
docs/minicpm_o2dot6_en.md
Normal file
@@ -0,0 +1,964 @@
|
||||
## MiniCPM-o 2.6
|
||||
|
||||
> Archieve at: 2026-02-02
|
||||
|
||||
**MiniCPM-o 2.6** is the latest and most capable model in the MiniCPM-o series. The model is built in an end-to-end fashion based on SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.6, and introduces new features for real-time speech conversation and multimodal live streaming. Notable features of MiniCPM-o 2.6 include:
|
||||
|
||||
- 🔥 **Leading Visual Capability.**
|
||||
MiniCPM-o 2.6 achieves an average score of 70.2 on OpenCompass, a comprehensive evaluation of 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-202405, Gemini 1.5 Pro, and Claude 3.5 Sonnet** for single image understanding. It also **outperforms GPT-4V and Claude 3.5 Sonnet** in multi-image and video understanding, and shows promising in-context learning capability.
|
||||
|
||||
- 🎙 **State-of-the-art Speech Capability.** MiniCPM-o 2.6 supports **bilingual real-time speech conversation with configurable voices** in English and Chinese. It **outperforms GPT-4o-realtime on audio understanding tasks** such as ASR and STT translation, and shows **state-of-the-art performance on speech conversation in both semantic and acoustic evaluations in the open-source community**. It also allows for fun features such as emotion/speed/style control, end-to-end voice cloning, role play, etc.
|
||||
|
||||
- 🎬 **Strong Multimodal Live Streaming Capability.** As a new feature, MiniCPM-o 2.6 can **accept continuous video and audio streams independent of user queries, and support real-time speech interaction**. It **outperforms GPT-4o-202408 and Claude 3.5 Sonnet and shows state-of-the-art performance in the open-source community on StreamingBench**, a comprehensive benchmark for real-time video understanding, omni-source (video & audio) understanding, and multimodal contextual understanding.
|
||||
|
||||
- 💪 **Strong OCR Capability and Others.**
|
||||
Advancing popular visual capabilities from MiniCPM-V series, MiniCPM-o 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves **state-of-the-art performance on OCRBench for models under 25B, surpassing proprietary models such as GPT-4o-202405**.
|
||||
Based on the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) and [VisCPM](https://github.com/OpenBMB/VisCPM) techniques, it features **trustworthy behaviors**, outperforming GPT-4o and Claude 3.5 Sonnet on MMHal-Bench, and supports **multilingual capabilities** on more than 30 languages.
|
||||
|
||||
|
||||
- 🚀 **Superior Efficiency.**
|
||||
In addition to its friendly size, MiniCPM-o 2.6 also shows **state-of-the-art token density** (i.e., the number of pixels encoded into each visual token). **It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models**. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-o 2.6 can efficiently support **multimodal live streaming** on end-side devices such as iPads.
|
||||
|
||||
- 💫 **Easy Usage.**
|
||||
MiniCPM-o 2.6 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4) and [GGUF](https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf) format quantized models in 16 sizes, (3) [vLLM](#efficient-inference-with-llamacpp-ollama-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks with [LLaMA-Factory](./docs/llamafactory_train_and_infer.md), (5) quick [local WebUI demo](#chat-with-our-demo-on-gradio), and (6) online web demo on [server](https://minicpm-omni-webdemo-us.modelbest.cn/).
|
||||
|
||||
**Model Architecture.**
|
||||
|
||||
- **End-to-end Omni-modal Architecture.** Different modality encoders/decoders are connected and trained in an **end-to-end** fashion to fully exploit rich multimodal knowledge. The model is trained in a fully end-to-end manner with only CE loss.
|
||||
- **Omni-modal Live Streaming Mechanism.** (1) We change the offline modality encoder/decoders into online ones for **streaming inputs/outputs.** (2) We devise a **time-division multiplexing (TDM) mechanism** for omni-modality streaming processing in the LLM backbone. It divides parallel omni-modality streams into sequential info within small periodic time slices.
|
||||
- **Configurable Speech Modeling Design.** We devise a multimodal system prompt, including traditional text system prompt, and **a new audio system prompt to determine the assistant voice**. This enables flexible voice configurations in inference time, and also facilitates end-to-end voice cloning and description-based voice creation.
|
||||
|
||||
<div align="center">
|
||||
<img src="./assets/minicpm-o-26-framework-v2.png" , width=80%>
|
||||
</div>
|
||||
|
||||
|
||||
### Evaluation <!-- omit in toc -->
|
||||
|
||||
<div align="center">
|
||||
<img src="./assets/radar.jpg", width=80%>
|
||||
</div>
|
||||
|
||||
<details>
|
||||
<summary>Click to view visual understanding results.</summary>
|
||||
|
||||
**Image Understanding**
|
||||
|
||||
<div align="center">
|
||||
<table style="margin: 0px auto;">
|
||||
<thead>
|
||||
<tr>
|
||||
<th align="left">Model</th>
|
||||
<th>Size</th>
|
||||
<th>Token Density<sup>+</sup></th>
|
||||
<th>OpenCompass</th>
|
||||
<th>OCRBench</th>
|
||||
<th>MathVista mini</th>
|
||||
<th>ChartQA</th>
|
||||
<th>MMVet</th>
|
||||
<th>MMStar</th>
|
||||
<th>MME</th>
|
||||
<th>MMB1.1 test</th>
|
||||
<th>AI2D</th>
|
||||
<th>MMMU val</th>
|
||||
<th>HallusionBench</th>
|
||||
<th>TextVQA val</th>
|
||||
<th>DocVQA test</th>
|
||||
<th>MathVerse mini</th>
|
||||
<th>MathVision</th>
|
||||
<th>MMHal Score</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody align="center">
|
||||
<tr>
|
||||
<td colspan="19" align="left"><strong>Proprietary</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GPT-4o-20240513</td>
|
||||
<td>-</td>
|
||||
<td>1088</td>
|
||||
<td><u>69.9</u></td>
|
||||
<td>736</td>
|
||||
<td>61.3</td>
|
||||
<td>85.7</td>
|
||||
<td><strong>69.1</strong></td>
|
||||
<td>63.9</td>
|
||||
<td>2328.7</td>
|
||||
<td>82.2</td>
|
||||
<td>84.6</td>
|
||||
<td><strong>69.2</strong></td>
|
||||
<td><strong>55.0</strong></td>
|
||||
<td>-</td>
|
||||
<td>92.8</td>
|
||||
<td><strong>50.2</strong></td>
|
||||
<td><strong>30.4</strong></td>
|
||||
<td><u>3.6</u></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Claude3.5-Sonnet</td>
|
||||
<td>-</td>
|
||||
<td>750</td>
|
||||
<td>67.9</td>
|
||||
<td>788</td>
|
||||
<td>61.6</td>
|
||||
<td><strong>90.8</strong></td>
|
||||
<td>66.0</td>
|
||||
<td>62.2</td>
|
||||
<td>1920.0</td>
|
||||
<td>78.5</td>
|
||||
<td>80.2</td>
|
||||
<td><u>65.9</u></td>
|
||||
<td>49.9</td>
|
||||
<td>-</td>
|
||||
<td><strong>95.2</strong></td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>3.4</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>64.4</td>
|
||||
<td>754</td>
|
||||
<td>57.7</td>
|
||||
<td>81.3</td>
|
||||
<td>64.0</td>
|
||||
<td>59.1</td>
|
||||
<td>2110.6</td>
|
||||
<td>73.9</td>
|
||||
<td>79.1</td>
|
||||
<td>60.6</td>
|
||||
<td>45.6</td>
|
||||
<td>73.5</td>
|
||||
<td>86.5</td>
|
||||
<td>-</td>
|
||||
<td>19.2</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GPT-4o-mini-20240718</td>
|
||||
<td>-</td>
|
||||
<td>1088</td>
|
||||
<td>64.1</td>
|
||||
<td>785</td>
|
||||
<td>52.4</td>
|
||||
<td>-</td>
|
||||
<td>66.9</td>
|
||||
<td>54.8</td>
|
||||
<td>2003.4</td>
|
||||
<td>76.0</td>
|
||||
<td>77.8</td>
|
||||
<td>60.0</td>
|
||||
<td>46.1</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>3.3</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td colspan="19" align="left"><strong>Open Source</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Cambrian-34B</td>
|
||||
<td>34B</td>
|
||||
<td><u>1820</u></td>
|
||||
<td>58.3</td>
|
||||
<td>591</td>
|
||||
<td>50.3</td>
|
||||
<td>75.6</td>
|
||||
<td>53.2</td>
|
||||
<td>54.2</td>
|
||||
<td>2049.9</td>
|
||||
<td>77.8</td>
|
||||
<td>79.5</td>
|
||||
<td>50.4</td>
|
||||
<td>41.6</td>
|
||||
<td>76.7</td>
|
||||
<td>75.5</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GLM-4V-9B</td>
|
||||
<td>13B</td>
|
||||
<td>784</td>
|
||||
<td>59.1</td>
|
||||
<td>776</td>
|
||||
<td>51.1</td>
|
||||
<td>-</td>
|
||||
<td>58.0</td>
|
||||
<td>54.8</td>
|
||||
<td>2018.8</td>
|
||||
<td>67.9</td>
|
||||
<td>71.2</td>
|
||||
<td>46.9</td>
|
||||
<td>45.0</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Pixtral-12B</td>
|
||||
<td>12B</td>
|
||||
<td>256</td>
|
||||
<td>61.0</td>
|
||||
<td>685</td>
|
||||
<td>56.9</td>
|
||||
<td>81.8</td>
|
||||
<td>58.5</td>
|
||||
<td>54.5</td>
|
||||
<td>-</td>
|
||||
<td>72.7</td>
|
||||
<td>79.0</td>
|
||||
<td>51.1</td>
|
||||
<td>47.0</td>
|
||||
<td>75.7</td>
|
||||
<td>90.7</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">VITA-1.5</td>
|
||||
<td>8B</td>
|
||||
<td>784</td>
|
||||
<td>63.3</td>
|
||||
<td>741</td>
|
||||
<td>66.2</td>
|
||||
<td>-</td>
|
||||
<td>52.7</td>
|
||||
<td>60.2</td>
|
||||
<td>2328.1</td>
|
||||
<td>76.8</td>
|
||||
<td>79.2</td>
|
||||
<td>52.6</td>
|
||||
<td>44.6</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">DeepSeek-VL2-27B (4B)</td>
|
||||
<td>27B</td>
|
||||
<td>672</td>
|
||||
<td>66.4</td>
|
||||
<td>809</td>
|
||||
<td>63.9</td>
|
||||
<td>86.0</td>
|
||||
<td>60.0</td>
|
||||
<td>61.9</td>
|
||||
<td>2253.0</td>
|
||||
<td>81.2</td>
|
||||
<td>83.8</td>
|
||||
<td>54.0</td>
|
||||
<td>45.3</td>
|
||||
<td><u>84.2</u></td>
|
||||
<td>93.3</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>3.0</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Qwen2-VL-7B</td>
|
||||
<td>8B</td>
|
||||
<td>784</td>
|
||||
<td>67.1</td>
|
||||
<td><u>866</u></td>
|
||||
<td>58.2</td>
|
||||
<td>83.0</td>
|
||||
<td>62.0</td>
|
||||
<td>60.7</td>
|
||||
<td>2326.0</td>
|
||||
<td>81.8</td>
|
||||
<td>83.0</td>
|
||||
<td>54.1</td>
|
||||
<td>50.6</td>
|
||||
<td><strong>84.3</strong></td>
|
||||
<td><u>94.5</u></td>
|
||||
<td>31.9</td>
|
||||
<td>16.3</td>
|
||||
<td>3.2</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">LLaVA-OneVision-72B</td>
|
||||
<td>72B</td>
|
||||
<td>182</td>
|
||||
<td>68.1</td>
|
||||
<td>741</td>
|
||||
<td>67.5</td>
|
||||
<td>83.7</td>
|
||||
<td>60.6</td>
|
||||
<td><strong>65.8</strong></td>
|
||||
<td>2261.0</td>
|
||||
<td><strong>85.0</strong></td>
|
||||
<td><u>85.6</u></td>
|
||||
<td>56.8</td>
|
||||
<td>49.0</td>
|
||||
<td>80.5</td>
|
||||
<td>91.3</td>
|
||||
<td>39.1</td>
|
||||
<td>-</td>
|
||||
<td>3.5</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">InternVL2.5-8B</td>
|
||||
<td>8B</td>
|
||||
<td>706</td>
|
||||
<td>68.3</td>
|
||||
<td>822</td>
|
||||
<td><u>64.4</u></td>
|
||||
<td>84.8</td>
|
||||
<td>62.8</td>
|
||||
<td>62.8</td>
|
||||
<td>2344.0</td>
|
||||
<td><u>83.6</u></td>
|
||||
<td>84.5</td>
|
||||
<td>56.0</td>
|
||||
<td>50.1</td>
|
||||
<td>79.1</td>
|
||||
<td>93.0</td>
|
||||
<td>39.5</td>
|
||||
<td>19.7</td>
|
||||
<td>3.4</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
|
||||
<td>8B</td>
|
||||
<td><strong>2822</strong></td>
|
||||
<td>65.2</td>
|
||||
<td>852*</td>
|
||||
<td>60.6</td>
|
||||
<td>79.4</td>
|
||||
<td>60.0</td>
|
||||
<td>57.5</td>
|
||||
<td><u>2348.4*</u></td>
|
||||
<td>78.0</td>
|
||||
<td>82.1</td>
|
||||
<td>49.8*</td>
|
||||
<td>48.1*</td>
|
||||
<td>80.1</td>
|
||||
<td>90.8</td>
|
||||
<td>25.7</td>
|
||||
<td>18.3</td>
|
||||
<td>3.6</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
|
||||
<td>8B</td>
|
||||
<td><strong>2822</strong></td>
|
||||
<td><strong>70.2</strong></td>
|
||||
<td><strong>897*</strong></td>
|
||||
<td><strong>71.9*</strong></td>
|
||||
<td><u>86.9*</u></td>
|
||||
<td><u>67.5</u></td>
|
||||
<td><u>64.0</u></td>
|
||||
<td><strong>2372.0*</strong></td>
|
||||
<td>80.5</td>
|
||||
<td><strong>85.8</strong></td>
|
||||
<td>50.4*</td>
|
||||
<td><u>51.9</u></td>
|
||||
<td>82.0</td>
|
||||
<td>93.5</td>
|
||||
<td><u>41.4*</u></td>
|
||||
<td><u>23.1*</u></td>
|
||||
<td><strong>3.8</strong></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
</div>
|
||||
* We evaluate this benchmark using chain-of-thought prompting. Specifically, for MME, we used this technique only for the Cognition set.
|
||||
|
||||
|
||||
<sup>+</sup> Token Density: number of pixels encoded into each visual token at maximum resolution, i.e., # pixels at maximum resolution / # visual tokens.
|
||||
|
||||
Note: For proprietary models, we calculate token density based on the image encoding charging strategy defined in the official API documentation, which provides an upper-bound estimation.
|
||||
|
||||
|
||||
**Multi-image and Video Understanding**
|
||||
|
||||
<div align="center">
|
||||
|
||||
<table style="margin: 0px auto;">
|
||||
<thead>
|
||||
<tr>
|
||||
<th align="left">Model</th>
|
||||
<th>Size</th>
|
||||
<th>BLINK val</th>
|
||||
<th>Mantis Eval</th>
|
||||
<th>MIRB</th>
|
||||
<th>Video-MME (wo / w subs)</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody align="center">
|
||||
<tr>
|
||||
<td colspan="6" align="left"><strong>Proprietary</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GPT-4o-20240513</td>
|
||||
<td>-</td>
|
||||
<td><strong>68.0</strong></td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td><strong>71.9/77.2<strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GPT4V</td>
|
||||
<td>-</td>
|
||||
<td>54.6</td>
|
||||
<td>62.7</td>
|
||||
<td>53.1</td>
|
||||
<td>59.9/63.3</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td colspan="6" align="left"><strong>Open-source</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">VITA-1.5</td>
|
||||
<td>8B</td>
|
||||
<td>45.0</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>56.1/58.7</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">LLaVA-NeXT-Interleave 14B</td>
|
||||
<td>14B</td>
|
||||
<td>52.6</td>
|
||||
<td>66.4</td>
|
||||
<td>30.2</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">LLaVA-OneVision-72B</td>
|
||||
<td>72B</td>
|
||||
<td>55.4</td>
|
||||
<td><strong>77.6</strong></td>
|
||||
<td>-</td>
|
||||
<td><u>66.2/69.5</u></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MANTIS 8B</td>
|
||||
<td>8B</td>
|
||||
<td>49.1</td>
|
||||
<td>59.5</td>
|
||||
<td>34.8</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Qwen2-VL-7B</td>
|
||||
<td>8B</td>
|
||||
<td>53.2</td>
|
||||
<td>69.6*</td>
|
||||
<td><strong>67.6*</strong></td>
|
||||
<td>63.3/69.0</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">InternVL2.5-8B</td>
|
||||
<td>8B</td>
|
||||
<td>54.8</td>
|
||||
<td>67.7</td>
|
||||
<td>52.5</td>
|
||||
<td>64.2/66.9</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
|
||||
<td>8B</td>
|
||||
<td>53.0</td>
|
||||
<td>69.1</td>
|
||||
<td>53.8</td>
|
||||
<td>60.9/63.6</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
|
||||
<td>8B</td>
|
||||
<td><u>56.7</u></td>
|
||||
<td><u>71.9</u></td>
|
||||
<td><u>58.6</u></td>
|
||||
<td>63.9/67.9</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
</div>
|
||||
* We evaluate officially released checkpoints by ourselves.
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
<details>
|
||||
<summary>Click to view audio understanding and speech conversation results.</summary>
|
||||
|
||||
**Audio Understanding**
|
||||
|
||||
<div align="center">
|
||||
<table style="margin: 0px auto;">
|
||||
<thead>
|
||||
<tr>
|
||||
<th align="left">Task</th>
|
||||
<th>Size</th>
|
||||
<th colspan="3">ASR (zh)</th>
|
||||
<th colspan="3">ASR (en)</th>
|
||||
<th colspan="2">AST</th>
|
||||
<th>Emotion</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<th align="left">Metric</th>
|
||||
<td></td>
|
||||
<th colspan="3">CER↓</th>
|
||||
<th colspan="3">WER↓</th>
|
||||
<th colspan="2">BLEU↑</th>
|
||||
<th>ACC↑</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<th align="left">Dataset</th>
|
||||
<td></td>
|
||||
<th>AISHELL-1</th>
|
||||
<th>Fleurs zh</th>
|
||||
<th>WenetSpeech test-net</th>
|
||||
<th>LibriSpeech test-clean</th>
|
||||
<th>GigaSpeech</th>
|
||||
<th>TED-LIUM</th>
|
||||
<th>CoVoST en2zh</th>
|
||||
<th>CoVoST zh2en</th>
|
||||
<th>MELD emotion</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody align="center">
|
||||
<tr>
|
||||
<td colspan="11" align="left"><strong>Proprietary</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GPT-4o-Realtime</td>
|
||||
<td>-</td>
|
||||
<td>7.3*</td>
|
||||
<td><u>5.4*</u></td>
|
||||
<td>28.9*</td>
|
||||
<td>2.6*</td>
|
||||
<td>12.9*</td>
|
||||
<td>4.8*</td>
|
||||
<td>37.1*</td>
|
||||
<td>15.7*</td>
|
||||
<td>33.2*</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
|
||||
<td>-</td>
|
||||
<td>4.5*</td>
|
||||
<td>5.9*</td>
|
||||
<td>14.3*</td>
|
||||
<td>2.9*</td>
|
||||
<td>10.6*</td>
|
||||
<td><strong>3.0*</strong></td>
|
||||
<td><u>47.3*</u></td>
|
||||
<td>22.6*</td>
|
||||
<td>48.4*</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td colspan="11" align="left"><strong>Open-Source</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Qwen2-Audio-7B</td>
|
||||
<td>8B</td>
|
||||
<td>-</td>
|
||||
<td>7.5</td>
|
||||
<td>-</td>
|
||||
<td><strong>1.6</strong></td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>45.2</td>
|
||||
<td><u>24.4</u></td>
|
||||
<td><strong>55.3</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Qwen2-Audio-7B-Instruct</td>
|
||||
<td>8B</td>
|
||||
<td>2.6*</td>
|
||||
<td>6.9*</td>
|
||||
<td><u>10.3*</u></td>
|
||||
<td>3.1*</td>
|
||||
<td><u>9.7</u>*</td>
|
||||
<td>5.9*</td>
|
||||
<td>39.5*</td>
|
||||
<td>22.9*</td>
|
||||
<td>17.4*</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">VITA-1.5</td>
|
||||
<td>8B</td>
|
||||
<td>2.16</td>
|
||||
<td>-</td>
|
||||
<td>8.4</td>
|
||||
<td>3.4</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GLM-4-Voice-Base</td>
|
||||
<td>9B</td>
|
||||
<td><u>2.5</u></td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>2.8</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
|
||||
<td>8B</td>
|
||||
<td><strong>1.6</strong></td>
|
||||
<td><strong>4.4</strong></td>
|
||||
<td><strong>6.9</strong></td>
|
||||
<td><u>1.7</u></td>
|
||||
<td><strong>8.7</strong></td>
|
||||
<td><strong>3.0</strong></td>
|
||||
<td><strong>48.2</strong></td>
|
||||
<td><strong>27.2</strong></td>
|
||||
<td><u>52.4</u></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
</div>
|
||||
* We evaluate officially released checkpoints by ourselves.<br><br>
|
||||
|
||||
**Speech Generation**
|
||||
|
||||
<div align="center">
|
||||
<table style="margin: 0px auto;">
|
||||
<thead>
|
||||
<tr>
|
||||
<th align="left">Task</th>
|
||||
<th>Size</th>
|
||||
<th colspan="9">SpeechQA</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<th align="left">Metric</th>
|
||||
<th></th>
|
||||
<th colspan="3">ACC↑</th>
|
||||
<th>G-Eval (10 point)↑</th>
|
||||
<th>Semantic ELO score↑</th>
|
||||
<th>Acoustic ELO score↑</th>
|
||||
<th>Overall ELO score↑</th>
|
||||
<th>UTMOS↑</th>
|
||||
<th>ASR-WER↓</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<th align="left">Dataset</th>
|
||||
<th></th>
|
||||
<th>Speech Llama Q.</th>
|
||||
<th>Speech Web Q.</th>
|
||||
<th>Speech Trivia QA</th>
|
||||
<th>Speech AlpacaEval</th>
|
||||
<th colspan="5">AudioArena</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody align="center">
|
||||
<tr>
|
||||
<td colspan="11" align="left"><strong>Proprietary</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GPT-4o-Realtime</td>
|
||||
<td></td>
|
||||
<td><strong>71.7</strong></td>
|
||||
<td><strong>51.6</strong></td>
|
||||
<td><strong>69.7</strong></td>
|
||||
<td><strong>7.4</strong></td>
|
||||
<td><strong>1157</strong></td>
|
||||
<td><strong>1203</strong></td>
|
||||
<td><strong>1200</strong></td>
|
||||
<td><strong>4.2</strong></td>
|
||||
<td><strong>2.3</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td colspan="11" align="left"><strong>Open-Source</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GLM-4-Voice</td>
|
||||
<td>9B</td>
|
||||
<td>50.0</td>
|
||||
<td>32.0</td>
|
||||
<td>36.4</td>
|
||||
<td><u>5.1</u></td>
|
||||
<td>999</td>
|
||||
<td>1147</td>
|
||||
<td>1035</td>
|
||||
<td><u>4.1</u></td>
|
||||
<td><u>11.7</u></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Llama-Omni</td>
|
||||
<td>8B</td>
|
||||
<td>45.3</td>
|
||||
<td>22.9</td>
|
||||
<td>10.7</td>
|
||||
<td>3.9</td>
|
||||
<td>960</td>
|
||||
<td>878</td>
|
||||
<td>897</td>
|
||||
<td>3.2</td>
|
||||
<td>24.3</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">VITA-1.5</td>
|
||||
<td>8B</td>
|
||||
<td>46.7</td>
|
||||
<td>28.1</td>
|
||||
<td>23.3</td>
|
||||
<td>2.0</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Moshi</td>
|
||||
<td>7B</td>
|
||||
<td>43.7</td>
|
||||
<td>23.8</td>
|
||||
<td>16.7</td>
|
||||
<td>2.4</td>
|
||||
<td>871</td>
|
||||
<td>808</td>
|
||||
<td>875</td>
|
||||
<td>2.8</td>
|
||||
<td>8.2</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Mini-Omni</td>
|
||||
<td>1B</td>
|
||||
<td>22.0</td>
|
||||
<td>12.8</td>
|
||||
<td>6.9</td>
|
||||
<td>2.5</td>
|
||||
<td>926</td>
|
||||
<td>803</td>
|
||||
<td>865</td>
|
||||
<td>3.4</td>
|
||||
<td>10.0</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
|
||||
<td>8B</td>
|
||||
<td><u>61.0</u></td>
|
||||
<td><u>40.0</u></td>
|
||||
<td><u>40.2</u></td>
|
||||
<td><u>5.1</u></td>
|
||||
<td><u>1088</u></td>
|
||||
<td><u>1163</u></td>
|
||||
<td><u>1131</u></td>
|
||||
<td><strong>4.2</strong></td>
|
||||
<td>9.8</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
</div>
|
||||
All results are from AudioEvals, and the evaluation methods along with further details can be found in <a href="https://github.com/OpenBMB/UltraEval-Audio" target="_blank">AudioEvals</a>.<br><br>
|
||||
|
||||
**End-to-end Voice Cloning**
|
||||
|
||||
<div align="center">
|
||||
<table style="margin: 0px auto;">
|
||||
<thead>
|
||||
<tr>
|
||||
<th align="left">Task</th>
|
||||
<th colspan="2">Voice cloning</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<th align="left">Metric</th>
|
||||
<th>SIMO↑</th>
|
||||
<th>SIMO↑</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<th align="left">Dataset</th>
|
||||
<th>Seed-TTS test-zh</th>
|
||||
<th>Seed-TTS test-en</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody align="center">
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">F5-TTS</td>
|
||||
<td><strong>76</strong></td>
|
||||
<td><strong>67</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">CosyVoice</td>
|
||||
<td><u>75</u></td>
|
||||
<td><u>64</u></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">FireRedTTS</td>
|
||||
<td>63</td>
|
||||
<td>46</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
|
||||
<td>57</td>
|
||||
<td>47</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
</div>
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>Click to view multimodal live streaming results.</summary>
|
||||
|
||||
**Multimodal Live Streaming**: results on StreamingBench
|
||||
|
||||
<table style="margin: 0px auto;">
|
||||
<thead>
|
||||
<tr>
|
||||
<th align="left">Model</th>
|
||||
<th>Size</th>
|
||||
<th>Real-Time Video Understanding</th>
|
||||
<th>Omni-Source Understanding</th>
|
||||
<th>Contextual Understanding</th>
|
||||
<th>Overall</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody align="center">
|
||||
<tr>
|
||||
<td colspan="7" align="left"><strong>Proprietary</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
|
||||
<td>-</td>
|
||||
<td><u>77.4</u></td>
|
||||
<td><strong>67.8</strong></td>
|
||||
<td><strong>51.1</strong></td>
|
||||
<td><strong>70.3</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GPT-4o-202408</td>
|
||||
<td>-</td>
|
||||
<td>74.5</td>
|
||||
<td>51.0</td>
|
||||
<td><u>48.0</u></td>
|
||||
<td>64.1</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Claude-3.5-Sonnet</td>
|
||||
<td>-</td>
|
||||
<td>74.0</td>
|
||||
<td>41.4</td>
|
||||
<td>37.8</td>
|
||||
<td>59.7</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td colspan="9" align="left"><strong>Open-source</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">VILA-1.5</td>
|
||||
<td>8B</td>
|
||||
<td>61.5</td>
|
||||
<td>37.5</td>
|
||||
<td>26.7</td>
|
||||
<td>49.5</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">LongVA</td>
|
||||
<td>7B</td>
|
||||
<td>63.1</td>
|
||||
<td>35.9</td>
|
||||
<td>30.2</td>
|
||||
<td>50.7</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">LLaVA-Next-Video-34B</td>
|
||||
<td>34B</td>
|
||||
<td>69.8</td>
|
||||
<td>41.7</td>
|
||||
<td>34.3</td>
|
||||
<td>56.7</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Qwen2-VL-7B</td>
|
||||
<td>8B</td>
|
||||
<td>71.2</td>
|
||||
<td>40.7</td>
|
||||
<td>33.1</td>
|
||||
<td>57.0</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">InternVL2-8B</td>
|
||||
<td>8B</td>
|
||||
<td>70.1</td>
|
||||
<td>42.7</td>
|
||||
<td>34.1</td>
|
||||
<td>57.0</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">VITA-1.5</td>
|
||||
<td>8B</td>
|
||||
<td>70.9</td>
|
||||
<td>40.8</td>
|
||||
<td>35.8</td>
|
||||
<td>57.4</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">LLaVA-OneVision-7B</td>
|
||||
<td>8B</td>
|
||||
<td>74.3</td>
|
||||
<td>40.8</td>
|
||||
<td>31.0</td>
|
||||
<td>58.4</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">InternLM-XC2.5-OL-7B</td>
|
||||
<td>8B</td>
|
||||
<td>75.4</td>
|
||||
<td>46.2</td>
|
||||
<td>33.6</td>
|
||||
<td>60.8</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
|
||||
<td>8B</td>
|
||||
<td>72.4</td>
|
||||
<td>40.2</td>
|
||||
<td>33.4</td>
|
||||
<td>57.7</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
|
||||
<td>8B</td>
|
||||
<td><strong>79.9</strong></td>
|
||||
<td><u>53.4</u></td>
|
||||
<td>38.5</td>
|
||||
<td><u>66.0</u></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
### Examples <!-- omit in toc -->
|
||||
|
||||
We deploy MiniCPM-o 2.6 on end devices. The demo video is the raw-speed recording on an iPad Pro and a Web demo.
|
||||
|
||||
<div align="center">
|
||||
<a href="https://www.youtube.com/watch?v=vRIMbxJzStY&t=2s"><img src="./assets/minicpmo2_6/2dot6_o_demo_video_img.png", width=70%></a>
|
||||
</div>
|
||||
|
||||
<br>
|
||||
|
||||
<div style="display: flex; flex-direction: column; align-items: center;">
|
||||
<img src="assets/minicpmo2_6/minicpmo2_6_math_intersect.png" alt="math" style="margin-bottom: 5px;">
|
||||
<img src="assets/minicpmo2_6/minicpmo2_6_diagram_train_NN.png" alt="diagram" style="margin-bottom: 5px;">
|
||||
<img src="assets/minicpmo2_6/minicpmo2_6_multi-image_bike.png" alt="bike" style="margin-bottom: 5px;">
|
||||
</div>
|
||||
|
||||
927
docs/minicpm_o2dot6_zh.md
Normal file
@@ -0,0 +1,927 @@
|
||||
## MiniCPM-o 2.6
|
||||
|
||||
> Archieve at: 2026-02-02
|
||||
|
||||
MiniCPM-o 2.6 是 MiniCPM-o 系列的最新、性能最佳模型。该模型基于 SigLip-400M、Whisper-medium-300M、ChatTTS-200M 和 Qwen2.5-7B 构建,共 8B 参数,通过端到端方式训练和推理。相比 MiniCPM-V 2.6,该模型在性能上有了显著提升,并支持了实时语音对话和多模态流式交互的新功能。MiniCPM-o 2.6 的主要特性包括:
|
||||
|
||||
|
||||
- 🔥 **领先的视觉能力。**
|
||||
MiniCPM-o 2.6 在 OpenCompass 榜单上(综合 8 个主流多模态评测基准)平均得分 70.2,**以 8B 量级的大小在单图理解方面超越了 GPT-4o-202405、Gemini 1.5 Pro 和 Claude 3.5 Sonnet 等主流商用闭源多模态大模型**。此外,它的多图和视频理解表现也**优于 GPT-4V 和 Claude 3.5 Sonnet**,并展现出了优秀的上下文学习能力。
|
||||
|
||||
- 🎙 **出色的语音能力。**
|
||||
MiniCPM-o 2.6 **支持可配置声音的中英双语实时对话**。MiniCPM-o 2.6 在语音理解任务(如 ASR 和 STT 等)**优于 GPT-4o-realtime**,并在语音对话的语义和声学评估中展现了**开源模型中最高的语音生成性能**。它还支持情绪/语速/风格控制、语音克隆、角色扮演等进阶能力。
|
||||
|
||||
- 🎬 **强大的多模态流式交互能力。**
|
||||
作为一项新功能,MiniCPM-o 2.6 能够**接受连续的视频和音频流,并和用户进行实时语音交互**。在针对实时视频理解、全模态视音频理解、多模态上下文理解的综合评测基准 StreamingBench 中,MiniCPM-o 2.6 取得开源社区最佳水平,并**超过了 GPT-4o-202408 和 Claude 3.5 Sonnet**。
|
||||
|
||||
- 💪 **强大的 OCR 能力及其他功能。**
|
||||
MiniCPM-o 2.6 进一步优化了 MiniCPM-V 2.6 的众多视觉理解能力,其可以处理任意长宽比的图像,像素数可达 180 万(如 1344x1344)。在 OCRBench 上取得**25B 以下最佳水平,超过 GPT-4o-202405 等商用闭源模型**。基于最新的 [RLHF-V](https://rlhf-v.github.io/)、[RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) 和 [VisCPM](https://github.com/OpenBMB/VisCPM) 技术,其具备了**可信的多模态行为**,在 MMHal-Bench 上超过了 GPT-4o 和 Claude 3.5,并支持英语、中文、德语、法语、意大利语、韩语等**30多种语言**。
|
||||
|
||||
- 🚀 **卓越的效率。**
|
||||
除了对个人用户友好的模型大小,MiniCPM-o 2.6 还表现出**最先进的视觉 token 密度**(即每个视觉 token 编码的像素数量)。它**仅需 640 个 token 即可处理 180 万像素图像,比大多数模型少 75%**。这一特性优化了模型的推理速度、首 token 延迟、内存占用和功耗。因此,MiniCPM-o 2.6 可以支持 iPad 等终端设备上的高效**多模态实时流式交互**。
|
||||
|
||||
|
||||
- 💫 **易于使用。**
|
||||
MiniCPM-o 2.6 可以通过多种方式轻松使用:(1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md) 支持在本地设备上进行高效的 CPU 推理,(2) [int4](https://huggingface.co/openbmb/MiniCPM-V-2_6-int4) 和 [GGUF](https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf) 格式的量化模型,有 16 种尺寸,(3) [vLLM](#基于-llamacppollamavllm-的高效推理) 支持高吞吐量和内存高效的推理,(4) 通过[LLaMA-Factory](./docs/llamafactory_train_and_infer.md)框架针对新领域和任务进行微调,(5) 使用 [Gradio](#本地-webui-demo-) 快速设置本地 WebUI 演示,(6) 部署于服务器的在线 [demo](https://minicpm-omni-webdemo-us.modelbest.cn/)。
|
||||
|
||||
**模型架构。**
|
||||
|
||||
- **端到端全模态架构。** 通过**端到端**的方式连接和训练不同模态的编/解码模块以充分利用丰富的多模态知识。模型完全使用 CE 损失端到端训练。
|
||||
- **全模态流式机制。** (1) 我们将不同模态的离线编/解码器改造为适用于**流式输入/输出**的在线模块。 (2) 我们针对大语言模型基座设计了**时分复用的全模态流式信息处理机制**,将平行的不同模态的信息流拆分重组为周期性时间片序列。
|
||||
- **可配置的声音方案。** 我们设计了新的多模态系统提示,包含传统文本系统提示词,和**用于指定模型声音的语音系统提示词**。模型可在推理时灵活地通过文字或语音样例控制声音风格,并支持端到端声音克隆和音色创建等高级能力。
|
||||
|
||||
<div align="center">
|
||||
<img src="./assets/minicpm-o-26-framework-v2.png" , width=80%>
|
||||
</div>
|
||||
|
||||
<br>
|
||||
|
||||
|
||||
|
||||
### 性能评估 <!-- omit in toc -->
|
||||
|
||||
<div align="center">
|
||||
<img src="./assets/radar.jpg", width=80%>
|
||||
</div>
|
||||
|
||||
<details>
|
||||
<summary>点击查看视觉理解能力详细评测结果。</summary>
|
||||
|
||||
**图像理解能力**
|
||||
|
||||
<div align="center">
|
||||
<table style="margin: 0px auto;">
|
||||
<thead>
|
||||
<tr>
|
||||
<th align="left">Model</th>
|
||||
<th>Size</th>
|
||||
<th>Token Density<sup>+</sup></th>
|
||||
<th>OpenCompass</th>
|
||||
<th>OCRBench</th>
|
||||
<th>MathVista mini</th>
|
||||
<th>ChartQA</th>
|
||||
<th>MMVet</th>
|
||||
<th>MMStar</th>
|
||||
<th>MME</th>
|
||||
<th>MMB1.1 test</th>
|
||||
<th>AI2D</th>
|
||||
<th>MMMU val</th>
|
||||
<th>HallusionBench</th>
|
||||
<th>TextVQA val</th>
|
||||
<th>DocVQA test</th>
|
||||
<th>MathVerse mini</th>
|
||||
<th>MathVision</th>
|
||||
<th>MMHal Score</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody align="center">
|
||||
<tr>
|
||||
<td colspan="19" align="left"><strong>Proprietary</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GPT-4o-20240513</td>
|
||||
<td>-</td>
|
||||
<td>1088</td>
|
||||
<td><u>69.9</u></td>
|
||||
<td>736</td>
|
||||
<td>61.3</td>
|
||||
<td>85.7</td>
|
||||
<td><strong>69.1</strong></td>
|
||||
<td>63.9</td>
|
||||
<td>2328.7</td>
|
||||
<td>82.2</td>
|
||||
<td>84.6</td>
|
||||
<td><strong>69.2</strong></td>
|
||||
<td><strong>55.0</strong></td>
|
||||
<td>-</td>
|
||||
<td>92.8</td>
|
||||
<td><strong>50.2</strong></td>
|
||||
<td><strong>30.4</strong></td>
|
||||
<td><u>3.6</u></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Claude3.5-Sonnet</td>
|
||||
<td>-</td>
|
||||
<td>750</td>
|
||||
<td>67.9</td>
|
||||
<td>788</td>
|
||||
<td>61.6</td>
|
||||
<td><strong>90.8</strong></td>
|
||||
<td>66.0</td>
|
||||
<td>62.2</td>
|
||||
<td>1920.0</td>
|
||||
<td>78.5</td>
|
||||
<td>80.2</td>
|
||||
<td><u>65.9</u></td>
|
||||
<td>49.9</td>
|
||||
<td>-</td>
|
||||
<td><strong>95.2</strong></td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>3.4</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>64.4</td>
|
||||
<td>754</td>
|
||||
<td>57.7</td>
|
||||
<td>81.3</td>
|
||||
<td>64.0</td>
|
||||
<td>59.1</td>
|
||||
<td>2110.6</td>
|
||||
<td>73.9</td>
|
||||
<td>79.1</td>
|
||||
<td>60.6</td>
|
||||
<td>45.6</td>
|
||||
<td>73.5</td>
|
||||
<td>86.5</td>
|
||||
<td>-</td>
|
||||
<td>19.2</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GPT-4o-mini-20240718</td>
|
||||
<td>-</td>
|
||||
<td>1088</td>
|
||||
<td>64.1</td>
|
||||
<td>785</td>
|
||||
<td>52.4</td>
|
||||
<td>-</td>
|
||||
<td>66.9</td>
|
||||
<td>54.8</td>
|
||||
<td>2003.4</td>
|
||||
<td>76.0</td>
|
||||
<td>77.8</td>
|
||||
<td>60.0</td>
|
||||
<td>46.1</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>3.3</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td colspan="19" align="left"><strong>Open Source</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Cambrian-34B</td>
|
||||
<td>34B</td>
|
||||
<td><u>1820</u></td>
|
||||
<td>58.3</td>
|
||||
<td>591</td>
|
||||
<td>50.3</td>
|
||||
<td>75.6</td>
|
||||
<td>53.2</td>
|
||||
<td>54.2</td>
|
||||
<td>2049.9</td>
|
||||
<td>77.8</td>
|
||||
<td>79.5</td>
|
||||
<td>50.4</td>
|
||||
<td>41.6</td>
|
||||
<td>76.7</td>
|
||||
<td>75.5</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GLM-4V-9B</td>
|
||||
<td>13B</td>
|
||||
<td>784</td>
|
||||
<td>59.1</td>
|
||||
<td>776</td>
|
||||
<td>51.1</td>
|
||||
<td>-</td>
|
||||
<td>58.0</td>
|
||||
<td>54.8</td>
|
||||
<td>2018.8</td>
|
||||
<td>67.9</td>
|
||||
<td>71.2</td>
|
||||
<td>46.9</td>
|
||||
<td>45.0</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Pixtral-12B</td>
|
||||
<td>12B</td>
|
||||
<td>256</td>
|
||||
<td>61.0</td>
|
||||
<td>685</td>
|
||||
<td>56.9</td>
|
||||
<td>81.8</td>
|
||||
<td>58.5</td>
|
||||
<td>54.5</td>
|
||||
<td>-</td>
|
||||
<td>72.7</td>
|
||||
<td>79.0</td>
|
||||
<td>51.1</td>
|
||||
<td>47.0</td>
|
||||
<td>75.7</td>
|
||||
<td>90.7</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">DeepSeek-VL2-27B (4B)</td>
|
||||
<td>27B</td>
|
||||
<td>672</td>
|
||||
<td>66.4</td>
|
||||
<td>809</td>
|
||||
<td>63.9</td>
|
||||
<td>86.0</td>
|
||||
<td>60.0</td>
|
||||
<td>61.9</td>
|
||||
<td>2253.0</td>
|
||||
<td>81.2</td>
|
||||
<td>83.8</td>
|
||||
<td>54.0</td>
|
||||
<td>45.3</td>
|
||||
<td><u>84.2</u></td>
|
||||
<td>93.3</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>3.0</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Qwen2-VL-7B</td>
|
||||
<td>8B</td>
|
||||
<td>784</td>
|
||||
<td>67.1</td>
|
||||
<td><u>866</u></td>
|
||||
<td>58.2</td>
|
||||
<td>83.0</td>
|
||||
<td>62.0</td>
|
||||
<td>60.7</td>
|
||||
<td>2326.0</td>
|
||||
<td>81.8</td>
|
||||
<td>83.0</td>
|
||||
<td>54.1</td>
|
||||
<td>50.6</td>
|
||||
<td><strong>84.3</strong></td>
|
||||
<td><u>94.5</u></td>
|
||||
<td>31.9</td>
|
||||
<td>16.3</td>
|
||||
<td>3.2</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">LLaVA-OneVision-72B</td>
|
||||
<td>72B</td>
|
||||
<td>182</td>
|
||||
<td>68.1</td>
|
||||
<td>741</td>
|
||||
<td>67.5</td>
|
||||
<td>83.7</td>
|
||||
<td>60.6</td>
|
||||
<td><strong>65.8</strong></td>
|
||||
<td>2261.0</td>
|
||||
<td><strong>85.0</strong></td>
|
||||
<td><u>85.6</u></td>
|
||||
<td>56.8</td>
|
||||
<td>49.0</td>
|
||||
<td>80.5</td>
|
||||
<td>91.3</td>
|
||||
<td>39.1</td>
|
||||
<td>-</td>
|
||||
<td>3.5</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">InternVL2.5-8B</td>
|
||||
<td>8B</td>
|
||||
<td>706</td>
|
||||
<td>68.3</td>
|
||||
<td>822</td>
|
||||
<td><u>64.4</u></td>
|
||||
<td>84.8</td>
|
||||
<td>62.8</td>
|
||||
<td>62.8</td>
|
||||
<td>2344.0</td>
|
||||
<td><u>83.6</u></td>
|
||||
<td>84.5</td>
|
||||
<td>56.0</td>
|
||||
<td>50.1</td>
|
||||
<td>79.1</td>
|
||||
<td>93.0</td>
|
||||
<td>39.5</td>
|
||||
<td>19.7</td>
|
||||
<td>3.4</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
|
||||
<td>8B</td>
|
||||
<td><strong>2822</strong></td>
|
||||
<td>65.2</td>
|
||||
<td>852*</td>
|
||||
<td>60.6</td>
|
||||
<td>79.4</td>
|
||||
<td>60.0</td>
|
||||
<td>57.5</td>
|
||||
<td><u>2348.4*</u></td>
|
||||
<td>78.0</td>
|
||||
<td>82.1</td>
|
||||
<td>49.8*</td>
|
||||
<td>48.1*</td>
|
||||
<td>80.1</td>
|
||||
<td>90.8</td>
|
||||
<td>25.7</td>
|
||||
<td>18.3</td>
|
||||
<td>3.6</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
|
||||
<td>8B</td>
|
||||
<td><strong>2822</strong></td>
|
||||
<td><strong>70.2</strong></td>
|
||||
<td><strong>897*</strong></td>
|
||||
<td><strong>71.9*</strong></td>
|
||||
<td><u>86.9*</u></td>
|
||||
<td><u>67.5</u></td>
|
||||
<td><u>64.0</u></td>
|
||||
<td><strong>2372.0*</strong></td>
|
||||
<td>80.5</td>
|
||||
<td><strong>85.8</strong></td>
|
||||
<td>50.4*</td>
|
||||
<td><u>51.9</u></td>
|
||||
<td>82.0</td>
|
||||
<td>93.5</td>
|
||||
<td><u>41.4*</u></td>
|
||||
<td><u>23.1*</u></td>
|
||||
<td><strong>3.8</strong></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
</div>
|
||||
* 我们使用思维链提示词来评估这些基准,对于 MME 我们只在 Cognition 任务上使用了思维链。
|
||||
+ Token Density:每个视觉 token 在最大分辨率下编码的像素数,即最大分辨率下的像素数 / 视觉 token 数。
|
||||
|
||||
注意:闭源模型的 Token Density 由 API 收费方式估算得到。
|
||||
|
||||
**多图和视频理解能力**
|
||||
|
||||
<div align="center">
|
||||
|
||||
<table style="margin: 0px auto;">
|
||||
<thead>
|
||||
<tr>
|
||||
<th align="left">Model</th>
|
||||
<th>Size</th>
|
||||
<th>BLINK val</th>
|
||||
<th>Mantis Eval</th>
|
||||
<th>MIRB</th>
|
||||
<th>Video-MME (wo / w subs)</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody align="center">
|
||||
<tr>
|
||||
<td colspan="6" align="left"><strong>Proprietary</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GPT-4o-20240513</td>
|
||||
<td>-</td>
|
||||
<td><strong>68</strong></td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td><strong>71.9/77.2<strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GPT4V</td>
|
||||
<td>-</td>
|
||||
<td>54.6</td>
|
||||
<td>62.7</td>
|
||||
<td>53.1</td>
|
||||
<td>59.9/63.3</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td colspan="6" align="left"><strong>Open-source</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">LLaVA-NeXT-Interleave 14B</td>
|
||||
<td>14B</td>
|
||||
<td>52.6</td>
|
||||
<td>66.4</td>
|
||||
<td>30.2</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">LLaVA-OneVision-72B</td>
|
||||
<td>72B</td>
|
||||
<td>55.4</td>
|
||||
<td><strong>77.6</strong></td>
|
||||
<td>-</td>
|
||||
<td><u>66.2/69.5</u></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MANTIS 8B</td>
|
||||
<td>8B</td>
|
||||
<td>49.1</td>
|
||||
<td>59.5</td>
|
||||
<td>34.8</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Qwen2-VL-7B</td>
|
||||
<td>8B</td>
|
||||
<td>53.2</td>
|
||||
<td>69.6*</td>
|
||||
<td><strong>67.6*</strong></td>
|
||||
<td>63.3/69.0</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">InternVL2.5-8B</td>
|
||||
<td>8B</td>
|
||||
<td>54.8</td>
|
||||
<td>67.7</td>
|
||||
<td>52.5</td>
|
||||
<td>64.2/66.9</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
|
||||
<td>8B</td>
|
||||
<td>53</td>
|
||||
<td>69.1</td>
|
||||
<td>53.8</td>
|
||||
<td>60.9/63.6</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
|
||||
<td>8B</td>
|
||||
<td><u>56.7</u></td>
|
||||
<td><u>71.9</u></td>
|
||||
<td><u>58.6</u></td>
|
||||
<td>63.9/67.9</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
</div>
|
||||
* 正式开源模型权重的评测结果。
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
<details>
|
||||
<summary>点击查看语音理解和生成能力的详细评测结果。</summary>
|
||||
|
||||
**语音理解能力**
|
||||
|
||||
<div align="center">
|
||||
<table style="margin: 0px auto;">
|
||||
<thead>
|
||||
<tr>
|
||||
<th align="left">Task</th>
|
||||
<th>Size</th>
|
||||
<th colspan="3">ASR (zh)</th>
|
||||
<th colspan="3">ASR (en)</th>
|
||||
<th colspan="2">AST</th>
|
||||
<th>Emotion</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<th align="left">Metric</th>
|
||||
<td></td>
|
||||
<th colspan="3">CER↓</th>
|
||||
<th colspan="3">WER↓</th>
|
||||
<th colspan="2">BLEU↑</th>
|
||||
<th>ACC↑</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<th align="left">Dataset</th>
|
||||
<td></td>
|
||||
<th>AISHELL-1</th>
|
||||
<th>Fleurs zh</th>
|
||||
<th>WenetSpeech test-net</th>
|
||||
<th>LibriSpeech test-clean</th>
|
||||
<th>GigaSpeech</th>
|
||||
<th>TED-LIUM</th>
|
||||
<th>CoVoST en2zh</th>
|
||||
<th>CoVoST zh2en</th>
|
||||
<th>MELD emotion</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody align="center">
|
||||
<tr>
|
||||
<td colspan="11" align="left"><strong>Proprietary</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GPT-4o-Realtime</td>
|
||||
<td>-</td>
|
||||
<td>7.3*</td>
|
||||
<td><u>5.4*</u></td>
|
||||
<td>28.9*</td>
|
||||
<td>2.6*</td>
|
||||
<td>12.9*</td>
|
||||
<td>4.8*</td>
|
||||
<td>37.1*</td>
|
||||
<td>15.7*</td>
|
||||
<td>33.2*</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
|
||||
<td>-</td>
|
||||
<td>4.5*</td>
|
||||
<td>5.9*</td>
|
||||
<td>14.3*</td>
|
||||
<td>2.9*</td>
|
||||
<td>10.6*</td>
|
||||
<td><strong>3.0*</strong></td>
|
||||
<td><u>47.3*</u></td>
|
||||
<td>22.6*</td>
|
||||
<td>48.4*</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td colspan="11" align="left"><strong>Open-Source</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Qwen2-Audio-7B</td>
|
||||
<td>8B</td>
|
||||
<td>-</td>
|
||||
<td>7.5</td>
|
||||
<td>-</td>
|
||||
<td><strong>1.6</strong></td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>45.2</td>
|
||||
<td><u>24.4</u></td>
|
||||
<td><strong>55.3</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Qwen2-Audio-7B-Instruct</td>
|
||||
<td>8B</td>
|
||||
<td>2.6*</td>
|
||||
<td>6.9*</td>
|
||||
<td><u>10.3*</u></td>
|
||||
<td>3.1*</td>
|
||||
<td><u>9.7</u>*</td>
|
||||
<td>5.9*</td>
|
||||
<td>39.5*</td>
|
||||
<td>22.9*</td>
|
||||
<td>17.4*</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GLM-4-Voice-Base</td>
|
||||
<td>9B</td>
|
||||
<td><u>2.5</u></td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>2.8</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
|
||||
<td>8B</td>
|
||||
<td><strong>1.6</strong></td>
|
||||
<td><strong>4.4</strong></td>
|
||||
<td><strong>6.9</strong></td>
|
||||
<td><u>1.7</u></td>
|
||||
<td><strong>8.7</strong></td>
|
||||
<td><strong>3.0</strong></td>
|
||||
<td><strong>48.2</strong></td>
|
||||
<td><strong>27.2</strong></td>
|
||||
<td><u>52.4</u></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
</div>
|
||||
* 正式开源模型权重的评测结果。<br><br>
|
||||
|
||||
**语音生成能力。**
|
||||
|
||||
<div align="center">
|
||||
<table style="margin: 0px auto;">
|
||||
<thead>
|
||||
<tr>
|
||||
<th align="left">Task</th>
|
||||
<th>Size</th>
|
||||
<th colspan="9">SpeechQA</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<th align="left">Metric</th>
|
||||
<th></th>
|
||||
<th colspan="3">ACC↑</th>
|
||||
<th>G-Eval (10 point)↑</th>
|
||||
<th>Semantic ELO score↑</th>
|
||||
<th>Acoustic ELO score↑</th>
|
||||
<th>Overall ELO score↑</th>
|
||||
<th>UTMOS↑</th>
|
||||
<th>ASR-WER↓</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<th align="left">Dataset</th>
|
||||
<th></th>
|
||||
<th>Speech Llama Q.</th>
|
||||
<th>Speech Web Q.</th>
|
||||
<th>Speech Trivia QA</th>
|
||||
<th>Speech AlpacaEval</th>
|
||||
<th colspan="5">AudioArena</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody align="center">
|
||||
<tr>
|
||||
<td colspan="11" align="left"><strong>Proprietary</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GPT-4o-Realtime</td>
|
||||
<td></td>
|
||||
<td><strong>71.7</strong></td>
|
||||
<td><strong>51.6</strong></td>
|
||||
<td><strong>69.7</strong></td>
|
||||
<td><strong>7.4</strong></td>
|
||||
<td><strong>1157</strong></td>
|
||||
<td><strong>1203</strong></td>
|
||||
<td><strong>1200</strong></td>
|
||||
<td><strong>4.2</strong></td>
|
||||
<td><strong>2.3</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td colspan="11" align="left"><strong>Open-Source</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GLM-4-Voice</td>
|
||||
<td>9B</td>
|
||||
<td>50.0</td>
|
||||
<td>32.0</td>
|
||||
<td>36.4</td>
|
||||
<td><u>5.1</u></td>
|
||||
<td>999</td>
|
||||
<td>1147</td>
|
||||
<td>1035</td>
|
||||
<td><u>4.1</u></td>
|
||||
<td><u>11.7</u></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Llama-Omni</td>
|
||||
<td>8B</td>
|
||||
<td>45.3</td>
|
||||
<td>22.9</td>
|
||||
<td>10.7</td>
|
||||
<td>3.9</td>
|
||||
<td>960</td>
|
||||
<td>878</td>
|
||||
<td>897</td>
|
||||
<td>3.2</td>
|
||||
<td>24.3</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">VITA-1.5</td>
|
||||
<td>8B</td>
|
||||
<td>46.7</td>
|
||||
<td>28.1</td>
|
||||
<td>23.3</td>
|
||||
<td>2.0</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
<td>-</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Moshi</td>
|
||||
<td>7B</td>
|
||||
<td>43.7</td>
|
||||
<td>23.8</td>
|
||||
<td>16.7</td>
|
||||
<td>2.4</td>
|
||||
<td>871</td>
|
||||
<td>808</td>
|
||||
<td>875</td>
|
||||
<td>2.8</td>
|
||||
<td>8.2</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Mini-Omni</td>
|
||||
<td>1B</td>
|
||||
<td>22.0</td>
|
||||
<td>12.8</td>
|
||||
<td>6.9</td>
|
||||
<td>2.5</td>
|
||||
<td>926</td>
|
||||
<td>803</td>
|
||||
<td>865</td>
|
||||
<td>3.4</td>
|
||||
<td>10.0</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
|
||||
<td>8B</td>
|
||||
<td><u>61.0</u></td>
|
||||
<td><u>40.0</u></td>
|
||||
<td><u>40.2</u></td>
|
||||
<td><u>5.1</u></td>
|
||||
<td><u>1088</u></td>
|
||||
<td><u>1163</u></td>
|
||||
<td><u>1131</u></td>
|
||||
<td><strong>4.2</strong></td>
|
||||
<td>9.8</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
</div>
|
||||
所有的结果都基于 <a href="https://github.com/OpenBMB/UltraEval-Audio" target="_blank">AudioEvals</a>。<br><br>
|
||||
|
||||
**端到端声音克隆能力。**
|
||||
|
||||
<div align="center">
|
||||
<table style="margin: 0px auto;">
|
||||
<thead>
|
||||
<tr>
|
||||
<th align="left">Task</th>
|
||||
<th colspan="2">TTS</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<th align="left">Metric</th>
|
||||
<th>SIMO↑</th>
|
||||
<th>SIMO↑</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<th align="left">Dataset</th>
|
||||
<th>Seed-TTS test-zh</th>
|
||||
<th>Seed-TTS test-en</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody align="center">
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">F5-TTS</td>
|
||||
<td><strong>76</strong></td>
|
||||
<td><strong>67</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">CosyVoice</td>
|
||||
<td><u>75</u></td>
|
||||
<td><u>64</u></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">FireRedTTS</td>
|
||||
<td>63</td>
|
||||
<td>46</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
|
||||
<td>57</td>
|
||||
<td>47</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
</div>
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>点击查看多模态流式交互能力评测详细结果。</summary>
|
||||
|
||||
**多模态流式交互能力**: StreamingBench 分数
|
||||
|
||||
<table style="margin: 0px auto;">
|
||||
<thead>
|
||||
<tr>
|
||||
<th align="left">Model</th>
|
||||
<th>Size</th>
|
||||
<th>Real-Time Video Understanding</th>
|
||||
<th>Omni-Source Understanding</th>
|
||||
<th>Contextual Understanding</th>
|
||||
<th>Overall</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody align="center">
|
||||
<tr>
|
||||
<td colspan="7" align="left"><strong>Proprietary</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
|
||||
<td>-</td>
|
||||
<td><u>77.4</u></td>
|
||||
<td><strong>67.8</strong></td>
|
||||
<td><strong>51.1</strong></td>
|
||||
<td><strong>70.3</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GPT-4o-202408</td>
|
||||
<td>-</td>
|
||||
<td>74.5</td>
|
||||
<td>51.0</td>
|
||||
<td><u>48.0</u></td>
|
||||
<td>64.1</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Claude-3.5-Sonnet</td>
|
||||
<td>-</td>
|
||||
<td>74.0</td>
|
||||
<td>41.4</td>
|
||||
<td>37.8</td>
|
||||
<td>59.7</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td colspan="9" align="left"><strong>Open-source</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">VILA-1.5</td>
|
||||
<td>8B</td>
|
||||
<td>61.5</td>
|
||||
<td>37.5</td>
|
||||
<td>26.7</td>
|
||||
<td>49.5</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">LongVA</td>
|
||||
<td>7B</td>
|
||||
<td>63.1</td>
|
||||
<td>35.9</td>
|
||||
<td>30.2</td>
|
||||
<td>50.7</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">LLaVA-Next-Video-34B</td>
|
||||
<td>34B</td>
|
||||
<td>69.8</td>
|
||||
<td>41.7</td>
|
||||
<td>34.3</td>
|
||||
<td>56.7</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Qwen2-VL-7B</td>
|
||||
<td>8B</td>
|
||||
<td>71.2</td>
|
||||
<td>40.7</td>
|
||||
<td>33.1</td>
|
||||
<td>57.0</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">InternVL2-8B</td>
|
||||
<td>8B</td>
|
||||
<td>70.1</td>
|
||||
<td>42.7</td>
|
||||
<td>34.1</td>
|
||||
<td>57.0</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">VITA-1.5</td>
|
||||
<td>8B</td>
|
||||
<td>70.9</td>
|
||||
<td>40.8</td>
|
||||
<td>35.8</td>
|
||||
<td>57.4</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">LLaVA-OneVision-7B</td>
|
||||
<td>8B</td>
|
||||
<td>74.3</td>
|
||||
<td>40.8</td>
|
||||
<td>31.0</td>
|
||||
<td>58.4</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">InternLM-XC2.5-OL-7B</td>
|
||||
<td>8B</td>
|
||||
<td>75.4</td>
|
||||
<td>46.2</td>
|
||||
<td>33.6</td>
|
||||
<td>60.8</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
|
||||
<td>8B</td>
|
||||
<td>72.4</td>
|
||||
<td>40.2</td>
|
||||
<td>33.4</td>
|
||||
<td>57.7</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
|
||||
<td>8B</td>
|
||||
<td><strong>79.9</strong></td>
|
||||
<td><u>53.4</u></td>
|
||||
<td>38.5</td>
|
||||
<td><u>66.0</u></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
### 典型示例 <!-- omit in toc -->
|
||||
|
||||
以下为 MiniCPM-o 2.6 的 iPad Pro 实机演示和 web demo 演示样例:
|
||||
|
||||
|
||||
<div align="center">
|
||||
<a href="https://www.youtube.com/watch?v=vRIMbxJzStY&t=2s"><img src="./assets/minicpmo2_6/2dot6_o_demo_video_img.png", width=70%></a>
|
||||
</div>
|
||||
<br>
|
||||
|
||||
|
||||
|
||||
<div style="display: flex; flex-direction: column; align-items: center;">
|
||||
<img src="assets/minicpmo2_6/minicpmo2_6_math_intersect.png" alt="math" style="margin-bottom: 5px;">
|
||||
<img src="assets/minicpmo2_6/minicpmo2_6_diagram_train_NN.png" alt="diagram" style="margin-bottom: 5px;">
|
||||
<img src="assets/minicpmo2_6/minicpmo2_6_multi-image_bike.png" alt="bike" style="margin-bottom: 5px;">
|
||||
</div>
|
||||
|
||||
|
||||
158
docs/minicpm_v4dot5_en.md
Normal file
@@ -0,0 +1,158 @@
|
||||
## MiniCPM-V 4.5
|
||||
|
||||
> Archieve at: 2026-02-03
|
||||
|
||||
**MiniCPM-V 4.5** is the latest and most capable model in the MiniCPM-V series. The model is built on Qwen3-8B and SigLIP2-400M with a total of 8B parameters. It exhibits a significant performance improvement over previous MiniCPM-V and MiniCPM-o models, and introduces new useful features. Notable features of MiniCPM-V 4.5 include:
|
||||
|
||||
- 🔥 **State-of-the-art Vision-Language Capability.**
|
||||
MiniCPM-V 4.5 achieves an average score of 77.0 on OpenCompass, a comprehensive evaluation of 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-latest, Gemini-2.0 Pro, and strong open-source models like Qwen2.5-VL 72B** for vision-language capabilities, making it the most performant MLLM under 30B parameters.
|
||||
|
||||
- 🎬 **Efficient High-FPS and Long Video Understanding.** Powered by a new unified 3D-Resampler over images and videos, MiniCPM-V 4.5 can now achieve 96x compression rate for video tokens, where 6 448x448 video frames can be jointly compressed into 64 video tokens (normally 1,536 tokens for most MLLMs). This means that the model can perceive significantly more video frames without increasing the LLM inference cost. This brings state-of-the-art high-FPS (up to 10FPS) video understanding and long video understanding capabilities on Video-MME, LVBench, MLVU, MotionBench, FavorBench, etc., efficiently.
|
||||
|
||||
- ⚙️ **Controllable Hybrid Fast/Deep Thinking.** MiniCPM-V 4.5 supports both fast thinking for efficient frequent usage with competitive performance, and deep thinking for more complex problem solving. To cover efficiency and performance trade-offs in different user scenarios, this fast/deep thinking mode can be switched in a highly controlled fashion.
|
||||
|
||||
- 💪 **Strong OCR, Document Parsing and Others.**
|
||||
Based on [LLaVA-UHD](https://arxiv.org/pdf/2403.11703) architecture, MiniCPM-V 4.5 can process high-resolution images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344), using 4x fewer visual tokens than most MLLMs. The model achieves **leading performance on OCRBench, surpassing proprietary models such as GPT-4o-latest and Gemini 2.5**. It also achieves state-of-the-art performance for PDF document parsing capability on OmniDocBench among general MLLMs. Based on the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) and [VisCPM](https://github.com/OpenBMB/VisCPM) techniques, it features **trustworthy behaviors**, outperforming GPT-4o-latest on MMHal-Bench, and supports **multilingual capabilities** in more than 30 languages.
|
||||
|
||||
|
||||
- 💫 **Easy Usage.**
|
||||
MiniCPM-V 4.5 can be easily used in various ways: (1) [llama.cpp](https://github.com/tc-mb/llama.cpp/blob/Support-MiniCPM-V-4.5/docs/multimodal/minicpmv4.5.md) and [ollama](https://github.com/tc-mb/ollama/tree/MIniCPM-V) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-V-4_5-int4), [GGUF](https://huggingface.co/openbmb/MiniCPM-V-4_5-gguf) and [AWQ](https://github.com/tc-mb/AutoAWQ) format quantized models in 16 sizes, (3) [SGLang](https://github.com/tc-mb/sglang/tree/main) and [vLLM](#efficient-inference-with-llamacpp-ollama-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks with [Transformers](https://github.com/tc-mb/transformers/tree/main) and [LLaMA-Factory](./docs/llamafactory_train_and_infer.md), (5) quick [local WebUI demo](#chat-with-our-demo-on-gradio), (6) optimized [local iOS app](https://github.com/tc-mb/MiniCPM-o-demo-iOS) on iPhone and iPad, and (7) online web demo on [server](http://101.126.42.235:30910/). See our [Cookbook](https://github.com/OpenSQZ/MiniCPM-V-CookBook) for full usage!
|
||||
|
||||
|
||||
### Key Techniques <!-- omit in toc -->
|
||||
|
||||
|
||||
<div align="center">
|
||||
<img src="../assets/minicpm-v-4dot5-framework.png" , width=100%>
|
||||
</div>
|
||||
|
||||
- **Architechture: Unified 3D-Resampler for High-density Video Compression.** MiniCPM-V 4.5 introduces a 3D-Resampler that overcomes the performance-efficiency trade-off in video understanding. By grouping and jointly compressing up to 6 consecutive video frames into just 64 tokens (the same token count used for a single image in MiniCPM-V series), MiniCPM-V 4.5 achieves a 96× compression rate for video tokens. This allows the model to process more video frames without additional LLM computational cost, enabling high-FPS video and long video understanding. The architecture supports unified encoding for images, multi-image inputs, and videos, ensuring seamless capability and knowledge transfer.
|
||||
|
||||
- **Pre-training: Unified Learning for OCR and Knowledge from Documents.** Existing MLLMs learn OCR capability and knowledge from documents in isolated training approaches. We observe that the essential difference between these two training approaches is the visibility of the text in images. By dynamically corrupting text regions in documents with varying noise levels and asking the model to reconstruct the text, the model learns to adaptively and properly switch between accurate text recognition (when text is visible) and multimodal context-based knowledge reasoning (when text is heavily obscured). This eliminates reliance on error-prone document parsers in knowledge learning from documents, and prevents hallucinations from over-augmented OCR data, resulting in top-tier OCR and multimodal knowledge performance with minimal engineering overhead.
|
||||
|
||||
- **Post-training: Hybrid Fast/Deep Thinking with Multimodal RL.** MiniCPM-V 4.5 offers a balanced reasoning experience through two switchable modes: fast thinking for efficient daily use and deep thinking for complex tasks. Using a new hybrid reinforcement learning method, the model jointly optimizes both modes, significantly enhancing fast-mode performance without compromising deep-mode capability. Incorporated with [RLPR](https://github.com/OpenBMB/RLPR) and [RLAIF-V](https://github.com/RLHF-V/RLAIF-V), it generalizes robust reasoning skills from broad multimodal data while effectively reducing hallucinations.
|
||||
|
||||
### Evaluation <!-- omit in toc -->
|
||||
|
||||
<div align="center">
|
||||
<img src="./assets/radar_minicpm_v45.png", width=60%>
|
||||
</div>
|
||||
<div align="center">
|
||||
<img src="./assets/minicpmv_4_5_evaluation_result.png" , width=80%>
|
||||
</div>
|
||||
|
||||
|
||||
### Inference Efficiency
|
||||
|
||||
|
||||
**OpenCompass**
|
||||
<div align="left">
|
||||
<table style="margin: 0px auto;">
|
||||
<thead>
|
||||
<tr>
|
||||
<th align="left">Model</th>
|
||||
<th>Size</th>
|
||||
<th>Avg Score ↑</th>
|
||||
<th>Total Inference Time ↓</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody align="center">
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GLM-4.1V-9B-Thinking</td>
|
||||
<td>10.3B</td>
|
||||
<td>76.6</td>
|
||||
<td>17.5h</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiMo-VL-7B-RL</td>
|
||||
<td>8.3B</td>
|
||||
<td>76.4</td>
|
||||
<td>11h</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiniCPM-V 4.5</td>
|
||||
<td>8.7B</td>
|
||||
<td><b>77.0</td>
|
||||
<td><b>7.5h</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
</div>
|
||||
|
||||
**Video-MME**
|
||||
|
||||
<div align="left">
|
||||
<table style="margin: 0px auto;">
|
||||
<thead>
|
||||
<tr>
|
||||
<th align="left">Model</th>
|
||||
<th>Size</th>
|
||||
<th>Avg Score ↑</th>
|
||||
<th>Total Inference Time ↓</th>
|
||||
<th>GPU Mem ↓</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody align="center">
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Qwen2.5-VL-7B-Instruct</td>
|
||||
<td>8.3B</td>
|
||||
<td>71.6</td>
|
||||
<td>3h</td>
|
||||
<td>60G</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GLM-4.1V-9B-Thinking</td>
|
||||
<td>10.3B</td>
|
||||
<td><b>73.6</td>
|
||||
<td>2.63h</td>
|
||||
<td>32G</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiniCPM-V 4.5</td>
|
||||
<td>8.7B</td>
|
||||
<td>73.5</td>
|
||||
<td><b>0.26h</td>
|
||||
<td><b>28G</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
</div>
|
||||
|
||||
Both Video-MME and OpenCompass were evaluated using 8×A100 GPUs for inference. The reported inference time of Video-MME includes full model-side computation, and excludes the external cost of video frame extraction (dependent on specific frame extraction tools) for fair comparison.
|
||||
|
||||
|
||||
### Examples <!-- omit in toc -->
|
||||
|
||||
<div align="center">
|
||||
<a href="https://www.youtube.com/watch?v=Cn23FujYMMU"><img src="../assets/minicpmv4_5/MiniCPM-V 4.5-8.26_img.jpeg", width=70%></a>
|
||||
</div>
|
||||
|
||||
<div style="display: flex; flex-direction: column; align-items: center;">
|
||||
<img src="../assets/minicpmv4_5/en_case1.png" alt="en_case1" style="margin-bottom: 5px;">
|
||||
<img src="../assets/minicpmv4_5/en_case2.png" alt="en_case2" style="margin-bottom: 5px;">
|
||||
<img src="../assets/minicpmv4_5/en_case3.jpeg" alt="en_case3" style="margin-bottom: 5px;">
|
||||
</div>
|
||||
|
||||
<details>
|
||||
<summary>Click to view more cases.</summary>
|
||||
<div style="display: flex; flex-direction: column; align-items: center;">
|
||||
<img src="../assets/minicpmv4_5/zh_extra.jpeg" alt="zh_extra" style="margin-bottom: 5px;">
|
||||
</div>
|
||||
|
||||
</details>
|
||||
|
||||
We deploy MiniCPM-V 4.5 on iPad M4 with [iOS demo](https://github.com/tc-mb/MiniCPM-o-demo-iOS). The demo video is the raw screen recording without edition.
|
||||
|
||||
<table align="center">
|
||||
<p align="center">
|
||||
<img src="../assets/minicpmv4_5/v45_en_handwriting.gif" width=45%/>
|
||||
|
||||
<img src="../assets/minicpmv4_5/v45_en_cot.gif" width=45%/>
|
||||
</p>
|
||||
<p align="center">
|
||||
<img src="../assets/minicpmv4_5/v45_cn_handwriting.gif" width=45%/>
|
||||
|
||||
<img src="../assets/minicpmv4_5/v45_cn_travel.gif" width=45%/>
|
||||
</p>
|
||||
</table>
|
||||
|
||||
156
docs/minicpm_v4dot5_zh.md
Normal file
@@ -0,0 +1,156 @@
|
||||
## MiniCPM-V 4.5
|
||||
|
||||
> Archieve at: 2026-02-03
|
||||
|
||||
**MiniCPM-V 4.5** 是 MiniCPM-V 系列中最新、最强大的模型。该模型基于 Qwen3-8B 与 SigLIP2-400M 构建,总参数量为 8B。其在性能上较前代 MiniCPM-V 与 MiniCPM-o 有显著提升,并引入了一系列全新的实用特性。其主要亮点包括:
|
||||
|
||||
|
||||
- 🔥 **领先的视觉理解能力**
|
||||
MiniCPM-V 4.5 在 OpenCompass 综合评测(涵盖 8 个主流评测基准)中取得了 77.0 的高分。**在仅 8B 参数的情况下超越了广泛使用的闭源模型(如 GPT-4o-latest、Gemini-2.0 Pro)以及强大的开源模型(如 Qwen2.5-VL 72B)**,成为 30B 参数规模以下最强的多模态大模型。
|
||||
|
||||
- 🎬 **高效的高帧率与长视频理解**
|
||||
借助全新的图像-视频统一 3D-Resampler,MiniCPM-V 4.5 能够实现 96 倍视频 token 压缩率,即将 6 帧 448x448 视频帧联合压缩为 64 个 token(大多数多模态大模型需约 1536 个 token)。这意味着模型在语言模型推理成本不增加的情况下,可以感知显著更多的视频帧,从而实现业界领先的 高帧率(最高 10FPS)视频理解与长视频理解,并在 Video-MME、LVBench、MLVU、MotionBench、FavorBench 等基准上高效率地展现出色性能。
|
||||
|
||||
- ⚙️ **可控的快思考 / 深思考模式**
|
||||
MiniCPM-V 4.5 同时支持 快思考(用于高频高效推理,性能具竞争力)与 深思考(用于复杂问题求解)。用户可根据不同场景对效率与性能的权衡,自由切换两种模式,实现高度可控的推理过程。
|
||||
|
||||
- 💪 **优秀的 OCR、文档解析与多语言能力**
|
||||
基于 [LLaVA-UHD](https://arxiv.org/pdf/2403.11703) 架构,MiniCPM-V 4.5 能处理任意长宽比、最高达 180 万像素(如 1344x1344) 的高分辨率图像,同时使用的视觉 token 数仅为多数 MLLM 的 1/4。其在 OCRBench 上取得超越 GPT-4o-latest 与 Gemini 2.5 等闭源模型的性能,并在 OmniDocBench 上展现了业界顶尖的 PDF 文档解析能力。借助最新的 [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) 和 [VisCPM](https://github.com/OpenBMB/VisCPM) 技术,模型在可靠性上表现优异,在 MMHal-Bench 上超越 GPT-4o-latest,并支持 30+ 种语言的多语言能力。
|
||||
|
||||
- 💫 **便捷易用的部署方式**
|
||||
MiniCPM-V 4.5 提供丰富灵活的使用方式:(1) [llama.cpp](https://github.com/tc-mb/llama.cpp/blob/master/docs/multimodal/minicpmo4.5.md) 与 [ollama](https://github.com/tc-mb/ollama/tree/MIniCPM-V) 支持本地 CPU 高效推理;(2) 提供 [int4](https://huggingface.co/openbmb/MiniCPM-V-4_5-int4)、[GGUF](https://huggingface.co/openbmb/MiniCPM-V-4_5-gguf)、[AWQ](https://github.com/tc-mb/AutoAWQ) 等 16 种规格的量化模型;(3)兼容 SGLang 与 [vLLM](#efficient-inference-with-llamacpp-ollama-vllm) (4) 借助 [Transformers](https://github.com/tc-mb/transformers/tree/main) 与 [LLaMA-Factory](./docs/llamafactory_train_and_infer.md) 在新领域与任务上进行微调;(5) 快速启动本地 [WebUI demo](#chat-with-our-demo-on-gradio);(6) 优化适配的 [iOS 本地应用](https://github.com/tc-mb/MiniCPM-o-demo-iOS),可在 iPhone 与 iPad 上高效运行;(7) 在线 [Web demo](http://101.126.42.235:30910/) 体验。更多使用方式请见 [Cookbook](https://github.com/OpenSQZ/MiniCPM-V-CookBook)。
|
||||
|
||||
### 技术亮点 <!-- omit in toc -->
|
||||
|
||||
- **架构:图像-视频统一的高密度视觉压缩 3D-Resampler**。 MiniCPM-V 4.5 在架构上引入了 3D-Resampler,成功突破了视频理解任务中性能与效率难以兼得的瓶颈。该方法能够将多达 6 帧连续视频帧压缩为仅 64 个 token(与 MiniCPM-V 系列中单张图像所用的 token 数相同),从而实现 96× 的视频 token 压缩率。这使得模型在语言模型计算成本不增加的情况下,可以处理更多的视频帧,从而实现高帧率视频理解和长视频理解。该架构统一支持单图、多图和视频的编码处理,确保了能力与知识的无缝迁移。
|
||||
|
||||
- **学习机制:OCR与文档知识的统一学习**。现有多模态大模型一般在不同训练阶段分别单独训练 OCR 能力与文档知识。我们发现这两个训练过程的本质差异在于图像中文本的可见性。通过动态对文档文本区域施加不同强度的噪声干扰,并要求模型重建文本,使其学会自适应地在准确文本识别(当文本清晰时)与基于多模态上下文的知识推理(当文本严重遮挡时)之间切换。这种方法使得 MiniCPM-V 在文档知识学习中摆脱了对高错误率的文档解析器的依赖,同时避免了过度增强的 OCR 数据产生的幻觉问题,以最小工程开销实现了顶尖的 OCR 与多模态知识处理性能。
|
||||
|
||||
- **后训练优化:基于多模态强化学习的混合快思考/深度思考模式**。 MiniCPM-V 4.5 通过两种可切换推理模式提供均衡的体验:面向高效日常应用的快速思考模式,以及处理复杂任务的深度思考模式。采用新颖的混合强化学习方法,模型可联合优化两种模式,在保持深度模式能力的同时显著提升快速模式性能。结合 [RLPR](https://github.com/OpenBMB/RLPR) 和 [RLAIF-V](https://github.com/RLHF-V/RLAIF-V) 技术,该模型可以从海量多模态数据中泛化出强大的推理能力,并有效减少幻觉现象。
|
||||
|
||||
<div align="center">
|
||||
<img src="../assets/minicpm-v-4dot5-framework.png" , width=80%>
|
||||
</div>
|
||||
|
||||
### 性能评估 <!-- omit in toc -->
|
||||
|
||||
<div align="center">
|
||||
<img src="../assets/radar_minicpm_v45.png", width=80%>
|
||||
</div>
|
||||
<div align="center">
|
||||
<img src="../assets/minicpmv_4_5_evaluation_result.png" , width=80%>
|
||||
</div>
|
||||
|
||||
|
||||
### 推理效率 <!-- omit in toc -->
|
||||
|
||||
|
||||
**OpenCompass**
|
||||
<div align="left">
|
||||
<table style="margin: 0px auto;">
|
||||
<thead>
|
||||
<tr>
|
||||
<th align="left">Model</th>
|
||||
<th>Size</th>
|
||||
<th>Avg Score ↑</th>
|
||||
<th>Total Inference Time ↓</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody align="center">
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GLM-4.1V-9B-Thinking</td>
|
||||
<td>10.3B</td>
|
||||
<td>76.6</td>
|
||||
<td>17.5h</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiMo-VL-7B-RL</td>
|
||||
<td>8.3B</td>
|
||||
<td>76.4</td>
|
||||
<td>11h</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiniCPM-V 4.5</td>
|
||||
<td>8.7B</td>
|
||||
<td><b>77.0</td>
|
||||
<td><b>7.5h</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
</div>
|
||||
|
||||
**Video-MME**
|
||||
|
||||
<div align="left">
|
||||
<table style="margin: 0px auto;">
|
||||
<thead>
|
||||
<tr>
|
||||
<th align="left">Model</th>
|
||||
<th>Size</th>
|
||||
<th>Avg Score ↑</th>
|
||||
<th>Total Inference Time ↓</th>
|
||||
<th>GPU Mem ↓</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody align="center">
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">Qwen2.5-VL-7B-Instruct</td>
|
||||
<td>8.3B</td>
|
||||
<td>71.6</td>
|
||||
<td>3h</td>
|
||||
<td>60G</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">GLM-4.1V-9B-Thinking</td>
|
||||
<td>10.3B</td>
|
||||
<td><b>73.6</td>
|
||||
<td>2.63h</td>
|
||||
<td>32G</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap="nowrap" align="left">MiniCPM-V 4.5</td>
|
||||
<td>8.7B</td>
|
||||
<td>73.5</td>
|
||||
<td><b>0.26h</td>
|
||||
<td><b>28G</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
|
||||
OpenCompass 和 Video-MME 均采用 A100*8卡 推理,其中 Video-MME 的推理时间未统计视频抽帧时间
|
||||
|
||||
### 典型示例 <!-- omit in toc -->
|
||||
<div align="center">
|
||||
<a href="https://www.youtube.com/watch?v=Cn23FujYMMU"><img src="../assets/minicpmv4_5/MiniCPM-V 4.5-8.26_img.jpeg", width=70%></a>
|
||||
</div>
|
||||
|
||||
<div style="display: flex; flex-direction: column; align-items: center;">
|
||||
<img src="../assets/minicpmv4_5/zh_case1.jpeg" alt="zh_case1" style="margin-bottom: 5px;">
|
||||
<img src="../assets/minicpmv4_5/zh_case2.jpeg" alt="zh_case2" style="margin-bottom: 5px;">
|
||||
</div>
|
||||
|
||||
<details>
|
||||
<summary>点击查看更多示例</summary>
|
||||
<div style="display: flex; flex-direction: column; align-items: center;">
|
||||
<img src="../assets/minicpmv4_5/en_extra.jpg" alt="en_extra" style="margin-bottom: 5px;">
|
||||
<img src="../assets/minicpmv4_5/en_case3.jpeg" alt="en_extra" style="margin-bottom: 5px;">
|
||||
</div>
|
||||
</details>
|
||||
|
||||
|
||||
我们使用 [iOS demo](https://github.com/tc-mb/MiniCPM-o-demo-iOS) 将 MiniCPM-V 4.5 部署在 iPad M4 ,并录制以下演示录屏,视频未经任何编辑。
|
||||
|
||||
<table align="center">
|
||||
<p align="center">
|
||||
<img src="../assets/minicpmv4_5/v45_en_handwriting.gif" width=45%/>
|
||||
|
||||
<img src="../assets/minicpmv4_5/v45_en_cot.gif" width=45%/>
|
||||
</p>
|
||||
<p align="center">
|
||||
<img src="../assets/minicpmv4_5/v45_cn_handwriting.gif" width=45%/>
|
||||
|
||||
<img src="../assets/minicpmv4_5/v45_cn_travel.gif" width=45%/>
|
||||
</p>
|
||||
</table>
|
||||
|
||||
@@ -69,6 +69,7 @@ class SupervisedDataset(Dataset):
|
||||
batch_vision=self.batch_vision,
|
||||
max_length=self.max_length
|
||||
)
|
||||
|
||||
ret = dict(
|
||||
input_ids=ret["input_ids"],
|
||||
position_ids=ret["position_ids"],
|
||||
@@ -80,7 +81,7 @@ class SupervisedDataset(Dataset):
|
||||
)
|
||||
except:
|
||||
logger.error(f"data fetch error")
|
||||
return self.__getitem__(random.randint(0, len(self)))
|
||||
return self.__getitem__(random.randint(0, len(self)))
|
||||
return ret
|
||||
|
||||
|
||||
@@ -283,20 +284,30 @@ def conversation_to_ids_qwen2(conversation, tokenizer):
|
||||
chat.append({"role":prefix, "content":message})
|
||||
raw_msg += prefix + message
|
||||
assert set([i['role'] for i in chat]) & set(['assistant'])
|
||||
if '<think>' in chat[-1]['content'] and '</think>' in chat[-1]['content']:
|
||||
enable_thinking = True
|
||||
else:
|
||||
enable_thinking = False
|
||||
|
||||
ret = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=False)
|
||||
input_ids = tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=False)
|
||||
ret = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=False, enable_thinking=enable_thinking)
|
||||
input_ids = tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=False, enable_thinking=enable_thinking)
|
||||
input_ids = np.array(input_ids)
|
||||
|
||||
if "<think>\n\n</think>\n\n" in ret:
|
||||
offset = 4
|
||||
else:
|
||||
offset = 0
|
||||
start_idxs = np.where(input_ids == tokenizer.convert_tokens_to_ids('<|im_start|>'))[0]
|
||||
assistant_idxs = np.where(input_ids == tokenizer.convert_tokens_to_ids('assistant'))[0]
|
||||
end_idxs = np.where(input_ids == tokenizer.convert_tokens_to_ids('<|im_end|>'))[0]
|
||||
|
||||
context = np.ones_like(input_ids, dtype=np.int8)
|
||||
|
||||
for assistant_idx in assistant_idxs:
|
||||
for i, assistant_idx in enumerate(assistant_idxs):
|
||||
if assistant_idx-1 in set(start_idxs):
|
||||
st = assistant_idx + 1
|
||||
if i == len(assistant_idxs) -1:
|
||||
st = assistant_idx + 2 + offset
|
||||
else:
|
||||
st = assistant_idx + 2
|
||||
for end_idx in end_idxs:
|
||||
if end_idx > st:
|
||||
context[st: end_idx + 1] = 0
|
||||
|
||||
@@ -52,7 +52,7 @@ torchrun $DISTRIBUTED_ARGS finetune.py \
|
||||
--per_device_train_batch_size 1 \
|
||||
--per_device_eval_batch_size 1 \
|
||||
--gradient_accumulation_steps 1 \
|
||||
--evaluation_strategy "steps" \
|
||||
--eval_strategy "steps" \
|
||||
--save_strategy "steps" \
|
||||
--save_steps 1000 \
|
||||
--save_total_limit 10 \
|
||||
|
||||
@@ -35,8 +35,8 @@ http://thunlp.oss-cn-qingdao.aliyuncs.com/multi_modal/never_delete/modelscope_st
|
||||
decord
|
||||
aiosignal
|
||||
tensorboard
|
||||
deepspeed==0.12.3
|
||||
transformers==4.44.2
|
||||
deepspeed
|
||||
transformers==4.51.2
|
||||
librosa==0.9.0
|
||||
soundfile==0.12.1
|
||||
vector-quantize-pytorch==1.18.5
|
||||
|
||||
@@ -7,7 +7,7 @@ from transformers.trainer_pt_utils import nested_detach
|
||||
from transformers.utils import is_sagemaker_mp_enabled
|
||||
from transformers.trainer import *
|
||||
from transformers.integrations import is_deepspeed_zero3_enabled
|
||||
|
||||
from typing import Dict, List, Optional, Tuple
|
||||
|
||||
class CPMTrainer(Trainer):
|
||||
def compute_loss(self, model, inputs, return_outputs=False):
|
||||
@@ -170,7 +170,7 @@ class CPMTrainer(Trainer):
|
||||
|
||||
return (loss, logits, labels)
|
||||
|
||||
def training_step(self, model: nn.Module, inputs: Dict[str, Union[torch.Tensor, Any]]) -> torch.Tensor:
|
||||
def training_step(self, model: nn.Module, inputs: Dict[str, Union[torch.Tensor, Any]], num_items_in_batch=None) -> torch.Tensor:
|
||||
"""
|
||||
Perform a training step on a batch of inputs.
|
||||
|
||||
@@ -189,8 +189,7 @@ class CPMTrainer(Trainer):
|
||||
`torch.Tensor`: The tensor with training loss on this batch.
|
||||
"""
|
||||
model.train()
|
||||
inputs = self._prepare_inputs(inputs)
|
||||
|
||||
inputs = self._prepare_inputs(inputs)
|
||||
if is_sagemaker_mp_enabled():
|
||||
loss_mb = smp_forward_backward(model, inputs, self.args.gradient_accumulation_steps)
|
||||
return loss_mb.reduce_mean().detach().to(self.args.device)
|
||||
|
||||