diff --git a/README.md b/README.md index cac9306..f0250e8 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@
- + **A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone** @@ -20,21 +20,21 @@ -
+

- MiniCPM-V 4.0 🤗 🤖 | MiniCPM-o 2.6 🤗 🤖 | MiniCPM-V 2.6 🤗 🤖 | 🍳 Cookbook | - 📄 Technical Blog [English/中文] + MiniCPM-V 4.5 🤗 🤖 | MiniCPM-o 2.6 🤗 🤖 | 🍳 Cookbook | + 📄 Technical Report (Coming Soon)

-**MiniCPM-o** is the latest series of end-side multimodal LLMs (MLLMs) ungraded from MiniCPM-V. The models can now take images, video, text, and audio as inputs and provide high-quality text and speech outputs in an end-to-end fashion. Since February 2024, we have released 6 versions of the model, aiming to achieve **strong performance and efficient deployment**. The most notable models in the series currently include: +**MiniCPM-V** is a series of efficient end-side multimodal LLMs (MLLMs), which accept images, videos and text as inputs and deliver high-quality text outputs. **MiniCPM-o** additionally takes audio as inputs and provide high-quality speech outputs in an end-to-end fashion. Since February 2024, we have released 7 versions of the model, aiming to achieve **strong performance and efficient deployment**. The most notable models in the series currently include: -- **MiniCPM-V 4.0**: 🚀🚀🚀 The latest efficient model in the MiniCPM-V series. With a total of 4B parameters, the model **surpasses GPT-4.1-mini-20250414, Qwen2.5-VL-3B-Instruct, and InternVL2.5-8B** in image understanding on the OpenCompass evaluation. With its small parameter-size and efficient architecure, MiniCPM-V 4.0 is an ideal choice for on-device deployment on the phone (e.g., **less than 2s first token delay and more than 17 token/s decoding** on iPhone 16 Pro Max using the open-sourced iOS App). -- **MiniCPM-o 2.6**: 🔥🔥🔥 The most capable model in the MiniCPM-o series. With a total of 8B parameters, this end-to-end model **achieves comparable performance to GPT-4o-202405 in vision, speech, and multimodal live streaming**, making it one of the most versatile and performant models in the open-source community. For the new voice mode, MiniCPM-o 2.6 **supports bilingual real-time speech conversation with configurable voices**, and also allows for fun capabilities such as emotion/speed/style control, end-to-end voice cloning, role play, etc. It also advances MiniCPM-V 2.6's visual capabilities such **strong OCR capability, trustworthy behavior, multilingual support, and video understanding**. Due to its superior token density, MiniCPM-o 2.6 can for the first time **support multimodal live streaming on end-side devices** such as iPad. +- **MiniCPM-V 4.5**: 🔥🔥🔥 The latest and most capable model in the MiniCPM-V series. With a total of 8B parameters, this model **outperforms GPT-4o-latest, Gemini-2.0 Pro, and Qwen2.5-VL 72B** in vision-language capabilities, making it the most performant on-device multimodal model in the open-source community. This version brings **new features including efficient high refresh rate and long video understanding (up to 96x compression rate for video tokens), controllable hybrid fast/deep thinking, strong handwritten OCR and complex table/document parsing**. It also advances MiniCPM-V's popular features such as trustworthy behavior, multilingual support and end-side deployability. + +- **MiniCPM-o 2.6**: ⭐️⭐️⭐️ The most capable model in the MiniCPM-o series. With a total of 8B parameters, this end-to-end model **achieves comparable performance to GPT-4o-202405 in vision, speech, and multimodal live streaming**, making it one of the most versatile and performant models in the open-source community. For the new voice mode, MiniCPM-o 2.6 **supports bilingual real-time speech conversation with configurable voices**, and also allows for fun capabilities such as emotion/speed/style control, end-to-end voice cloning, role play, etc. Due to its superior token density, MiniCPM-o 2.6 can for the first time **support multimodal live streaming on end-side devices** such as iPad. -- **MiniCPM-V 2.6**: The most capable model in the MiniCPM-V series. With a total of 8B parameters, the model **surpasses GPT-4V in single-image, multi-image and video understanding**. It outperforms **GPT-4o mini, Gemini 1.5 Pro and Claude 3.5 Sonnet** in single image understanding, and can for the first time support real-time video understanding on iPad. @@ -42,19 +42,16 @@ #### 📌 Pinned -* [2025.08.05] 🚀🚀🚀 We open-source MiniCPM-V 4.0, which outperforms GPT-4.1-mini-20250414 in image understanding. It advances popular features of MiniCPM-V 2.6, and largely improves the efficiency. We also open-source the iOS App on iPhone and iPad. Try it now! - -* [2025.08.01] 🔥🔥🔥 We've open-sourced the [MiniCPM-V & o Cookbook](https://github.com/OpenSQZ/MiniCPM-V-CookBook)! It provides comprehensive guides for diverse user scenarios, paired with our new [Docs Site](https://minicpm-o.readthedocs.io/en/latest/index.html) for smoother onboarding. +* [2025.08.26] 🔥🔥🔥 We open-source MiniCPM-V 4.5, which outperforms GPT-4o-latest, Gemini-2.0 Pro, and Qwen2.5-VL 72B. It advances popular capabilities of MiniCPM-V, and brings useful new features. Try it now! +* [2025.08.01] ⭐️⭐️⭐️ We open-sourced the [MiniCPM-V & o Cookbook](https://github.com/OpenSQZ/MiniCPM-V-CookBook)! It provides comprehensive guides for diverse user scenarios, paired with our new [Docs Site](https://minicpm-o.readthedocs.io/en/latest/index.html) for smoother onboarding. * [2025.06.20] ⭐️⭐️⭐️ Our official [Ollama repository](https://ollama.com/openbmb) is released. Try our latest models with [one click](https://ollama.com/openbmb/minicpm-o2.6)! -* [2025.03.01] 🚀🚀🚀 RLAIF-V, which is the alignment technique of MiniCPM-o, is accepted by CVPR 2025!The [code](https://github.com/RLHF-V/RLAIF-V), [dataset](https://huggingface.co/datasets/openbmb/RLAIF-V-Dataset), [paper](https://arxiv.org/abs/2405.17220) are open-sourced! +* [2025.03.01] 🚀🚀🚀 RLAIF-V, the alignment technique of MiniCPM-o, is accepted by CVPR 2025 Highlights!The [code](https://github.com/RLHF-V/RLAIF-V), [dataset](https://huggingface.co/datasets/openbmb/RLAIF-V-Dataset), [paper](https://arxiv.org/abs/2405.17220) are open-sourced! * [2025.01.24] 📢📢📢 MiniCPM-o 2.6 technical report is released! See [here](https://openbmb.notion.site/MiniCPM-o-2-6-A-GPT-4o-Level-MLLM-for-Vision-Speech-and-Multimodal-Live-Streaming-on-Your-Phone-185ede1b7a558042b5d5e45e6b237da9). -* [2025.01.23] 💡💡💡 MiniCPM-o 2.6 is now supported by [Align-Anything](https://github.com/PKU-Alignment/align-anything), a framework by PKU-Alignment Team for aligning any-to-any modality large models with human intentions. It supports DPO and SFT fine-tuning on both vision and audio. Try it now! - * [2025.01.19] 📢 **ATTENTION!** We are currently working on merging MiniCPM-o 2.6 into the official repositories of llama.cpp, Ollama, and vllm. Until the merge is complete, please USE OUR LOCAL FORKS of [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md), [Ollama](https://github.com/OpenBMB/ollama/blob/minicpm-v2.6/examples/minicpm-v2.6/README.md), and [vllm](https://github.com/OpenBMB/MiniCPM-o?tab=readme-ov-file#efficient-inference-with-llamacpp-ollama-vllm). **Using the official repositories before the merge may lead to unexpected issues**. * [2025.01.19] ⭐️⭐️⭐️ MiniCPM-o tops GitHub Trending and reaches top-2 on Hugging Face Trending! @@ -76,6 +73,10 @@
Click to view more news. +* [2025.08.02] 🚀🚀🚀 We open-source MiniCPM-V 4.0, which outperforms GPT-4.1-mini-20250414 in image understanding. It advances popular features of MiniCPM-V 2.6, and largely improves the efficiency. We also open-source the iOS App on iPhone and iPad. Try it now! + +* [2025.01.23] 💡💡💡 MiniCPM-o 2.6 is now supported by [Align-Anything](https://github.com/PKU-Alignment/align-anything), a framework by PKU-Alignment Team for aligning any-to-any modality large models with human intentions. It supports DPO and SFT fine-tuning on both vision and audio. Try it now! + * [2024.08.15] We now also support multi-image SFT. For more details, please refer to the [document](https://github.com/OpenBMB/MiniCPM-V/tree/main/finetune). * [2024.08.14] MiniCPM-V 2.6 now also supports [fine-tuning](https://github.com/modelscope/ms-swift/issues/1613) with the SWIFT framework! * [2024.08.10] 🚀🚀🚀 MiniCPM-Llama3-V 2.5 is now fully supported by [official](https://github.com/ggerganov/llama.cpp) llama.cpp! GGUF models of various sizes are available [here](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf). @@ -107,10 +108,9 @@ ## Contents -- [MiniCPM-V 4.0](#minicpm-v-40) - - [Examples](#examples) +- [MiniCPM-V 4.5](#minicpm-v-45) + - [Key Techniques](#key-techniques) - [MiniCPM-o 2.6](#minicpm-o-26) -- [MiniCPM-V 2.6](#minicpm-v-26) - [MiniCPM-V \& o Cookbook](#minicpm-v--o-cookbook) - [Chat with Our Demo on Gradio 🤗](#chat-with-our-demo-on-gradio-) - [Inference](#inference) @@ -130,559 +130,82 @@ - [Limitations](#limitations) -## MiniCPM-V 4.0 +## MiniCPM-V 4.5 -**MiniCPM-V 4.0** is the latest efficient model in the MiniCPM-V series. The model is built based on SigLIP2-400M and MiniCPM4-3B with a total of 4.1B parameters. It inherits the strong single-image, multi-image and video understanding performance of MiniCPM-V 2.6 with largely improved efficiency. Notable features of MiniCPM-V 4.0 include: +**MiniCPM-V 4.5** is the latest and most capable model in the MiniCPM-V series. The model is built on Qwen3-8B and SigLIP2-400M with a total of 8B parameters. It exhibits a significant performance improvement over previous MiniCPM-V and MiniCPM-o models, and introduces new useful features. Notable features of MiniCPM-V 4.5 include: -- 🔥 **Leading Visual Capability.** - With only 4.1B parameters, MiniCPM-V 4.0 achieves an average score of 69.0 on OpenCompass, a comprehensive evaluation of 8 popular benchmarks, **outperforming GPT-4.1-mini-20250414, MiniCPM-V 2.6 (8.1B params, OpenCompass 65.2) and Qwen2.5-VL-3B-Instruct (3.8B params, OpenCompass 64.5)**. It also shows good performance in multi-image understanding and video understanding. +- 🔥 **State-of-the-art Vision-Language Capability.** + MiniCPM-V 4.5 achieves an average score of 77.2 on OpenCompass, a comprehensive evaluation of 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-latest, Gemini-2.0 Pro, and strong open-source models like Qwen2.5-VL 72B** for vision-language capabilities, making it the most performant MLLM under 30B parameters. + +- 🎬 **Efficient High Refresh Rate and Long Video Understanding.** Powered by a new unified 3D-Resampler over images and videos, MiniCPM-V 4.5 can now achieve 96x compression rate for video tokens, where 6 448x448 video frames can be jointly compressed into 64 video tokens (normally 1,536 tokens for most MLLMs). This means that the model can percieve significantly more video frames without increasing the LLM inference cost. This brings state-of-the-art high refresh rate (up to 10FPS) video understanding and long video understanding capabilities on Video-MME, LVBench, MLVU, MotionBench, FavorBench, etc., efficiently. + +- ⚙️ **Controllable Hybrid Fast/Deep Thinking.** MiniCPM-V 4.5 supports both fast thinking for efficient frequent usage with competitive performance, and deep thinking for more complex problem solving. To cover efficiency and performance trade-offs in different user scenarios, this fast/deep thinking mode can be switched in a highly controlled fashion. + +- 💪 **Strong OCR, Document Parsing and Others.** +Based on [LLaVA-UHD](https://arxiv.org/pdf/2403.11703) architecture, MiniCPM-V 4.5 can process high-resolution images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344), using 4x less visual tokens than most MLLMs. The model achieves **leading performance on OCRBench, surpassing proprietary models such as GPT-4o-latest and Gemini 2.5**. It also achieves state-of-the-art performance for PDF document parsing capability on OmniDocBench among general MLLMs. Based on the the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) and [VisCPM](https://github.com/OpenBMB/VisCPM) techniques, it features **trustworthy behaviors**, outperforming GPT-4o-latest on MMHal-Bench, and supports **multilingual capabilities** in more than 30 languages. -- 🚀 **Superior Efficiency.** - Designed for on-device deployment, MiniCPM-V 4.0 runs smoothly on end devices. For example, it devlivers **less than 2s first token delay and more than 17 token/s decoding on iPhone 16 Pro Max**, without heating problems. It also shows superior throughput under concurrent requests. - 💫 **Easy Usage.** - MiniCPM-V 4.0 can be easily used in various ways including **llama.cpp, Ollama, vLLM, SGLang, LLaMA-Factory and local web demo** etc. We also open-source iOS App that can run on iPhone and iPad. Get started easily with our well-structured [Cookbook](https://github.com/OpenSQZ/MiniCPM-V-CookBook), featuring detailed instructions and practical examples. +MiniCPM-V 4.5 can be easily used in various ways: (1) [llama.cpp](https://github.com/tc-mb/llama.cpp/blob/Support-MiniCPM-V-4.5/docs/multimodal/minicpmv4.5.md) and [ollama](https://github.com/tc-mb/ollama/tree/MIniCPM-V) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-V-4_5-int4), [GGUF](https://huggingface.co/openbmb/MiniCPM-V-4_5-gguf) and [AWQ](https://github.com/tc-mb/AutoAWQ) format quantized models in 16 sizes, (3) [SGLang](https://github.com/tc-mb/sglang/tree/main) and [vLLM](#efficient-inference-with-llamacpp-ollama-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks with [Transformers](https://github.com/tc-mb/transformers/tree/main) and [LLaMA-Factory](./docs/llamafactory_train_and_infer.md), (5) quick [local WebUI demo](#chat-with-our-demo-on-gradio), (6) optimized [local iOS app](https://github.com/tc-mb/MiniCPM-o-demo-iOS) on iPhone and iPad, and (7) online web demo on [server](http://101.126.42.235:30910/). See our [Cookbook](https://github.com/OpenSQZ/MiniCPM-V-CookBook) for full usages! + + +### Key Techniques + + +
+ +
+ +- **Architechture: Unified 3D-Resampler for High-density Video Compression.** MiniCPM-V 4.5 introduces a 3D-Resampler that overcomes the performance-efficiency trade-off in video understanding. By grouping and jointly compressing up to 6 consecutive video frames into just 64 tokens (the same token count used for a single image in MiniCPM-V series), MiniCPM-V 4.5 achieves a 96× compression rate for video tokens. This allows the model to process more video frames without additional LLM computational cost, enabling high refresh rate video and long video understanding. The architecture supports unified encoding for images, multi-image inputs, and videos, ensuring seamless capability and knowledge transfer. + +- **Pre-training: Unified Learning for OCR and Knowledge from Documents.** Existing MLLMs learn OCR capability and knowledge from documents in isolated training approaches. We observe the essential difference between these two training approaches is the visibility of the text in images. By dynamically corrupting text regions in documents with varying noise levels and asking the model to reconstruct the text, the model learns to adaptively and properly switch between accurate text recognition (when text is visible) and multimodal context-based knowledge reasoning (when text is heavily obscured). This eliminates reliance on error-prone document parsers in knowledge learning from documents, and prevents hallucinations from over-augmented OCR data, resulting in top-tier OCR and multimodal knowledge performance with minimal engineering overhead. + +- **Post-training: Hybrid Fast/Deep Thinking with Multimodal RL.** MiniCPM-V 4.5 offers a balanced reasoning experience through two switchable modes: fast thinking for efficient daily use and deep thinking for complex tasks. Using a new hybrid reinforcement learning method, the model jointly optimizes both modes, significantly enhancing fast-mode performance without compromising deep-mode capability. Incorporated with [RLPR](https://github.com/OpenBMB/RLPR) and [RLAIF-V](https://github.com/RLHF-V/RLAIF-V), it generalizes robust reasoning skills from broad multimodal data while effectively reducing hallucinations. ### Evaluation -
-Click to view single image results on OpenCompass.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
modelSizeOpencompassOCRBenchMathVistaHallusionBenchMMMUMMVetMMBench V1.1MMStarAI2D
Proprietary
GPT-4v-20240409-63.565655.243.961.767.579.856.078.6
Gemini-1.5-Pro-64.575458.345.660.664.073.959.179.1
GPT-4.1-mini-20250414-68.984070.949.355.074.380.960.976.0
Claude 3.5 Sonnet-20241022-70.679865.355.566.470.181.765.181.2
Open-source
Qwen2.5-VL-3B-Instruct3.8B64.582861.246.651.260.076.856.381.4
InternVL2.5-4B3.7B65.182060.846.651.861.578.258.781.4
Qwen2.5-VL-7B-Instruct8.3B70.988868.151.958.069.782.264.184.3
InternVL2.5-8B8.1B68.182164.549.056.262.882.563.284.6
MiniCPM-V-2.68.1B65.285260.848.149.860.078.057.582.1
MiniCPM-o-2.68.7B70.288973.351.150.967.280.663.386.1
MiniCPM-V-4.04.1B69.089466.950.851.268.079.762.882.9
+ +
+
+
-
-
-Click to view single image results on ChartQA, MME, RealWorldQA, TextVQA, DocVQA, MathVision, DynaMath, WeMath, Object HalBench and MM Halbench. +### Examples
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
modelSizeChartQAMMERealWorldQATextVQADocVQAMathVisionDynaMathWeMathObj HalMM Hal
CHAIRs↓CHAIRi↓score avg@3↑hall rate avg@3↓
Proprietary
GPT-4v-20240409-78.5192761.478.088.4-------
Gemini-1.5-Pro-87.2-67.578.893.141.031.550.5----
GPT-4.1-mini-20250414------45.347.7-----
Claude 3.5 Sonnet-20241022-90.8-60.174.195.235.635.744.0----
Open-source
Qwen2.5-VL-3B-Instruct3.8B84.0215765.479.393.921.913.222.918.310.83.9 33.3
InternVL2.5-4B3.7B84.0233864.376.891.618.415.221.213.78.73.2 46.5
Qwen2.5-VL-7B-Instruct8.3B87.3234768.584.995.725.421.836.213.37.94.1 31.6
InternVL2.5-8B8.1B84.8234470.179.193.017.09.423.518.311.63.6 37.2
MiniCPM-V-2.68.1B79.4234865.080.190.817.59.020.47.34.74.0 29.9
MiniCPM-o-2.68.7B86.9237268.182.093.521.710.425.26.33.44.1 31.3
MiniCPM-V-4.04.1B84.4229868.580.892.920.714.232.76.33.54.1 29.2
+
-
- -
-Click to view multi-image and video understanding results on Mantis, Blink and Video-MME. -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
modelSizeMantisBlinkVideo-MME
wo subsw subs
Proprietary
GPT-4v-20240409-62.754.659.963.3
Gemini-1.5-Pro--59.175.081.3
GPT-4o-20240513--68.071.977.2
Open-source
Qwen2.5-VL-3B-Instruct3.8B-47.661.567.6
InternVL2.5-4B3.7B62.750.862.363.6
Qwen2.5-VL-7B-Instruct8.3B-56.465.171.6
InternVL2.5-8B8.1B67.754.864.266.9
MiniCPM-V-2.68.1B69.153.060.963.6
MiniCPM-o-2.68.7B71.956.763.969.6
MiniCPM-V-4.04.1B71.454.061.265.8
-
- -
- -### Examples -
- math + en_case1 + en_case2 + en_case3
-We deploy MiniCPM-V 4.0 on iPhone 16 Pro Max with [iOS demo](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/demo/ios_demo/ios.md). The demo video is the raw screen recording without edition. +
+Click to view more cases. +
+ zh_extra +
+ +
+ +We deploy MiniCPM-V 4.5 on iPad M4 with [iOS demo](https://github.com/tc-mb/MiniCPM-o-demo-iOS). The demo video is the raw screen recording without edition.

- +      - +

- +      - +

-
- + ## MiniCPM-o 2.6 @@ -693,7 +216,7 @@ We deploy MiniCPM-V 4.0 on iPhone 16 Pro Max with [iOS demo](https://github.com/ - 🎙 **State-of-the-art Speech Capability.** MiniCPM-o 2.6 supports **bilingual real-time speech conversation with configurable voices** in English and Chinese. It **outperforms GPT-4o-realtime on audio understanding tasks** such as ASR and STT translation, and shows **state-of-the-art performance on speech conversation in both semantic and acoustic evaluations in the open-source community**. It also allows for fun features such as emotion/speed/style control, end-to-end voice cloning, role play, etc. -- 🎬 **Strong Multimodal Live Streaming Capability.** As a new feature, MiniCPM-o 2.6 can **accept continuous video and audio streams independent of user queries, and support real-time speech interaction**. It **outperforms GPT-4o-202408 and Claude 3.5 Sonnet and shows state-of-art performance in the open-source community on StreamingBench**, a comprehensive benchmark for real-time video understanding, omni-source (video & audio) understanding, and multimodal contextual understanding. +- 🎬 **Strong Multimodal Live Streaming Capability.** As a new feature, MiniCPM-o 2.6 can **accept continuous video and audio streams independent of user queries, and support real-time speech interaction**. It **outperforms GPT-4o-202408 and Claude 3.5 Sonnet and shows state-of-art performance in the open-source community on StreamingBench**, a comprehensive benchmark for real-time video understanding, omni-source (video & audio) understanding, and multimodal contextual understanding. - 💪 **Strong OCR Capability and Others.** Advancing popular visual capabilities from MiniCPM-V series, MiniCPM-o 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves **state-of-the-art performance on OCRBench for models under 25B, surpassing proprietary models such as GPT-4o-202405**. @@ -1647,780 +1170,12 @@ We deploy MiniCPM-o 2.6 on end devices. The demo video is the raw-speed recordin -## MiniCPM-V 2.6 - -
-Click to view more details of MiniCPM-V 2.6 - -**MiniCPM-V 2.6** is the latest and most capable model in the MiniCPM-V series. The model is built on SigLip-400M and Qwen2-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-Llama3-V 2.5, and introduces new features for multi-image and video understanding. Notable features of MiniCPM-V 2.6 include: - -- 🔥 **Leading Performance.** - MiniCPM-V 2.6 achieves an average score of 65.2 on the latest version of OpenCompass, a comprehensive evaluation over 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o mini, GPT-4V, Gemini 1.5 Pro, and Claude 3.5 Sonnet** for single image understanding. - -- 🖼️ **Multi Image Understanding and In-context Learning.** MiniCPM-V 2.6 can also perform **conversation and reasoning over multiple images**. It achieves **state-of-the-art performance** on popular multi-image benchmarks such as Mantis-Eval, BLINK, Mathverse mv and Sciverse mv, and also shows promising in-context learning capability. - -- 🎬 **Video Understanding.** MiniCPM-V 2.6 can also **accept video inputs**, performing conversation and providing dense captions for spatial-temporal information. It outperforms **GPT-4V, Claude 3.5 Sonnet and LLaVA-NeXT-Video-34B** on Video-MME with/without subtitles. - -- 💪 **Strong OCR Capability and Others.** - MiniCPM-V 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves **state-of-the-art performance on OCRBench, surpassing proprietary models such as GPT-4o, GPT-4V, and Gemini 1.5 Pro**. - Based on the the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) and [VisCPM](https://github.com/OpenBMB/VisCPM) techniques, it features **trustworthy behaviors**, with significantly lower hallucination rates than GPT-4o and GPT-4V on Object HalBench, and supports **multilingual capabilities** on English, Chinese, German, French, Italian, Korean, etc. - - -- 🚀 **Superior Efficiency.** - In addition to its friendly size, MiniCPM-V 2.6 also shows **state-of-the-art token density** (i.e., number of pixels encoded into each visual token). **It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models**. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-V 2.6 can efficiently support **real-time video understanding** on end-side devices such as iPad. - -- 💫 **Easy Usage.** -MiniCPM-V 2.6 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpmv-main/examples/llava/README-minicpmv2.6.md) and [Ollama](https://github.com/OpenBMB/ollama/blob/minicpm-v2.6/examples/minicpm-v2.6/README.md) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-V-2_6-int4) and [GGUF](https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf) format quantized models in 16 sizes, (3) [vLLM](#inference-with-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks, (5) quick local WebUI demo setup with [Gradio](#chat-with-our-demo-on-gradio), and (6) online web [demo](http://120.92.209.146:8887/). - -### Evaluation -
- -
- -
-Click to view single image results on OpenCompass, MME, MMVet, OCRBench, MMMU, MathVista, MMB, AI2D, TextVQA, DocVQA, HallusionBench, Object HalBench. -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
ModelSizeToken Density+OpenCompassMMEMMVetOCRBenchMMMU valMathVista miniMMB1.1 testAI2DTextVQA valDocVQA testHallusionBenchObject HalBench
Proprietary
GPT-4o-108869.92328.769.173669.261.382.284.6-92.855.017.6
Claude 3.5 Sonnet-75067.91920.066.078865.961.678.580.2-95.249.913.8
Gemini 1.5 Pro--64.42110.664.075460.657.773.979.173.586.545.6-
GPT-4o mini-108864.12003.466.978560.052.476.077.8--46.112.4
GPT-4V-108863.52070.267.565661.754.779.878.678.087.243.914.2
Step-1V--59.52206.463.362549.944.878.079.271.6-48.4-
Qwen-VL-Max-78458.32281.761.868452.043.474.675.779.593.141.213.4
Open-source
LLaVA-NeXT-Yi-34B34B15755.02006.550.757448.840.477.878.969.3-34.812.6
Mini-Gemini-HD-34B34B157-2141.059.351848.043.3-80.574.178.9--
Cambrian-34B34B182058.32049.953.259150.450.377.879.576.775.541.614.7
GLM-4V-9B13B78459.12018.858.077646.951.167.971.2--45.0-
InternVL2-8B8B70664.12215.154.379451.258.379.483.677.491.645.021.3
MiniCPM-Llama-V 2.58B188258.82024.652.872545.854.372.078.476.684.842.410.3
MiniCPM-V 2.68B282265.22348.4*60.0852*49.8*60.678.082.180.190.848.1*8.2
- -
-* We evaluate this benchmark using chain-of-thought prompting. Specifically, for MME, we used this technique only for the Cognition set. - -+ Token Density: number of pixels encoded into each visual token at maximum resolution, i.e., # pixels at maximum resolution / # visual tokens. - -Note: For proprietary models, we calculate token density based on the image encoding charging strategy defined in the official API documentation, which provides an upper-bound estimation. - -
- - -
-Click to view multi-image results on Mantis Eval, BLINK, Mathverse mv, Sciverse mv, MIRB. -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
ModelSizeMantis EvalBLINK valMathverse mvSciverse mvMIRB
Proprietary
GPT-4V-62.754.660.366.953.1
LLaVA-NeXT-Interleave-14B14B66.452.632.730.2-
Open-source
Emu2-Chat37B37.836.2-27.2-
CogVLM17B45.241.1---
VPG-C7B52.443.124.323.1-
VILA 8B8B51.239.3-36.5-
InternLM-XComposer-2.58B53.1*48.932.1*-42.5
InternVL2-8B8B59.0*50.930.5*34.4*56.9*
MiniCPM-V 2.68B69.153.084.974.953.8
- -
-* We evaluate the officially released checkpoint by ourselves. -
- -
-Click to view video results on Video-MME and Video-ChatGPT. -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
ModelSizeVideo-MMEVideo-ChatGPT
w/o subsw subsCorrectnessDetailContextTemporalConsistency
Proprietary
Claude 3.5 Sonnet-60.062.9-----
GPT-4V-59.963.3-----
Open-source
LLaVA-NeXT-7B7B--3.393.293.922.603.12
LLaVA-NeXT-34B34B--3.293.233.832.513.47
CogVLM2-Video12B--3.493.463.232.983.64
LongVA7B52.454.33.053.093.772.443.64
InternVL2-8B8B54.056.9-----
InternLM-XComposer-2.58B55.8------
LLaVA-NeXT-Video32B60.263.03.483.373.952.643.28
MiniCPM-V 2.68B60.963.63.593.283.932.733.62
-
-
- - -
-Click to view few-shot results on TextVQA, VizWiz, VQAv2, OK-VQA. -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
ModelSizeShotTextVQA valVizWiz test-devVQAv2 test-devOK-VQA val
Flamingo80B0*35.031.656.340.6
436.539.663.157.4
837.344.865.657.5
IDEFICS80B0*30.936.060.045.2
434.340.463.652.4
835.746.164.855.1
OmniCorpus7B0*43.049.863.245.5
445.451.364.546.5
845.652.264.746.6
Emu237B026.440.433.526.7
448.254.667.053.2
849.354.767.854.1
MM130B026.240.448.926.7
849.354.770.954.1
MiniCPM-V 2.6+8B043.933.845.423.9
463.660.565.550.1
864.663.468.251.4
- - -
-* denotes zero image shot and two additional text shots following Flamingo. - -+ We evaluate the pretraining ckpt without SFT. -
- -### Examples - -
- Bike - Menu - Code - Mem - medal -
-
- Click to view more cases. -
- elec - Menu -
-
- -We deploy MiniCPM-V 2.6 on end devices. The demo video is the raw screen recording on a iPad Pro without edition. - - -

- -      - -

-
- - -

- -      - -

-
- - -

- - -

-
- -
- ## Legacy Models | Model | Introduction and Guidance | |:----------------------|:-------------------:| +| MiniCPM-V 4.0 | [Document](./docs/minicpm_v4_en.md) | +| MiniCPM-V 2.6 | [Document](./docs/minicpm_v2dot6_en.md) | | MiniCPM-Llama3-V 2.5 | [Document](./docs/minicpm_llama3_v2dot5.md) | | MiniCPM-V 2.0 | [Document](./docs/minicpm_v2.md) | | MiniCPM-V 1.0 | [Document](./docs/minicpm_v1.md) | @@ -2511,10 +1266,9 @@ Open `http://localhost:8000/` in browser and enjoy the vision mode chatbot. | Model | Device | Memory |          Description | Download | |:-----------|:--:|:-----------:|:-------------------|:---------------:| -| MiniCPM-V 4.0| GPU | 9 GB | The latest version, strong end-side multimodal performance for single image, multi-image and video understanding. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-4)    [](https://modelscope.cn/models/OpenBMB/MiniCPM-V-4) | -| MiniCPM-V 4.0 gguf | CPU | 4 GB | The gguf version, lower memory usage and faster inference. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-4-gguf)    [](https://modelscope.cn/models/OpenBMB/MiniCPM-V-4-gguf) | -| MiniCPM-V 4.0 int4 | GPU | 5 GB | The int4 quantized version, lower GPU memory usage. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-4-int4)    [](https://modelscope.cn/models/OpenBMB/MiniCPM-V-4-int4) | -| MiniCPM-V 4.0 AWQ | GPU | 5 GB | The int4 quantized version, lower GPU memory usage. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-4-AWQ)    [](https://modelscope.cn/models/OpenBMB/MiniCPM-V-4-AWQ) | +| MiniCPM-V 4.5| GPU | 18 GB | The latest version, strong end-side multimodal performance for single image, multi-image and video understanding. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-4_5)    [](https://modelscope.cn/models/OpenBMB/MiniCPM-V-4_5) | +| MiniCPM-V 4.5 gguf | CPU | 8 GB | The gguf version, lower memory usage and faster inference. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-4_5-gguf)    [](https://modelscope.cn/models/OpenBMB/MiniCPM-V-4_5-gguf) | +| MiniCPM-V 4.5 int4 | GPU | 9 GB | The int4 quantized version, lower GPU memory usage. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-4_5-int4)    [](https://modelscope.cn/models/OpenBMB/MiniCPM-V-4_5-int4) | | MiniCPM-o 2.6| GPU | 18 GB | The latest version, achieving GPT-4o level performance for vision, speech and multimodal live streaming on end-side devices. | [🤗](https://huggingface.co/openbmb/MiniCPM-o-2_6)    [](https://modelscope.cn/models/OpenBMB/MiniCPM-o-2_6) | | MiniCPM-o 2.6 gguf | CPU | 8 GB | The gguf version, lower memory usage and faster inference. | [🤗](https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf)    [](https://modelscope.cn/models/OpenBMB/MiniCPM-o-2_6-gguf) | | MiniCPM-o 2.6 int4 | GPU | 9 GB | The int4 quantized version, lower GPU memory usage. | [🤗](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4)    [](https://modelscope.cn/models/OpenBMB/MiniCPM-o-2_6-int4) | @@ -2524,7 +1278,8 @@ Open `http://localhost:8000/` in browser and enjoy the vision mode chatbot. ### Multi-turn Conversation -Please ensure that `transformers==4.44.2` is installed, as other versions may have compatibility issues. We are investigating this issue. +If you wish to enable long-thinking mode, provide the argument `enable_thinking=True` to the chat function. + ```shell pip install -r requirements_o2.6.txt ``` @@ -2543,20 +1298,23 @@ from transformers import AutoModel, AutoTokenizer torch.manual_seed(100) -model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6 +model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6 attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager model = model.eval().cuda() -tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True) # or openbmb/MiniCPM-o-2_6 +tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True) # or openbmb/MiniCPM-o-2_6 image = Image.open('./assets/minicpmo2_6/show_demo.jpg').convert('RGB') +enable_thinking=False # If `enable_thinking=True`, the long-thinking mode is enabled. + # First round chat question = "What is the landform in the picture?" msgs = [{'role': 'user', 'content': [image, question]}] answer = model.chat( msgs=msgs, - tokenizer=tokenizer + tokenizer=tokenizer, + enable_thinking=enable_thinking ) print(answer) @@ -2573,25 +1331,36 @@ print(answer) You will get the following output: -``` -"The landform in the picture is karst topography, characterized by its unique and striking limestone formations that rise dramatically from the surrounding landscape." +```shell +# round1 +The landform in the picture is karst topography. Karst landscapes are characterized by distinctive, jagged limestone hills or mountains with steep, irregular peaks and deep valleys—exactly what you see here These unique formations result from the dissolution of soluble rocks like limestone over millions of years through water erosion. -"When traveling to this picturesque location, you should pay attention to the weather conditions as they can change rapidly in such areas. It's also important to respect local ecosystems and wildlife by staying on designated paths and not disturbing natural habitats. Additionally, bringing appropriate gear for photography is advisable due to the stunning reflections and lighting during sunrise or sunset." +This scene closely resembles the famous karst landscape of Guilin and Yangshuo in China’s Guangxi Province. The area features dramatic, pointed limestone peaks rising dramatically above serene rivers and lush green forests, creating a breathtaking and iconic natural beauty that attracts millions of visitors each year for its picturesque views. + +# round2 +When traveling to a karst landscape like this, here are some important tips: + +1. Wear comfortable shoes: The terrain can be uneven and hilly. +2. Bring water and snacks for energy during hikes or boat rides. +3. Protect yourself from the sun with sunscreen, hats, and sunglasses—especially since you’ll likely spend time outdoors exploring scenic spots. +4. Respect local customs and nature regulations by not littering or disturbing wildlife. + +By following these guidelines, you'll have a safe and enjoyable trip while appreciating the stunning natural beauty of places such as Guilin’s karst mountains. ``` #### Chat with Multiple Images
- Click to view Python code running MiniCPM-V-4 with multiple images input. + Click to view Python code running MiniCPM-V-4_5 with multiple images input. ```python import torch from PIL import Image from transformers import AutoModel, AutoTokenizer -model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6 +model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6 attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager model = model.eval().cuda() -tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True) # or openbmb/MiniCPM-o-2_6 +tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True) # or openbmb/MiniCPM-o-2_6 image1 = Image.open('image1.jpg').convert('RGB') image2 = Image.open('image2.jpg').convert('RGB') @@ -2609,17 +1378,17 @@ print(answer) #### In-context Few-shot Learning
- Click to view Python code running MiniCPM-V-4 with few-shot input. + Click to view Python code running MiniCPM-V-4_5 with few-shot input. ```python import torch from PIL import Image from transformers import AutoModel, AutoTokenizer -model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6 +model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6 attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager model = model.eval().cuda() -tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True) # or openbmb/MiniCPM-o-2_6 +tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True) # or openbmb/MiniCPM-o-2_6 question = "production date" image1 = Image.open('example1.jpg').convert('RGB') @@ -2644,60 +1413,109 @@ print(answer) #### Chat with Video
- Click to view Python code running MiniCPM-V-4 with video input. + Click to view Python code running MiniCPM-V-4_5 by with video input and 3D-Resampler. ```python +## The 3d-resampler compresses multiple frames into 64 tokens by introducing temporal_ids. +# To achieve this, you need to organize your video data into two corresponding sequences: +# frames: List[Image] +# temporal_ids: List[List[Int]]. + import torch from PIL import Image from transformers import AutoModel, AutoTokenizer from decord import VideoReader, cpu # pip install decord +from scipy.spatial import cKDTree +import numpy as np +import math -model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6 +model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6 attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager model = model.eval().cuda() -tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True) # or openbmb/MiniCPM-o-2_6 +tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True) # or openbmb/MiniCPM-o-2_6 -MAX_NUM_FRAMES=64 # if cuda OOM set a smaller number +MAX_NUM_FRAMES=180 # Indicates the maximum number of frames received after the videos are packed. The actual maximum number of valid frames is MAX_NUM_FRAMES * MAX_NUM_PACKING. +MAX_NUM_PACKING=3 # indicates the maximum packing number of video frames. valid range: 1-6 +TIME_SCALE = 0.1 -def encode_video(video_path): +def map_to_nearest_scale(values, scale): + tree = cKDTree(np.asarray(scale)[:, None]) + _, indices = tree.query(np.asarray(values)[:, None]) + return np.asarray(scale)[indices] + + +def group_array(arr, size): + return [arr[i:i+size] for i in range(0, len(arr), size)] + +def encode_video(video_path, choose_fps=3, force_packing=None): def uniform_sample(l, n): gap = len(l) / n idxs = [int(i * gap + gap / 2) for i in range(n)] return [l[i] for i in idxs] - vr = VideoReader(video_path, ctx=cpu(0)) - sample_fps = round(vr.get_avg_fps() / 1) # FPS - frame_idx = [i for i in range(0, len(vr), sample_fps)] - if len(frame_idx) > MAX_NUM_FRAMES: - frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES) + fps = vr.get_avg_fps() + video_duration = len(vr) / fps + + if choose_fps * int(video_duration) <= MAX_NUM_FRAMES: + packing_nums = 1 + choose_frames = round(min(choose_fps, round(fps)) * min(MAX_NUM_FRAMES, video_duration)) + + else: + packing_nums = math.ceil(video_duration * choose_fps / MAX_NUM_FRAMES) + if packing_nums <= MAX_NUM_PACKING: + choose_frames = round(video_duration * choose_fps) + else: + choose_frames = round(MAX_NUM_FRAMES * MAX_NUM_PACKING) + packing_nums = MAX_NUM_PACKING + + frame_idx = [i for i in range(0, len(vr))] + frame_idx = np.array(uniform_sample(frame_idx, choose_frames)) + + if force_packing: + packing_nums = min(force_packing, MAX_NUM_PACKING) + + print(video_path, ' duration:', video_duration) + print(f'get video frames={len(frame_idx)}, packing_nums={packing_nums}') + frames = vr.get_batch(frame_idx).asnumpy() - frames = [Image.fromarray(v.astype('uint8')) for v in frames] - print('num frames:', len(frames)) - return frames + + frame_idx_ts = frame_idx / fps + scale = np.arange(0, video_duration, TIME_SCALE) + + frame_ts_id = map_to_nearest_scale(frame_idx_ts, scale) / TIME_SCALE + frame_ts_id = frame_ts_id.astype(np.int32) + + assert len(frames) == len(frame_ts_id) + + frames = [Image.fromarray(v.astype('uint8')).convert('RGB') for v in frames] + frame_ts_id_group = group_array(frame_ts_id, packing_nums) + + return frames, frame_ts_id_group + video_path="video_test.mp4" -frames = encode_video(video_path) +fps = 5 # fps for video +force_packing = None # You can set force_packing to ensure that 3D packing is forcibly enabled; otherwise, encode_video will dynamically set the packing quantity based on the duration. +frames, frame_ts_id_group = encode_video(video_path, fps, force_packing=force_packing) + question = "Describe the video" msgs = [ {'role': 'user', 'content': frames + [question]}, ] -# Set decode params for video -params = {} -params["use_image_id"] = False -params["max_slice_nums"] = 2 # use 1 if cuda OOM and video resolution > 448*448 answer = model.chat( msgs=msgs, tokenizer=tokenizer, - **params + use_image_id=False, + max_slice_nums=1, + temporal_ids=frame_ts_id_group ) print(answer) ```
- #### Speech and Audio Mode Model initialization @@ -3122,7 +1940,7 @@ pip install vllm ### Simple Fine-tuning -We support simple fine-tuning with Hugging Face for MiniCPM-V 4.0, MiniCPM-o 2.6, MiniCPM-V 2.6, MiniCPM-Llama3-V 2.5 and MiniCPM-V 2.0. +We support simple fine-tuning with Hugging Face for MiniCPM-o 2.6, MiniCPM-V 2.6, MiniCPM-Llama3-V 2.5 and MiniCPM-V 2.0. [Reference Document](./finetune/readme.md) @@ -3139,7 +1957,7 @@ Best Practices: [MiniCPM-o 2.6](https://github.com/PKU-Alignment/align-anything/ We support fine-tuning MiniCPM-o 2.6 and MiniCPM-V 2.6 with the LLaMA-Factory framework. LLaMA-Factory provides a solution for flexibly customizing the fine-tuning (Lora/Full/Qlora) of 200+ LLMs without the need for coding through the built-in web UI LLaMABoard. It supports various training methods like sft/ppo/dpo/kto and advanced algorithms like Galore/BAdam/LLaMA-Pro/Pissa/LongLoRA. -Best Practices: [MiniCPM-V 4.0 | MiniCPM-o 2.6 | MiniCPM-V 2.6](./docs/llamafactory_train_and_infer.md). +Best Practices: [MiniCPM-o 2.6 | MiniCPM-V 2.6](./docs/llamafactory_train_and_infer.md). ### With the SWIFT Framework diff --git a/README_zh.md b/README_zh.md index d64ec39..1efd0e1 100644 --- a/README_zh.md +++ b/README_zh.md @@ -1,6 +1,7 @@
- + + **端侧可用的 GPT-4o 级视觉、语音、多模态实时流式大模型** @@ -18,27 +19,27 @@

- MiniCPM-V 4.0 🤗 🤖 | MiniCPM-o 2.6 🤗 🤖 | MiniCPM-V 2.6 🤗 🤖 | - 📄 技术报告 [中文/English] + MiniCPM-V 4.5 🤗 🤖 | MiniCPM-o 2.6 🤗 🤖 | + 📄 技术报告 [即将推出]

-**MiniCPM-o** 是从 MiniCPM-V 升级的最新端侧多模态大模型系列。该系列模型可以以端到端方式,接受图像、视频、文本、音频作为输入,并生成高质量文本和语音输出。自2024年2月以来,我们以实现高性能和高效部署为目标,发布了6个版本的模型。目前系列中最值得关注的模型包括: +**MiniCPM-V** 端侧多模态大模型系列可以以端到端方式,接受图像、视频、文本、音频作为输入,并生成高质量文本和语音输出。自2024年2月以来,我们以实现高性能和高效部署为目标,发布了7个版本的模型。目前系列中最值得关注的模型包括: -- **MiniCPM-V 4.0**:🚀🚀🚀 MiniCPM-V 系列中最新的高效模型,参数总量为 4B。该模型在 OpenCompass 评测中图像理解能力超越了 GPT-4.1-mini-20250414、Qwen2.5-VL-3B-Instruct 和 InternVL2.5-8B。凭借小巧的参数规模和高效的架构,MiniCPM-V 4.0 是移动端部署的理想选择(例如,在 iPhone 16 Pro Max 上使用开源 iOS 应用时,首 token 延迟低于 2 秒,解码速度超过 17 token/s)。 +- **MiniCPM-V 4.5**:🔥🔥🔥 MiniCPM-V 系列中最新、最强大的模型。总参数量 8B,在**视觉能力上超越了 GPT-4o-latest、Gemini-2.0 Pro 以及 Qwen2.5-VL 72B**,成为开源社区中性能最强的端侧多模态模型。本版本带来了全新特性,**包括高效的高帧率与长视频理解(视频 token 压缩率最高可达 96 倍)、可控的快思考/深思考模式、出色的手写体 OCR 与复杂表格/文档解析能力**。同时,它进一步强化了 MiniCPM-V 系列广受欢迎的特性,如可靠性、多语言支持与端侧可部署性。 - -- **MiniCPM-o 2.6**: 🔥🔥🔥 MiniCPM-o 系列的最新、性能最佳模型。总参数量 8B,**视觉、语音和多模态流式能力达到了 GPT-4o-202405 级别**,是开源社区中模态支持最丰富、性能最佳的模型之一。在新的语音模式中,MiniCPM-o 2.6 **支持可配置声音的中英双语语音对话,还具备情感/语速/风格控制、端到端声音克隆、角色扮演等进阶能力**。模型也进一步提升了 MiniCPM-V 2.6 的 **OCR、可信行为、多语言支持和视频理解等视觉能力**。基于其领先的视觉 token 密度,MiniCPM-V 2.6 成为了**首个支持在 iPad 等端侧设备上进行多模态实时流式交互**的多模态大模型。 - -- **MiniCPM-V 2.6**: MiniCPM-V 系列中性能最佳的模型。总参数量 8B,单图、多图和视频理解性能**超越了 GPT-4V**。它取得了优于 **GPT-4o mini、Gemini 1.5 Pro 和 Claude 3.5 Sonnet**等的单图理解表现,并成为了首个支持在 iPad 等端侧设备上进行实时视频理解的多模态大模型。 +- **MiniCPM-o 2.6**: ⭐️⭐️⭐️ MiniCPM-o 系列中性能最佳模型。总参数量 8B,**视觉、语音和多模态流式能力达到了 GPT-4o-202405 级别**,是开源社区中模态支持最丰富、性能最佳的模型之一。在新的语音模式中,MiniCPM-o 2.6 **支持可配置声音的中英双语语音对话,还具备情感/语速/风格控制、端到端声音克隆、角色扮演等进阶能力**。模型也进一步提升了 MiniCPM-V 2.6 的 **OCR、可信行为、多语言支持和视频理解等视觉能力**。基于其领先的视觉 token 密度,MiniCPM-V 2.6 成为了**首个支持在 iPad 等端侧设备上进行多模态实时流式交互**的多模态大模型。 ## 更新日志 #### 📌 置顶 +* [2025.08.26] 🔥🔥🔥 我们开源了 MiniCPM-V 4.5,其视觉性能超越了 GPT-4o-latest、Gemini-2.0 Pro 和 Qwen2.5-VL 72B。它不仅延续并强化了 MiniCPM-V 的热门能力,还带来了诸多实用的新功能。欢迎试用! + + * [2025.08.05] 🚀🚀🚀 我们开源了 MiniCPM-V 4.0,该模型在图像理解能力上超越了 GPT-4.1-mini-20250414。该模型不仅继承了 MiniCPM-V 2.6 的众多实用特性,还大幅提升了推理效率。我们还同步开源了适用于 iPhone 和 iPad 的 iOS 应用,欢迎试用! @@ -98,9 +99,8 @@ ## 目录 -- [MiniCPM-V 4.0](#minicpm-v-40) +- [MiniCPM-V 4.5](#minicpm-v-45) - [MiniCPM-o 2.6](#minicpm-o-26) -- [MiniCPM-V 2.6](#minicpm-v-26) - [Chat with Our Demo on Gradio 🤗](#chat-with-our-demo-on-gradio-) - [推理](#推理) - [模型库](#模型库) @@ -120,561 +120,83 @@ - [模型局限性](#模型局限性) -## MiniCPM-V 4.0 +## MiniCPM-V 4.5 -MiniCPM-V 4.0 是 MiniCPM-V 系列中的最新模型。该模型基于 SigLIP2-400M 和 MiniCPM4-3B 构建,参数总量为 4.1B。它延续了 MiniCPM-V 2.6 在单图、多图和视频理解方面的强大能力,同时大幅提升了推理效率。MiniCPM-V 4.0 的主要特点包括: -- 🔥 **领先的视觉能力。** -MiniCPM-V 4.0 在 OpenCompass 上获得了平均 69.0 的高分,超越了 MiniCPM-V 2.6(8.1B,得分 65.2)、 Qwen2.5-VL-3B-Instruct(3.8B,得分 64.5)和**广泛使用的闭源模型 GPT-4.1-mini-20250414**。在多图理解与视频理解任务上,MiniCPM-V 4.0 也表现出色。 +**MiniCPM-V 4.5** 是 MiniCPM-V 系列中最新、最强大的模型。该模型基于 Qwen3-8B 与 SigLIP2-400M 构建,总参数量为 8B。其在性能上较前代 MiniCPM-V 与 MiniCPM-o 有显著提升,并引入了一系列全新的实用特性。其主要亮点包括: -- 🚀 **卓越的效率。** -MiniCPM-V 4.0 专为端侧设备优化,**可在 iPhone 16 Pro Max 上流畅运行,首 token 延迟低至 2 秒,解码速度达 17.9 tokens/s**,且无发热问题。MiniCPM-V 4.0 在并发请求场景下表现出领先的吞吐率指标。 -- 💫 **易于使用。** -MiniCPM-V 4.0 支持多种推理方式,包括 **llama.cpp、Ollama、vLLM、SGLang、LLaMA-Factory 及本地 Web Demo 等**。我们还开源了可以在 iPhone 和 iPad 运行的 iOS App。欢迎参考我们开源的 **结构清晰的[使用手册](https://github.com/OpenSQZ/MiniCPM-V-CookBook)** 玩转 MiniCPM-V 4.0,其中涵盖了详细的部署指南和真实示例。 +- 🔥 **领先的视觉理解能力** + MiniCPM-V 4.5 在 OpenCompass 综合评测(涵盖 8 个主流评测基准)中取得了 77.2 的高分。**在仅 8B 参数的情况下超越了广泛使用的闭源模型(如 GPT-4o-latest、Gemini-2.0 Pro)以及强大的开源模型(如 Qwen2.5-VL 72B)**,成为 30B 参数规模以下最强的多模态大模型。 +- 🎬 **高效的高帧率与长视频理解** + 借助全新的图像-视频统一 3D-Resampler,MiniCPM-V 4.5 能够实现 96 倍视频 token 压缩率,即将 6 帧 448x448 视频帧联合压缩为 64 个 token(大多数多模态大模型需约 1536 个 token)。这意味着模型在语言模型推理成本不增加的情况下,可以感知显著更多的视频帧,从而实现业界领先的 高帧率(最高 10FPS)视频理解与长视频理解,并在 Video-MME、LVBench、MLVU、MotionBench、FavorBench 等基准上高效率地展现出色性能。 + +- ⚙️ **可控的快思考 / 深思考模式** + MiniCPM-V 4.5 同时支持 快思考(用于高频高效推理,性能具竞争力)与 深思考(用于复杂问题求解)。用户可根据不同场景对效率与性能的权衡,自由切换两种模式,实现高度可控的推理过程。 + +- 💪 **优秀的 OCR、文档解析与多语言能力** + 基于 [LLaVA-UHD](https://arxiv.org/pdf/2403.11703) 架构,MiniCPM-V 4.5 能处理任意长宽比、最高达 180 万像素(如 1344x1344) 的高分辨率图像,同时使用的视觉 token 数仅为多数 MLLM 的 1/4。其在 OCRBench 上取得超越 GPT-4o-latest 与 Gemini 2.5 等闭源模型的性能,并在 OmniDocBench 上展现了业界顶尖的 PDF 文档解析能力。借助最新的 [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) 和 [VisCPM](https://github.com/OpenBMB/VisCPM) 技术,模型在可靠性上表现优异,在 MMHal-Bench 上超越 GPT-4o-latest,并支持 30+ 种语言的多语言能力。 + +- 💫 **便捷易用的部署方式** + MiniCPM-V 4.5 提供丰富灵活的使用方式:(1) [llama.cpp](https://github.com/tc-mb/llama.cpp/blob/master/docs/multimodal/minicpmo4.5.md) 与 [ollama](https://github.com/tc-mb/ollama/tree/MIniCPM-V) 支持本地 CPU 高效推理;(2) 提供 [int4](https://huggingface.co/openbmb/MiniCPM-V-4_5-int4)、[GGUF](https://huggingface.co/openbmb/MiniCPM-V-4_5-gguf)、[AWQ](https://github.com/tc-mb/AutoAWQ) 等 16 种规格的量化模型;(3)兼容 SGLang 与 [vLLM](#efficient-inference-with-llamacpp-ollama-vllm) (4) 借助 [Transformers](https://github.com/tc-mb/transformers/tree/main) 与 [LLaMA-Factory](./docs/llamafactory_train_and_infer.md) 在新领域与任务上进行微调;(5) 快速启动本地 [WebUI demo](#chat-with-our-demo-on-gradio);(6) 优化适配的 [iOS 本地应用](https://github.com/tc-mb/MiniCPM-o-demo-iOS),可在 iPhone 与 iPad 上高效运行;(7) 在线 [Web demo](http://101.126.42.235:30910/) 体验。更多使用方式请见 [Cookbook](https://github.com/OpenSQZ/MiniCPM-V-CookBook)。 + +### 技术亮点 + +- **架构:图像-视频统一的高密度视觉压缩 3D-Resampler**。 MiniCPM-V 4.5 在架构上引入了 3D-Resampler,成功突破了视频理解任务中性能与效率难以兼得的瓶颈。该方法能够将多达 6 帧连续视频帧压缩为仅 64 个 token(与 MiniCPM-V 系列中单张图像所用的 token 数相同),从而实现 96× 的视频 token 压缩率。这使得模型在语言模型计算成本不增加的情况下,可以处理更多的视频帧,从而实现高帧率视频理解和长视频理解。该架构统一支持单图、多图和视频的编码处理,确保了能力与知识的无缝迁移。 + +- **学习机制:OCR与文档知识的统一学习**。现有多模态大模型一般在不同训练阶段分别单独训练 OCR 能力与文档知识。我们发现这两个训练过程的本质差异在于图像中文本的可见性。通过动态对文档文本区域施加不同强度的噪声干扰,并要求模型重建文本,使其学会自适应地在准确文本识别(当文本清晰时)与基于多模态上下文的知识推理(当文本严重遮挡时)之间切换。这种方法使得 MiniCPM-V 在文档知识学习中摆脱了对高错误率的文档解析器的依赖,同时避免了过度增强的 OCR 数据产生的幻觉问题,以最小工程开销实现了顶尖的 OCR 与多模态知识处理性能。 + +- **后训练优化:基于多模态强化学习的混合快思考/深度思考模式**。 MiniCPM-V 4.5 通过两种可切换推理模式提供均衡的体验:面向高效日常应用的快速思考模式,以及处理复杂任务的深度思考模式。采用新颖的混合强化学习方法,模型可联合优化两种模式,在保持深度模式能力的同时显著提升快速模式性能。结合 [RLPR](https://github.com/OpenBMB/RLPR) 和 [RLAIF-V](https://github.com/RLHF-V/RLAIF-V) 技术,该模型可以从海量多模态数据中泛化出强大的推理能力,并有效减少幻觉现象。 + +
+ +
### 性能评估 - -
-点击查看在OpenCompass上的单图理解能力的评测结果。
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
modelSizeOpencompassOCRBenchMathVistaHallusionBenchMMMUMMVetMMBench V1.1MMStarAI2D
Proprietary
GPT-4v-20240409-63.565655.243.961.767.579.856.078.6
Gemini-1.5-Pro-64.575458.345.660.664.073.959.179.1
GPT-4.1-mini-20250414-68.984070.949.355.074.380.960.976.0
Claude 3.5 Sonnet-20241022-70.679865.355.566.470.181.765.181.2
Open-source
Qwen2.5-VL-3B-Instruct3.8B64.582861.246.651.260.076.856.381.4
InternVL2.5-4B3.7B65.182060.846.651.861.578.258.781.4
Qwen2.5-VL-7B-Instruct8.3B70.988868.151.958.069.782.264.184.3
InternVL2.5-8B8.1B68.182164.549.056.262.882.563.284.6
MiniCPM-V-2.68.1B65.285260.848.149.860.078.057.582.1
MiniCPM-o-2.68.7B70.288973.351.150.967.280.663.386.1
MiniCPM-V-4.04.1B69.089466.950.851.268.079.762.882.9
+ +
+
+
-
- -
-点击查看在图表理解、文档理解、数学推理、幻觉等领域的评测结果。 +### 典型示例
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
modelSizeChartQAMMERealWorldQATextVQADocVQAMathVisionDynaMathWeMathObj HalMM Hal
CHAIRs↓CHAIRi↓score avg@3↑hall rate avg@3↓
Proprietary
GPT-4v-20240409-78.5192761.478.088.4-------
Gemini-1.5-Pro-87.2-67.578.893.141.031.550.5----
GPT-4.1-mini-20250414------45.347.7-----
Claude 3.5 Sonnet-20241022-90.8-60.174.195.235.635.744.0----
Open-source
Qwen2.5-VL-3B-Instruct3.8B84.0215765.479.393.921.913.222.918.310.83.9 33.3
InternVL2.5-4B3.7B84.0233864.376.891.618.415.221.213.78.73.2 46.5
Qwen2.5-VL-7B-Instruct8.3B87.3234768.584.995.725.421.836.213.37.94.1 31.6
InternVL2.5-8B8.1B84.8234470.179.193.017.09.423.518.311.63.6 37.2
MiniCPM-V-2.68.1B79.4234865.080.190.817.59.020.47.34.74.0 29.9
MiniCPM-o-2.68.7B86.9237268.182.093.521.710.425.26.33.44.1 31.3
MiniCPM-V-4.04.1B84.4229868.580.892.920.714.232.76.33.54.1 29.2
+
-
- -
-点击查看多图和视频理解能力的评测结果。 -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
modelSizeMantisBlinkVideo-MME
wo subsw subs
Proprietary
GPT-4v-20240409-62.754.659.963.3
Gemini-1.5-Pro--59.175.081.3
GPT-4o-20240513--68.071.977.2
Open-source
Qwen2.5-VL-3B-Instruct3.8B-47.661.567.6
InternVL2.5-4B3.7B62.750.862.363.6
Qwen2.5-VL-7B-Instruct8.3B-56.465.171.6
InternVL2.5-8B8.1B67.754.864.266.9
MiniCPM-V-2.68.1B69.153.060.963.6
MiniCPM-o-2.68.7B71.956.763.969.6
MiniCPM-V-4.04.1B71.454.061.265.8
-
- -
- -### 典型示例 -
- math + zh_case1 + zh_case2
+
+点击查看更多示例 +
+ en_extra + en_extra +
+
-我们在 iPhone 16 Pro Max 上部署了 MiniCPM-V 4.0 [iOS demo](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/demo/ios_demo/ios.md),并录制了以下演示录屏,视频未经加速等任何编辑: + +我们使用 [iOS demo](https://github.com/tc-mb/MiniCPM-o-demo-iOS) 将 MiniCPM-V 4.5 部署在 iPad M4 ,并录制以下演示录屏,视频未经任何编辑。

- +      - +

- +      - +

-
+ + ## MiniCPM-o 2.6 @@ -1606,773 +1128,13 @@ MiniCPM-o 2.6 可以通过多种方式轻松使用:(1) [llama.cpp](https://git Click to view more details of MiniCPM-V 2.6 -## MiniCPM-V 2.6 - -**MiniCPM-V 2.6** 是 MiniCPM-V 系列中最新、性能最佳的模型。该模型基于 SigLip-400M 和 Qwen2-7B 构建,共 8B 参数。与 MiniCPM-Llama3-V 2.5 相比,MiniCPM-V 2.6 性能提升显著,并引入了多图和视频理解的新功能。MiniCPM-V 2.6 的主要特点包括: - - -- 🔥 **领先的性能。** - MiniCPM-V 2.6 在最新版本 OpenCompass 榜单上(综合 8 个主流多模态评测基准)平均得分 65.2,**以8B量级的大小在单图理解方面超越了 GPT-4o mini、GPT-4V、Gemini 1.5 Pro 和 Claude 3.5 Sonnet 等主流商用闭源多模态大模型**。 - -- 🖼️ **多图理解和上下文学习。** - MiniCPM-V 2.6 还支持**多图对话和推理**。它在 Mantis-Eval、BLINK、Mathverse mv 和 Sciverse mv 等主流多图评测基准中取得了**最佳水平**,并展现出了优秀的上下文学习能力。 - -- 🎬 **视频理解。** - MiniCPM-V 2.6 还可以**接受视频输入**,进行对话和提供涵盖时序和空间信息的详细视频描述。模型在 有/无字幕 评测场景下的 Video-MME 表现均超过了 **GPT-4V、Claude 3.5 Sonnet 和 LLaVA-NeXT-Video-34B**等商用闭源模型。 - -- 💪 **强大的 OCR 能力及其他功能。** - MiniCPM-V 2.6 可以处理任意长宽比的图像,像素数可达 180 万(如 1344x1344)。在 OCRBench 上取得**最佳水平,超过 GPT-4o、GPT-4V 和 Gemini 1.5 Pro 等商用闭源模型**。基于最新的 [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) 和 [VisCPM](https://github.com/OpenBMB/VisCPM) 技术,其具备了**可信的多模态行为**,在 Object HalBench 上的幻觉率显著低于 GPT-4o 和 GPT-4V,并支持英语、中文、德语、法语、意大利语、韩语等**多种语言**。 - -- 🚀 **卓越的效率。** - 除了对个人用户友好的模型大小,MiniCPM-V 2.6 还表现出**最先进的视觉 token 密度**(即每个视觉 token 编码的像素数量)。它**仅需 640 个 token 即可处理 180 万像素图像,比大多数模型少 75%**。这一特性优化了模型的推理速度、首 token 延迟、内存占用和功耗。因此,MiniCPM-V 2.6 可以支持 iPad 等终端设备上的高效**实时视频理解**。 - -- 💫 **易于使用。** - MiniCPM-V 2.6 可以通过多种方式轻松使用:(1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpmv-main/examples/llava/README-minicpmv2.6.md) 和 [ollama](https://github.com/OpenBMB/ollama/blob/minicpm-v2.6/examples/minicpm-v2.6/README.md) 支持在本地设备上进行高效的 CPU 推理,(2) [int4](https://huggingface.co/openbmb/MiniCPM-V-2_6-int4) 和 [GGUF](https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf) 格式的量化模型,有 16 种尺寸,(3) [vLLM](#vllm-部署-) 支持高吞吐量和内存高效的推理,(4) 针对新领域和任务进行微调,(5) 使用 [Gradio](#本地-webui-demo-) 快速设置本地 WebUI 演示,(6) 在线[demo](http://120.92.209.146:8887/)即可体验。 - -### 性能评估 -
- -
- -
-点击查看 OpenCompass, MME, MMVet, OCRBench, MMMU, MathVista, MMB, AI2D, TextVQA, DocVQA, HallusionBench, Object HalBench 上的单图评测结果详情。 -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
ModelSizeToken Density+OpenCompassMMEMMVetOCRBenchMMMU valMathVista miniMMB1.1 testAI2DTextVQA valDocVQA testHallusionBenchObject HalBench
Proprietary
GPT-4o-108869.92328.769.173669.261.382.284.6-92.855.017.6
Claude 3.5 Sonnet-75067.91920.066.078865.961.678.580.2-95.249.913.8
Gemini 1.5 Pro--64.42110.664.075460.657.773.979.173.586.545.6-
GPT-4o mini-108864.12003.466.978560.052.476.077.8--46.112.4
GPT-4V-108863.52070.267.565661.754.779.878.678.087.243.914.2
Step-1V--59.52206.463.362549.944.878.079.271.6-48.4-
Qwen-VL-Max-78458.32281.761.868452.043.474.675.779.593.141.213.4
Open-source
LLaVA-NeXT-Yi-34B34B15755.02006.550.757448.840.477.878.969.3-34.812.6
Mini-Gemini-HD-34B34B157-214159.351848.043.3-80.574.178.9--
Cambrian-34B34B182058.32049.953.259150.450.377.879.576.775.541.614.7
GLM-4V-9B13B78459.12018.858.077646.951.167.971.2--45.0-
InternVL2-8B8B70664.12215.154.379451.258.379.483.677.491.645.021.3
MiniCPM-Llama-V 2.58B188258.82024.652.872545.854.372.078.476.684.842.410.3
MiniCPM-V 2.68B282265.22348.4*60.0852*49.8*60.678.082.180.190.848.1*8.2
- -
-* 我们使用思维链提示词来评估这些基准。 - -+ Token Density:每个视觉 token 在最大分辨率下编码的像素数,即最大分辨率下的像素数 / 视觉 token 数。 - -注意:闭源模型的 Token Density 由 API 收费方式估算得到。 -
- - -
-点击查看 Mantis Eval, BLINK, Mathverse mv, Sciverse mv, MIRB 上的多图评测结果详情。 -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
ModelSizeMantis EvalBLINK valMathverse mvSciverse mvMIRB
Proprietary
GPT-4V-62.754.660.366.953.1
LLaVA-NeXT-Interleave-14B14B66.452.632.730.2-
Open-source
Emu2-Chat37B37.836.2-27.2-
CogVLM17B45.241.1---
VPG-C7B52.443.124.323.1-
VILA 8B8B51.239.3-36.5-
InternLM-XComposer-2.58B53.1*48.932.1*-42.5
InternVL2-8B8B59.0*50.930.5*34.4*56.9*
MiniCPM-V 2.68B69.153.084.974.953.8
- - -
-* 正式开源模型权重的评测结果。 -
- -
-点击查看 Video-MME 和 Video-ChatGPT 上的视频评测结果详情。 -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
ModelSizeVideo-MMEVideo-ChatGPT
w/o subsw subsCorrectnessDetailContextTemporalConsistency
Proprietary
Claude 3.5 Sonnet-60.062.9-----
GPT-4V-59.963.3-----
Open-source
LLaVA-NeXT-7B7B--3.393.293.922.603.12
LLaVA-NeXT-34B34B--3.293.233.832.513.47
CogVLM2-Video12B--3.493.463.232.983.64
LongVA7B52.454.33.053.093.772.443.64
InternVL2-8B8B54.056.9-----
InternLM-XComposer-2.58B55.8------
LLaVA-NeXT-Video32B60.263.03.483.373.952.643.28
MiniCPM-V 2.68B60.963.63.593.283.932.733.62
-
-
- - -
-点击查看 TextVQA, VizWiz, VQAv2, OK-VQA上的少样本评测结果详情。 -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
ModelSizeShotTextVQA valVizWiz test-devVQAv2 test-devOK-VQA val
Flamingo80B0*35.031.656.340.6
436.539.663.157.4
837.344.865.657.5
IDEFICS80B0*30.936.060.045.2
434.340.463.652.4
835.746.164.855.1
OmniCorpus7B0*43.049.863.245.5
445.451.364.546.5
845.652.264.746.6
Emu237B026.440.433.526.7
448.254.667.053.2
849.354.767.854.1
MM130B026.240.448.926.7
849.354.770.954.1
MiniCPM-V 2.6+8B043.933.845.423.9
463.660.565.550.1
864.663.468.251.4
- - -
-* 使用 Flamingo 方式 zero image shot 和 two additional text shots 评估零样本性能。 - -+ 我们在没有进行监督微调 (SFT) 的情况下评估预训练的模型权重 (ckpt)。 -
- -### 典型示例 - -
- Bike - Menu - Code - Mem - medal -
-
- 点击查看更多示例。 -
- elec - Menu -
-
- -我们将 MiniCPM-V 2.6 部署在iPad Pro上,并录制了以下演示视频。 - - -

- -      - -

-
- - -

- - -

-
- -
- ## 历史版本模型 | 模型 | 介绍信息和使用教程 | |:----------------------|:-------------------:| +| MiniCPM-V 4.0 | [文档](./docs/minicpm_v4_zh.md) | +| MiniCPM-V 2.6 | [文档](./docs/minicpm_v2dot6_zh.md) | | MiniCPM-Llama3-V 2.5 | [文档](./docs/minicpm_llama3_v2dot5.md) | | MiniCPM-V 2.0 | [文档](./docs/minicpm_v2.md) | | MiniCPM-V 1.0 | [文档](./docs/minicpm_v1.md) | @@ -2433,10 +1195,9 @@ python web_demos/minicpm-o_2.6/chatbot_web_demo_o2.6.py | 模型 | 设备 | 资源 |          简介 | 下载链接 | |:--------------|:-:|:----------:|:-------------------|:---------------:| -| MiniCPM-V 4.0| GPU | 9 GB | 提供出色的端侧单图、多图、视频理解能力。 | [🤗](https://huggingface.co/openbmb/MiniCPM-V-4)    [](https://modelscope.cn/models/OpenBMB/MiniCPM-V-4) | -| MiniCPM-V 4.0 gguf | CPU | 4 GB | gguf 版本,更低的内存占用和更高的推理效率。 | [🤗](https://huggingface.co/openbmb/MiniCPM-V-4-gguf)    [](https://modelscope.cn/models/OpenBMB/MiniCPM-V-4-gguf) | -| MiniCPM-V 4.0 int4 | GPU | 5 GB | int4量化版,更低显存占用 | [🤗](https://huggingface.co/openbmb/MiniCPM-V-4-int4)    [](https://modelscope.cn/models/OpenBMB/MiniCPM-V-4-int4) | -| MiniCPM-V 4.0 AWQ | GPU | 5 GB | int4量化版,更低显存占用 | [🤗](https://huggingface.co/openbmb/MiniCPM-V-4-AWQ)    [](https://modelscope.cn/models/OpenBMB/MiniCPM-V-4-AWQ) | +| MiniCPM-V 4.5| GPU | 18 GB | 提供出色的端侧单图、多图、视频理解能力。 | [🤗](https://huggingface.co/openbmb/MiniCPM-V-4_5)    [](https://modelscope.cn/models/OpenBMB/MiniCPM-V-4_5) | +| MiniCPM-V 4.5 gguf | CPU | 8 GB | gguf 版本,更低的内存占用和更高的推理效率。 | [🤗](https://huggingface.co/openbmb/MiniCPM-V-4_5-gguf)    [](https://modelscope.cn/models/OpenBMB/MiniCPM-V-4_5-gguf) | +| MiniCPM-V 4.5 int4 | GPU | 9 GB | int4量化版,更低显存占用 | [🤗](https://huggingface.co/openbmb/MiniCPM-V-4_5-int4)    [](https://modelscope.cn/models/OpenBMB/MiniCPM-V-4_5-int4) | | MiniCPM-o 2.6| GPU | 18 GB | 最新版本,提供端侧 GPT-4o 级的视觉、语音、多模态流式交互能力。 | [🤗](https://huggingface.co/openbmb/MiniCPM-o-2_6)    [](https://modelscope.cn/models/OpenBMB/MiniCPM-o-2_6) | | MiniCPM-o 2.6 gguf | CPU | 8 GB | gguf 版本,更低的内存占用和更高的推理效率。 | [🤗](https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf)    [](https://modelscope.cn/models/OpenBMB/MiniCPM-o-2_6-gguf) | | MiniCPM-o 2.6 int4 | GPU | 9 GB | int4量化版,更低显存占用。 | [🤗](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4)    [](https://modelscope.cn/models/OpenBMB/MiniCPM-o-2_6-int4) | @@ -2448,7 +1209,7 @@ python web_demos/minicpm-o_2.6/chatbot_web_demo_o2.6.py ### 多轮对话 -请确保 `transformers==4.44.2`,其他版本目前可能会有兼容性问题 +如果您希望开启长思考模式,请向 `chat` 函数传入参数 `enable_thinking=True` ```shell pip install -r requirements_o2.6.txt @@ -2466,21 +1227,25 @@ from transformers import AutoModel, AutoTokenizer torch.manual_seed(100) -model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6 +model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6 attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager model = model.eval().cuda() -tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True) +tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True) # or openbmb/MiniCPM-o-2_6 image = Image.open('./assets/minicpmo2_6/show_demo.jpg').convert('RGB') +enable_thinking=False # If `enable_thinking=True`, the long-thinking mode is enabled. + # First round chat question = "What is the landform in the picture?" msgs = [{'role': 'user', 'content': [image, question]}] answer = model.chat( msgs=msgs, - tokenizer=tokenizer + tokenizer=tokenizer, + enable_thinking=enable_thinking ) + print(answer) # Second round chat, pass history context of multi-turn conversation @@ -2496,25 +1261,36 @@ print(answer) 你可以得到如下推理结果: -``` -"The landform in the picture is karst topography, characterized by its unique and striking limestone formations that rise dramatically from the surrounding landscape." +```shell +# round1 +The landform in the picture is karst topography. Karst landscapes are characterized by distinctive, jagged limestone hills or mountains with steep, irregular peaks and deep valleys—exactly what you see here These unique formations result from the dissolution of soluble rocks like limestone over millions of years through water erosion. -"When traveling to this picturesque location, you should pay attention to the weather conditions as they can change rapidly in such areas. It's also important to respect local ecosystems and wildlife by staying on designated paths and not disturbing natural habitats. Additionally, bringing appropriate gear for photography is advisable due to the stunning reflections and lighting during sunrise or sunset." +This scene closely resembles the famous karst landscape of Guilin and Yangshuo in China’s Guangxi Province. The area features dramatic, pointed limestone peaks rising dramatically above serene rivers and lush green forests, creating a breathtaking and iconic natural beauty that attracts millions of visitors each year for its picturesque views. + +# round2 +When traveling to a karst landscape like this, here are some important tips: + +1. Wear comfortable shoes: The terrain can be uneven and hilly. +2. Bring water and snacks for energy during hikes or boat rides. +3. Protect yourself from the sun with sunscreen, hats, and sunglasses—especially since you’ll likely spend time outdoors exploring scenic spots. +4. Respect local customs and nature regulations by not littering or disturbing wildlife. + +By following these guidelines, you'll have a safe and enjoyable trip while appreciating the stunning natural beauty of places such as Guilin’s karst mountains. ``` #### 多图对话
- 点击查看 MiniCPM-V-4 多图输入的 Python 代码。 + 点击查看 MiniCPM-V-4_5 多图输入的 Python 代码。 ```python import torch from PIL import Image from transformers import AutoModel, AutoTokenizer -model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6 +model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6 attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager model = model.eval().cuda() -tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True) +tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True) image1 = Image.open('image1.jpg').convert('RGB') image2 = Image.open('image2.jpg').convert('RGB') @@ -2539,10 +1315,10 @@ import torch from PIL import Image from transformers import AutoModel, AutoTokenizer -model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6 +model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6 attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager model = model.eval().cuda() -tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True) +tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True) question = "production date" image1 = Image.open('example1.jpg').convert('RGB') @@ -2567,53 +1343,103 @@ print(answer) #### 视频对话
- 点击查看 MiniCPM-V-4 视频输入的 Python 代码。 + 点击查看 MiniCPM-V-4_5 视频输入的 3D-Resampler 推理的 Python 代码。 ```python +## The 3d-resampler compresses multiple frames into 64 tokens by introducing temporal_ids. +# To achieve this, you need to organize your video data into two corresponding sequences: +# frames: List[Image] +# temporal_ids: List[List[Int]]. + import torch from PIL import Image from transformers import AutoModel, AutoTokenizer from decord import VideoReader, cpu # pip install decord +from scipy.spatial import cKDTree +import numpy as np +import math -model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6 +model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6 attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager model = model.eval().cuda() -tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True) +tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True) # or openbmb/MiniCPM-o-2_6 -MAX_NUM_FRAMES=64 # if cuda OOM set a smaller number +MAX_NUM_FRAMES=180 # Indicates the maximum number of frames received after the videos are packed. The actual maximum number of valid frames is MAX_NUM_FRAMES * MAX_NUM_PACKING. +MAX_NUM_PACKING=3 # indicates the maximum packing number of video frames. valid range: 1-6 +TIME_SCALE = 0.1 -def encode_video(video_path): +def map_to_nearest_scale(values, scale): + tree = cKDTree(np.asarray(scale)[:, None]) + _, indices = tree.query(np.asarray(values)[:, None]) + return np.asarray(scale)[indices] + + +def group_array(arr, size): + return [arr[i:i+size] for i in range(0, len(arr), size)] + +def encode_video(video_path, choose_fps=3, force_packing=None): def uniform_sample(l, n): gap = len(l) / n idxs = [int(i * gap + gap / 2) for i in range(n)] return [l[i] for i in idxs] - vr = VideoReader(video_path, ctx=cpu(0)) - sample_fps = round(vr.get_avg_fps() / 1) # FPS - frame_idx = [i for i in range(0, len(vr), sample_fps)] - if len(frame_idx) > MAX_NUM_FRAMES: - frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES) + fps = vr.get_avg_fps() + video_duration = len(vr) / fps + + if choose_fps * int(video_duration) <= MAX_NUM_FRAMES: + packing_nums = 1 + choose_frames = round(min(choose_fps, round(fps)) * min(MAX_NUM_FRAMES, video_duration)) + + else: + packing_nums = math.ceil(video_duration * choose_fps / MAX_NUM_FRAMES) + if packing_nums <= MAX_NUM_PACKING: + choose_frames = round(video_duration * choose_fps) + else: + choose_frames = round(MAX_NUM_FRAMES * MAX_NUM_PACKING) + packing_nums = MAX_NUM_PACKING + + frame_idx = [i for i in range(0, len(vr))] + frame_idx = np.array(uniform_sample(frame_idx, choose_frames)) + + if force_packing: + packing_nums = min(force_packing, MAX_NUM_PACKING) + + print(video_path, ' duration:', video_duration) + print(f'get video frames={len(frame_idx)}, packing_nums={packing_nums}') + frames = vr.get_batch(frame_idx).asnumpy() - frames = [Image.fromarray(v.astype('uint8')) for v in frames] - print('num frames:', len(frames)) - return frames + + frame_idx_ts = frame_idx / fps + scale = np.arange(0, video_duration, TIME_SCALE) + + frame_ts_id = map_to_nearest_scale(frame_idx_ts, scale) / TIME_SCALE + frame_ts_id = frame_ts_id.astype(np.int32) + + assert len(frames) == len(frame_ts_id) + + frames = [Image.fromarray(v.astype('uint8')).convert('RGB') for v in frames] + frame_ts_id_group = group_array(frame_ts_id, packing_nums) + + return frames, frame_ts_id_group + video_path="video_test.mp4" -frames = encode_video(video_path) +fps = 5 # fps for video +force_packing = None # You can set force_packing to ensure that 3D packing is forcibly enabled; otherwise, encode_video will dynamically set the packing quantity based on the duration. +frames, frame_ts_id_group = encode_video(video_path, fps, force_packing=force_packing) + question = "Describe the video" msgs = [ {'role': 'user', 'content': frames + [question]}, ] -# Set decode params for video -params = {} -params["use_image_id"] = False -params["max_slice_nums"] = 2 # use 1 if cuda OOM and video resolution > 448*448 answer = model.chat( msgs=msgs, tokenizer=tokenizer, - **params + use_image_id=False, + max_slice_nums=1, + temporal_ids=frame_ts_id_group ) print(answer) ``` diff --git a/assets/minicpm-v-4dot5-framework.png b/assets/minicpm-v-4dot5-framework.png new file mode 100644 index 0000000..7fd5a95 Binary files /dev/null and b/assets/minicpm-v-4dot5-framework.png differ diff --git a/assets/minicpm_v_and_minicpm_o_title.png b/assets/minicpm_v_and_minicpm_o_title.png new file mode 100644 index 0000000..9c4c23a Binary files /dev/null and b/assets/minicpm_v_and_minicpm_o_title.png differ diff --git a/assets/minicpmv4_5/MiniCPM-V 4.5-8.26.mp4 b/assets/minicpmv4_5/MiniCPM-V 4.5-8.26.mp4 new file mode 100644 index 0000000..875d3c1 Binary files /dev/null and b/assets/minicpmv4_5/MiniCPM-V 4.5-8.26.mp4 differ diff --git a/assets/minicpmv4_5/MiniCPM-V 4.5-8.26_img.jpeg b/assets/minicpmv4_5/MiniCPM-V 4.5-8.26_img.jpeg new file mode 100644 index 0000000..e67c808 Binary files /dev/null and b/assets/minicpmv4_5/MiniCPM-V 4.5-8.26_img.jpeg differ diff --git a/assets/minicpmv4_5/en_case1.jpeg b/assets/minicpmv4_5/en_case1.jpeg new file mode 100644 index 0000000..18a796f Binary files /dev/null and b/assets/minicpmv4_5/en_case1.jpeg differ diff --git a/assets/minicpmv4_5/en_case1.png b/assets/minicpmv4_5/en_case1.png new file mode 100644 index 0000000..406c3a3 Binary files /dev/null and b/assets/minicpmv4_5/en_case1.png differ diff --git a/assets/minicpmv4_5/en_case2.jpeg b/assets/minicpmv4_5/en_case2.jpeg new file mode 100644 index 0000000..c7d3e6b Binary files /dev/null and b/assets/minicpmv4_5/en_case2.jpeg differ diff --git a/assets/minicpmv4_5/en_case2.png b/assets/minicpmv4_5/en_case2.png new file mode 100644 index 0000000..02be125 Binary files /dev/null and b/assets/minicpmv4_5/en_case2.png differ diff --git a/assets/minicpmv4_5/en_case3.jpeg b/assets/minicpmv4_5/en_case3.jpeg new file mode 100644 index 0000000..bf2ec37 Binary files /dev/null and b/assets/minicpmv4_5/en_case3.jpeg differ diff --git a/assets/minicpmv4_5/en_case4.jpeg b/assets/minicpmv4_5/en_case4.jpeg new file mode 100644 index 0000000..ac99502 Binary files /dev/null and b/assets/minicpmv4_5/en_case4.jpeg differ diff --git a/assets/minicpmv4_5/en_extra.jpg b/assets/minicpmv4_5/en_extra.jpg new file mode 100644 index 0000000..9f3c2ec Binary files /dev/null and b/assets/minicpmv4_5/en_extra.jpg differ diff --git a/assets/minicpmv4_5/v45_cn_handwriting.gif b/assets/minicpmv4_5/v45_cn_handwriting.gif new file mode 100644 index 0000000..49c48f1 Binary files /dev/null and b/assets/minicpmv4_5/v45_cn_handwriting.gif differ diff --git a/assets/minicpmv4_5/v45_cn_travel.gif b/assets/minicpmv4_5/v45_cn_travel.gif new file mode 100644 index 0000000..f2bdbb0 Binary files /dev/null and b/assets/minicpmv4_5/v45_cn_travel.gif differ diff --git a/assets/minicpmv4_5/v45_en_cot.gif b/assets/minicpmv4_5/v45_en_cot.gif new file mode 100644 index 0000000..0330a3e Binary files /dev/null and b/assets/minicpmv4_5/v45_en_cot.gif differ diff --git a/assets/minicpmv4_5/v45_en_handwriting.gif b/assets/minicpmv4_5/v45_en_handwriting.gif new file mode 100644 index 0000000..9a00e4d Binary files /dev/null and b/assets/minicpmv4_5/v45_en_handwriting.gif differ diff --git a/assets/minicpmv4_5/zh_case1.jpeg b/assets/minicpmv4_5/zh_case1.jpeg new file mode 100644 index 0000000..e671d0f Binary files /dev/null and b/assets/minicpmv4_5/zh_case1.jpeg differ diff --git a/assets/minicpmv4_5/zh_case2.jpeg b/assets/minicpmv4_5/zh_case2.jpeg new file mode 100644 index 0000000..d4353d6 Binary files /dev/null and b/assets/minicpmv4_5/zh_case2.jpeg differ diff --git a/assets/minicpmv4_5/zh_extra.jpeg b/assets/minicpmv4_5/zh_extra.jpeg new file mode 100644 index 0000000..36ce5cc Binary files /dev/null and b/assets/minicpmv4_5/zh_extra.jpeg differ diff --git a/assets/minicpmv_4_5_evaluation_results.jpg b/assets/minicpmv_4_5_evaluation_results.jpg new file mode 100644 index 0000000..66baeb8 Binary files /dev/null and b/assets/minicpmv_4_5_evaluation_results.jpg differ diff --git a/assets/minicpmv_4_5_evaluation_results.png b/assets/minicpmv_4_5_evaluation_results.png new file mode 100644 index 0000000..c37fc13 Binary files /dev/null and b/assets/minicpmv_4_5_evaluation_results.png differ diff --git a/assets/radar_minicpm_v45.png b/assets/radar_minicpm_v45.png new file mode 100644 index 0000000..9586444 Binary files /dev/null and b/assets/radar_minicpm_v45.png differ diff --git a/docs/minicpm_v2dot6_en.md b/docs/minicpm_v2dot6_en.md new file mode 100644 index 0000000..9ef6dac --- /dev/null +++ b/docs/minicpm_v2dot6_en.md @@ -0,0 +1,945 @@ +## MiniCPM-V 2.6 + +> Archieve at: 2025-01-13 + +**MiniCPM-V 2.6** is the latest and most capable model in the MiniCPM-V series. The model is built on SigLip-400M and Qwen2-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-Llama3-V 2.5, and introduces new features for multi-image and video understanding. Notable features of MiniCPM-V 2.6 include: + +- 🔥 **Leading Performance.** + MiniCPM-V 2.6 achieves an average score of 65.2 on the latest version of OpenCompass, a comprehensive evaluation over 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o mini, GPT-4V, Gemini 1.5 Pro, and Claude 3.5 Sonnet** for single image understanding. + +- 🖼️ **Multi Image Understanding and In-context Learning.** MiniCPM-V 2.6 can also perform **conversation and reasoning over multiple images**. It achieves **state-of-the-art performance** on popular multi-image benchmarks such as Mantis-Eval, BLINK, Mathverse mv and Sciverse mv, and also shows promising in-context learning capability. + +- 🎬 **Video Understanding.** MiniCPM-V 2.6 can also **accept video inputs**, performing conversation and providing dense captions for spatial-temporal information. It outperforms **GPT-4V, Claude 3.5 Sonnet and LLaVA-NeXT-Video-34B** on Video-MME with/without subtitles. + +- 💪 **Strong OCR Capability and Others.** + MiniCPM-V 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves **state-of-the-art performance on OCRBench, surpassing proprietary models such as GPT-4o, GPT-4V, and Gemini 1.5 Pro**. + Based on the the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) and [VisCPM](https://github.com/OpenBMB/VisCPM) techniques, it features **trustworthy behaviors**, with significantly lower hallucination rates than GPT-4o and GPT-4V on Object HalBench, and supports **multilingual capabilities** on English, Chinese, German, French, Italian, Korean, etc. + + +- 🚀 **Superior Efficiency.** + In addition to its friendly size, MiniCPM-V 2.6 also shows **state-of-the-art token density** (i.e., number of pixels encoded into each visual token). **It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models**. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-V 2.6 can efficiently support **real-time video understanding** on end-side devices such as iPad. + +- 💫 **Easy Usage.** +MiniCPM-V 2.6 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpmv-main/examples/llava/README-minicpmv2.6.md) and [ollama](https://github.com/OpenBMB/ollama/blob/minicpm-v2.6/examples/minicpm-v2.6/README.md) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-V-2_6-int4) and [GGUF](https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf) format quantized models in 16 sizes, (3) [vLLM](#inference-with-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks, (5) quick local WebUI demo setup with [Gradio](#chat-with-our-demo-on-gradio), and (6) online web [demo](http://120.92.209.146:8887/). + +### Evaluation +
+ +
+ +
+Click to view single image results on OpenCompass, MME, MMVet, OCRBench, MMMU, MathVista, MMB, AI2D, TextVQA, DocVQA, HallusionBench, Object HalBench. +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelSizeToken Density+OpenCompassMMEMMVetOCRBenchMMMU valMathVista miniMMB1.1 testAI2DTextVQA valDocVQA testHallusionBenchObject HalBench
Proprietary
GPT-4o-108869.92328.769.173669.261.382.284.6-92.855.017.6
Claude 3.5 Sonnet-75067.91920.066.078865.961.678.580.2-95.249.913.8
Gemini 1.5 Pro--64.42110.664.075460.657.773.979.173.586.545.6-
GPT-4o mini-108864.12003.466.978560.052.476.077.8--46.112.4
GPT-4V-108863.52070.267.565661.754.779.878.678.087.243.914.2
Step-1V--59.52206.463.362549.944.878.079.271.6-48.4-
Qwen-VL-Max-78458.32281.761.868452.043.474.675.779.593.141.213.4
Open-source
LLaVA-NeXT-Yi-34B34B15755.02006.550.757448.840.477.878.969.3-34.812.6
Mini-Gemini-HD-34B34B157-2141.059.351848.043.3-80.574.178.9--
Cambrian-34B34B182058.32049.953.259150.450.377.879.576.775.541.614.7
GLM-4V-9B13B78459.12018.858.077646.951.167.971.2--45.0-
InternVL2-8B8B70664.12215.154.379451.258.379.483.677.491.645.021.3
MiniCPM-Llama-V 2.58B188258.82024.652.872545.854.372.078.476.684.842.410.3
MiniCPM-V 2.68B282265.22348.4*60.0852*49.8*60.678.082.180.190.848.1*8.2
+ +
+* We evaluate this benchmark using chain-of-thought prompting. Specifically, for MME, we used this technique only for the Cognition set. + ++ Token Density: number of pixels encoded into each visual token at maximum resolution, i.e., # pixels at maximum resolution / # visual tokens. + +Note: For proprietary models, we calculate token density based on the image encoding charging strategy defined in the official API documentation, which provides an upper-bound estimation. + +
+ + +
+Click to view multi-image results on Mantis Eval, BLINK, Mathverse mv, Sciverse mv, MIRB. +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelSizeMantis EvalBLINK valMathverse mvSciverse mvMIRB
Proprietary
GPT-4V-62.754.660.366.953.1
LLaVA-NeXT-Interleave-14B14B66.452.632.730.2-
Open-source
Emu2-Chat37B37.836.2-27.2-
CogVLM17B45.241.1---
VPG-C7B52.443.124.323.1-
VILA 8B8B51.239.3-36.5-
InternLM-XComposer-2.58B53.1*48.932.1*-42.5
InternVL2-8B8B59.0*50.930.5*34.4*56.9*
MiniCPM-V 2.68B69.153.084.974.953.8
+ +
+* We evaluate the officially released checkpoint by ourselves. +
+ +
+Click to view video results on Video-MME and Video-ChatGPT. +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelSizeVideo-MMEVideo-ChatGPT
w/o subsw subsCorrectnessDetailContextTemporalConsistency
Proprietary
Claude 3.5 Sonnet-60.062.9-----
GPT-4V-59.963.3-----
Open-source
LLaVA-NeXT-7B7B--3.393.293.922.603.12
LLaVA-NeXT-34B34B--3.293.233.832.513.47
CogVLM2-Video12B--3.493.463.232.983.64
LongVA7B52.454.33.053.093.772.443.64
InternVL2-8B8B54.056.9-----
InternLM-XComposer-2.58B55.8------
LLaVA-NeXT-Video32B60.263.03.483.373.952.643.28
MiniCPM-V 2.68B60.963.63.593.283.932.733.62
+
+
+ + +
+Click to view few-shot results on TextVQA, VizWiz, VQAv2, OK-VQA. +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelSizeShotTextVQA valVizWiz test-devVQAv2 test-devOK-VQA val
Flamingo80B0*35.031.656.340.6
436.539.663.157.4
837.344.865.657.5
IDEFICS80B0*30.936.060.045.2
434.340.463.652.4
835.746.164.855.1
OmniCorpus7B0*43.049.863.245.5
445.451.364.546.5
845.652.264.746.6
Emu237B026.440.433.526.7
448.254.667.053.2
849.354.767.854.1
MM130B026.240.448.926.7
849.354.770.954.1
MiniCPM-V 2.6+8B043.933.845.423.9
463.660.565.550.1
864.663.468.251.4
+ + +
+* denotes zero image shot and two additional text shots following Flamingo. + ++ We evaluate the pretraining ckpt without SFT. +
+ +### Examples + +
+ Bike + Menu + Code + Mem + medal +
+
+ Click to view more cases. +
+ elec + Menu +
+
+ +We deploy MiniCPM-V 2.6 on end devices. The demo video is the raw screen recording on a iPad Pro without edition. + + +

+ +      + +

+
+ + +

+ +      + +

+
+ + +

+ + +

+
+ +
+ + + +### Multi-turn Conversation + + +
+ +
+ + +```python +import torch +from PIL import Image +from transformers import AutoModel, AutoTokenizer + +torch.manual_seed(0) + +model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True, + attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager +model = model.eval().cuda() +tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True) + +image = Image.open('./assets/airplane.jpeg').convert('RGB') + +# First round chat +question = "Tell me the model of this aircraft." +msgs = [{'role': 'user', 'content': [image, question]}] + +answer = model.chat( + image=None, + msgs=msgs, + tokenizer=tokenizer +) +print(answer) + +# Second round chat +# pass history context of multi-turn conversation +msgs.append({"role": "assistant", "content": [answer]}) +msgs.append({"role": "user", "content": ["Introduce something about Airbus A380."]}) + +answer = model.chat( + image=None, + msgs=msgs, + tokenizer=tokenizer +) +print(answer) +``` + +You could get the following output: + +``` +"The aircraft in the image is an Airbus A380, which can be identified by its large size, double-deck structure, and the distinctive shape of its wings and engines. The A380 is a wide-body aircraft known for being the world's largest passenger airliner, designed for long-haul flights. It has four engines, which are characteristic of large commercial aircraft. The registration number on the aircraft can also provide specific information about the model if looked up in an aviation database." + +"The Airbus A380 is a double-deck, wide-body, four-engine jet airliner made by Airbus. It is the world's largest passenger airliner and is known for its long-haul capabilities. The aircraft was developed to improve efficiency and comfort for passengers traveling over long distances. It has two full-length passenger decks, which can accommodate more passengers than a typical single-aisle airplane. The A380 has been operated by airlines such as Lufthansa, Singapore Airlines, and Emirates, among others. It is widely recognized for its unique design and significant impact on the aviation industry." +``` + +#### Multi-image Understanding +
+ Click to view Python example of MiniCPM-V 2.6 multi-image understanding + +```python +import torch +from PIL import Image +from transformers import AutoModel, AutoTokenizer + +model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True, + attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager +model = model.eval().cuda() +tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True) + +image1 = Image.open('image1.jpg').convert('RGB') +image2 = Image.open('image2.jpg').convert('RGB') +question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.' + +msgs = [{'role': 'user', 'content': [image1, image2, question]}] + +answer = model.chat( + image=None, + msgs=msgs, + tokenizer=tokenizer +) +print(answer) +``` +
+ +#### Few-shot In-Context-Learning + +
+ Click to view Python example of MiniCPM-V 2.6 few-shot in-context-learning example + +```python +import torch +from PIL import Image +from transformers import AutoModel, AutoTokenizer + +model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True, + attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager +model = model.eval().cuda() +tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True) + +question = "production date" +image1 = Image.open('example1.jpg').convert('RGB') +answer1 = "2023.08.04" +image2 = Image.open('example2.jpg').convert('RGB') +answer2 = "2007.04.24" +image_test = Image.open('test.jpg').convert('RGB') + +msgs = [ + {'role': 'user', 'content': [image1, question]}, {'role': 'assistant', 'content': [answer1]}, + {'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]}, + {'role': 'user', 'content': [image_test, question]} +] + +answer = model.chat( + image=None, + msgs=msgs, + tokenizer=tokenizer +) +print(answer) +``` +
+ +#### Video understanding +
+ Click to view Python example of MiniCPM-V 2.6 video understanding + +```python +import torch +from PIL import Image +from transformers import AutoModel, AutoTokenizer +from decord import VideoReader, cpu # pip install decord + +model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True, + attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager +model = model.eval().cuda() +tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True) + +MAX_NUM_FRAMES=64 # if cuda OOM set a smaller number + +def encode_video(video_path): + def uniform_sample(l, n): + gap = len(l) / n + idxs = [int(i * gap + gap / 2) for i in range(n)] + return [l[i] for i in idxs] + + vr = VideoReader(video_path, ctx=cpu(0)) + sample_fps = round(vr.get_avg_fps() / 1) # FPS + frame_idx = [i for i in range(0, len(vr), sample_fps)] + if len(frame_idx) > MAX_NUM_FRAMES: + frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES) + frames = vr.get_batch(frame_idx).asnumpy() + frames = [Image.fromarray(v.astype('uint8')) for v in frames] + print('num frames:', len(frames)) + return frames + +video_path="video_test.mp4" +frames = encode_video(video_path) +question = "Describe the video" +msgs = [ + {'role': 'user', 'content': frames + [question]}, +] + +# Set decode params for video +params = {} +params["use_image_id"] = False +params["max_slice_nums"] = 2 # 如果cuda OOM且视频分辨率大于448*448可设为1 + +answer = model.chat( + image=None, + msgs=msgs, + tokenizer=tokenizer, + **params +) +print(answer) +``` +
diff --git a/docs/minicpm_v2dot6_zh.md b/docs/minicpm_v2dot6_zh.md new file mode 100644 index 0000000..5b58aa3 --- /dev/null +++ b/docs/minicpm_v2dot6_zh.md @@ -0,0 +1,763 @@ +## MiniCPM-V 2.6 + +> Archieve at: 2025-08-25 + +**MiniCPM-V 2.6** 是 MiniCPM-V 系列中最新、性能最佳的模型。该模型基于 SigLip-400M 和 Qwen2-7B 构建,共 8B 参数。与 MiniCPM-Llama3-V 2.5 相比,MiniCPM-V 2.6 性能提升显著,并引入了多图和视频理解的新功能。MiniCPM-V 2.6 的主要特点包括: + + +- 🔥 **领先的性能。** + MiniCPM-V 2.6 在最新版本 OpenCompass 榜单上(综合 8 个主流多模态评测基准)平均得分 65.2,**以8B量级的大小在单图理解方面超越了 GPT-4o mini、GPT-4V、Gemini 1.5 Pro 和 Claude 3.5 Sonnet 等主流商用闭源多模态大模型**。 + +- 🖼️ **多图理解和上下文学习。** + MiniCPM-V 2.6 还支持**多图对话和推理**。它在 Mantis-Eval、BLINK、Mathverse mv 和 Sciverse mv 等主流多图评测基准中取得了**最佳水平**,并展现出了优秀的上下文学习能力。 + +- 🎬 **视频理解。** + MiniCPM-V 2.6 还可以**接受视频输入**,进行对话和提供涵盖时序和空间信息的详细视频描述。模型在 有/无字幕 评测场景下的 Video-MME 表现均超过了 **GPT-4V、Claude 3.5 Sonnet 和 LLaVA-NeXT-Video-34B**等商用闭源模型。 + +- 💪 **强大的 OCR 能力及其他功能。** + MiniCPM-V 2.6 可以处理任意长宽比的图像,像素数可达 180 万(如 1344x1344)。在 OCRBench 上取得**最佳水平,超过 GPT-4o、GPT-4V 和 Gemini 1.5 Pro 等商用闭源模型**。基于最新的 [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) 和 [VisCPM](https://github.com/OpenBMB/VisCPM) 技术,其具备了**可信的多模态行为**,在 Object HalBench 上的幻觉率显著低于 GPT-4o 和 GPT-4V,并支持英语、中文、德语、法语、意大利语、韩语等**多种语言**。 + +- 🚀 **卓越的效率。** + 除了对个人用户友好的模型大小,MiniCPM-V 2.6 还表现出**最先进的视觉 token 密度**(即每个视觉 token 编码的像素数量)。它**仅需 640 个 token 即可处理 180 万像素图像,比大多数模型少 75%**。这一特性优化了模型的推理速度、首 token 延迟、内存占用和功耗。因此,MiniCPM-V 2.6 可以支持 iPad 等终端设备上的高效**实时视频理解**。 + +- 💫 **易于使用。** + MiniCPM-V 2.6 可以通过多种方式轻松使用:(1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpmv-main/examples/llava/README-minicpmv2.6.md) 和 [ollama](https://github.com/OpenBMB/ollama/blob/minicpm-v2.6/examples/minicpm-v2.6/README.md) 支持在本地设备上进行高效的 CPU 推理,(2) [int4](https://huggingface.co/openbmb/MiniCPM-V-2_6-int4) 和 [GGUF](https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf) 格式的量化模型,有 16 种尺寸,(3) [vLLM](#vllm-部署-) 支持高吞吐量和内存高效的推理,(4) 针对新领域和任务进行微调,(5) 使用 [Gradio](#本地-webui-demo-) 快速设置本地 WebUI 演示,(6) 在线[demo](http://120.92.209.146:8887/)即可体验。 + +### 性能评估 +
+ +
+ +
+点击查看 OpenCompass, MME, MMVet, OCRBench, MMMU, MathVista, MMB, AI2D, TextVQA, DocVQA, HallusionBench, Object HalBench 上的单图评测结果详情。 +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelSizeToken Density+OpenCompassMMEMMVetOCRBenchMMMU valMathVista miniMMB1.1 testAI2DTextVQA valDocVQA testHallusionBenchObject HalBench
Proprietary
GPT-4o-108869.92328.769.173669.261.382.284.6-92.855.017.6
Claude 3.5 Sonnet-75067.91920.066.078865.961.678.580.2-95.249.913.8
Gemini 1.5 Pro--64.42110.664.075460.657.773.979.173.586.545.6-
GPT-4o mini-108864.12003.466.978560.052.476.077.8--46.112.4
GPT-4V-108863.52070.267.565661.754.779.878.678.087.243.914.2
Step-1V--59.52206.463.362549.944.878.079.271.6-48.4-
Qwen-VL-Max-78458.32281.761.868452.043.474.675.779.593.141.213.4
Open-source
LLaVA-NeXT-Yi-34B34B15755.02006.550.757448.840.477.878.969.3-34.812.6
Mini-Gemini-HD-34B34B157-214159.351848.043.3-80.574.178.9--
Cambrian-34B34B182058.32049.953.259150.450.377.879.576.775.541.614.7
GLM-4V-9B13B78459.12018.858.077646.951.167.971.2--45.0-
InternVL2-8B8B70664.12215.154.379451.258.379.483.677.491.645.021.3
MiniCPM-Llama-V 2.58B188258.82024.652.872545.854.372.078.476.684.842.410.3
MiniCPM-V 2.68B282265.22348.4*60.0852*49.8*60.678.082.180.190.848.1*8.2
+ +
+* 我们使用思维链提示词来评估这些基准。 + ++ Token Density:每个视觉 token 在最大分辨率下编码的像素数,即最大分辨率下的像素数 / 视觉 token 数。 + +注意:闭源模型的 Token Density 由 API 收费方式估算得到。 +
+ + +
+点击查看 Mantis Eval, BLINK, Mathverse mv, Sciverse mv, MIRB 上的多图评测结果详情。 +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelSizeMantis EvalBLINK valMathverse mvSciverse mvMIRB
Proprietary
GPT-4V-62.754.660.366.953.1
LLaVA-NeXT-Interleave-14B14B66.452.632.730.2-
Open-source
Emu2-Chat37B37.836.2-27.2-
CogVLM17B45.241.1---
VPG-C7B52.443.124.323.1-
VILA 8B8B51.239.3-36.5-
InternLM-XComposer-2.58B53.1*48.932.1*-42.5
InternVL2-8B8B59.0*50.930.5*34.4*56.9*
MiniCPM-V 2.68B69.153.084.974.953.8
+ + +
+* 正式开源模型权重的评测结果。 +
+ +
+点击查看 Video-MME 和 Video-ChatGPT 上的视频评测结果详情。 +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelSizeVideo-MMEVideo-ChatGPT
w/o subsw subsCorrectnessDetailContextTemporalConsistency
Proprietary
Claude 3.5 Sonnet-60.062.9-----
GPT-4V-59.963.3-----
Open-source
LLaVA-NeXT-7B7B--3.393.293.922.603.12
LLaVA-NeXT-34B34B--3.293.233.832.513.47
CogVLM2-Video12B--3.493.463.232.983.64
LongVA7B52.454.33.053.093.772.443.64
InternVL2-8B8B54.056.9-----
InternLM-XComposer-2.58B55.8------
LLaVA-NeXT-Video32B60.263.03.483.373.952.643.28
MiniCPM-V 2.68B60.963.63.593.283.932.733.62
+
+
+ + +
+点击查看 TextVQA, VizWiz, VQAv2, OK-VQA上的少样本评测结果详情。 +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelSizeShotTextVQA valVizWiz test-devVQAv2 test-devOK-VQA val
Flamingo80B0*35.031.656.340.6
436.539.663.157.4
837.344.865.657.5
IDEFICS80B0*30.936.060.045.2
434.340.463.652.4
835.746.164.855.1
OmniCorpus7B0*43.049.863.245.5
445.451.364.546.5
845.652.264.746.6
Emu237B026.440.433.526.7
448.254.667.053.2
849.354.767.854.1
MM130B026.240.448.926.7
849.354.770.954.1
MiniCPM-V 2.6+8B043.933.845.423.9
463.660.565.550.1
864.663.468.251.4
+ + +
+* 使用 Flamingo 方式 zero image shot 和 two additional text shots 评估零样本性能。 + ++ 我们在没有进行监督微调 (SFT) 的情况下评估预训练的模型权重 (ckpt)。 +
+ +### 典型示例 + +
+ Bike + Menu + Code + Mem + medal +
+
+ 点击查看更多示例。 +
+ elec + Menu +
+
+ +我们将 MiniCPM-V 2.6 部署在iPad Pro上,并录制了以下演示视频。 + + +

+ +      + +

+
+ + +

+ + +

+
+ +
diff --git a/docs/minicpm_v4_en.md b/docs/minicpm_v4_en.md new file mode 100644 index 0000000..02ae692 --- /dev/null +++ b/docs/minicpm_v4_en.md @@ -0,0 +1,556 @@ +## MiniCPM-V 4.0 + +> Archieve at: 2025-08-25 + +**MiniCPM-V 4.0** is the latest efficient model in the MiniCPM-V series. The model is built based on SigLIP2-400M and MiniCPM4-3B with a total of 4.1B parameters. It inherits the strong single-image, multi-image and video understanding performance of MiniCPM-V 2.6 with largely improved efficiency. Notable features of MiniCPM-V 4.0 include: + +- 🔥 **Leading Visual Capability.** + With only 4.1B parameters, MiniCPM-V 4.0 achieves an average score of 69.0 on OpenCompass, a comprehensive evaluation of 8 popular benchmarks, **outperforming GPT-4.1-mini-20250414, MiniCPM-V 2.6 (8.1B params, OpenCompass 65.2) and Qwen2.5-VL-3B-Instruct (3.8B params, OpenCompass 64.5)**. It also shows good performance in multi-image understanding and video understanding. + +- 🚀 **Superior Efficiency.** + Designed for on-device deployment, MiniCPM-V 4.0 runs smoothly on end devices. For example, it devlivers **less than 2s first token delay and more than 17 token/s decoding on iPhone 16 Pro Max**, without heating problems. It also shows superior throughput under concurrent requests. + +- 💫 **Easy Usage.** + MiniCPM-V 4.0 can be easily used in various ways including **llama.cpp, Ollama, vLLM, SGLang, LLaMA-Factory and local web demo** etc. We also open-source iOS App that can run on iPhone and iPad. Get started easily with our well-structured [Cookbook](https://github.com/OpenSQZ/MiniCPM-V-CookBook), featuring detailed instructions and practical examples. + +### Evaluation + +
+Click to view single image results on OpenCompass. +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
modelSizeOpencompassOCRBenchMathVistaHallusionBenchMMMUMMVetMMBench V1.1MMStarAI2D
Proprietary
GPT-4v-20240409-63.565655.243.961.767.579.856.078.6
Gemini-1.5-Pro-64.575458.345.660.664.073.959.179.1
GPT-4.1-mini-20250414-68.984070.949.355.074.380.960.976.0
Claude 3.5 Sonnet-20241022-70.679865.355.566.470.181.765.181.2
Open-source
Qwen2.5-VL-3B-Instruct3.8B64.582861.246.651.260.076.856.381.4
InternVL2.5-4B3.7B65.182060.846.651.861.578.258.781.4
Qwen2.5-VL-7B-Instruct8.3B70.988868.151.958.069.782.264.184.3
InternVL2.5-8B8.1B68.182164.549.056.262.882.563.284.6
MiniCPM-V-2.68.1B65.285260.848.149.860.078.057.582.1
MiniCPM-o-2.68.7B70.288973.351.150.967.280.663.386.1
MiniCPM-V-4.04.1B69.089466.950.851.268.079.762.882.9
+
+ +
+ +
+Click to view single image results on ChartQA, MME, RealWorldQA, TextVQA, DocVQA, MathVision, DynaMath, WeMath, Object HalBench and MM Halbench. + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
modelSizeChartQAMMERealWorldQATextVQADocVQAMathVisionDynaMathWeMathObj HalMM Hal
CHAIRs↓CHAIRi↓score avg@3↑hall rate avg@3↓
Proprietary
GPT-4v-20240409-78.5192761.478.088.4-------
Gemini-1.5-Pro-87.2-67.578.893.141.031.550.5----
GPT-4.1-mini-20250414------45.347.7-----
Claude 3.5 Sonnet-20241022-90.8-60.174.195.235.635.744.0----
Open-source
Qwen2.5-VL-3B-Instruct3.8B84.0215765.479.393.921.913.222.918.310.83.9 33.3
InternVL2.5-4B3.7B84.0233864.376.891.618.415.221.213.78.73.2 46.5
Qwen2.5-VL-7B-Instruct8.3B87.3234768.584.995.725.421.836.213.37.94.1 31.6
InternVL2.5-8B8.1B84.8234470.179.193.017.09.423.518.311.63.6 37.2
MiniCPM-V-2.68.1B79.4234865.080.190.817.59.020.47.34.74.0 29.9
MiniCPM-o-2.68.7B86.9237268.182.093.521.710.425.26.33.44.1 31.3
MiniCPM-V-4.04.1B84.4229868.580.892.920.714.232.76.33.54.1 29.2
+
+ +
+ +
+Click to view multi-image and video understanding results on Mantis, Blink and Video-MME. +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
modelSizeMantisBlinkVideo-MME
wo subsw subs
Proprietary
GPT-4v-20240409-62.754.659.963.3
Gemini-1.5-Pro--59.175.081.3
GPT-4o-20240513--68.071.977.2
Open-source
Qwen2.5-VL-3B-Instruct3.8B-47.661.567.6
InternVL2.5-4B3.7B62.750.862.363.6
Qwen2.5-VL-7B-Instruct8.3B-56.465.171.6
InternVL2.5-8B8.1B67.754.864.266.9
MiniCPM-V-2.68.1B69.153.060.963.6
MiniCPM-o-2.68.7B71.956.763.969.6
MiniCPM-V-4.04.1B71.454.061.265.8
+
+ +
+ +### Examples + +
+ math +
+ +We deploy MiniCPM-V 4.0 on iPhone 16 Pro Max with [iOS demo](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/demo/ios_demo/ios.md). The demo video is the raw screen recording without edition. + + +

+ +      + +

+

+ +      + +

+
+ + diff --git a/docs/minicpm_v4_zh.md b/docs/minicpm_v4_zh.md new file mode 100644 index 0000000..20c7e83 --- /dev/null +++ b/docs/minicpm_v4_zh.md @@ -0,0 +1,557 @@ +## MiniCPM-V 4.0 + +> Archieve at: 2025-08-25 + +MiniCPM-V 4.0 是 MiniCPM-V 系列中的最新模型。该模型基于 SigLIP2-400M 和 MiniCPM4-3B 构建,参数总量为 4.1B。它延续了 MiniCPM-V 2.6 在单图、多图和视频理解方面的强大能力,同时大幅提升了推理效率。MiniCPM-V 4.0 的主要特点包括: + +- 🔥 **领先的视觉能力。** +MiniCPM-V 4.0 在 OpenCompass 上获得了平均 69.0 的高分,超越了 MiniCPM-V 2.6(8.1B,得分 65.2)、 Qwen2.5-VL-3B-Instruct(3.8B,得分 64.5)和**广泛使用的闭源模型 GPT-4.1-mini-20250414**。在多图理解与视频理解任务上,MiniCPM-V 4.0 也表现出色。 + +- 🚀 **卓越的效率。** +MiniCPM-V 4.0 专为端侧设备优化,**可在 iPhone 16 Pro Max 上流畅运行,首 token 延迟低至 2 秒,解码速度达 17.9 tokens/s**,且无发热问题。MiniCPM-V 4.0 在并发请求场景下表现出领先的吞吐率指标。 + +- 💫 **易于使用。** +MiniCPM-V 4.0 支持多种推理方式,包括 **llama.cpp、Ollama、vLLM、SGLang、LLaMA-Factory 及本地 Web Demo 等**。我们还开源了可以在 iPhone 和 iPad 运行的 iOS App。欢迎参考我们开源的 **结构清晰的[使用手册](https://github.com/OpenSQZ/MiniCPM-V-CookBook)** 玩转 MiniCPM-V 4.0,其中涵盖了详细的部署指南和真实示例。 + + +### 性能评估 + + +
+点击查看在OpenCompass上的单图理解能力的评测结果。 +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
modelSizeOpencompassOCRBenchMathVistaHallusionBenchMMMUMMVetMMBench V1.1MMStarAI2D
Proprietary
GPT-4v-20240409-63.565655.243.961.767.579.856.078.6
Gemini-1.5-Pro-64.575458.345.660.664.073.959.179.1
GPT-4.1-mini-20250414-68.984070.949.355.074.380.960.976.0
Claude 3.5 Sonnet-20241022-70.679865.355.566.470.181.765.181.2
Open-source
Qwen2.5-VL-3B-Instruct3.8B64.582861.246.651.260.076.856.381.4
InternVL2.5-4B3.7B65.182060.846.651.861.578.258.781.4
Qwen2.5-VL-7B-Instruct8.3B70.988868.151.958.069.782.264.184.3
InternVL2.5-8B8.1B68.182164.549.056.262.882.563.284.6
MiniCPM-V-2.68.1B65.285260.848.149.860.078.057.582.1
MiniCPM-o-2.68.7B70.288973.351.150.967.280.663.386.1
MiniCPM-V-4.04.1B69.089466.950.851.268.079.762.882.9
+
+ +
+ +
+点击查看在图表理解、文档理解、数学推理、幻觉等领域的评测结果。 + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
modelSizeChartQAMMERealWorldQATextVQADocVQAMathVisionDynaMathWeMathObj HalMM Hal
CHAIRs↓CHAIRi↓score avg@3↑hall rate avg@3↓
Proprietary
GPT-4v-20240409-78.5192761.478.088.4-------
Gemini-1.5-Pro-87.2-67.578.893.141.031.550.5----
GPT-4.1-mini-20250414------45.347.7-----
Claude 3.5 Sonnet-20241022-90.8-60.174.195.235.635.744.0----
Open-source
Qwen2.5-VL-3B-Instruct3.8B84.0215765.479.393.921.913.222.918.310.83.9 33.3
InternVL2.5-4B3.7B84.0233864.376.891.618.415.221.213.78.73.2 46.5
Qwen2.5-VL-7B-Instruct8.3B87.3234768.584.995.725.421.836.213.37.94.1 31.6
InternVL2.5-8B8.1B84.8234470.179.193.017.09.423.518.311.63.6 37.2
MiniCPM-V-2.68.1B79.4234865.080.190.817.59.020.47.34.74.0 29.9
MiniCPM-o-2.68.7B86.9237268.182.093.521.710.425.26.33.44.1 31.3
MiniCPM-V-4.04.1B84.4229868.580.892.920.714.232.76.33.54.1 29.2
+
+ +
+ +
+点击查看多图和视频理解能力的评测结果。 +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
modelSizeMantisBlinkVideo-MME
wo subsw subs
Proprietary
GPT-4v-20240409-62.754.659.963.3
Gemini-1.5-Pro--59.175.081.3
GPT-4o-20240513--68.071.977.2
Open-source
Qwen2.5-VL-3B-Instruct3.8B-47.661.567.6
InternVL2.5-4B3.7B62.750.862.363.6
Qwen2.5-VL-7B-Instruct8.3B-56.465.171.6
InternVL2.5-8B8.1B67.754.864.266.9
MiniCPM-V-2.68.1B69.153.060.963.6
MiniCPM-o-2.68.7B71.956.763.969.6
MiniCPM-V-4.04.1B71.454.061.265.8
+
+ +
+ +### 典型示例 + +
+ math +
+ + +我们在 iPhone 16 Pro Max 上部署了 MiniCPM-V 4.0 [iOS demo](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/demo/ios_demo/ios.md),并录制了以下演示录屏,视频未经加速等任何编辑: + + +

+ +      + +

+

+ +      + +

+