From 509e934a59e5a229ff9ae9226411298fadd3ff36 Mon Sep 17 00:00:00 2001 From: tc-mb Date: Fri, 29 Aug 2025 01:00:45 +0800 Subject: [PATCH 1/7] update video link Signed-off-by: tc-mb --- README.md | 2 +- README_zh.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 1247680..2a66c0b 100644 --- a/README.md +++ b/README.md @@ -174,7 +174,7 @@ MiniCPM-V 4.5 can be easily used in various ways: (1) [llama.cpp](https://github ### Examples
- +
diff --git a/README_zh.md b/README_zh.md index 7540562..68d846d 100644 --- a/README_zh.md +++ b/README_zh.md @@ -165,7 +165,7 @@ ### 典型示例
- +
From 02c68764d4ef723098e99fad0591e424dca428bf Mon Sep 17 00:00:00 2001 From: YuzaChongyi <490083538@qq.com> Date: Fri, 29 Aug 2025 23:52:31 +0800 Subject: [PATCH 2/7] update readme (#984) Co-authored-by: wangchongyi <> --- README.md | 79 ++++++++++++++++++++++++++++++++++++++++++++++++++++ README_zh.md | 78 +++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 157 insertions(+) diff --git a/README.md b/README.md index 2a66c0b..5d7d269 100644 --- a/README.md +++ b/README.md @@ -171,6 +171,85 @@ MiniCPM-V 4.5 can be easily used in various ways: (1) [llama.cpp](https://github
+### Inference Efficiency + + +**OpenCompass** +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelSizeAvg Score ↑Total Inference Time ↓
GLM-4.1V-9B-Thinking10.3B76.617.5h
MiMo-VL-7B-RL8.3B76.411h
MiniCPM-V-4_58.7B77.07.5h
+
+ +**Video-MME** + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelSizeAvg Score ↑Total Inference Time ↓GPU Mem ↓
Qwen2.5-VL-7B-Instruct8.3B71.63h60G
GLM-4.1V-9B-Thinking10.3B73.62.63h32G
MiniCPM-V-4_58.7B73.50.26h28G
+
+ +Both Video-MME and OpenCompass were evaluated using 8×A100 GPUs for inference. The reported inference time of Video-MME excludes the cost of video frame extraction. + + ### Examples
diff --git a/README_zh.md b/README_zh.md index 68d846d..eff0b4f 100644 --- a/README_zh.md +++ b/README_zh.md @@ -163,6 +163,84 @@
+### 推理效率 + + +**OpenCompass** +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelSizeAvg Score ↑Total Inference Time ↓
GLM-4.1V-9B-Thinking10.3B76.617.5h
MiMo-VL-7B-RL8.3B76.411h
MiniCPM-V-4_58.7B77.07.5h
+
+ +**Video-MME** + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelSizeAvg Score ↑Total Inference Time ↓GPU Mem ↓
Qwen2.5-VL-7B-Instruct8.3B71.63h60G
GLM-4.1V-9B-Thinking10.3B73.62.63h32G
MiniCPM-V-4_58.7B73.50.26h28G
+ + +OpenCompass 和 Video-MME 均采用 A100*8卡 推理,其中 Video-MME 的推理时间未统计视频抽帧时间 + ### 典型示例
From b9a95ee0ea5da4676630a3c9b2ec9312d1998d1f Mon Sep 17 00:00:00 2001 From: YuzaChongyi <490083538@qq.com> Date: Fri, 29 Aug 2025 23:58:10 +0800 Subject: [PATCH 3/7] update readme (#985) Co-authored-by: wangchongyi <> --- README.md | 4 ++-- README_zh.md | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 5d7d269..017886f 100644 --- a/README.md +++ b/README.md @@ -199,7 +199,7 @@ MiniCPM-V 4.5 can be easily used in various ways: (1) [llama.cpp](https://github 11h - MiniCPM-V-4_5 + MiniCPM-V 4.5 8.7B 77.0 7.5h @@ -237,7 +237,7 @@ MiniCPM-V 4.5 can be easily used in various ways: (1) [llama.cpp](https://github 32G - MiniCPM-V-4_5 + MiniCPM-V 4.5 8.7B 73.5 0.26h diff --git a/README_zh.md b/README_zh.md index eff0b4f..fe9b2f4 100644 --- a/README_zh.md +++ b/README_zh.md @@ -191,7 +191,7 @@ 11h - MiniCPM-V-4_5 + MiniCPM-V 4.5 8.7B 77.0 7.5h @@ -229,7 +229,7 @@ 32G - MiniCPM-V-4_5 + MiniCPM-V 4.5 8.7B 73.5 0.26h From da79d55ad4c1498df085f011d0b28bb27afb1117 Mon Sep 17 00:00:00 2001 From: YuzaChongyi <490083538@qq.com> Date: Sat, 30 Aug 2025 00:02:35 +0800 Subject: [PATCH 4/7] update readme (#986) Co-authored-by: wangchongyi <> --- README.md | 10 +++++----- README_zh.md | 10 +++++----- 2 files changed, 10 insertions(+), 10 deletions(-) diff --git a/README.md b/README.md index 017886f..47e6e62 100644 --- a/README.md +++ b/README.md @@ -201,8 +201,8 @@ MiniCPM-V 4.5 can be easily used in various ways: (1) [llama.cpp](https://github MiniCPM-V 4.5 8.7B - 77.0 - 7.5h + 77.0 + 7.5h @@ -232,7 +232,7 @@ MiniCPM-V 4.5 can be easily used in various ways: (1) [llama.cpp](https://github GLM-4.1V-9B-Thinking 10.3B - 73.6 + 73.6 2.63h 32G @@ -240,8 +240,8 @@ MiniCPM-V 4.5 can be easily used in various ways: (1) [llama.cpp](https://github MiniCPM-V 4.5 8.7B 73.5 - 0.26h - 28G + 0.26h + 28G diff --git a/README_zh.md b/README_zh.md index fe9b2f4..dfeee03 100644 --- a/README_zh.md +++ b/README_zh.md @@ -193,8 +193,8 @@ MiniCPM-V 4.5 8.7B - 77.0 - 7.5h + 77.0 + 7.5h @@ -224,7 +224,7 @@ GLM-4.1V-9B-Thinking 10.3B - 73.6 + 73.6 2.63h 32G @@ -232,8 +232,8 @@ MiniCPM-V 4.5 8.7B 73.5 - 0.26h - 28G + 0.26h + 28G From 1c89161d65690cd9b78004baeba116c1c38f8f20 Mon Sep 17 00:00:00 2001 From: Yuan Yao Date: Sat, 30 Aug 2025 10:26:52 +0800 Subject: [PATCH 5/7] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 47e6e62..7912558 100644 --- a/README.md +++ b/README.md @@ -2,7 +2,7 @@ -**A GPT-4o Level MLLM for Single Image, Multi Image and Video Understanding on Your Phone** +**A GPT-4o Level MLLM for Single Image, Multi Image and High-FPS Video Understanding on Your Phone** [中文](./README_zh.md) | English From d16875b1205240f2027e95f9127c7bde8c9de39a Mon Sep 17 00:00:00 2001 From: Yuan Yao Date: Sat, 30 Aug 2025 10:52:45 +0800 Subject: [PATCH 6/7] Update README.md --- README.md | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/README.md b/README.md index 7912558..99f2ff1 100644 --- a/README.md +++ b/README.md @@ -28,10 +28,10 @@
-**MiniCPM-V** is a series of efficient end-side multimodal LLMs (MLLMs), which accept images, videos and text as inputs and deliver high-quality text outputs. **MiniCPM-o** additionally takes audio as inputs and provide high-quality speech outputs in an end-to-end fashion. Since February 2024, we have released 7 versions of the model, aiming to achieve **strong performance and efficient deployment**. The most notable models in the series currently include: +**MiniCPM-V** is a series of efficient end-side multimodal LLMs (MLLMs), which accept images, videos and text as inputs and deliver high-quality text outputs. **MiniCPM-o** additionally takes audio as inputs and provides high-quality speech outputs in an end-to-end fashion. Since February 2024, we have released 7 versions of the model, aiming to achieve **strong performance and efficient deployment**. The most notable models in the series currently include: -- **MiniCPM-V 4.5**: 🔥🔥🔥 The latest and most capable model in the MiniCPM-V series. With a total of 8B parameters, this model **outperforms GPT-4o-latest, Gemini-2.0 Pro, and Qwen2.5-VL 72B** in vision-language capabilities, making it the most performant on-device multimodal model in the open-source community. This version brings **new features including efficient high refresh rate and long video understanding (up to 96x compression rate for video tokens), controllable hybrid fast/deep thinking, strong handwritten OCR and complex table/document parsing**. It also advances MiniCPM-V's popular features such as trustworthy behavior, multilingual support and end-side deployability. +- **MiniCPM-V 4.5**: 🔥🔥🔥 The latest and most capable model in the MiniCPM-V series. With a total of 8B parameters, this model **outperforms GPT-4o-latest, Gemini-2.0 Pro, and Qwen2.5-VL 72B** in vision-language capabilities, making it the most performant on-device multimodal model in the open-source community. This version brings **new features including efficient high-FPS and long video understanding (up to 96x compression rate for video tokens), controllable hybrid fast/deep thinking, strong handwritten OCR and complex table/document parsing**. It also advances MiniCPM-V's popular features such as trustworthy behavior, multilingual support and end-side deployability. - **MiniCPM-o 2.6**: ⭐️⭐️⭐️ The most capable model in the MiniCPM-o series. With a total of 8B parameters, this end-to-end model **achieves comparable performance to GPT-4o-202405 in vision, speech, and multimodal live streaming**, making it one of the most versatile and performant models in the open-source community. For the new voice mode, MiniCPM-o 2.6 **supports bilingual real-time speech conversation with configurable voices**, and also allows for fun capabilities such as emotion/speed/style control, end-to-end voice cloning, role play, etc. Due to its superior token density, MiniCPM-o 2.6 can for the first time **support multimodal live streaming on end-side devices** such as iPad. @@ -83,7 +83,7 @@ * [2024.07.19] MiniCPM-Llama3-V 2.5 supports vLLM now! See [here](#inference-with-vllm). -* [2024.06.03] Now, you can run MiniCPM-Llama3-V 2.5 on multiple low VRAM GPUs(12 GB or 16 GB) by distributing the model's layers across multiple GPUs. For more details, Check this [link](https://github.com/OpenBMB/MiniCPM-V/blob/main/docs/inference_on_multiple_gpus.md). +* [2024.06.03] Now, you can run MiniCPM-Llama3-V 2.5 on multiple low VRAM GPUs(12 GB or 16 GB) by distributing the model's layers across multiple GPUs. For more details, check this [link](https://github.com/OpenBMB/MiniCPM-V/blob/main/docs/inference_on_multiple_gpus.md). * [2024.05.28] 🚀🚀🚀 MiniCPM-Llama3-V 2.5 now fully supports its feature in llama.cpp and Ollama! Please pull the latest code **of our provided forks** ([llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-v2.5/examples/minicpmv/README.md), [Ollama](https://github.com/OpenBMB/ollama/tree/minicpm-v2.5/examples/minicpm-v2.5)). GGUF models in various sizes are available [here](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf/tree/main). MiniCPM-Llama3-V 2.5 series is **not supported by the official repositories yet**, and we are working hard to merge PRs. Please stay tuned! * [2024.05.28] 💫 We now support LoRA fine-tuning for MiniCPM-Llama3-V 2.5, using only 2 V100 GPUs! See more statistics [here](https://github.com/OpenBMB/MiniCPM-V/tree/main/finetune#model-fine-tuning-memory-usage-statistics). @@ -91,7 +91,7 @@ * [2024.05.25] MiniCPM-Llama3-V 2.5 now supports streaming outputs and customized system prompts. Try it [here](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5#usage)! * [2024.05.24] We release the MiniCPM-Llama3-V 2.5 [gguf](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf), which supports [llama.cpp](#inference-with-llamacpp) inference and provides a 6~8 token/s smooth decoding on mobile phones. Try it now! -* [2024.05.23] 🔍 We've released a comprehensive comparison between Phi-3-vision-128k-instruct and MiniCPM-Llama3-V 2.5, including benchmarks evaluations, multilingual capabilities, and inference efficiency 🌟📊🌍🚀. Click [here](./docs/compare_with_phi-3_vision.md) to view more details. +* [2024.05.23] 🔍 We've released a comprehensive comparison between Phi-3-vision-128k-instruct and MiniCPM-Llama3-V 2.5, including benchmark evaluations, multilingual capabilities, and inference efficiency 🌟📊🌍🚀. Click [here](./docs/compare_with_phi-3_vision.md) to view more details. * [2024.05.20] We open-soure MiniCPM-Llama3-V 2.5, it has improved OCR capability and supports 30+ languages, representing the first end-side MLLM achieving GPT-4V level performance! We provide [efficient inference](#deployment-on-mobile-phone) and [simple fine-tuning](./finetune/readme.md). Try it now! * [2024.04.23] MiniCPM-V-2.0 supports vLLM now! Click [here](#inference-with-vllm) to view more details. @@ -100,7 +100,7 @@ * [2024.04.15] MiniCPM-V-2.0 now also supports [fine-tuning](https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/minicpm-v-2最佳实践.md) with the SWIFT framework! * [2024.04.12] We open-source MiniCPM-V 2.0, which achieves comparable performance with Gemini Pro in understanding scene text and outperforms strong Qwen-VL-Chat 9.6B and Yi-VL 34B on OpenCompass, a comprehensive evaluation over 11 popular benchmarks. Click here to view the MiniCPM-V 2.0 technical blog. * [2024.03.14] MiniCPM-V now supports [fine-tuning](https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/minicpm-v最佳实践.md) with the SWIFT framework. Thanks to [Jintao](https://github.com/Jintao-Huang) for the contribution! -* [2024.03.01] MiniCPM-V now can be deployed on Mac! +* [2024.03.01] MiniCPM-V can now be deployed on Mac! * [2024.02.01] We open-source MiniCPM-V and OmniLMM-12B, which support efficient end-side deployment and powerful multimodal capabilities correspondingly. @@ -136,16 +136,16 @@ - 🔥 **State-of-the-art Vision-Language Capability.** MiniCPM-V 4.5 achieves an average score of 77.0 on OpenCompass, a comprehensive evaluation of 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-latest, Gemini-2.0 Pro, and strong open-source models like Qwen2.5-VL 72B** for vision-language capabilities, making it the most performant MLLM under 30B parameters. -- 🎬 **Efficient High Refresh Rate and Long Video Understanding.** Powered by a new unified 3D-Resampler over images and videos, MiniCPM-V 4.5 can now achieve 96x compression rate for video tokens, where 6 448x448 video frames can be jointly compressed into 64 video tokens (normally 1,536 tokens for most MLLMs). This means that the model can percieve significantly more video frames without increasing the LLM inference cost. This brings state-of-the-art high refresh rate (up to 10FPS) video understanding and long video understanding capabilities on Video-MME, LVBench, MLVU, MotionBench, FavorBench, etc., efficiently. +- 🎬 **Efficient High-FPS and Long Video Understanding.** Powered by a new unified 3D-Resampler over images and videos, MiniCPM-V 4.5 can now achieve 96x compression rate for video tokens, where 6 448x448 video frames can be jointly compressed into 64 video tokens (normally 1,536 tokens for most MLLMs). This means that the model can perceive significantly more video frames without increasing the LLM inference cost. This brings state-of-the-art high-FPS (up to 10FPS) video understanding and long video understanding capabilities on Video-MME, LVBench, MLVU, MotionBench, FavorBench, etc., efficiently. - ⚙️ **Controllable Hybrid Fast/Deep Thinking.** MiniCPM-V 4.5 supports both fast thinking for efficient frequent usage with competitive performance, and deep thinking for more complex problem solving. To cover efficiency and performance trade-offs in different user scenarios, this fast/deep thinking mode can be switched in a highly controlled fashion. - 💪 **Strong OCR, Document Parsing and Others.** -Based on [LLaVA-UHD](https://arxiv.org/pdf/2403.11703) architecture, MiniCPM-V 4.5 can process high-resolution images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344), using 4x less visual tokens than most MLLMs. The model achieves **leading performance on OCRBench, surpassing proprietary models such as GPT-4o-latest and Gemini 2.5**. It also achieves state-of-the-art performance for PDF document parsing capability on OmniDocBench among general MLLMs. Based on the the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) and [VisCPM](https://github.com/OpenBMB/VisCPM) techniques, it features **trustworthy behaviors**, outperforming GPT-4o-latest on MMHal-Bench, and supports **multilingual capabilities** in more than 30 languages. +Based on [LLaVA-UHD](https://arxiv.org/pdf/2403.11703) architecture, MiniCPM-V 4.5 can process high-resolution images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344), using 4x fewer visual tokens than most MLLMs. The model achieves **leading performance on OCRBench, surpassing proprietary models such as GPT-4o-latest and Gemini 2.5**. It also achieves state-of-the-art performance for PDF document parsing capability on OmniDocBench among general MLLMs. Based on the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) and [VisCPM](https://github.com/OpenBMB/VisCPM) techniques, it features **trustworthy behaviors**, outperforming GPT-4o-latest on MMHal-Bench, and supports **multilingual capabilities** in more than 30 languages. - 💫 **Easy Usage.** -MiniCPM-V 4.5 can be easily used in various ways: (1) [llama.cpp](https://github.com/tc-mb/llama.cpp/blob/Support-MiniCPM-V-4.5/docs/multimodal/minicpmv4.5.md) and [ollama](https://github.com/tc-mb/ollama/tree/MIniCPM-V) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-V-4_5-int4), [GGUF](https://huggingface.co/openbmb/MiniCPM-V-4_5-gguf) and [AWQ](https://github.com/tc-mb/AutoAWQ) format quantized models in 16 sizes, (3) [SGLang](https://github.com/tc-mb/sglang/tree/main) and [vLLM](#efficient-inference-with-llamacpp-ollama-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks with [Transformers](https://github.com/tc-mb/transformers/tree/main) and [LLaMA-Factory](./docs/llamafactory_train_and_infer.md), (5) quick [local WebUI demo](#chat-with-our-demo-on-gradio), (6) optimized [local iOS app](https://github.com/tc-mb/MiniCPM-o-demo-iOS) on iPhone and iPad, and (7) online web demo on [server](http://101.126.42.235:30910/). See our [Cookbook](https://github.com/OpenSQZ/MiniCPM-V-CookBook) for full usages! +MiniCPM-V 4.5 can be easily used in various ways: (1) [llama.cpp](https://github.com/tc-mb/llama.cpp/blob/Support-MiniCPM-V-4.5/docs/multimodal/minicpmv4.5.md) and [ollama](https://github.com/tc-mb/ollama/tree/MIniCPM-V) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-V-4_5-int4), [GGUF](https://huggingface.co/openbmb/MiniCPM-V-4_5-gguf) and [AWQ](https://github.com/tc-mb/AutoAWQ) format quantized models in 16 sizes, (3) [SGLang](https://github.com/tc-mb/sglang/tree/main) and [vLLM](#efficient-inference-with-llamacpp-ollama-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks with [Transformers](https://github.com/tc-mb/transformers/tree/main) and [LLaMA-Factory](./docs/llamafactory_train_and_infer.md), (5) quick [local WebUI demo](#chat-with-our-demo-on-gradio), (6) optimized [local iOS app](https://github.com/tc-mb/MiniCPM-o-demo-iOS) on iPhone and iPad, and (7) online web demo on [server](http://101.126.42.235:30910/). See our [Cookbook](https://github.com/OpenSQZ/MiniCPM-V-CookBook) for full usage! ### Key Techniques @@ -155,9 +155,9 @@ MiniCPM-V 4.5 can be easily used in various ways: (1) [llama.cpp](https://github
-- **Architechture: Unified 3D-Resampler for High-density Video Compression.** MiniCPM-V 4.5 introduces a 3D-Resampler that overcomes the performance-efficiency trade-off in video understanding. By grouping and jointly compressing up to 6 consecutive video frames into just 64 tokens (the same token count used for a single image in MiniCPM-V series), MiniCPM-V 4.5 achieves a 96× compression rate for video tokens. This allows the model to process more video frames without additional LLM computational cost, enabling high refresh rate video and long video understanding. The architecture supports unified encoding for images, multi-image inputs, and videos, ensuring seamless capability and knowledge transfer. +- **Architechture: Unified 3D-Resampler for High-density Video Compression.** MiniCPM-V 4.5 introduces a 3D-Resampler that overcomes the performance-efficiency trade-off in video understanding. By grouping and jointly compressing up to 6 consecutive video frames into just 64 tokens (the same token count used for a single image in MiniCPM-V series), MiniCPM-V 4.5 achieves a 96× compression rate for video tokens. This allows the model to process more video frames without additional LLM computational cost, enabling high-FPS video and long video understanding. The architecture supports unified encoding for images, multi-image inputs, and videos, ensuring seamless capability and knowledge transfer. -- **Pre-training: Unified Learning for OCR and Knowledge from Documents.** Existing MLLMs learn OCR capability and knowledge from documents in isolated training approaches. We observe the essential difference between these two training approaches is the visibility of the text in images. By dynamically corrupting text regions in documents with varying noise levels and asking the model to reconstruct the text, the model learns to adaptively and properly switch between accurate text recognition (when text is visible) and multimodal context-based knowledge reasoning (when text is heavily obscured). This eliminates reliance on error-prone document parsers in knowledge learning from documents, and prevents hallucinations from over-augmented OCR data, resulting in top-tier OCR and multimodal knowledge performance with minimal engineering overhead. +- **Pre-training: Unified Learning for OCR and Knowledge from Documents.** Existing MLLMs learn OCR capability and knowledge from documents in isolated training approaches. We observe that the essential difference between these two training approaches is the visibility of the text in images. By dynamically corrupting text regions in documents with varying noise levels and asking the model to reconstruct the text, the model learns to adaptively and properly switch between accurate text recognition (when text is visible) and multimodal context-based knowledge reasoning (when text is heavily obscured). This eliminates reliance on error-prone document parsers in knowledge learning from documents, and prevents hallucinations from over-augmented OCR data, resulting in top-tier OCR and multimodal knowledge performance with minimal engineering overhead. - **Post-training: Hybrid Fast/Deep Thinking with Multimodal RL.** MiniCPM-V 4.5 offers a balanced reasoning experience through two switchable modes: fast thinking for efficient daily use and deep thinking for complex tasks. Using a new hybrid reinforcement learning method, the model jointly optimizes both modes, significantly enhancing fast-mode performance without compromising deep-mode capability. Incorporated with [RLPR](https://github.com/OpenBMB/RLPR) and [RLAIF-V](https://github.com/RLHF-V/RLAIF-V), it generalizes robust reasoning skills from broad multimodal data while effectively reducing hallucinations. @@ -294,11 +294,11 @@ We deploy MiniCPM-V 4.5 on iPad M4 with [iOS demo](https://github.com/tc-mb/Mini - 🎙 **State-of-the-art Speech Capability.** MiniCPM-o 2.6 supports **bilingual real-time speech conversation with configurable voices** in English and Chinese. It **outperforms GPT-4o-realtime on audio understanding tasks** such as ASR and STT translation, and shows **state-of-the-art performance on speech conversation in both semantic and acoustic evaluations in the open-source community**. It also allows for fun features such as emotion/speed/style control, end-to-end voice cloning, role play, etc. -- 🎬 **Strong Multimodal Live Streaming Capability.** As a new feature, MiniCPM-o 2.6 can **accept continuous video and audio streams independent of user queries, and support real-time speech interaction**. It **outperforms GPT-4o-202408 and Claude 3.5 Sonnet and shows state-of-art performance in the open-source community on StreamingBench**, a comprehensive benchmark for real-time video understanding, omni-source (video & audio) understanding, and multimodal contextual understanding. +- 🎬 **Strong Multimodal Live Streaming Capability.** As a new feature, MiniCPM-o 2.6 can **accept continuous video and audio streams independent of user queries, and support real-time speech interaction**. It **outperforms GPT-4o-202408 and Claude 3.5 Sonnet and shows state-of-the-art performance in the open-source community on StreamingBench**, a comprehensive benchmark for real-time video understanding, omni-source (video & audio) understanding, and multimodal contextual understanding. - 💪 **Strong OCR Capability and Others.** Advancing popular visual capabilities from MiniCPM-V series, MiniCPM-o 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves **state-of-the-art performance on OCRBench for models under 25B, surpassing proprietary models such as GPT-4o-202405**. - Based on the the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) and [VisCPM](https://github.com/OpenBMB/VisCPM) techniques, it features **trustworthy behaviors**, outperforming GPT-4o and Claude 3.5 Sonnet on MMHal-Bench, and supports **multilingual capabilities** on more than 30 languages. + Based on the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) and [VisCPM](https://github.com/OpenBMB/VisCPM) techniques, it features **trustworthy behaviors**, outperforming GPT-4o and Claude 3.5 Sonnet on MMHal-Bench, and supports **multilingual capabilities** on more than 30 languages. - 🚀 **Superior Efficiency.** From 0d8b90df9755a3e94993efdcf0952a53b7206f2a Mon Sep 17 00:00:00 2001 From: Yuan Yao Date: Sun, 31 Aug 2025 11:24:58 +0800 Subject: [PATCH 7/7] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 99f2ff1..9b77675 100644 --- a/README.md +++ b/README.md @@ -247,7 +247,7 @@ MiniCPM-V 4.5 can be easily used in various ways: (1) [llama.cpp](https://github
-Both Video-MME and OpenCompass were evaluated using 8×A100 GPUs for inference. The reported inference time of Video-MME excludes the cost of video frame extraction. +Both Video-MME and OpenCompass were evaluated using 8×A100 GPUs for inference. The reported inference time of Video-MME includes full model-side computation, and excludes the external cost of video frame extraction (dependent on specific frame extraction tools) for fair comparison. ### Examples