diff --git a/README.md b/README.md
index 259c5da..9624ae6 100644
--- a/README.md
+++ b/README.md
@@ -22,15 +22,17 @@
@@ -1874,6 +2427,32 @@ We deploy MiniCPM-V 2.6 on end devices. The demo video is the raw screen recordi
| OmniLMM-12B | [Document](././docs/omnilmm_en.md) |
+## MiniCPM-V & o Cookbook
+
+Discover comprehensive, ready-to-deploy solutions for the MiniCPM-V and MiniCPM-o model series in our structured [cookbook](https://github.com/OpenSQZ/MiniCPM-V-CookBook), which empowers developers to rapidly implement multimodal AI applications with integrated vision, speech, and live-streaming capabilities. Key features include:
+
+**Easy Usage Documentation**
+
+Our comprehensive [documentation website](https://minicpm-o.readthedocs.io/en/latest/index.html) presents every recipe in a clear, well-organized manner.
+All features are displayed at a glance, making it easy for you to quickly find exactly what you need.
+
+**Broad User Spectrum**
+
+We support a wide range of users, from individuals to enterprises and researchers.
+
+* **Individuals**: Enjoy effortless inference using [Ollama](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/ollama/minicpm-v4_ollama.md) and [Llama.cpp](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/llama.cpp/minicpm-v4_llamacpp.md) with minimal setup.
+* **Enterprises**: Achieve high-throughput, scalable performance with [vLLM](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/vllm/minicpm-v4_vllm.md) and [SGLang](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/sglang/MiniCPM-v4_sglang.md).
+* **Researchers**: Leverage advanced frameworks including [Transformers](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/finetune_full.md), [LLaMA-Factory](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/finetune_llamafactory.md), [SWIFT](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/swift.md), and [Align-anything](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/align_anything.md) to enable flexible model development and cutting-edge experimentation.
+
+**Versatile Deployment Scenarios**
+
+Our ecosystem delivers optimal solution for a variety of hardware environments and deployment demands.
+
+* **Web demo**: Launch interactive multimodal AI web demo with [FastAPI](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/demo/README.md).
+* **Quantized deployment**: Maximize efficiency and minimize resource consumption using [GGUF](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/quantization/gguf/minicpm-v4_gguf_quantize.md) and [BNB](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/quantization/bnb/minicpm-v4_bnb_quantize.md).
+* **End devices**: Bring powerful AI experiences to [iPhone and iPad](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/demo/ios_demo/ios.md), supporting offline and privacy-sensitive applications.
+
+
## Chat with Our Demo on Gradio 🤗
We provide online and local demos powered by Hugging Face Gradio

, the most popular model deployment framework nowadays. It supports streaming outputs, progress bars, queuing, alerts, and other useful features.
@@ -1932,6 +2511,10 @@ Open `http://localhost:8000/` in browser and enjoy the vision mode chatbot.
| Model | Device | Memory | Description | Download |
|:-----------|:--:|:-----------:|:-------------------|:---------------:|
+| MiniCPM-V 4.0| GPU | 9 GB | The latest version, strong end-side multimodal performance for single image, multi-image and video understanding. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-4) [

](https://modelscope.cn/models/OpenBMB/MiniCPM-V-4) |
+| MiniCPM-V 4.0 gguf | CPU | 4 GB | The gguf version, lower memory usage and faster inference. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-4-gguf) [

](https://modelscope.cn/models/OpenBMB/MiniCPM-V-4-gguf) |
+| MiniCPM-V 4.0 int4 | GPU | 5 GB | The int4 quantized version, lower GPU memory usage. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-4-int4) [

](https://modelscope.cn/models/OpenBMB/MiniCPM-V-4-int4) |
+| MiniCPM-V 4.0 AWQ | GPU | 5 GB | The int4 quantized version, lower GPU memory usage. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-4-AWQ) [

](https://modelscope.cn/models/OpenBMB/MiniCPM-V-4-AWQ) |
| MiniCPM-o 2.6| GPU | 18 GB | The latest version, achieving GPT-4o level performance for vision, speech and multimodal live streaming on end-side devices. | [🤗](https://huggingface.co/openbmb/MiniCPM-o-2_6) [

](https://modelscope.cn/models/OpenBMB/MiniCPM-o-2_6) |
| MiniCPM-o 2.6 gguf | CPU | 8 GB | The gguf version, lower memory usage and faster inference. | [🤗](https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf) [

](https://modelscope.cn/models/OpenBMB/MiniCPM-o-2_6-gguf) |
| MiniCPM-o 2.6 int4 | GPU | 9 GB | The int4 quantized version, lower GPU memory usage. | [🤗](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4) [

](https://modelscope.cn/models/OpenBMB/MiniCPM-o-2_6-int4) |
@@ -1960,10 +2543,10 @@ from transformers import AutoModel, AutoTokenizer
torch.manual_seed(100)
-model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
+model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
-tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True) # or openbmb/MiniCPM-o-2_6
image = Image.open('./assets/minicpmo2_6/show_demo.jpg').convert('RGB')
@@ -1991,24 +2574,24 @@ print(answer)
You will get the following output:
```
-"The landform in the picture is a mountain range. The mountains appear to be karst formations, characterized by their steep, rugged peaks and smooth, rounded shapes. These types of mountains are often found in regions with limestone bedrock and are shaped by processes such as erosion and weathering. The reflection of the mountains in the water adds to the scenic beauty of the landscape."
+"The landform in the picture is karst topography, characterized by its unique and striking limestone formations that rise dramatically from the surrounding landscape."
-"When traveling to this scenic location, it's important to pay attention to the weather conditions, as the area appears to be prone to fog and mist, especially during sunrise or sunset. Additionally, ensure you have proper footwear for navigating the potentially slippery terrain around the water. Lastly, respect the natural environment by not disturbing the local flora and fauna."
+"When traveling to this picturesque location, you should pay attention to the weather conditions as they can change rapidly in such areas. It's also important to respect local ecosystems and wildlife by staying on designated paths and not disturbing natural habitats. Additionally, bringing appropriate gear for photography is advisable due to the stunning reflections and lighting during sunrise or sunset."
```
#### Chat with Multiple Images
- Click to view Python code running MiniCPM-o 2.6 with multiple images input.
+ Click to view Python code running MiniCPM-V-4 with multiple images input.
```python
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
-model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
+model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
-tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True) # or openbmb/MiniCPM-o-2_6
image1 = Image.open('image1.jpg').convert('RGB')
image2 = Image.open('image2.jpg').convert('RGB')
@@ -2026,17 +2609,17 @@ print(answer)
#### In-context Few-shot Learning
- Click to view Python code running MiniCPM-o 2.6 with few-shot input.
+ Click to view Python code running MiniCPM-V-4 with few-shot input.
```python
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
-model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
+model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
-tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True) # or openbmb/MiniCPM-o-2_6
question = "production date"
image1 = Image.open('example1.jpg').convert('RGB')
@@ -2061,7 +2644,7 @@ print(answer)
#### Chat with Video
- Click to view Python code running MiniCPM-o 2.6 with video input.
+ Click to view Python code running MiniCPM-V-4 with video input.
```python
import torch
@@ -2069,10 +2652,10 @@ from PIL import Image
from transformers import AutoModel, AutoTokenizer
from decord import VideoReader, cpu # pip install decord
-model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
+model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
-tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True) # or openbmb/MiniCPM-o-2_6
MAX_NUM_FRAMES=64 # if cuda OOM set a smaller number
@@ -2135,7 +2718,7 @@ model.tts.float()
-##### Mimick
+##### Mimick
`Mimick` task reflects a model's end-to-end speech modeling capability. The model takes audio input, and outputs an ASR transcription and subsequently reconstructs the original audio with high similarity. The higher the similarity between the reconstructed audio and the original audio, the stronger the model's foundational capability in end-to-end speech modeling.
@@ -2163,7 +2746,7 @@ res = model.chat(
-##### General Speech Conversation with Configurable Voices
+##### General Speech Conversation with Configurable Voices
A general usage scenario of `MiniCPM-o-2.6` is role-playing a specific character based on the audio prompt. It will mimic the voice of the character to some extent and act like the character in text, including language style. In this mode, `MiniCPM-o-2.6` sounds **more natural and human-like**. Self-defined audio prompts can be used to customize the voice of the character in an end-to-end manner.
@@ -2205,7 +2788,7 @@ print(res)
-##### Speech Conversation as an AI Assistant
+##### Speech Conversation as an AI Assistant
An enhanced feature of `MiniCPM-o-2.6` is to act as an AI assistant, but only with limited choice of voices. In this mode, `MiniCPM-o-2.6` is **less human-like and more like a voice assistant**. In this mode, the model is more instruction-following. For demo, you are suggested to use `assistant_female_voice`, `assistant_male_voice`, and `assistant_default_female_voice`. Other voices may work but not as stable as the default voices.
@@ -2248,7 +2831,7 @@ print(res)
-##### Instruction-to-Speech
+##### Instruction-to-Speech
`MiniCPM-o-2.6` can also do Instruction-to-Speech, aka **Voice Creation**. You can describe a voice in detail, and the model will generate a voice that matches the description. For more Instruction-to-Speech sample instructions, you can refer to https://voxinstruct.github.io/VoxInstruct/.
@@ -2271,7 +2854,7 @@ res = model.chat(
-##### Voice Cloning
+##### Voice Cloning
`MiniCPM-o-2.6` can also do zero-shot text-to-speech, aka **Voice Cloning**. With this mode, model will act like a TTS model.
@@ -2298,7 +2881,7 @@ res = model.chat(
-##### Addressing Various Audio Understanding Tasks
+##### Addressing Various Audio Understanding Tasks
`MiniCPM-o-2.6` can also be used to address various audio understanding tasks, such as ASR, speaker analysis, general audio captioning, and sound scene tagging.
@@ -2515,11 +3098,11 @@ PYTORCH_ENABLE_MPS_FALLBACK=1 python test.py
-### Efficient Inference with llama.cpp, ollama, vLLM
+### Efficient Inference with llama.cpp, Ollama, vLLM
See [our fork of llama.cpp](https://github.com/OpenBMB/llama.cpp/tree/minicpmv-main/examples/llava/README-minicpmv2.6.md) for more detail. This implementation supports smooth inference of 16~18 token/s on iPad (test environment:iPad Pro + M4).
-See [our fork of ollama](https://github.com/OpenBMB/ollama/blob/minicpm-v2.6/examples/minicpm-v2.6/README.md) for more detail. This implementation supports smooth inference of 16~18 token/s on iPad (test environment:iPad Pro + M4).
+See [our fork of Ollama](https://github.com/OpenBMB/ollama/blob/minicpm-v2.6/examples/minicpm-v2.6/README.md) for more detail. This implementation supports smooth inference of 16~18 token/s on iPad (test environment:iPad Pro + M4).
@@ -2565,31 +3148,6 @@ We now support MiniCPM-V series fine-tuning with the SWIFT framework. SWIFT supp
Best Practices:[MiniCPM-V 1.0](https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/minicpm-v最佳实践.md), [MiniCPM-V 2.0](https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/minicpm-v-2最佳实践.md), [MiniCPM-V 2.6](https://github.com/modelscope/ms-swift/issues/1613).
-## MiniCPM-V & o Cookbook
-
-Discover comprehensive, ready-to-deploy solutions for the MiniCPM-V and MiniCPM-o model series in our structured [cookbook](https://github.com/OpenSQZ/MiniCPM-V-CookBook), which empowers developers to rapidly implement multimodal AI applications with integrated vision, speech, and live-streaming capabilities. Key features include:
-
-**Easy Usage Documentation**
-
-Our comprehensive [documentation website](https://minicpm-o.readthedocs.io/en/latest/index.html) presents every recipe in a clear, well-organized manner.
-All features are displayed at a glance, making it easy for you to quickly find exactly what you need.
-
-**Broad User Spectrum**
-
-We support a wide range of users, from individuals to enterprises and researchers.
-
-* **Individuals**: Enjoy effortless inference using [Ollama](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/ollama/minicpm-v4_ollama.md) and [Llama.cpp](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/llama.cpp/minicpm-v4_llamacpp.md) with minimal setup.
-* **Enterprises**: Achieve high-throughput, scalable performance with [vLLM](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/vllm/minicpm-v4_vllm.md) and [SGLang](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/deployment/sglang/MiniCPM-v4_sglang.md).
-* **Researchers**: Leverage advanced frameworks including [Transformers](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/finetune_full.md), [LLaMA-Factory](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/finetune_llamafactory.md), [SWIFT](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/swift.md), and [Align-anything](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/finetune/align_anything.md) to enable flexible model development and cutting-edge experimentation.
-
-**Versatile Deployment Scenarios**
-
-Our ecosystem delivers optimal solution for a variety of hardware environments and deployment demands.
-
-* **Web demo**: Launch interactive multimodal AI web demo with [FastAPI](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/demo/README.md).
-* **Quantized deployment**: Maximize efficiency and minimize resource consumption using [GGUF](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/quantization/gguf/minicpm-v4_gguf_quantize.md) and [BNB](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/quantization/bnb/minicpm-v4_bnb_quantize.md).
-* **Edge devices**: Bring powerful AI experiences to [iPhone and iPad](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/demo/ios_demo/ios.md), supporting offline and privacy-sensitive applications.
-
## Awesome work using MiniCPM-V & MiniCPM-o
- [text-extract-api](https://github.com/CatchTheTornado/text-extract-api): Document extraction API using OCRs and Ollama supported models 
diff --git a/README_zh.md b/README_zh.md
index 3267c9d..8d04b47 100644
--- a/README_zh.md
+++ b/README_zh.md
@@ -13,12 +13,12 @@
微信社区 |
- MiniCPM-V 📖 最佳实践
+ 🍳 使用指南
- MiniCPM-o 2.6 🤗 🤖 | MiniCPM-V 2.6 🤗 🤖 |
+ MiniCPM-V 4.0 🤗 🤖 | MiniCPM-o 2.6 🤗 🤖 | MiniCPM-V 2.6 🤗 🤖 |
📄 技术报告 [中文/English]
@@ -27,6 +27,8 @@
**MiniCPM-o** 是从 MiniCPM-V 升级的最新端侧多模态大模型系列。该系列模型可以以端到端方式,接受图像、视频、文本、音频作为输入,并生成高质量文本和语音输出。自2024年2月以来,我们以实现高性能和高效部署为目标,发布了6个版本的模型。目前系列中最值得关注的模型包括:
+- **MiniCPM-V 4.0**:🚀🚀🚀 MiniCPM-V 系列中最新的高效模型,参数总量为 4B。该模型在 OpenCompass 评测中图像理解能力超越了 GPT-4.1-mini-20250414、Qwen2.5-VL-3B-Instruct 和 InternVL2.5-8B。凭借小巧的参数规模和高效的架构,MiniCPM-V 4.0 是移动端部署的理想选择(例如,在 iPhone 16 Pro Max 上使用开源 iOS 应用时,首 token 延迟低于 2 秒,解码速度超过 17 token/s)。
+
- **MiniCPM-o 2.6**: 🔥🔥🔥 MiniCPM-o 系列的最新、性能最佳模型。总参数量 8B,**视觉、语音和多模态流式能力达到了 GPT-4o-202405 级别**,是开源社区中模态支持最丰富、性能最佳的模型之一。在新的语音模式中,MiniCPM-o 2.6 **支持可配置声音的中英双语语音对话,还具备情感/语速/风格控制、端到端声音克隆、角色扮演等进阶能力**。模型也进一步提升了 MiniCPM-V 2.6 的 **OCR、可信行为、多语言支持和视频理解等视觉能力**。基于其领先的视觉 token 密度,MiniCPM-V 2.6 成为了**首个支持在 iPad 等端侧设备上进行多模态实时流式交互**的多模态大模型。
@@ -37,10 +39,12 @@
#### 📌 置顶
+* [2024.08.02] 🚀🚀🚀 我们开源了 MiniCPM-V 4.0,该模型在图像理解能力上超越了 GPT-4.1-mini-20250414。该模型不仅继承了 MiniCPM-V 2.6 的众多实用特性,还大幅提升了推理效率。我们还同步开源了适用于 iPhone 和 iPad 的 iOS 应用,欢迎试用!
+
* [2025.08.01] 🔥🔥🔥 我们开源了 [MiniCPM-V & o Cookbook](https://github.com/OpenSQZ/MiniCPM-V-CookBook),提供针对不同人群的全场景使用指南,配合最新的[文档网站](https://minicpm-o.readthedocs.io/en/latest/index.html)上手更轻松!
-* [2025.06.20] ⭐️⭐️⭐️ MiniCPM-o 的 ollama [官方仓库](https://ollama.com/openbmb)正式支持 MiniCPM-o 2.6 等模型啦,欢迎[一键使用](https://ollama.com/openbmb/minicpm-o2.6)!
+* [2025.06.20] ⭐️⭐️⭐️ MiniCPM-o 的 Ollama [官方仓库](https://ollama.com/openbmb)正式支持 MiniCPM-o 2.6 等模型啦,欢迎[一键使用](https://ollama.com/openbmb/minicpm-o2.6)!
* [2025.03.01] 🚀🚀🚀 MiniCPM-o 系列的对齐技术 RLAIF-V 被 CVPR 2025 接收了!其[代码](https://github.com/RLHF-V/RLAIF-V)、[数据](https://huggingface.co/datasets/openbmb/RLAIF-V-Dataset)、[论文](https://arxiv.org/abs/2405.17220)均已开源。
@@ -48,7 +52,7 @@
* [2025.01.23] 💡💡💡 MiniCPM-o 2.6 现在已被北大团队开发的 [Align-Anything](https://github.com/PKU-Alignment/align-anything),一个用于对齐全模态大模型的框架集成,支持 DPO 和 SFT 在视觉和音频模态上的微调。欢迎试用!
-* [2025.01.19] 📢 **注意!** 我们正在努力将 MiniCPM-o 2.6 的支持合并到 llama.cpp、ollama、vLLM 的官方仓库,但还未完成。请大家暂时先使用我们提供的 fork 来进行部署:[llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md)、[ollama](https://github.com/OpenBMB/ollama/blob/minicpm-v2.6/examples/minicpm-v2.6/README.md)、[vllm](https://github.com/OpenBMB/MiniCPM-o?tab=readme-ov-file#efficient-inference-with-llamacpp-ollama-vllm)。 **合并完成前,使用官方仓库可能会导致不可预期的问题**。
+* [2025.01.19] 📢 **注意!** 我们正在努力将 MiniCPM-o 2.6 的支持合并到 llama.cpp、Ollama、vLLM 的官方仓库,但还未完成。请大家暂时先使用我们提供的 fork 来进行部署:[llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md)、[Ollama](https://github.com/OpenBMB/ollama/blob/minicpm-v2.6/examples/minicpm-v2.6/README.md)、[vllm](https://github.com/OpenBMB/MiniCPM-o?tab=readme-ov-file#efficient-inference-with-llamacpp-ollama-vllm)。 **合并完成前,使用官方仓库可能会导致不可预期的问题**。
* [2025.01.19] ⭐️⭐️⭐️ MiniCPM-o 在 GitHub Trending 上登顶, Hugging Face Trending 上也达到了第二!
@@ -76,7 +80,7 @@
* [2024.07.19] MiniCPM-Llama3-V 2.5 现已支持[vLLM](#vllm-部署-) !
* [2024.06.03] 现在,你可以利用多张低显存显卡(12G/16G)进行GPU串行推理。详情请参见该[文档](https://github.com/OpenBMB/MiniCPM-V/blob/main/docs/inference_on_multiple_gpus.md)配置。
* [2024.05.28] 💫 我们现在支持 MiniCPM-Llama3-V 2.5 的 LoRA 微调,更多内存使用统计信息可以在[这里](https://github.com/OpenBMB/MiniCPM-V/tree/main/finetune#model-fine-tuning-memory-usage-statistics)找到。
-* [2024.05.28] 💥 MiniCPM-Llama3-V 2.5 现在在 llama.cpp 和 ollama 中完全支持其功能!**请拉取我们最新的 fork 来使用**:[llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-v2.5/examples/minicpmv/README.md) & [ollama](https://github.com/OpenBMB/ollama/tree/minicpm-v2.5/examples/minicpm-v2.5)。我们还发布了各种大小的 GGUF 版本,请点击[这里](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf/tree/main)查看。请注意,**目前官方仓库尚未支持 MiniCPM-Llama3-V 2.5**,我们也正积极推进将这些功能合并到 llama.cpp & ollama 官方仓库,敬请关注!
+* [2024.05.28] 💥 MiniCPM-Llama3-V 2.5 现在在 llama.cpp 和 Ollama 中完全支持其功能!**请拉取我们最新的 fork 来使用**:[llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-v2.5/examples/minicpmv/README.md) & [ollama](https://github.com/OpenBMB/ollama/tree/minicpm-v2.5/examples/minicpm-v2.5)。我们还发布了各种大小的 GGUF 版本,请点击[这里](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf/tree/main)查看。请注意,**目前官方仓库尚未支持 MiniCPM-Llama3-V 2.5**,我们也正积极推进将这些功能合并到 llama.cpp & ollama 官方仓库,敬请关注!
* [2024.05.25] MiniCPM-Llama3-V 2.5 [支持流式输出和自定义系统提示词](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5#usage)了,欢迎试用!
* [2024.05.24] 我们开源了 MiniCPM-Llama3-V 2.5 [gguf](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf),支持 [llama.cpp](#llamacpp-部署) 推理!实现端侧 6-8 tokens/s 的流畅解码,欢迎试用!
* [2024.05.23] 🔍 我们添加了Phi-3-vision-128k-instruct 与 MiniCPM-Llama3-V 2.5的全面对比,包括基准测试评估、多语言能力和推理效率 🌟📊🌍🚀。点击[这里](./docs/compare_with_phi-3_vision.md)查看详细信息。
@@ -94,6 +98,7 @@
## 目录
+- [MiniCPM-V 4.0](#minicpm-v-40)
- [MiniCPM-o 2.6](#minicpm-o-26)
- [MiniCPM-V 2.6](#minicpm-v-26)
- [Chat with Our Demo on Gradio 🤗](#chat-with-our-demo-on-gradio-)
@@ -104,19 +109,573 @@
- [少样本上下文对话](#少样本上下文对话)
- [视频对话](#视频对话)
- [语音对话](#语音对话)
- - [Mimick](#mimick)
- - [可配置声音的语音对话](#可配置声音的语音对话)
- - [更多语音任务](#更多语音任务)
- [多模态流式交互](#多模态流式交互)
- [多卡推理](#多卡推理)
- [Mac 推理](#mac-推理)
- - [基于 llama.cpp、ollama、vLLM 的高效推理](#基于-llamacppollamavllm-的高效推理)
+ - [基于 llama.cpp、Ollama、vLLM 的高效推理](#基于-llamacppollamavllm-的高效推理)
- [微调](#微调)
- [MiniCPM-V \& o 使用手册](#minicpm-v--o-使用手册)
- [基于 MiniCPM-V \& MiniCPM-o 的更多项目](#基于-minicpm-v--minicpm-o-的更多项目)
- [FAQs](#faqs)
- [模型局限性](#模型局限性)
+
+## MiniCPM-V 4.0
+
+MiniCPM-V 4.0 是 MiniCPM-V 系列中的最新模型。该模型基于 SigLIP2-400M 和 MiniCPM4-3B 构建,参数总量为 4.1B。它延续了 MiniCPM-V 2.6 在单图、多图和视频理解方面的强大能力,同时大幅提升了推理效率。MiniCPM-V 4.0 的主要特点包括:
+
+- 🔥 **领先的视觉能力。**
+MiniCPM-V 4.0 在 OpenCompass 上获得了平均 69.0 的高分,超越了 MiniCPM-V 2.6(8.1B,得分 65.2)、 Qwen2.5-VL-3B-Instruct(3.8B,得分 64.5)和**广泛使用的闭源模型 GPT-4.1-mini-20250414**。在多图理解与视频理解任务上,MiniCPM-V 4.0 也表现出色。
+
+- 🚀 **卓越的效率。**
+MiniCPM-V 4.0 专为端侧设备优化,**可在 iPhone 16 Pro Max 上流畅运行,首 token 延迟低至 2 秒,解码速度达 17.9 tokens/s**,且无发热问题。MiniCPM-V 4.0 在并发请求场景下表现出领先的吞吐率指标。
+
+- 💫 **易于使用。**
+MiniCPM-V 4.0 支持多种推理方式,包括 **llama.cpp、Ollama、vLLM、SGLang、LLaMA-Factory 及本地 Web Demo 等**。我们还开源了可以在 iPhone 和 iPad 运行的 iOS App。欢迎参考我们开源的 **结构清晰的[使用手册](https://github.com/OpenSQZ/MiniCPM-V-CookBook)** 玩转 MiniCPM-V 4.0,其中涵盖了详细的部署指南和真实示例。
+
+
+### 性能评估
+
+
+
+点击查看在OpenCompass上的单图理解能力的评测结果。
+
+
+
+
+ | model |
+ Size |
+ Opencompass |
+ OCRBench |
+ MathVista |
+ HallusionBench |
+ MMMU |
+ MMVet |
+ MMBench V1.1 |
+ MMStar |
+ AI2D |
+
+
+
+
+ | Proprietary |
+
+
+ | GPT-4v-20240409 |
+ - |
+ 63.5 |
+ 656 |
+ 55.2 |
+ 43.9 |
+ 61.7 |
+ 67.5 |
+ 79.8 |
+ 56.0 |
+ 78.6 |
+
+
+ | Gemini-1.5-Pro |
+ - |
+ 64.5 |
+ 754 |
+ 58.3 |
+ 45.6 |
+ 60.6 |
+ 64.0 |
+ 73.9 |
+ 59.1 |
+ 79.1 |
+
+
+ | GPT-4.1-mini-20250414 |
+ - |
+ 68.9 |
+ 840 |
+ 70.9 |
+ 49.3 |
+ 55.0 |
+ 74.3 |
+ 80.9 |
+ 60.9 |
+ 76.0 |
+
+
+ | Claude 3.5 Sonnet-20241022 |
+ - |
+ 70.6 |
+ 798 |
+ 65.3 |
+ 55.5 |
+ 66.4 |
+ 70.1 |
+ 81.7 |
+ 65.1 |
+ 81.2 |
+
+
+ | Open-source |
+
+
+ | Qwen2.5-VL-3B-Instruct |
+ 3.8B |
+ 64.5 |
+ 828 |
+ 61.2 |
+ 46.6 |
+ 51.2 |
+ 60.0 |
+ 76.8 |
+ 56.3 |
+ 81.4 |
+
+
+ | InternVL2.5-4B |
+ 3.7B |
+ 65.1 |
+ 820 |
+ 60.8 |
+ 46.6 |
+ 51.8 |
+ 61.5 |
+ 78.2 |
+ 58.7 |
+ 81.4 |
+
+
+ | Qwen2.5-VL-7B-Instruct |
+ 8.3B |
+ 70.9 |
+ 888 |
+ 68.1 |
+ 51.9 |
+ 58.0 |
+ 69.7 |
+ 82.2 |
+ 64.1 |
+ 84.3 |
+
+
+ | InternVL2.5-8B |
+ 8.1B |
+ 68.1 |
+ 821 |
+ 64.5 |
+ 49.0 |
+ 56.2 |
+ 62.8 |
+ 82.5 |
+ 63.2 |
+ 84.6 |
+
+
+ | MiniCPM-V-2.6 |
+ 8.1B |
+ 65.2 |
+ 852 |
+ 60.8 |
+ 48.1 |
+ 49.8 |
+ 60.0 |
+ 78.0 |
+ 57.5 |
+ 82.1 |
+
+
+ | MiniCPM-o-2.6 |
+ 8.7B |
+ 70.2 |
+ 889 |
+ 73.3 |
+ 51.1 |
+ 50.9 |
+ 67.2 |
+ 80.6 |
+ 63.3 |
+ 86.1 |
+
+
+ | MiniCPM-V-4.0 |
+ 4.1B |
+ 69.0 |
+ 894 |
+ 66.9 |
+ 50.8 |
+ 51.2 |
+ 68.0 |
+ 79.7 |
+ 62.8 |
+ 82.9 |
+
+
+
+
+
+
+
+
+点击查看在图表理解、文档理解、数学推理、幻觉等领域的评测结果。
+
+
+
+
+
+ | model |
+ Size |
+ ChartQA |
+ MME |
+ RealWorldQA |
+ TextVQA |
+ DocVQA |
+ MathVision |
+ DynaMath |
+ WeMath |
+ Obj Hal |
+ MM Hal |
+
+
+
+
+ |
+ |
+ |
+ |
+ |
+ |
+ |
+ |
+ |
+ |
+ CHAIRs↓ |
+ CHAIRi↓ |
+ score avg@3↑ |
+ hall rate avg@3↓ |
+
+
+
+ | Proprietary |
+
+
+ | GPT-4v-20240409 |
+ - |
+ 78.5 |
+ 1927 |
+ 61.4 |
+ 78.0 |
+ 88.4 |
+ - |
+ - |
+ - |
+ - |
+ - |
+ - |
+ - |
+
+
+ | Gemini-1.5-Pro |
+ - |
+ 87.2 |
+ - |
+ 67.5 |
+ 78.8 |
+ 93.1 |
+ 41.0 |
+ 31.5 |
+ 50.5 |
+ - |
+ - |
+ - |
+ - |
+
+
+ | GPT-4.1-mini-20250414 |
+ - |
+ - |
+ - |
+ - |
+ - |
+ - |
+ 45.3 |
+ 47.7 |
+ - |
+ - |
+ - |
+ - |
+ - |
+
+
+ | Claude 3.5 Sonnet-20241022 |
+ - |
+ 90.8 |
+ - |
+ 60.1 |
+ 74.1 |
+ 95.2 |
+ 35.6 |
+ 35.7 |
+ 44.0 |
+ - |
+ - |
+ - |
+ - |
+
+
+ | Open-source |
+
+
+ | Qwen2.5-VL-3B-Instruct |
+ 3.8B |
+ 84.0 |
+ 2157 |
+ 65.4 |
+ 79.3 |
+ 93.9 |
+ 21.9 |
+ 13.2 |
+ 22.9 |
+ 18.3 |
+ 10.8 |
+ 3.9 |
+ 33.3 |
+
+
+ | InternVL2.5-4B |
+ 3.7B |
+ 84.0 |
+ 2338 |
+ 64.3 |
+ 76.8 |
+ 91.6 |
+ 18.4 |
+ 15.2 |
+ 21.2 |
+ 13.7 |
+ 8.7 |
+ 3.2 |
+ 46.5 |
+
+
+ | Qwen2.5-VL-7B-Instruct |
+ 8.3B |
+ 87.3 |
+ 2347 |
+ 68.5 |
+ 84.9 |
+ 95.7 |
+ 25.4 |
+ 21.8 |
+ 36.2 |
+ 13.3 |
+ 7.9 |
+ 4.1 |
+ 31.6 |
+
+
+ | InternVL2.5-8B |
+ 8.1B |
+ 84.8 |
+ 2344 |
+ 70.1 |
+ 79.1 |
+ 93.0 |
+ 17.0 |
+ 9.4 |
+ 23.5 |
+ 18.3 |
+ 11.6 |
+ 3.6 |
+ 37.2 |
+
+
+ | MiniCPM-V-2.6 |
+ 8.1B |
+ 79.4 |
+ 2348 |
+ 65.0 |
+ 80.1 |
+ 90.8 |
+ 17.5 |
+ 9.0 |
+ 20.4 |
+ 7.3 |
+ 4.7 |
+ 4.0 |
+ 29.9 |
+
+
+ | MiniCPM-o-2.6 |
+ 8.7B |
+ 86.9 |
+ 2372 |
+ 68.1 |
+ 82.0 |
+ 93.5 |
+ 21.7 |
+ 10.4 |
+ 25.2 |
+ 6.3 |
+ 3.4 |
+ 4.1 |
+ 31.3 |
+
+
+ | MiniCPM-V-4.0 |
+ 4.1B |
+ 84.4 |
+ 2298 |
+ 68.5 |
+ 80.8 |
+ 92.9 |
+ 20.7 |
+ 14.2 |
+ 32.7 |
+ 6.3 |
+ 3.5 |
+ 4.1 |
+ 29.2 |
+
+
+
+
+
+
+
+
+点击查看多图和视频理解能力的评测结果。
+
+
+
+
+ | model |
+ Size |
+ Mantis |
+ Blink |
+ Video-MME |
+
+
+
+
+ |
+ |
+ |
+ |
+ wo subs |
+ w subs |
+
+
+
+ | Proprietary |
+
+
+ | GPT-4v-20240409 |
+ - |
+ 62.7 |
+ 54.6 |
+ 59.9 |
+ 63.3 |
+
+
+ | Gemini-1.5-Pro |
+ - |
+ - |
+ 59.1 |
+ 75.0 |
+ 81.3 |
+
+
+ | GPT-4o-20240513 |
+ - |
+ - |
+ 68.0 |
+ 71.9 |
+ 77.2 |
+
+
+ | Open-source |
+
+
+ | Qwen2.5-VL-3B-Instruct |
+ 3.8B |
+ - |
+ 47.6 |
+ 61.5 |
+ 67.6 |
+
+
+ | InternVL2.5-4B |
+ 3.7B |
+ 62.7 |
+ 50.8 |
+ 62.3 |
+ 63.6 |
+
+
+ | Qwen2.5-VL-7B-Instruct |
+ 8.3B |
+ - |
+ 56.4 |
+ 65.1 |
+ 71.6 |
+
+
+ | InternVL2.5-8B |
+ 8.1B |
+ 67.7 |
+ 54.8 |
+ 64.2 |
+ 66.9 |
+
+
+ | MiniCPM-V-2.6 |
+ 8.1B |
+ 69.1 |
+ 53.0 |
+ 60.9 |
+ 63.6 |
+
+
+ | MiniCPM-o-2.6 |
+ 8.7B |
+ 71.9 |
+ 56.7 |
+ 63.9 |
+ 69.6 |
+
+
+ | MiniCPM-V-4.0 |
+ 4.1B |
+ 71.4 |
+ 54.0 |
+ 61.2 |
+ 65.8 |
+
+
+
+
+
+
+
+### 典型示例
+
+
+

+
+
+
+我们在 iPhone 16 Pro Max 上部署了 MiniCPM-V 4.0 [iOS demo](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/demo/ios_demo/ios.md),并录制了以下演示录屏,视频未经加速等任何编辑:
+
+
+
## MiniCPM-o 2.6
@@ -1874,6 +2433,10 @@ python web_demos/minicpm-o_2.6/chatbot_web_demo_o2.6.py
| 模型 | 设备 | 资源 | 简介 | 下载链接 |
|:--------------|:-:|:----------:|:-------------------|:---------------:|
+| MiniCPM-V 4.0| GPU | 9 GB | 提供出色的端侧单图、多图、视频理解能力。 | [🤗](https://huggingface.co/openbmb/MiniCPM-V-4) [
](https://modelscope.cn/models/OpenBMB/MiniCPM-V-4) |
+| MiniCPM-V 4.0 gguf | CPU | 4 GB | gguf 版本,更低的内存占用和更高的推理效率。 | [🤗](https://huggingface.co/openbmb/MiniCPM-V-4-gguf) [
](https://modelscope.cn/models/OpenBMB/MiniCPM-V-4-gguf) |
+| MiniCPM-V 4.0 int4 | GPU | 5 GB | int4量化版,更低显存占用 | [🤗](https://huggingface.co/openbmb/MiniCPM-V-4-int4) [
](https://modelscope.cn/models/OpenBMB/MiniCPM-V-4-int4) |
+| MiniCPM-V 4.0 AWQ | GPU | 5 GB | int4量化版,更低显存占用 | [🤗](https://huggingface.co/openbmb/MiniCPM-V-4-AWQ) [
](https://modelscope.cn/models/OpenBMB/MiniCPM-V-4-AWQ) |
| MiniCPM-o 2.6| GPU | 18 GB | 最新版本,提供端侧 GPT-4o 级的视觉、语音、多模态流式交互能力。 | [🤗](https://huggingface.co/openbmb/MiniCPM-o-2_6) [
](https://modelscope.cn/models/OpenBMB/MiniCPM-o-2_6) |
| MiniCPM-o 2.6 gguf | CPU | 8 GB | gguf 版本,更低的内存占用和更高的推理效率。 | [🤗](https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf) [
](https://modelscope.cn/models/OpenBMB/MiniCPM-o-2_6-gguf) |
| MiniCPM-o 2.6 int4 | GPU | 9 GB | int4量化版,更低显存占用。 | [🤗](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4) [
](https://modelscope.cn/models/OpenBMB/MiniCPM-o-2_6-int4) |
@@ -1903,10 +2466,10 @@ from transformers import AutoModel, AutoTokenizer
torch.manual_seed(100)
-model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
+model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
-tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True)
image = Image.open('./assets/minicpmo2_6/show_demo.jpg').convert('RGB')
@@ -1934,24 +2497,24 @@ print(answer)
你可以得到如下推理结果:
```
-"The landform in the picture is a mountain range. The mountains appear to be karst formations, characterized by their steep, rugged peaks and smooth, rounded shapes. These types of mountains are often found in regions with limestone bedrock and are shaped by processes such as erosion and weathering. The reflection of the mountains in the water adds to the scenic beauty of the landscape."
+"The landform in the picture is karst topography, characterized by its unique and striking limestone formations that rise dramatically from the surrounding landscape."
-"When traveling to this scenic location, it's important to pay attention to the weather conditions, as the area appears to be prone to fog and mist, especially during sunrise or sunset. Additionally, ensure you have proper footwear for navigating the potentially slippery terrain around the water. Lastly, respect the natural environment by not disturbing the local flora and fauna."
+"When traveling to this picturesque location, you should pay attention to the weather conditions as they can change rapidly in such areas. It's also important to respect local ecosystems and wildlife by staying on designated paths and not disturbing natural habitats. Additionally, bringing appropriate gear for photography is advisable due to the stunning reflections and lighting during sunrise or sunset."
```
#### 多图对话
- 点击查看 MiniCPM-o 2.6 多图输入的 Python 代码。
+ 点击查看 MiniCPM-V-4 多图输入的 Python 代码。
```python
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
-model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
+model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
-tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True)
image1 = Image.open('image1.jpg').convert('RGB')
image2 = Image.open('image2.jpg').convert('RGB')
@@ -1969,17 +2532,17 @@ print(answer)
#### 少样本上下文对话
- 点击查看 MiniCPM-o 2.6 少样本上下文对话的 Python 代码。
+ 点击查看 MiniCPM-V-4 少样本上下文对话的 Python 代码。
```python
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
-model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
+model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
-tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True)
question = "production date"
image1 = Image.open('example1.jpg').convert('RGB')
@@ -2004,7 +2567,7 @@ print(answer)
#### 视频对话
- 点击查看 MiniCPM-o 2.6 视频输入的 Python 代码。
+ 点击查看 MiniCPM-V-4 视频输入的 Python 代码。
```python
import torch
@@ -2012,10 +2575,10 @@ from PIL import Image
from transformers import AutoModel, AutoTokenizer
from decord import VideoReader, cpu # pip install decord
-model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
+model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
-tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4', trust_remote_code=True)
MAX_NUM_FRAMES=64 # if cuda OOM set a smaller number
@@ -2076,7 +2639,7 @@ model.tts.float()
-##### Mimick
+##### Mimick
点击查看 MiniCPM-o 2.6 端到端语音理解生成的 Python 代码。
@@ -2099,7 +2662,7 @@ res = model.chat(
-##### 可配置声音的语音对话
+##### 可配置声音的语音对话
点击查看个性化配置 MiniCPM-o 2.6 对话声音的 Python 代码。
```python
@@ -2145,7 +2708,7 @@ print(res)
-##### 更多语音任务
+##### 更多语音任务
点击查看 MiniCPM-o 2.6 完成更多语音任务的 Python 代码。
@@ -2398,11 +2961,11 @@ PYTORCH_ENABLE_MPS_FALLBACK=1 python test.py
-### 基于 llama.cpp、ollama、vLLM 的高效推理
+### 基于 llama.cpp、Ollama、vLLM 的高效推理
llama.cpp 用法请参考[我们的fork llama.cpp](https://github.com/OpenBMB/llama.cpp/tree/minicpmv-main/examples/llava/README-minicpmv2.6.md), 在iPad上可以支持 16~18 token/s 的流畅推理(测试环境:iPad Pro + M4)。
-ollama 用法请参考[我们的fork ollama](https://github.com/OpenBMB/ollama/blob/minicpm-v2.6/examples/minicpm-v2.6/README.md), 在iPad上可以支持 16~18 token/s 的流畅推理(测试环境:iPad Pro + M4)。
+Ollama 用法请参考[我们的fork Ollama](https://github.com/OpenBMB/ollama/blob/minicpm-v2.6/examples/minicpm-v2.6/README.md), 在iPad上可以支持 16~18 token/s 的流畅推理(测试环境:iPad Pro + M4)。
点击查看, vLLM 现已官方支持MiniCPM-o 2.6、MiniCPM-V 2.6、MiniCPM-Llama3-V 2.5 和 MiniCPM-V 2.0。