diff --git a/README.md b/README.md index ce1a180..38faa7e 100644 --- a/README.md +++ b/README.md @@ -2,23 +2,25 @@ -**A GPT-4V Level Multimodal LLM on Your Phone** +**A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone** [中文](./README_zh.md) | English +Join our 💬 WeChat + +

- MiniCPM-Llama3-V 2.5 🤗 🤖 | - MiniCPM-V 2.0 🤗 🤖 | - Technical Blog + MiniCPM-V 2.6 🤗 🤖 | MiniCPM-Llama3-V 2.5 🤗 🤖 | + MiniCPM-Llama3-V 2.5 Technical Report

-**MiniCPM-V** is a series of end-side multimodal LLMs (MLLMs) designed for vision-language understanding. The models take image and text as inputs and provide high-quality text outputs. Since February 2024, we have released 4 versions of the model, aiming to achieve **strong performance and efficient deployment**. The most notable models in this series currently include: +**MiniCPM-V** is a series of end-side multimodal LLMs (MLLMs) designed for vision-language understanding. The models take image, video and text as inputs and provide high-quality text outputs. Since February 2024, we have released 5 versions of the model, aiming to achieve **strong performance and efficient deployment**. The most notable models in this series currently include: -- **MiniCPM-Llama3-V 2.5**: 🔥🔥🔥 The latest and most capable model in the MiniCPM-V series. With a total of 8B parameters, the model **surpasses proprietary models such as GPT-4V-1106, Gemini Pro, Qwen-VL-Max and Claude 3** in overall performance. Equipped with the enhanced OCR and instruction-following capability, the model can also support multimodal conversation for **over 30 languages** including English, Chinese, French, Spanish, German etc. With help of quantization, compilation optimizations, and several efficient inference techniques on CPUs and NPUs, MiniCPM-Llama3-V 2.5 can be **efficiently deployed on end-side devices**. +- **MiniCPM-V 2.6**: 🔥🔥🔥 The latest and most capable model in the MiniCPM-V series. With a total of 8B parameters, the model **surpasses GPT-4V in single image, multi-image and video understanding**. It outperforms **GPT-4o mini, Gemini 1.5 Pro and Claude 3.5 Sonnet** in single image understanding, and advances MiniCPM-Llama3-V 2.5's features such as strong OCR capability, trustworthy behavior, multilingual support, and end-side deployment. Due to its superior token density, MiniCPM-V 2.6 can for the first time support real-time video understanding on end-side devices such as iPad. - **MiniCPM-V 2.0**: The lightest model in the MiniCPM-V series. With 2B parameters, it surpasses larger models such as Yi-VL 34B, CogVLM-Chat 17B, and Qwen-VL-Chat 10B in overall performance. It can accept image inputs of any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344), achieving comparable performance with Gemini Pro in understanding scene-text and matches GPT-4V in low hallucination rates. @@ -27,18 +29,26 @@ #### 📌 Pinned -* [2024.05.28] 🚀🚀🚀 MiniCPM-Llama3-V 2.5 now fully supports its feature in [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-v2.5/examples/minicpmv/README.md) and [ollama](https://github.com/OpenBMB/ollama/tree/minicpm-v2.5)! Please pull the latest code for llama.cpp & ollama. We also release GGUF in various sizes [here](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf/tree/main). FAQ list for Ollama usage is comming within a day. Please stay tuned! +* [2024.08.14] MiniCPM-V 2.6 now also supports [fine-tuning](https://github.com/modelscope/ms-swift/issues/1613) with the SWIFT framework! +* [2024.08.10] 🚀🚀🚀 MiniCPM-Llama3-V 2.5 is now fully supported by [official](https://github.com/ggerganov/llama.cpp) llama.cpp! GGUF models of various sizes are available [here](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf). +* [2024.08.06] 🔥🔥🔥 We open-source MiniCPM-V 2.6, which outperforms GPT-4V on single image, multi-image and video understanding. It advances popular features of MiniCPM-Llama3-V 2.5, and can support real-time video understanding on iPad. Try it now! +* [2024.08.03] MiniCPM-Llama3-V 2.5 technical report is released! See [here](https://arxiv.org/abs/2408.01800). +* [2024.07.19] MiniCPM-Llama3-V 2.5 supports vLLM now! See [here](#inference-with-vllm). +* [2024.05.28] 🚀🚀🚀 MiniCPM-Llama3-V 2.5 now fully supports its feature in llama.cpp and ollama! Please pull the latest code **of our provided forks** ([llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-v2.5/examples/minicpmv/README.md), [ollama](https://github.com/OpenBMB/ollama/tree/minicpm-v2.5/examples/minicpm-v2.5)). GGUF models in various sizes are available [here](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf/tree/main). MiniCPM-Llama3-V 2.5 series is **not supported by the official repositories yet**, and we are working hard to merge PRs. Please stay tuned! * [2024.05.28] 💫 We now support LoRA fine-tuning for MiniCPM-Llama3-V 2.5, using only 2 V100 GPUs! See more statistics [here](https://github.com/OpenBMB/MiniCPM-V/tree/main/finetune#model-fine-tuning-memory-usage-statistics). * [2024.05.23] 🔍 We've released a comprehensive comparison between Phi-3-vision-128k-instruct and MiniCPM-Llama3-V 2.5, including benchmarks evaluations, multilingual capabilities, and inference efficiency 🌟📊🌍🚀. Click [here](./docs/compare_with_phi-3_vision.md) to view more details. * [2024.05.23] 🔥🔥🔥 MiniCPM-V tops GitHub Trending and Hugging Face Trending! Our demo, recommended by Hugging Face Gradio’s official account, is available [here](https://huggingface.co/spaces/openbmb/MiniCPM-Llama3-V-2_5). Come and try it out!
+
+Click to view more news. +* [2024.06.03] Now, you can run MiniCPM-Llama3-V 2.5 on multiple low VRAM GPUs(12 GB or 16 GB) by distributing the model's layers across multiple GPUs. For more details, Check this [link](https://github.com/OpenBMB/MiniCPM-V/blob/main/docs/inference_on_multiple_gpus.md). * [2024.05.25] MiniCPM-Llama3-V 2.5 now supports streaming outputs and customized system prompts. Try it [here](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5#usage)! * [2024.05.24] We release the MiniCPM-Llama3-V 2.5 [gguf](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf), which supports [llama.cpp](#inference-with-llamacpp) inference and provides a 6~8 token/s smooth decoding on mobile phones. Try it now! * [2024.05.20] We open-soure MiniCPM-Llama3-V 2.5, it has improved OCR capability and supports 30+ languages, representing the first end-side MLLM achieving GPT-4V level performance! We provide [efficient inference](#deployment-on-mobile-phone) and [simple fine-tuning](./finetune/readme.md). Try it now! -* [2024.04.23] MiniCPM-V-2.0 supports vLLM now! Click [here](#vllm) to view more details. +* [2024.04.23] MiniCPM-V-2.0 supports vLLM now! Click [here](#inference-with-vllm) to view more details. * [2024.04.18] We create a HuggingFace Space to host the demo of MiniCPM-V 2.0 at [here](https://huggingface.co/spaces/openbmb/MiniCPM-V-2)! * [2024.04.17] MiniCPM-V-2.0 supports deploying [WebUI Demo](#webui-demo) now! * [2024.04.15] MiniCPM-V-2.0 now also supports [fine-tuning](https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/minicpm-v-2最佳实践.md) with the SWIFT framework! @@ -46,30 +56,803 @@ * [2024.03.14] MiniCPM-V now supports [fine-tuning](https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/minicpm-v最佳实践.md) with the SWIFT framework. Thanks to [Jintao](https://github.com/Jintao-Huang) for the contribution! * [2024.03.01] MiniCPM-V now can be deployed on Mac! * [2024.02.01] We open-source MiniCPM-V and OmniLMM-12B, which support efficient end-side deployment and powerful multimodal capabilities correspondingly. +
## Contents +- [MiniCPM-V 2.6](#minicpm-v-26) - [MiniCPM-Llama3-V 2.5](#minicpm-llama3-v-25) - [MiniCPM-V 2.0](#minicpm-v-20) -- [Online Demo](#online-demo) +- [Chat with Our Demo on Gradio 🤗](#chat-with-our-demo-on-gradio-) - [Install](#install) - [Inference](#inference) - [Model Zoo](#model-zoo) - [Multi-turn Conversation](#multi-turn-conversation) + - [Chat with multiple images](#chat-with-multiple-images) + - [In-context few-shot learning](#in-context-few-shot-learning) + - [Chat with video](#chat-with-video) + - [Inference on Multiple GPUs](#inference-on-multiple-gpus) - [Inference on Mac](#inference-on-mac) - [Deployment on Mobile Phone](#deployment-on-mobile-phone) - - [WebUI Demo](#webui-demo) - [Inference with llama.cpp](#inference-with-llamacpp) + - [Inference with ollama](#inference-with-ollama) - [Inference with vLLM](#inference-with-vllm) - [Fine-tuning](#fine-tuning) -- [TODO](#todo) -- [🌟 Star History](#-star-history) -- [Citation](#citation) +- [FAQs](#faqs) + + +## MiniCPM-V 2.6 + +**MiniCPM-V 2.6** is the latest and most capable model in the MiniCPM-V series. The model is built on SigLip-400M and Qwen2-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-Llama3-V 2.5, and introduces new features for multi-image and video understanding. Notable features of MiniCPM-V 2.6 include: + +- 🔥 **Leading Performance.** + MiniCPM-V 2.6 achieves an average score of 65.2 on the latest version of OpenCompass, a comprehensive evaluation over 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o mini, GPT-4V, Gemini 1.5 Pro, and Claude 3.5 Sonnet** for single image understanding. + +- 🖼️ **Multi Image Understanding and In-context Learning.** MiniCPM-V 2.6 can also perform **conversation and reasoning over multiple images**. It achieves **state-of-the-art performance** on popular multi-image benchmarks such as Mantis-Eval, BLINK, Mathverse mv and Sciverse mv, and also shows promising in-context learning capability. + +- 🎬 **Video Understanding.** MiniCPM-V 2.6 can also **accept video inputs**, performing conversation and providing dense captions for spatial-temporal information. It outperforms **GPT-4V, Claude 3.5 Sonnet and LLaVA-NeXT-Video-34B** on Video-MME with/without subtitles. + +- 💪 **Strong OCR Capability and Others.** + MiniCPM-V 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves **state-of-the-art performance on OCRBench, surpassing proprietary models such as GPT-4o, GPT-4V, and Gemini 1.5 Pro**. + Based on the the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) and [VisCPM](https://github.com/OpenBMB/VisCPM) techniques, it features **trustworthy behaviors**, with significantly lower hallucination rates than GPT-4o and GPT-4V on Object HalBench, and supports **multilingual capabilities** on English, Chinese, German, French, Italian, Korean, etc. + + +- 🚀 **Superior Efficiency.** + In addition to its friendly size, MiniCPM-V 2.6 also shows **state-of-the-art token density** (i.e., number of pixels encoded into each visual token). **It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models**. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-V 2.6 can efficiently support **real-time video understanding** on end-side devices such as iPad. + +- 💫 **Easy Usage.** +MiniCPM-V 2.6 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpmv-main/examples/llava/README-minicpmv2.6.md) and [ollama](https://github.com/OpenBMB/ollama/blob/minicpm-v2.6/examples/minicpm-v2.6/README.md) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-V-2_6-int4) and [GGUF](https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf) format quantized models in 16 sizes, (3) [vLLM](#inference-with-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks, (5) quick local WebUI demo setup with [Gradio](#chat-with-our-demo-on-gradio), and (6) online web [demo](https://huggingface.co/spaces/openbmb/MiniCPM-V-2_6). + +### Evaluation +
+ +
+ +
+Click to view single image results on OpenCompass, MME, MMVet, OCRBench, MMMU, MathVista, MMB, AI2D, TextVQA, DocVQA, HallusionBench, Object HalBench. +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelSizeToken Density+OpenCompassMMEMMVetOCRBenchMMMU valMathVista miniMMB1.1 testAI2DTextVQA valDocVQA testHallusionBenchObject HalBench
Proprietary
GPT-4o-108869.92328.769.173669.261.382.284.6-92.855.017.6
Claude 3.5 Sonnet-75067.91920.066.078865.961.678.580.2-95.249.913.8
Gemini 1.5 Pro--64.42110.664.075460.657.773.979.173.586.545.6-
GPT-4o mini-108864.12003.466.978560.052.476.077.8--46.112.4
GPT-4V-108863.52070.267.565661.754.779.878.678.087.243.914.2
Step-1V--59.52206.463.362549.944.878.079.271.6-48.4-
Qwen-VL-Max-78458.32281.761.868452.043.474.675.779.593.141.213.4
Open-source
LLaVA-NeXT-Yi-34B34B15755.02006.550.757448.840.477.878.969.3-34.812.6
Mini-Gemini-HD-34B34B157-2141.059.351848.043.3-80.574.178.9--
Cambrian-34B34B182058.32049.953.259150.450.377.879.576.775.541.614.7
GLM-4V-9B13B78459.12018.858.077646.951.167.971.2--45.0-
InternVL2-8B8B70664.12215.154.379451.258.379.483.677.491.645.021.3
MiniCPM-Llama-V 2.58B188258.82024.652.872545.854.372.078.476.684.842.410.3
MiniCPM-V 2.68B282265.22348.4*60.0852*49.8*60.678.082.180.190.848.1*8.2
+ +
+* We evaluate this benchmark using chain-of-thought prompting. Specifically, for MME, we used this technique only for the Cognition set. + ++ Token Density: number of pixels encoded into each visual token at maximum resolution, i.e., # pixels at maximum resolution / # visual tokens. + +Note: For proprietary models, we calculate token density based on the image encoding charging strategy defined in the official API documentation, which provides an upper-bound estimation. + +
+ + +
+Click to view multi-image results on Mantis Eval, BLINK, Mathverse mv, Sciverse mv, MIRB. +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelSizeMantis EvalBLINK valMathverse mvSciverse mvMIRB
Proprietary
GPT-4V-62.754.660.366.953.1
LLaVA-NeXT-Interleave-14B14B66.452.632.730.2-
Open-source
Emu2-Chat37B37.836.2-27.2-
CogVLM17B45.241.1---
VPG-C7B52.443.124.323.1-
VILA 8B8B51.239.3-36.5-
InternLM-XComposer-2.58B53.1*48.932.1*-42.5
InternVL2-8B8B59.0*50.930.5*34.4*56.9*
MiniCPM-V 2.68B69.153.084.974.953.8
+ +
+* We evaluate the officially released checkpoint by ourselves. +
+ +
+Click to view video results on Video-MME and Video-ChatGPT. +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelSizeVideo-MMEVideo-ChatGPT
w/o subsw subsCorrectnessDetailContextTemporalConsistency
Proprietary
Claude 3.5 Sonnet-60.0------
GPT-4V-59.9------
Open-source
LLaVA-NeXT-7B7B--3.393.293.922.603.12
LLaVA-NeXT-34B34B--3.293.233.832.513.47
CogVLM2-Video12B--3.493.463.232.983.64
LongVA7B52.454.33.053.093.772.443.64
InternVL2-8B8B54.056.9-----
InternLM-XComposer-2.58B55.8------
LLaVA-NeXT-Video32B60.263.03.483.373.952.643.28
MiniCPM-V 2.68B60.963.63.593.283.932.733.62
+
+
+ + +
+Click to view few-shot results on TextVQA, VizWiz, VQAv2, OK-VQA. +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelSizeShotTextVQA valVizWiz test-devVQAv2 test-devOK-VQA val
Flamingo80B0*35.031.656.340.6
436.539.663.157.4
837.344.865.657.5
IDEFICS80B0*30.936.060.045.2
434.340.463.652.4
835.746.164.855.1
OmniCorpus7B0*43.049.863.245.5
445.451.364.546.5
845.652.264.746.6
Emu237B026.440.433.526.7
448.254.667.053.2
849.354.767.854.1
MM130B026.240.448.926.7
849.354.770.954.1
MiniCPM-V 2.6+8B043.933.845.423.9
463.660.565.550.1
864.663.468.251.4
+ + +
+* denotes zero image shot and two additional text shots following Flamingo. + ++ We evaluate the pretraining ckpt without SFT. +
+ +### Examples + +
+ Bike + Menu + Code + Mem + medal +
+
+ Click to view more cases. +
+ elec + Menu +
+
+ +We deploy MiniCPM-V 2.6 on end devices. The demo video is the raw screen recording on a iPad Pro without edition. + + +

+ +      + +

+
+ + +

+ +      + +

+
+ + +

+ + +

+
## MiniCPM-Llama3-V 2.5 +
+Click to view more details of MiniCPM-Llama3-V 2.5 + **MiniCPM-Llama3-V 2.5** is the latest model in the MiniCPM-V series. The model is built on SigLip-400M and Llama3-8B-Instruct with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.0. Notable features of MiniCPM-Llama3-V 2.5 include: - 🔥 **Leading Performance.** @@ -87,6 +870,9 @@ - 🚀 **Efficient Deployment.** MiniCPM-Llama3-V 2.5 systematically employs **model quantization, CPU optimizations, NPU optimizations and compilation optimizations**, achieving high-efficiency deployment on end-side devices. For mobile phones with Qualcomm chips, we have integrated the NPU acceleration framework QNN into llama.cpp for the first time. After systematic optimization, MiniCPM-Llama3-V 2.5 has realized a **150x acceleration in end-side MLLM image encoding** and a **3x speedup in language decoding**. +- 💫 **Easy Usage.** +MiniCPM-Llama3-V 2.5 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-v2.5/examples/minicpmv/README.md) and [ollama](https://github.com/OpenBMB/ollama/tree/minicpm-v2.5/examples/minicpm-v2.5) support for efficient CPU inference on local devices, (2) [GGUF](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf) format quantized models in 16 sizes, (3) efficient [LoRA](https://github.com/OpenBMB/MiniCPM-V/tree/main/finetune#lora-finetuning) fine-tuning with only 2 V100 GPUs, (4) [streaming output](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5#usage), (5) quick local WebUI demo setup with [Gradio](https://github.com/OpenBMB/MiniCPM-V/blob/main/web_demo_2.5.py) and [Streamlit](https://github.com/OpenBMB/MiniCPM-V/blob/main/web_demo_streamlit-2_5.py), and (6) interactive demos on [HuggingFace Spaces](https://huggingface.co/spaces/openbmb/MiniCPM-Llama3-V-2_5). + ### Evaluation
@@ -385,20 +1171,8 @@

-We deploy MiniCPM-Llama3-V 2.5 on end devices. The demo video is the raw screen recording on a Xiaomi 14 Pro without edition. +
- -

- - -

-
- - -

- -

-
## MiniCPM-V 2.0 @@ -455,9 +1229,29 @@ We deploy MiniCPM-V 2.0 on end devices. The demo video is the raw screen recordi | OmniLMM-12B | [Document](./omnilmm_en.md) | +## Chat with Our Demo on Gradio 🤗 + +We provide online and local demos powered by Hugging Face Gradio , the most popular model deployment framework nowadays. It supports streaming outputs, progress bars, queuing, alerts, and other useful features. + + +### Online Demo + +Click here to try out the online demo of [MiniCPM-V 2.6](https://huggingface.co/spaces/openbmb/MiniCPM-V-2_6) | [MiniCPM-Llama3-V 2.5](https://huggingface.co/spaces/openbmb/MiniCPM-Llama3-V-2_5) | [MiniCPM-V 2.0](https://huggingface.co/spaces/openbmb/MiniCPM-V-2). + +### Local WebUI Demo + +You can easily build your own local WebUI demo with Gradio using the following commands. + +```shell +pip install -r requirements.txt +``` + +```shell +# For NVIDIA GPUs, run: +python web_demo_2.6.py --device cuda + +``` -## Online Demo -Click here to try out the Demo of [MiniCPM-Llama3-V 2.5](https://huggingface.co/spaces/openbmb/MiniCPM-Llama3-V-2_5) | [MiniCPM-V 2.0](https://huggingface.co/spaces/openbmb/MiniCPM-V-2). ## Install @@ -488,9 +1282,12 @@ pip install -r requirements.txt | Model | Device | Memory |          Description | Download | |:-----------|:--:|:-----------:|:-------------------|:---------------:| -| MiniCPM-Llama3-V 2.5 | GPU | 19 GB | The lastest version, achieving state-of-the end-side multimodal performance. | [🤗](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5/)    [](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5) | -| MiniCPM-Llama3-V 2.5 gguf | CPU | 5 GB | The gguf version, lower GPU memory and faster inference. | [🤗](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf)   [](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5-gguf) | -| MiniCPM-Llama3-V 2.5 int4 | GPU | 8 GB | The int4 quantized version,lower GPU memory usage. | [🤗](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-int4/)    [](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5-int4) | +| MiniCPM-V 2.6| GPU | 17 GB | The latest version, achieving state-of-the-art end-side performance for single image, multi-image and video understanding. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-2_6)    [](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6) | +| MiniCPM-V 2.6 gguf | CPU | 6 GB | The gguf version, lower memory usage and faster inference. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf)    [](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6-gguf) | +| MiniCPM-V 2.6 int4 | GPU | 7 GB | The int4 quantized version, lower GPU memory usage. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-2_6-int4)    [](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6-int4) | +| MiniCPM-Llama3-V 2.5 | GPU | 19 GB | Strong end-side multimodal performance. | [🤗](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5/)    [](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5) | +| MiniCPM-Llama3-V 2.5 gguf | CPU | 6 GB | The gguf version, lower memory usage and faster inference. | [🤗](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf)   [](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5-gguf) | +| MiniCPM-Llama3-V 2.5 int4 | GPU | 8 GB | The int4 quantized version, lower GPU memory usage. | [🤗](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-int4/)    [](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5-int4) | | MiniCPM-V 2.0 | GPU | 8 GB | Light version, balance the performance the computation cost. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-2)    [](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2) | | MiniCPM-V 1.0 | GPU | 7 GB | Lightest version, achieving the fastest inference. | [🤗](https://huggingface.co/openbmb/MiniCPM-V)    [](https://modelscope.cn/models/OpenBMB/MiniCPM-V) | @@ -504,30 +1301,40 @@ Please refer to the following codes to run. ```python -from chat import MiniCPMVChat, img2base64 import torch -import json +from PIL import Image +from transformers import AutoModel, AutoTokenizer torch.manual_seed(0) -chat_model = MiniCPMVChat('openbmb/MiniCPM-Llama3-V-2_5') +model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True, + attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager +model = model.eval().cuda() +tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True) -im_64 = img2base64('./assets/airplane.jpeg') +image = Image.open('./assets/airplane.jpeg').convert('RGB') # First round chat -msgs = [{"role": "user", "content": "Tell me the model of this aircraft."}] +question = "Tell me the model of this aircraft." +msgs = [{'role': 'user', 'content': [image, question]}] -inputs = {"image": im_64, "question": json.dumps(msgs)} -answer = chat_model.chat(inputs) +answer = model.chat( + image=None, + msgs=msgs, + tokenizer=tokenizer +) print(answer) # Second round chat # pass history context of multi-turn conversation -msgs.append({"role": "assistant", "content": answer}) -msgs.append({"role": "user", "content": "Introduce something about Airbus A380."}) +msgs.append({"role": "assistant", "content": [answer]}) +msgs.append({"role": "user", "content": ["Introduce something about Airbus A380."]}) -inputs = {"image": im_64, "question": json.dumps(msgs)} -answer = chat_model.chat(inputs) +answer = model.chat( + image=None, + msgs=msgs, + tokenizer=tokenizer +) print(answer) ``` @@ -539,6 +1346,129 @@ You will get the following output: "The Airbus A380 is a double-deck, wide-body, four-engine jet airliner made by Airbus. It is the world's largest passenger airliner and is known for its long-haul capabilities. The aircraft was developed to improve efficiency and comfort for passengers traveling over long distances. It has two full-length passenger decks, which can accommodate more passengers than a typical single-aisle airplane. The A380 has been operated by airlines such as Lufthansa, Singapore Airlines, and Emirates, among others. It is widely recognized for its unique design and significant impact on the aviation industry." ``` +#### Chat with multiple images +
+ Click to view Python code running MiniCPM-V 2.6 with multiple images input. + +```python +import torch +from PIL import Image +from transformers import AutoModel, AutoTokenizer + +model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True, + attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager +model = model.eval().cuda() +tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True) + +image1 = Image.open('image1.jpg').convert('RGB') +image2 = Image.open('image2.jpg').convert('RGB') +question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.' + +msgs = [{'role': 'user', 'content': [image1, image2, question]}] + +answer = model.chat( + image=None, + msgs=msgs, + tokenizer=tokenizer +) +print(answer) +``` +
+ +#### In-context few-shot learning +
+ Click to view Python code running MiniCPM-V 2.6 with few-shot input. + +```python +import torch +from PIL import Image +from transformers import AutoModel, AutoTokenizer + +model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True, + attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager +model = model.eval().cuda() +tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True) + +question = "production date" +image1 = Image.open('example1.jpg').convert('RGB') +answer1 = "2023.08.04" +image2 = Image.open('example2.jpg').convert('RGB') +answer2 = "2007.04.24" +image_test = Image.open('test.jpg').convert('RGB') + +msgs = [ + {'role': 'user', 'content': [image1, question]}, {'role': 'assistant', 'content': [answer1]}, + {'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]}, + {'role': 'user', 'content': [image_test, question]} +] + +answer = model.chat( + image=None, + msgs=msgs, + tokenizer=tokenizer +) +print(answer) +``` +
+ +#### Chat with video +
+ Click to view Python code running MiniCPM-V 2.6 with video input. + +```python +import torch +from PIL import Image +from transformers import AutoModel, AutoTokenizer +from decord import VideoReader, cpu # pip install decord + +model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True, + attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager +model = model.eval().cuda() +tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True) + +MAX_NUM_FRAMES=64 # if cuda OOM set a smaller number + +def encode_video(video_path): + def uniform_sample(l, n): + gap = len(l) / n + idxs = [int(i * gap + gap / 2) for i in range(n)] + return [l[i] for i in idxs] + + vr = VideoReader(video_path, ctx=cpu(0)) + sample_fps = round(vr.get_avg_fps() / 1) # FPS + frame_idx = [i for i in range(0, len(vr), sample_fps)] + if len(frame_idx) > MAX_NUM_FRAMES: + frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES) + frames = vr.get_batch(frame_idx).asnumpy() + frames = [Image.fromarray(v.astype('uint8')) for v in frames] + print('num frames:', len(frames)) + return frames + +video_path="video_test.mp4" +frames = encode_video(video_path) +question = "Describe the video" +msgs = [ + {'role': 'user', 'content': frames + [question]}, +] + +# Set decode params for video +params = {} +params["use_image_id"] = False +params["max_slice_nums"] = 2 # use 1 if cuda OOM and video resolution > 448*448 + +answer = model.chat( + image=None, + msgs=msgs, + tokenizer=tokenizer, + **params +) +print(answer) +``` +
+ + +### Inference on Multiple GPUs +You can run MiniCPM-Llama3-V 2.5 on multiple low VRAM GPUs (12 GB or 16 GB) by distributing the model's layers across multiple GPUs. Please refer to this [tutorial](https://github.com/OpenBMB/MiniCPM-V/blob/main/docs/inference_on_multiple_gpus.md) for detailed instructions on how to load the model and inference using multiple low VRAM GPUs. ### Inference on Mac @@ -577,52 +1507,100 @@ PYTORCH_ENABLE_MPS_FALLBACK=1 python test.py ### Deployment on Mobile Phone -MiniCPM-V 2.0 can be deployed on mobile phones with Android operating systems. 🚀 Click [here](https://github.com/OpenBMB/mlc-MiniCPM) to install apk. MiniCPM-Llama3-V 2.5 coming soon. +MiniCPM-V 2.0 can be deployed on mobile phones with Android operating systems. 🚀 Click [MiniCPM-V 2.0](https://github.com/OpenBMB/mlc-MiniCPM) to install apk. -### WebUI Demo +### Inference with llama.cpp +MiniCPM-V 2.6 can run with llama.cpp now! See [our fork of llama.cpp](https://github.com/OpenBMB/llama.cpp/tree/minicpmv-main/examples/llava/README-minicpmv2.6.md) for more detail. This implementation supports smooth inference of 16~18 token/s on iPad (test environment:iPad Pro + M4). + +### Inference with ollama +MiniCPM-V 2.6 can run with ollama now! See [our fork of ollama](https://github.com/OpenBMB/ollama/blob/minicpm-v2.6/examples/minicpm-v2.6/README.md) for more detail. This implementation supports smooth inference of 16~18 token/s on iPad (test environment:iPad Pro + M4). + +### Inference with vLLM
-Click to see how to deploy WebUI demo on different devices - -```shell -pip install -r requirements.txt -``` - -```shell -# For NVIDIA GPUs, run: -python web_demo_2.5.py --device cuda + vLLM now officially supports MiniCPM-V 2.6, MiniCPM-Llama3-V 2.5 and MiniCPM-V 2.0, Click to see. -# For Mac with MPS (Apple silicon or AMD GPUs), run: -PYTORCH_ENABLE_MPS_FALLBACK=1 python web_demo_2.5.py --device mps -``` -
- -### Inference with llama.cpp -MiniCPM-Llama3-V 2.5 can run with llama.cpp now! See our fork of [llama.cpp](https://github.com/OpenBMB/llama.cpp/tree/minicpm-v2.5/examples/minicpmv) for more detail. This implementation supports smooth inference of 6~8 token/s on mobile phones (test environment:Xiaomi 14 pro + Snapdragon 8 Gen 3). - -### Inference with vLLM - -
-Click to see how to inference with vLLM -Because our pull request to vLLM is still waiting for reviewing, we fork this repository to build and test our vLLM demo. Here are the steps: - -1. Clone our version of vLLM: +1. Install vLLM(>=0.5.4): ```shell -git clone https://github.com/OpenBMB/vllm.git +pip install vllm ``` -2. Install vLLM: +2. Install timm: (optional, MiniCPM-V 2.0 need timm) ```shell -cd vllm -pip install -e . +pip install timm==0.9.10 ``` -3. Install timm: -```shell -pip install timm=0.9.10 -``` -4. Run our demo: -```shell -python examples/minicpmv_example.py +3. Run the example(for image): +```python +from transformers import AutoTokenizer +from PIL import Image +from vllm import LLM, SamplingParams + +MODEL_NAME = "openbmb/MiniCPM-V-2_6" +# Also available for previous models +# MODEL_NAME = "openbmb/MiniCPM-Llama3-V-2_5" +# MODEL_NAME = "HwwwH/MiniCPM-V-2" + +image = Image.open("xxx.png").convert("RGB") +tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True) +llm = LLM( + model=MODEL_NAME, + trust_remote_code=True, + gpu_memory_utilization=1, + max_model_len=2048 +) + +messages = [{ + "role": + "user", + "content": + # Number of images + "(./)" + \ + "\nWhat is the content of this image?" +}] +prompt = tokenizer.apply_chat_template( + messages, + tokenize=False, + add_generation_prompt=True +) + +# Single Inference +inputs = { + "prompt": prompt, + "multi_modal_data": { + "image": image + # Multi images, the number of images should be equal to that of `(./)` + # "image": [image, image] + }, +} +# Batch Inference +# inputs = [{ +# "prompt": prompt, +# "multi_modal_data": { +# "image": image +# }, +# } for _ in 2] + + +# 2.6 +stop_tokens = ['<|im_end|>', '<|endoftext|>'] +stop_token_ids = [tokenizer.convert_tokens_to_ids(i) for i in stop_tokens] +# 2.0 +# stop_token_ids = [tokenizer.eos_id] +# 2.5 +# stop_token_ids = [tokenizer.eos_id, tokenizer.eot_id] + +sampling_params = SamplingParams( + stop_token_ids=stop_token_ids, + use_beam_search=True, + temperature=0, + best_of=3, + max_tokens=1024 +) + +outputs = llm.generate(inputs, sampling_params=sampling_params) + +print(outputs[0].outputs[0].text) ``` +4. click [here](https://modelbest.feishu.cn/wiki/C2BWw4ZP0iCDy7kkCPCcX2BHnOf?from=from_copylink) if you want to use it with *video*, or get more details about `vLLM`.
## Fine-tuning @@ -637,30 +1615,25 @@ We support simple fine-tuning with Hugging Face for MiniCPM-V 2.0 and MiniCPM-Ll We now support MiniCPM-V series fine-tuning with the SWIFT framework. SWIFT supports training, inference, evaluation and deployment of nearly 200 LLMs and MLLMs . It supports the lightweight training solutions provided by PEFT and a complete Adapters Library including techniques such as NEFTune, LoRA+ and LLaMA-PRO. -Best Practices:[MiniCPM-V 1.0](https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/minicpm-v最佳实践.md), [MiniCPM-V 2.0](https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/minicpm-v-2最佳实践.md) +Best Practices:[MiniCPM-V 1.0](https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/minicpm-v最佳实践.md), [MiniCPM-V 2.0](https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/minicpm-v-2最佳实践.md), [MiniCPM-V 2.6](https://github.com/modelscope/ms-swift/issues/1613). - - -## TODO - -- [x] MiniCPM-V fine-tuning support -- [ ] Code release for real-time interactive assistant +## FAQs +Click here to view the [FAQs](./docs/faqs.md) ## Model License -The code in this repo is released according to [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) +* This repository is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License. -The usage of MiniCPM-V's and OmniLMM's parameters is subject to "[General Model License Agreement - Source Notes - Publicity Restrictions - Commercial License](https://github.com/OpenBMB/General-Model-License/blob/main/通用模型许可协议-来源说明-宣传限制-商业授权.md)" +* The usage of MiniCPM-V model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md). -The parameters are fully open to academic research - -Please contact cpm@modelbest.cn to obtain written authorization for commercial uses. Free commercial use is also allowed after registration. +* The models and weights of MiniCPM are completely free for academic research. after filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, are also available for free commercial use. + ## Statement As LMMs, MiniCPM-V models (including OmniLMM) generate contents by learning a large amount of multimodal corpora, but they cannot comprehend, express personal opinions or make value judgement. Anything generated by MiniCPM-V models does not represent the views and positions of the model developers -We will not be liable for any problems arising from the use of MiniCPMV-V models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model. +We will not be liable for any problems arising from the use of MiniCPM-V models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model. ## Institutions @@ -671,15 +1644,16 @@ This project is developed by the following institutions: - [ModelBest](https://modelbest.cn/) - [Zhihu](https://www.zhihu.com/ ) -## Other Multimodal Projects from Our Team +## 🌟 Star History -👏 Welcome to explore other multimodal projects of our team: -[VisCPM](https://github.com/OpenBMB/VisCPM/tree/main) | [RLHF-V](https://github.com/RLHF-V/RLHF-V) | [LLaVA-UHD](https://github.com/thunlp/LLaVA-UHD) | [RLAIF-V](https://github.com/RLHF-V/RLAIF-V) + +

+ +

+
-## 🌟 Star History - - + -## Citation +## Key Techniques and Other Multimodal Projects + +👏 Welcome to explore key techniques of MiniCPM-V and other multimodal projects of our team: + +[VisCPM](https://github.com/OpenBMB/VisCPM/tree/main) | [RLHF-V](https://github.com/RLHF-V/RLHF-V) | [LLaVA-UHD](https://github.com/thunlp/LLaVA-UHD) | [RLAIF-V](https://github.com/RLHF-V/RLAIF-V) + + +## Citation If you find our model/code/paper helpful, please consider cite our papers 📝 and star us ⭐️! ```bib -@article{yu2023rlhf, - title={Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback}, - author={Yu, Tianyu and Yao, Yuan and Zhang, Haoye and He, Taiwen and Han, Yifeng and Cui, Ganqu and Hu, Jinyi and Liu, Zhiyuan and Zheng, Hai-Tao and Sun, Maosong and others}, - journal={arXiv preprint arXiv:2312.00849}, - year={2023} -} -@article{viscpm, - title={Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages}, - author={Jinyi Hu and Yuan Yao and Chongyi Wang and Shan Wang and Yinxu Pan and Qianyu Chen and Tianyu Yu and Hanghao Wu and Yue Zhao and Haoye Zhang and Xu Han and Yankai Lin and Jiao Xue and Dahai Li and Zhiyuan Liu and Maosong Sun}, - journal={arXiv preprint arXiv:2308.12038}, - year={2023} -} -@article{xu2024llava-uhd, - title={{LLaVA-UHD}: an LMM Perceiving Any Aspect Ratio and High-Resolution Images}, - author={Xu, Ruyi and Yao, Yuan and Guo, Zonghao and Cui, Junbo and Ni, Zanlin and Ge, Chunjiang and Chua, Tat-Seng and Liu, Zhiyuan and Huang, Gao}, - journal={arXiv preprint arXiv:2403.11703}, - year={2024} -} -@article{yu2024rlaifv, - title={RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness}, - author={Yu, Tianyu and Zhang, Haoye and Yao, Yuan and Dang, Yunkai and Chen, Da and Lu, Xiaoman and Cui, Ganqu and He, Taiwen and Liu, Zhiyuan and Chua, Tat-Seng and Sun, Maosong}, - journal={arXiv preprint arXiv:2405.17220}, +@article{yao2024minicpm, + title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone}, + author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others}, + journal={arXiv preprint arXiv:2408.01800}, year={2024} } ```