@@ -385,20 +1171,8 @@
-We deploy MiniCPM-Llama3-V 2.5 on end devices. The demo video is the raw screen recording on a Xiaomi 14 Pro without edition.
+
-
-
-
-
-
-
-
-
-
-
-
-
## MiniCPM-V 2.0
@@ -455,9 +1229,29 @@ We deploy MiniCPM-V 2.0 on end devices. The demo video is the raw screen recordi
| OmniLMM-12B | [Document](./omnilmm_en.md) |
+## Chat with Our Demo on Gradio 🤗
+
+We provide online and local demos powered by Hugging Face Gradio

, the most popular model deployment framework nowadays. It supports streaming outputs, progress bars, queuing, alerts, and other useful features.
+
+
+### Online Demo
+
+Click here to try out the online demo of [MiniCPM-V 2.6](https://huggingface.co/spaces/openbmb/MiniCPM-V-2_6) | [MiniCPM-Llama3-V 2.5](https://huggingface.co/spaces/openbmb/MiniCPM-Llama3-V-2_5) | [MiniCPM-V 2.0](https://huggingface.co/spaces/openbmb/MiniCPM-V-2).
+
+### Local WebUI Demo
+
+You can easily build your own local WebUI demo with Gradio using the following commands.
+
+```shell
+pip install -r requirements.txt
+```
+
+```shell
+# For NVIDIA GPUs, run:
+python web_demo_2.6.py --device cuda
+
+```
-## Online Demo
-Click here to try out the Demo of [MiniCPM-Llama3-V 2.5](https://huggingface.co/spaces/openbmb/MiniCPM-Llama3-V-2_5) | [MiniCPM-V 2.0](https://huggingface.co/spaces/openbmb/MiniCPM-V-2).
## Install
@@ -488,9 +1282,12 @@ pip install -r requirements.txt
| Model | Device | Memory | Description | Download |
|:-----------|:--:|:-----------:|:-------------------|:---------------:|
-| MiniCPM-Llama3-V 2.5 | GPU | 19 GB | The lastest version, achieving state-of-the end-side multimodal performance. | [🤗](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5/) [

](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5) |
-| MiniCPM-Llama3-V 2.5 gguf | CPU | 5 GB | The gguf version, lower GPU memory and faster inference. | [🤗](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf) [

](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5-gguf) |
-| MiniCPM-Llama3-V 2.5 int4 | GPU | 8 GB | The int4 quantized version,lower GPU memory usage. | [🤗](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-int4/) [

](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5-int4) |
+| MiniCPM-V 2.6| GPU | 17 GB | The latest version, achieving state-of-the-art end-side performance for single image, multi-image and video understanding. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-2_6) [

](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6) |
+| MiniCPM-V 2.6 gguf | CPU | 6 GB | The gguf version, lower memory usage and faster inference. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf) [

](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6-gguf) |
+| MiniCPM-V 2.6 int4 | GPU | 7 GB | The int4 quantized version, lower GPU memory usage. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-2_6-int4) [

](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6-int4) |
+| MiniCPM-Llama3-V 2.5 | GPU | 19 GB | Strong end-side multimodal performance. | [🤗](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5/) [

](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5) |
+| MiniCPM-Llama3-V 2.5 gguf | CPU | 6 GB | The gguf version, lower memory usage and faster inference. | [🤗](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf) [

](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5-gguf) |
+| MiniCPM-Llama3-V 2.5 int4 | GPU | 8 GB | The int4 quantized version, lower GPU memory usage. | [🤗](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-int4/) [

](https://modelscope.cn/models/OpenBMB/MiniCPM-Llama3-V-2_5-int4) |
| MiniCPM-V 2.0 | GPU | 8 GB | Light version, balance the performance the computation cost. | [🤗](https://huggingface.co/openbmb/MiniCPM-V-2) [

](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2) |
| MiniCPM-V 1.0 | GPU | 7 GB | Lightest version, achieving the fastest inference. | [🤗](https://huggingface.co/openbmb/MiniCPM-V) [

](https://modelscope.cn/models/OpenBMB/MiniCPM-V) |
@@ -504,30 +1301,40 @@ Please refer to the following codes to run.
```python
-from chat import MiniCPMVChat, img2base64
import torch
-import json
+from PIL import Image
+from transformers import AutoModel, AutoTokenizer
torch.manual_seed(0)
-chat_model = MiniCPMVChat('openbmb/MiniCPM-Llama3-V-2_5')
+model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
+ attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
+model = model.eval().cuda()
+tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
-im_64 = img2base64('./assets/airplane.jpeg')
+image = Image.open('./assets/airplane.jpeg').convert('RGB')
# First round chat
-msgs = [{"role": "user", "content": "Tell me the model of this aircraft."}]
+question = "Tell me the model of this aircraft."
+msgs = [{'role': 'user', 'content': [image, question]}]
-inputs = {"image": im_64, "question": json.dumps(msgs)}
-answer = chat_model.chat(inputs)
+answer = model.chat(
+ image=None,
+ msgs=msgs,
+ tokenizer=tokenizer
+)
print(answer)
# Second round chat
# pass history context of multi-turn conversation
-msgs.append({"role": "assistant", "content": answer})
-msgs.append({"role": "user", "content": "Introduce something about Airbus A380."})
+msgs.append({"role": "assistant", "content": [answer]})
+msgs.append({"role": "user", "content": ["Introduce something about Airbus A380."]})
-inputs = {"image": im_64, "question": json.dumps(msgs)}
-answer = chat_model.chat(inputs)
+answer = model.chat(
+ image=None,
+ msgs=msgs,
+ tokenizer=tokenizer
+)
print(answer)
```
@@ -539,6 +1346,129 @@ You will get the following output:
"The Airbus A380 is a double-deck, wide-body, four-engine jet airliner made by Airbus. It is the world's largest passenger airliner and is known for its long-haul capabilities. The aircraft was developed to improve efficiency and comfort for passengers traveling over long distances. It has two full-length passenger decks, which can accommodate more passengers than a typical single-aisle airplane. The A380 has been operated by airlines such as Lufthansa, Singapore Airlines, and Emirates, among others. It is widely recognized for its unique design and significant impact on the aviation industry."
```
+#### Chat with multiple images
+
+ Click to view Python code running MiniCPM-V 2.6 with multiple images input.
+
+```python
+import torch
+from PIL import Image
+from transformers import AutoModel, AutoTokenizer
+
+model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
+ attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
+model = model.eval().cuda()
+tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
+
+image1 = Image.open('image1.jpg').convert('RGB')
+image2 = Image.open('image2.jpg').convert('RGB')
+question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.'
+
+msgs = [{'role': 'user', 'content': [image1, image2, question]}]
+
+answer = model.chat(
+ image=None,
+ msgs=msgs,
+ tokenizer=tokenizer
+)
+print(answer)
+```
+
+
+#### In-context few-shot learning
+
+ Click to view Python code running MiniCPM-V 2.6 with few-shot input.
+
+```python
+import torch
+from PIL import Image
+from transformers import AutoModel, AutoTokenizer
+
+model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
+ attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
+model = model.eval().cuda()
+tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
+
+question = "production date"
+image1 = Image.open('example1.jpg').convert('RGB')
+answer1 = "2023.08.04"
+image2 = Image.open('example2.jpg').convert('RGB')
+answer2 = "2007.04.24"
+image_test = Image.open('test.jpg').convert('RGB')
+
+msgs = [
+ {'role': 'user', 'content': [image1, question]}, {'role': 'assistant', 'content': [answer1]},
+ {'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]},
+ {'role': 'user', 'content': [image_test, question]}
+]
+
+answer = model.chat(
+ image=None,
+ msgs=msgs,
+ tokenizer=tokenizer
+)
+print(answer)
+```
+
+
+#### Chat with video
+
+ Click to view Python code running MiniCPM-V 2.6 with video input.
+
+```python
+import torch
+from PIL import Image
+from transformers import AutoModel, AutoTokenizer
+from decord import VideoReader, cpu # pip install decord
+
+model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
+ attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
+model = model.eval().cuda()
+tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
+
+MAX_NUM_FRAMES=64 # if cuda OOM set a smaller number
+
+def encode_video(video_path):
+ def uniform_sample(l, n):
+ gap = len(l) / n
+ idxs = [int(i * gap + gap / 2) for i in range(n)]
+ return [l[i] for i in idxs]
+
+ vr = VideoReader(video_path, ctx=cpu(0))
+ sample_fps = round(vr.get_avg_fps() / 1) # FPS
+ frame_idx = [i for i in range(0, len(vr), sample_fps)]
+ if len(frame_idx) > MAX_NUM_FRAMES:
+ frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
+ frames = vr.get_batch(frame_idx).asnumpy()
+ frames = [Image.fromarray(v.astype('uint8')) for v in frames]
+ print('num frames:', len(frames))
+ return frames
+
+video_path="video_test.mp4"
+frames = encode_video(video_path)
+question = "Describe the video"
+msgs = [
+ {'role': 'user', 'content': frames + [question]},
+]
+
+# Set decode params for video
+params = {}
+params["use_image_id"] = False
+params["max_slice_nums"] = 2 # use 1 if cuda OOM and video resolution > 448*448
+
+answer = model.chat(
+ image=None,
+ msgs=msgs,
+ tokenizer=tokenizer,
+ **params
+)
+print(answer)
+```
+
+
+
+### Inference on Multiple GPUs
+You can run MiniCPM-Llama3-V 2.5 on multiple low VRAM GPUs (12 GB or 16 GB) by distributing the model's layers across multiple GPUs. Please refer to this [tutorial](https://github.com/OpenBMB/MiniCPM-V/blob/main/docs/inference_on_multiple_gpus.md) for detailed instructions on how to load the model and inference using multiple low VRAM GPUs.
### Inference on Mac
@@ -577,52 +1507,100 @@ PYTORCH_ENABLE_MPS_FALLBACK=1 python test.py
### Deployment on Mobile Phone
-MiniCPM-V 2.0 can be deployed on mobile phones with Android operating systems. 🚀 Click [here](https://github.com/OpenBMB/mlc-MiniCPM) to install apk. MiniCPM-Llama3-V 2.5 coming soon.
+MiniCPM-V 2.0 can be deployed on mobile phones with Android operating systems. 🚀 Click [MiniCPM-V 2.0](https://github.com/OpenBMB/mlc-MiniCPM) to install apk.
-### WebUI Demo
+### Inference with llama.cpp
+MiniCPM-V 2.6 can run with llama.cpp now! See [our fork of llama.cpp](https://github.com/OpenBMB/llama.cpp/tree/minicpmv-main/examples/llava/README-minicpmv2.6.md) for more detail. This implementation supports smooth inference of 16~18 token/s on iPad (test environment:iPad Pro + M4).
+
+### Inference with ollama
+MiniCPM-V 2.6 can run with ollama now! See [our fork of ollama](https://github.com/OpenBMB/ollama/blob/minicpm-v2.6/examples/minicpm-v2.6/README.md) for more detail. This implementation supports smooth inference of 16~18 token/s on iPad (test environment:iPad Pro + M4).
+
+### Inference with vLLM
-Click to see how to deploy WebUI demo on different devices
-
-```shell
-pip install -r requirements.txt
-```
-
-```shell
-# For NVIDIA GPUs, run:
-python web_demo_2.5.py --device cuda
+ vLLM now officially supports MiniCPM-V 2.6, MiniCPM-Llama3-V 2.5 and MiniCPM-V 2.0, Click to see.
-# For Mac with MPS (Apple silicon or AMD GPUs), run:
-PYTORCH_ENABLE_MPS_FALLBACK=1 python web_demo_2.5.py --device mps
-```
-
-
-### Inference with llama.cpp
-MiniCPM-Llama3-V 2.5 can run with llama.cpp now! See our fork of [llama.cpp](https://github.com/OpenBMB/llama.cpp/tree/minicpm-v2.5/examples/minicpmv) for more detail. This implementation supports smooth inference of 6~8 token/s on mobile phones (test environment:Xiaomi 14 pro + Snapdragon 8 Gen 3).
-
-### Inference with vLLM
-
-
-Click to see how to inference with vLLM
-Because our pull request to vLLM is still waiting for reviewing, we fork this repository to build and test our vLLM demo. Here are the steps:
-
-1. Clone our version of vLLM:
+1. Install vLLM(>=0.5.4):
```shell
-git clone https://github.com/OpenBMB/vllm.git
+pip install vllm
```
-2. Install vLLM:
+2. Install timm: (optional, MiniCPM-V 2.0 need timm)
```shell
-cd vllm
-pip install -e .
+pip install timm==0.9.10
```
-3. Install timm:
-```shell
-pip install timm=0.9.10
-```
-4. Run our demo:
-```shell
-python examples/minicpmv_example.py
+3. Run the example(for image):
+```python
+from transformers import AutoTokenizer
+from PIL import Image
+from vllm import LLM, SamplingParams
+
+MODEL_NAME = "openbmb/MiniCPM-V-2_6"
+# Also available for previous models
+# MODEL_NAME = "openbmb/MiniCPM-Llama3-V-2_5"
+# MODEL_NAME = "HwwwH/MiniCPM-V-2"
+
+image = Image.open("xxx.png").convert("RGB")
+tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
+llm = LLM(
+ model=MODEL_NAME,
+ trust_remote_code=True,
+ gpu_memory_utilization=1,
+ max_model_len=2048
+)
+
+messages = [{
+ "role":
+ "user",
+ "content":
+ # Number of images
+ "(./)" + \
+ "\nWhat is the content of this image?"
+}]
+prompt = tokenizer.apply_chat_template(
+ messages,
+ tokenize=False,
+ add_generation_prompt=True
+)
+
+# Single Inference
+inputs = {
+ "prompt": prompt,
+ "multi_modal_data": {
+ "image": image
+ # Multi images, the number of images should be equal to that of `(./)`
+ # "image": [image, image]
+ },
+}
+# Batch Inference
+# inputs = [{
+# "prompt": prompt,
+# "multi_modal_data": {
+# "image": image
+# },
+# } for _ in 2]
+
+
+# 2.6
+stop_tokens = ['<|im_end|>', '<|endoftext|>']
+stop_token_ids = [tokenizer.convert_tokens_to_ids(i) for i in stop_tokens]
+# 2.0
+# stop_token_ids = [tokenizer.eos_id]
+# 2.5
+# stop_token_ids = [tokenizer.eos_id, tokenizer.eot_id]
+
+sampling_params = SamplingParams(
+ stop_token_ids=stop_token_ids,
+ use_beam_search=True,
+ temperature=0,
+ best_of=3,
+ max_tokens=1024
+)
+
+outputs = llm.generate(inputs, sampling_params=sampling_params)
+
+print(outputs[0].outputs[0].text)
```
+4. click [here](https://modelbest.feishu.cn/wiki/C2BWw4ZP0iCDy7kkCPCcX2BHnOf?from=from_copylink) if you want to use it with *video*, or get more details about `vLLM`.
## Fine-tuning
@@ -637,30 +1615,25 @@ We support simple fine-tuning with Hugging Face for MiniCPM-V 2.0 and MiniCPM-Ll
We now support MiniCPM-V series fine-tuning with the SWIFT framework. SWIFT supports training, inference, evaluation and deployment of nearly 200 LLMs and MLLMs . It supports the lightweight training solutions provided by PEFT and a complete Adapters Library including techniques such as NEFTune, LoRA+ and LLaMA-PRO.
-Best Practices:[MiniCPM-V 1.0](https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/minicpm-v最佳实践.md), [MiniCPM-V 2.0](https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/minicpm-v-2最佳实践.md)
+Best Practices:[MiniCPM-V 1.0](https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/minicpm-v最佳实践.md), [MiniCPM-V 2.0](https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/minicpm-v-2最佳实践.md), [MiniCPM-V 2.6](https://github.com/modelscope/ms-swift/issues/1613).
-
-
-## TODO
-
-- [x] MiniCPM-V fine-tuning support
-- [ ] Code release for real-time interactive assistant
+## FAQs
+Click here to view the [FAQs](./docs/faqs.md)
## Model License
-The code in this repo is released according to [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE)
+* This repository is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
-The usage of MiniCPM-V's and OmniLMM's parameters is subject to "[General Model License Agreement - Source Notes - Publicity Restrictions - Commercial License](https://github.com/OpenBMB/General-Model-License/blob/main/通用模型许可协议-来源说明-宣传限制-商业授权.md)"
+* The usage of MiniCPM-V model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md).
-The parameters are fully open to academic research
-
-Please contact cpm@modelbest.cn to obtain written authorization for commercial uses. Free commercial use is also allowed after registration.
+* The models and weights of MiniCPM are completely free for academic research. after filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, are also available for free commercial use.
+
## Statement
As LMMs, MiniCPM-V models (including OmniLMM) generate contents by learning a large amount of multimodal corpora, but they cannot comprehend, express personal opinions or make value judgement. Anything generated by MiniCPM-V models does not represent the views and positions of the model developers
-We will not be liable for any problems arising from the use of MiniCPMV-V models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model.
+We will not be liable for any problems arising from the use of MiniCPM-V models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model.
## Institutions
@@ -671,15 +1644,16 @@ This project is developed by the following institutions:
-

[ModelBest](https://modelbest.cn/)
-

[Zhihu](https://www.zhihu.com/ )
-## Other Multimodal Projects from Our Team
+## 🌟 Star History
-👏 Welcome to explore other multimodal projects of our team:
-[VisCPM](https://github.com/OpenBMB/VisCPM/tree/main) | [RLHF-V](https://github.com/RLHF-V/RLHF-V) | [LLaVA-UHD](https://github.com/thunlp/LLaVA-UHD) | [RLAIF-V](https://github.com/RLHF-V/RLAIF-V)
+
+
+
+
+
-## 🌟 Star History
-
-
+
-## Citation
+## Key Techniques and Other Multimodal Projects
+
+👏 Welcome to explore key techniques of MiniCPM-V and other multimodal projects of our team:
+
+[VisCPM](https://github.com/OpenBMB/VisCPM/tree/main) | [RLHF-V](https://github.com/RLHF-V/RLHF-V) | [LLaVA-UHD](https://github.com/thunlp/LLaVA-UHD) | [RLAIF-V](https://github.com/RLHF-V/RLAIF-V)
+
+
+## Citation
If you find our model/code/paper helpful, please consider cite our papers 📝 and star us ⭐️!
```bib
-@article{yu2023rlhf,
- title={Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback},
- author={Yu, Tianyu and Yao, Yuan and Zhang, Haoye and He, Taiwen and Han, Yifeng and Cui, Ganqu and Hu, Jinyi and Liu, Zhiyuan and Zheng, Hai-Tao and Sun, Maosong and others},
- journal={arXiv preprint arXiv:2312.00849},
- year={2023}
-}
-@article{viscpm,
- title={Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages},
- author={Jinyi Hu and Yuan Yao and Chongyi Wang and Shan Wang and Yinxu Pan and Qianyu Chen and Tianyu Yu and Hanghao Wu and Yue Zhao and Haoye Zhang and Xu Han and Yankai Lin and Jiao Xue and Dahai Li and Zhiyuan Liu and Maosong Sun},
- journal={arXiv preprint arXiv:2308.12038},
- year={2023}
-}
-@article{xu2024llava-uhd,
- title={{LLaVA-UHD}: an LMM Perceiving Any Aspect Ratio and High-Resolution Images},
- author={Xu, Ruyi and Yao, Yuan and Guo, Zonghao and Cui, Junbo and Ni, Zanlin and Ge, Chunjiang and Chua, Tat-Seng and Liu, Zhiyuan and Huang, Gao},
- journal={arXiv preprint arXiv:2403.11703},
- year={2024}
-}
-@article{yu2024rlaifv,
- title={RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness},
- author={Yu, Tianyu and Zhang, Haoye and Yao, Yuan and Dang, Yunkai and Chen, Da and Lu, Xiaoman and Cui, Ganqu and He, Taiwen and Liu, Zhiyuan and Chua, Tat-Seng and Sun, Maosong},
- journal={arXiv preprint arXiv:2405.17220},
+@article{yao2024minicpm,
+ title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone},
+ author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others},
+ journal={arXiv preprint arXiv:2408.01800},
year={2024}
}
```