mirror of
https://github.com/OpenBMB/MiniCPM-V.git
synced 2026-02-04 17:59:18 +08:00
Merge pull request #334 from LDLINGLINGLING/main
增加了量化脚本,SWIFT 和 Xinference 的推理文档,在 readme 中增加了常用模块和新模块的快速导航
This commit is contained in:
@@ -70,6 +70,15 @@ Join our <a href="docs/wechat.md" target="_blank"> 💬 WeChat</a>
|
||||
- [🌟 Star History](#-star-history)
|
||||
- [Citation](#citation)
|
||||
|
||||
## MiniCPM-Llama3-V 2.5 Common Module Navigation <!-- omit in toc -->
|
||||
You can click on the following table to quickly access the commonly used content you need in MiniCPM-Llama3-V 2.5.
|
||||
| Functional Categories | | | | | | | ||
|
||||
|:--------:|:------:|:--------------:|:--------:|:-------:|:-----------:|:-----------:|:--------:|:-----------:|
|
||||
| Inference | [Transformers](https://github.com/OpenBMB/MiniCPM-V/blob/main/docs/inference_on_multiple_gpus.md) | [ollama](https://github.com/OpenBMB/ollama/tree/minicpm-v2.5/examples/minicpm-v2.5) | [SWIFT](./docs/swift_train_and_infer.md) | [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-v2.5/examples/minicpmv/README.md) | [Xinfrence](./docs/xinference_infer.md) | [Gradio](./web_demo_2.5.py) | [Streamlit](./web_demo_streamlit-2_5.py) |[vLLM](#vllm)
|
||||
| Finetune | [Full-parameter](./finetune/readme.md) | [Lora](./finetune/readme.md) | [SWIFT](./docs/swift_train_and_infer.md) | | | | | |
|
||||
| Edge Deployment | [apk](http://minicpm.modelbest.cn/android/modelbest-release-20240528_182155.apk) | [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-v2.5/examples/minicpmv/README.md) | | | | | | |
|
||||
| Quantize | [Bnb](./quantize/bnb_quantize.py) |
|
||||
|
||||
## MiniCPM-Llama3-V 2.5
|
||||
|
||||
**MiniCPM-Llama3-V 2.5** is the latest model in the MiniCPM-V series. The model is built on SigLip-400M and Llama3-8B-Instruct with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.0. Notable features of MiniCPM-Llama3-V 2.5 include:
|
||||
|
||||
10
README_zh.md
10
README_zh.md
@@ -75,6 +75,16 @@
|
||||
- [🌟 Star History](#-star-history)
|
||||
- [引用](#引用)
|
||||
|
||||
## MiniCPM-Llama3-V 2.5快速导航 <!-- omit in toc -->
|
||||
你可以点击以下表格快速访问MiniCPM-Llama3-V 2.5中你所需要的常用内容
|
||||
|
||||
| 功能分类 | | | | | | | ||
|
||||
|:--------:|:------:|:--------------:|:--------:|:-------:|:-----------:|:-----------:|:--------:|:-----------:|
|
||||
| 推理 | [Transformers](https://github.com/OpenBMB/MiniCPM-V/blob/main/docs/inference_on_multiple_gpus.md) | [ollama](https://github.com/OpenBMB/ollama/tree/minicpm-v2.5/examples/minicpm-v2.5) | [SWIFT](./docs/swift_train_and_infer.md) | [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-v2.5/examples/minicpmv/README.md) | [Xinfrence](./docs/xinference_infer.md) | [Gradio](./web_demo_2.5.py) | [Streamlit](./web_demo_streamlit-2_5.py) |[vLLM](#vllm)
|
||||
| 微调 | [Full-parameter](./finetune/readme.md) | [LoRA](./finetune/readme.md) | [SWIFT](./docs/swift_train_and_infer.md) | | | | | |
|
||||
| 安卓部署 | [apk](http://minicpm.modelbest.cn/android/modelbest-release-20240528_182155.apk) | [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-v2.5/examples/minicpmv/README.md) | | | | | | |
|
||||
| 量化 | [Bnb](./quantize/bnb_quantize.py) |
|
||||
|
||||
|
||||
## MiniCPM-Llama3-V 2.5
|
||||
**MiniCPM-Llama3-V 2.5** 是 MiniCPM-V 系列的最新版本模型,基于 SigLip-400M 和 Llama3-8B-Instruct 构建,共 8B 参数量,相较于 MiniCPM-V 2.0 性能取得较大幅度提升。MiniCPM-Llama3-V 2.5 值得关注的特点包括:
|
||||
|
||||
BIN
assets/xinferenc_demo_image/xinference_register_model1.png
Normal file
BIN
assets/xinferenc_demo_image/xinference_register_model1.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 120 KiB |
BIN
assets/xinferenc_demo_image/xinference_register_model2.png
Normal file
BIN
assets/xinferenc_demo_image/xinference_register_model2.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 88 KiB |
BIN
assets/xinferenc_demo_image/xinference_search_box.png
Normal file
BIN
assets/xinferenc_demo_image/xinference_search_box.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 8.7 KiB |
BIN
assets/xinferenc_demo_image/xinference_webui_button.png
Normal file
BIN
assets/xinferenc_demo_image/xinference_webui_button.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 29 KiB |
135
docs/swift_train_and_infer.md
Normal file
135
docs/swift_train_and_infer.md
Normal file
@@ -0,0 +1,135 @@
|
||||
## SWIFT install
|
||||
You can quickly install SWIFT using bash commands.
|
||||
|
||||
``` bash
|
||||
git clone https://github.com/modelscope/swift.git
|
||||
cd swift
|
||||
pip install -r requirements.txt
|
||||
pip install -e '.[llm]'
|
||||
```
|
||||
|
||||
## SWIFT Infer
|
||||
Inference using SWIFT can be carried out in two ways: through a command line interface and via Python code.
|
||||
|
||||
### Quick start
|
||||
Here are the steps to launch SWIFT from the Bash command line:
|
||||
|
||||
1. Run the bash code will download the model of MiniCPM-Llama3-V-2_5 and run the inference
|
||||
``` shell
|
||||
CUDA_VISIBLE_DEVICES=0 swift infer --model_type minicpm-v-v2_5-chat
|
||||
```
|
||||
|
||||
2. You can also run the code with more arguments below to run the inference:
|
||||
```
|
||||
model_id_or_path # Can be the model ID from Hugging Face or the local path to the model
|
||||
infer_backend ['AUTO', 'vllm', 'pt'] # Backend for inference, default is auto
|
||||
dtype ['bf16', 'fp16', 'fp32', 'AUTO'] # Computational precision
|
||||
max_length # Maximum length
|
||||
max_new_tokens: int = 2048 # Maximum number of tokens to generate
|
||||
do_sample: bool = True # Whether to sample during generation
|
||||
temperature: float = 0.3 # Temperature coefficient during generation
|
||||
top_k: int = 20
|
||||
top_p: float = 0.7
|
||||
repetition_penalty: float = 1. # Penalty for repetition
|
||||
num_beams: int = 1 # Number of beams for beam search
|
||||
stop_words: List[str] = None # List of stop words
|
||||
quant_method ['bnb', 'hqq', 'eetq', 'awq', 'gptq', 'aqlm'] # Quantization method for the model
|
||||
quantization_bit [0, 1, 2, 3, 4, 8] # Default is 0, which means no quantization is used
|
||||
```
|
||||
3. Example:
|
||||
``` shell
|
||||
CUDA_VISIBLE_DEVICES=0,1 swift infer \
|
||||
--model_type minicpm-v-v2_5-chat \
|
||||
--model_id_or_path /root/ld/ld_model_pretrain/MiniCPM-Llama3-V-2_5 \
|
||||
--dtype bf16
|
||||
```
|
||||
### Python code with SWIFT infer
|
||||
The following demonstrates using Python code to initiate inference with the MiniCPM-Llama3-V-2_5 model through SWIFT.
|
||||
|
||||
```python
|
||||
import os
|
||||
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1' # Set the number of GPUs to use
|
||||
|
||||
from swift.llm import (
|
||||
get_model_tokenizer, get_template, inference, ModelType,
|
||||
get_default_template_type, inference_stream
|
||||
) # Import necessary modules
|
||||
|
||||
from swift.utils import seed_everything # Set random seed
|
||||
import torch
|
||||
|
||||
model_type = ModelType.minicpm_v_v2_5_chat
|
||||
template_type = get_default_template_type(model_type) # Obtain the template type, primarily used for constructing special tokens and image processing workflow
|
||||
print(f'template_type: {template_type}')
|
||||
|
||||
model, tokenizer = get_model_tokenizer(model_type, torch.bfloat16,
|
||||
model_id_or_path='/root/ld/ld_model_pretrain/MiniCPM-Llama3-V-2_5',
|
||||
model_kwargs={'device_map': 'auto'}) # Load the model, set model type, model path, model parameters, device allocation, etc., computation precision, etc.
|
||||
model.generation_config.max_new_tokens = 256
|
||||
template = get_template(template_type, tokenizer) # Construct the template based on the template type
|
||||
seed_everything(42)
|
||||
|
||||
images = ['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png'] # Image URL
|
||||
query = '距离各城市多远?' # Note: Query is still in Chinese, consider translating if needed
|
||||
response, history = inference(model, template, query, images=images) # Obtain results through inference
|
||||
print(f'query: {query}')
|
||||
print(f'response: {response}')
|
||||
|
||||
# Streaming output
|
||||
query = '距离最远的城市是哪?' # Note: Query is still in Chinese, consider translating if needed
|
||||
gen = inference_stream(model, template, query, history, images=images) # Call the streaming output interface
|
||||
print_idx = 0
|
||||
print(f'query: {query}\nresponse: ', end='')
|
||||
for response, history in gen:
|
||||
delta = response[print_idx:]
|
||||
print(delta, end='', flush=True)
|
||||
print_idx = len(response)
|
||||
print()
|
||||
print(f'history: {history}')
|
||||
```
|
||||
|
||||
## SWIFT train
|
||||
SWIFT supports training on the local dataset,the training steps are as follows:
|
||||
1. Make the train data like this:
|
||||
```jsonl
|
||||
{"query": "What does this picture describe?", "response": "This picture has a giant panda.", "images": ["local_image_path"]}
|
||||
{"query": "What does this picture describe?", "response": "This picture has a giant panda.", "history": [], "images": ["image_path"]}
|
||||
{"query": "Is bamboo tasty?", "response": "It seems pretty tasty judging by the panda's expression.", "history": [["What's in this picture?", "There's a giant panda in this picture."], ["What is the panda doing?", "Eating bamboo."]], "images": ["image_url"]}
|
||||
```
|
||||
2. LoRA Tuning:
|
||||
|
||||
The LoRA target model are k and v weight in LLM you should pay attention to the eval_steps,maybe you should set the eval_steps to a large value, like 200000,beacuase in the eval time , SWIFT will return a memory bug so you should set the eval_steps to a very large value.
|
||||
```shell
|
||||
# Experimental environment: A100
|
||||
# 32GB GPU memory
|
||||
CUDA_VISIBLE_DEVICES=0 swift sft \
|
||||
--model_type minicpm-v-v2_5-chat \
|
||||
--dataset coco-en-2-mini \
|
||||
```
|
||||
3. All parameters finetune:
|
||||
|
||||
When the argument of lora_target_modules is ALL, the model will finetune all the parameters.
|
||||
```shell
|
||||
CUDA_VISIBLE_DEVICES=0,1 swift sft \
|
||||
--model_type minicpm-v-v2_5-chat \
|
||||
--dataset coco-en-2-mini \
|
||||
--lora_target_modules ALL \
|
||||
--eval_steps 200000
|
||||
```
|
||||
|
||||
## LoRA Merge and Infer
|
||||
The LoRA weight can be merge to the base model and then load to infer.
|
||||
|
||||
1. Load the LoRA weight to infer run the follow code:
|
||||
```shell
|
||||
CUDA_VISIBLE_DEVICES=0 swift infer \
|
||||
--ckpt_dir /your/lora/save/checkpoint
|
||||
```
|
||||
2. Merge the LoRA weight to the base model:
|
||||
|
||||
The code will load and merge the LoRA weight to the base model, save the merge model to the LoRA save path and load the merge model to infer
|
||||
```shell
|
||||
CUDA_VISIBLE_DEVICES=0 swift infer \
|
||||
--ckpt_dir your/lora/save/checkpoint \
|
||||
--merge_lora true
|
||||
```
|
||||
67
docs/xinference_infer.md
Normal file
67
docs/xinference_infer.md
Normal file
@@ -0,0 +1,67 @@
|
||||
## Xinference Infer
|
||||
Xinference is a unified inference platform that provides a unified interface for different inference engines. It supports LLM, text generation, image generation, and more.but it's not bigger than Swift too much.
|
||||
|
||||
|
||||
### Xinference install
|
||||
Xinference can be installed simply by using the following easy bash code:
|
||||
```shell
|
||||
pip install "xinference[all]"
|
||||
```
|
||||
|
||||
### Quick start
|
||||
The initial steps for conducting inference with Xinference involve downloading the model during the first launch.
|
||||
1. Start Xinference in the terminal:
|
||||
```shell
|
||||
xinference
|
||||
```
|
||||
2. Start the web ui.
|
||||
3. Search for "MiniCPM-Llama3-V-2_5" in the search box.
|
||||
|
||||

|
||||
|
||||
4. Find and click the MiniCPM-Llama3-V-2_5 button.
|
||||
5. Follow the config and launch the model.
|
||||
```plaintext
|
||||
Model engine : Transformers
|
||||
model format : pytorch
|
||||
Model size : 8
|
||||
quantization : none
|
||||
N-GPU : auto
|
||||
Replica : 1
|
||||
```
|
||||
6. After first click the launch button,xinference will download the model from huggingface. We should click the webui button.
|
||||
|
||||

|
||||
|
||||
7. Upload the image and chatting with the MiniCPM-Llama3-V-2_5
|
||||
|
||||
### Local MiniCPM-Llama3-V-2_5 Launch
|
||||
If you have already downloaded the MiniCPM-Llama3-V-2_5 model locally, you can proceed with Xinference inference following these steps:
|
||||
1. Start Xinference
|
||||
```shell
|
||||
xinference
|
||||
```
|
||||
2. Start the web ui.
|
||||
3. To register a new model, follow these steps: the settings highlighted in red are fixed and cannot be changed, whereas others are customizable according to your needs. Complete the process by clicking the 'Register Model' button.
|
||||
|
||||

|
||||

|
||||
|
||||
4. After completing the model registration, proceed to 'Custom Models' and locate the model you just registered.
|
||||
5. Follow the config and launch the model.
|
||||
```plaintext
|
||||
Model engine : Transformers
|
||||
model format : pytorch
|
||||
Model size : 8
|
||||
quantization : none
|
||||
N-GPU : auto
|
||||
Replica : 1
|
||||
```
|
||||
6. After first click the launch button,Xinference will download the model from Huggingface. we should click the chat button.
|
||||

|
||||
7. Upload the image and chatting with the MiniCPM-Llama3-V-2_5
|
||||
|
||||
### FAQ
|
||||
1. Why can't the sixth step open the WebUI?
|
||||
|
||||
Maybe your firewall or mac os to prevent the web to open.
|
||||
81
quantize/bnb_quantize.py
Normal file
81
quantize/bnb_quantize.py
Normal file
@@ -0,0 +1,81 @@
|
||||
"""
|
||||
the script will use bitandbytes to quantize the MiniCPM-Llama3-V-2_5 model.
|
||||
the be quantized model can be finetuned by MiniCPM-Llama3-V-2_5 or not.
|
||||
you only need to set the model_path 、save_path and run bash code
|
||||
|
||||
cd MiniCPM-V
|
||||
python quantize/bnb_quantize.py
|
||||
|
||||
you will get the quantized model in save_path、quantized_model test time and gpu usage
|
||||
"""
|
||||
|
||||
|
||||
import torch
|
||||
from transformers import AutoModel, AutoTokenizer, BitsAndBytesConfig
|
||||
from PIL import Image
|
||||
import time
|
||||
import torch
|
||||
import GPUtil
|
||||
import os
|
||||
|
||||
assert torch.cuda.is_available(),"CUDA is not available, but this code requires a GPU."
|
||||
|
||||
device = 'cuda' # Select GPU to use
|
||||
model_path = '/root/ld/ld_model_pretrained/MiniCPM-Llama3-V-2_5' # Model download path
|
||||
save_path = '/root/ld/ld_model_pretrain/MiniCPM-Llama3-V-2_5_int4' # Quantized model save path
|
||||
image_path = './assets/airplane.jpeg'
|
||||
|
||||
|
||||
# Create a configuration object to specify quantization parameters
|
||||
quantization_config = BitsAndBytesConfig(
|
||||
load_in_4bit=True, # Whether to perform 4-bit quantization
|
||||
load_in_8bit=False, # Whether to perform 8-bit quantization
|
||||
bnb_4bit_compute_dtype=torch.float16, # Computation precision setting
|
||||
bnb_4bit_quant_storage=torch.uint8, # Storage format for quantized weights
|
||||
bnb_4bit_quant_type="nf4", # Quantization format, here using normally distributed int4
|
||||
bnb_4bit_use_double_quant=True, # Whether to use double quantization, i.e., quantizing zeropoint and scaling parameters
|
||||
llm_int8_enable_fp32_cpu_offload=False, # Whether LLM uses int8, with fp32 parameters stored on the CPU
|
||||
llm_int8_has_fp16_weight=False, # Whether mixed precision is enabled
|
||||
llm_int8_skip_modules=["out_proj", "kv_proj", "lm_head"], # Modules not to be quantized
|
||||
llm_int8_threshold=6.0 # Outlier value in the llm.int8() algorithm, distinguishing whether to perform quantization based on this value
|
||||
)
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
|
||||
model = AutoModel.from_pretrained(
|
||||
model_path,
|
||||
device_map=device, # Allocate model to device
|
||||
quantization_config=quantization_config,
|
||||
trust_remote_code=True
|
||||
)
|
||||
|
||||
gpu_usage = GPUtil.getGPUs()[0].memoryUsed
|
||||
start=time.time()
|
||||
response = model.chat(
|
||||
image=Image.open(image_path).convert("RGB"),
|
||||
msgs=[
|
||||
{
|
||||
"role": "user",
|
||||
"content": "What is in this picture?"
|
||||
}
|
||||
],
|
||||
tokenizer=tokenizer
|
||||
) # 模型推理
|
||||
print('Output after quantization:',response)
|
||||
print('Inference time after quantization:',time.time()-start)
|
||||
print(f"GPU memory usage after quantization: {round(gpu_usage/1024,2)}GB")
|
||||
|
||||
"""
|
||||
Expected output:
|
||||
|
||||
Output after quantization: This picture contains specific parts of an airplane, including wings, engines, and tail sections. These components are key parts of large commercial aircraft.
|
||||
The wings support lift during flight, while the engines provide thrust to move the plane forward. The tail section is typically used for stabilizing flight and plays a role in airline branding.
|
||||
The design and color of the airplane indicate that it belongs to Air China, likely a passenger aircraft due to its large size and twin-engine configuration.
|
||||
There are no markings or insignia on the airplane indicating the specific model or registration number; such information may require additional context or a clearer perspective to discern.
|
||||
Inference time after quantization: 8.583992719650269 seconds
|
||||
GPU memory usage after quantization: 6.41 GB
|
||||
"""
|
||||
|
||||
# Save the model and tokenizer
|
||||
os.makedirs(save_path, exist_ok=True)
|
||||
model.save_pretrained(save_path, safe_serialization=True)
|
||||
tokenizer.save_pretrained(save_path)
|
||||
Reference in New Issue
Block a user