Merge pull request #334 from LDLINGLINGLING/main

增加了量化脚本，SWIFT 和 Xinference 的推理文档，在 readme 中增加了常用模块和新模块的快速导航
2026-02-04 17:59:18 +08:00 · 2024-07-31 06:59:04 +08:00
parent 703aa91b61 e88f8283a9
commit e1fdc44b52
9 changed files with 302 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -70,6 +70,15 @@ Join our <a href="docs/wechat.md" target="_blank"> 💬 WeChat</a>
 - [🌟 Star History](#-star-history)
 - [Citation](#citation)

+## MiniCPM-Llama3-V 2.5 Common Module Navigation <!-- omit in toc -->
+You can click on the following table to quickly access the commonly used content you need in MiniCPM-Llama3-V 2.5.
+| Functional Categories  |  | |  |  | |  |  ||
+|:--------:|:------:|:--------------:|:--------:|:-------:|:-----------:|:-----------:|:--------:|:-----------:|
+| Inference | [Transformers](https://github.com/OpenBMB/MiniCPM-V/blob/main/docs/inference_on_multiple_gpus.md) | [ollama](https://github.com/OpenBMB/ollama/tree/minicpm-v2.5/examples/minicpm-v2.5) | [SWIFT](./docs/swift_train_and_infer.md) | [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-v2.5/examples/minicpmv/README.md) | [Xinfrence](./docs/xinference_infer.md) | [Gradio](./web_demo_2.5.py) | [Streamlit](./web_demo_streamlit-2_5.py) |[vLLM](#vllm)
+| Finetune     | [Full-parameter](./finetune/readme.md) |   [Lora](./finetune/readme.md)     | [SWIFT](./docs/swift_train_and_infer.md)           |  |             |             |          |             |
+| Edge Deployment  | [apk](http://minicpm.modelbest.cn/android/modelbest-release-20240528_182155.apk)  | [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-v2.5/examples/minicpmv/README.md)               |  |        |             |             |          |             |
+| Quantize | [Bnb](./quantize/bnb_quantize.py)  |            
+
 ## MiniCPM-Llama3-V 2.5

 **MiniCPM-Llama3-V 2.5** is the latest model in the MiniCPM-V series. The model is built on SigLip-400M and Llama3-8B-Instruct with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.0. Notable features of MiniCPM-Llama3-V 2.5 include:
--- a/README_zh.md
+++ b/README_zh.md
@@ -75,6 +75,16 @@
 - [🌟 Star History](#-star-history)
 - [引用](#引用)

+## MiniCPM-Llama3-V 2.5快速导航 <!-- omit in toc -->
+你可以点击以下表格快速访问MiniCPM-Llama3-V 2.5中你所需要的常用内容
+
+| 功能分类 |  | |  |  | |  |  ||
+|:--------:|:------:|:--------------:|:--------:|:-------:|:-----------:|:-----------:|:--------:|:-----------:|
+| 推理 | [Transformers](https://github.com/OpenBMB/MiniCPM-V/blob/main/docs/inference_on_multiple_gpus.md) | [ollama](https://github.com/OpenBMB/ollama/tree/minicpm-v2.5/examples/minicpm-v2.5) | [SWIFT](./docs/swift_train_and_infer.md) | [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-v2.5/examples/minicpmv/README.md) | [Xinfrence](./docs/xinference_infer.md) | [Gradio](./web_demo_2.5.py) | [Streamlit](./web_demo_streamlit-2_5.py) |[vLLM](#vllm)
+| 微调      | [Full-parameter](./finetune/readme.md) |   [LoRA](./finetune/readme.md)     | [SWIFT](./docs/swift_train_and_infer.md)           |  |             |             |          |             |
+| 安卓部署  | [apk](http://minicpm.modelbest.cn/android/modelbest-release-20240528_182155.apk)  | [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-v2.5/examples/minicpmv/README.md)               |  |        |             |             |          |             |
+| 量化      | [Bnb](./quantize/bnb_quantize.py)  |                            
+

 ## MiniCPM-Llama3-V 2.5
 **MiniCPM-Llama3-V 2.5** 是 MiniCPM-V 系列的最新版本模型，基于 SigLip-400M 和 Llama3-8B-Instruct 构建，共 8B 参数量，相较于 MiniCPM-V 2.0 性能取得较大幅度提升。MiniCPM-Llama3-V 2.5 值得关注的特点包括：
--- a/assets/xinferenc_demo_image/xinference_register_model1.png
+++ b/assets/xinferenc_demo_image/xinference_register_model1.png
--- a/assets/xinferenc_demo_image/xinference_register_model2.png
+++ b/assets/xinferenc_demo_image/xinference_register_model2.png
--- a/assets/xinferenc_demo_image/xinference_search_box.png
+++ b/assets/xinferenc_demo_image/xinference_search_box.png
--- a/assets/xinferenc_demo_image/xinference_webui_button.png
+++ b/assets/xinferenc_demo_image/xinference_webui_button.png
--- a/docs/swift_train_and_infer.md
+++ b/docs/swift_train_and_infer.md
@@ -0,0 +1,135 @@
+## SWIFT install
+You can quickly install SWIFT using bash commands.
+
+``` bash
+git clone https://github.com/modelscope/swift.git
+cd swift
+pip install -r requirements.txt
+pip install -e '.[llm]'
+```
+
+## SWIFT Infer
+Inference using SWIFT can be carried out in two ways: through a command line interface and via Python code.
+
+### Quick start
+Here are the steps to launch SWIFT from the Bash command line:
+
+1. Run the bash code will download the model of MiniCPM-Llama3-V-2_5 and run the inference
+``` shell
+CUDA_VISIBLE_DEVICES=0 swift infer --model_type minicpm-v-v2_5-chat
+```
+
+2. You can also run the code with more arguments below to run the inference:
+``` 
+model_id_or_path  # Can be the model ID from Hugging Face or the local path to the model
+infer_backend ['AUTO', 'vllm', 'pt']  # Backend for inference, default is auto
+dtype ['bf16', 'fp16', 'fp32', 'AUTO']  # Computational precision
+max_length  # Maximum length
+max_new_tokens: int = 2048  # Maximum number of tokens to generate
+do_sample: bool = True  # Whether to sample during generation
+temperature: float = 0.3  # Temperature coefficient during generation
+top_k: int = 20 
+top_p: float = 0.7
+repetition_penalty: float = 1.  # Penalty for repetition
+num_beams: int = 1  # Number of beams for beam search
+stop_words: List[str] = None  # List of stop words
+quant_method ['bnb', 'hqq', 'eetq', 'awq', 'gptq', 'aqlm']  # Quantization method for the model
+quantization_bit [0, 1, 2, 3, 4, 8]  # Default is 0, which means no quantization is used
+```
+3. Example:
+``` shell
+CUDA_VISIBLE_DEVICES=0，1 swift infer \
+--model_type minicpm-v-v2_5-chat \
+--model_id_or_path /root/ld/ld_model_pretrain/MiniCPM-Llama3-V-2_5 \
+--dtype bf16 
+```
+### Python code with SWIFT infer
+The following demonstrates using Python code to initiate inference with the MiniCPM-Llama3-V-2_5 model through SWIFT.
+
+```python
+import os
+os.environ['CUDA_VISIBLE_DEVICES'] = '0,1'  # Set the number of GPUs to use
+
+from swift.llm import (
+    get_model_tokenizer, get_template, inference, ModelType,
+    get_default_template_type, inference_stream
+)  # Import necessary modules
+
+from swift.utils import seed_everything  # Set random seed
+import torch
+
+model_type = ModelType.minicpm_v_v2_5_chat
+template_type = get_default_template_type(model_type)  # Obtain the template type, primarily used for constructing special tokens and image processing workflow
+print(f'template_type: {template_type}')
+
+model, tokenizer = get_model_tokenizer(model_type, torch.bfloat16,
+                                       model_id_or_path='/root/ld/ld_model_pretrain/MiniCPM-Llama3-V-2_5',
+                                       model_kwargs={'device_map': 'auto'})  # Load the model, set model type, model path, model parameters, device allocation, etc., computation precision, etc.
+model.generation_config.max_new_tokens = 256
+template = get_template(template_type, tokenizer)  # Construct the template based on the template type
+seed_everything(42)
+
+images = ['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png']  # Image URL
+query = '距离各城市多远？'  # Note: Query is still in Chinese, consider translating if needed
+response, history = inference(model, template, query, images=images)  # Obtain results through inference
+print(f'query: {query}')
+print(f'response: {response}')
+
+# Streaming output
+query = '距离最远的城市是哪？'  # Note: Query is still in Chinese, consider translating if needed
+gen = inference_stream(model, template, query, history, images=images)  # Call the streaming output interface
+print_idx = 0
+print(f'query: {query}\nresponse: ', end='')
+for response, history in gen:
+    delta = response[print_idx:]
+    print(delta, end='', flush=True)
+    print_idx = len(response)
+print()
+print(f'history: {history}')
+```
+
+## SWIFT train
+SWIFT supports training on the local dataset,the training steps are as follows:
+1. Make the train data like this:
+```jsonl
+{"query": "What does this picture describe?", "response": "This picture has a giant panda.", "images": ["local_image_path"]}
+{"query": "What does this picture describe?", "response": "This picture has a giant panda.", "history": [], "images": ["image_path"]}
+{"query": "Is bamboo tasty?", "response": "It seems pretty tasty judging by the panda's expression.", "history": [["What's in this picture?", "There's a giant panda in this picture."], ["What is the panda doing?", "Eating bamboo."]], "images": ["image_url"]}
+```
+2. LoRA Tuning:
+
+The LoRA target model are k and v weight in LLM you should pay attention to the eval_steps,maybe you should set the eval_steps to a large value, like 200000,beacuase in the eval time , SWIFT will return a memory bug so you should set the eval_steps to a very large value.
+```shell
+# Experimental environment: A100
+# 32GB GPU memory
+CUDA_VISIBLE_DEVICES=0 swift sft \
+--model_type minicpm-v-v2_5-chat \
+--dataset coco-en-2-mini \
+```
+3. All parameters finetune:
+
+When the argument of lora_target_modules is ALL, the model will finetune all the parameters.
+```shell
+CUDA_VISIBLE_DEVICES=0,1 swift sft \
+--model_type minicpm-v-v2_5-chat \
+--dataset coco-en-2-mini \
+--lora_target_modules ALL \
+--eval_steps 200000
+```
+
+## LoRA Merge and Infer
+The LoRA weight can be merge to the base model and then load to infer.
+
+1. Load the LoRA weight to infer run the follow code:
+```shell
+CUDA_VISIBLE_DEVICES=0 swift infer \
+--ckpt_dir /your/lora/save/checkpoint
+```
+2. Merge the LoRA weight to the base model:
+
+The code will load and merge the LoRA weight to the base model, save the merge model to the LoRA save path and load the merge model to infer
+```shell
+CUDA_VISIBLE_DEVICES=0 swift infer \
+--ckpt_dir your/lora/save/checkpoint \
+--merge_lora true
+```
--- a/docs/xinference_infer.md
+++ b/docs/xinference_infer.md
@@ -0,0 +1,67 @@
+## Xinference Infer
+Xinference is a unified inference platform that provides a unified interface for different inference engines. It supports LLM, text generation, image generation, and more.but it's not bigger than Swift too much.
+
+
+### Xinference install
+Xinference can be installed simply by using the following easy bash code:
+```shell
+pip install "xinference[all]"
+```
+
+### Quick start
+The initial steps for conducting inference with Xinference involve downloading the model during the first launch.
+1. Start Xinference in the terminal:
+```shell
+xinference
+```
+2. Start the web ui.
+3. Search for "MiniCPM-Llama3-V-2_5" in the search box.
+
+![alt text](../assets/xinferenc_demo_image/xinference_search_box.png)
+
+4. Find and click the MiniCPM-Llama3-V-2_5 button.
+5. Follow the config and launch the model.
+```plaintext
+Model engine : Transformers
+model format : pytorch
+Model size   : 8
+quantization : none
+N-GPU        : auto
+Replica      : 1
+```
+6. After first click the launch button,xinference will download the model from huggingface. We should click the webui button.
+
+![alt text](../assets/xinferenc_demo_image/xinference_webui_button.png)
+
+7. Upload the image and chatting with the MiniCPM-Llama3-V-2_5
+
+### Local MiniCPM-Llama3-V-2_5 Launch
+If you have already downloaded the MiniCPM-Llama3-V-2_5 model locally, you can proceed with Xinference inference following these steps:
+1. Start Xinference
+```shell
+xinference
+```
+2. Start the web ui.
+3. To register a new model, follow these steps: the settings highlighted in red are fixed and cannot be changed, whereas others are customizable according to your needs. Complete the process by clicking the 'Register Model' button.
+
+![alt text](../assets/xinferenc_demo_image/xinference_register_model1.png)
+![alt text](../assets/xinferenc_demo_image/xinference_register_model2.png)
+
+4. After completing the model registration, proceed to 'Custom Models' and locate the model you just registered.
+5. Follow the config and launch the model.
+```plaintext
+Model engine : Transformers
+model format : pytorch
+Model size   : 8
+quantization : none
+N-GPU        : auto
+Replica      : 1
+```
+6. After first click the launch button,Xinference will download the model from Huggingface. we should click the chat button.
+![alt text](../assets/xinferenc_demo_image/xinference_webui_button.png)
+7. Upload the image and chatting with the MiniCPM-Llama3-V-2_5
+
+### FAQ
+1. Why can't the sixth step open the WebUI?
+
+Maybe your firewall or mac os to prevent the web to open.
--- a/quantize/bnb_quantize.py
+++ b/quantize/bnb_quantize.py
@@ -0,0 +1,81 @@
+"""
+the script will use bitandbytes to quantize the MiniCPM-Llama3-V-2_5 model.
+the be quantized model can be finetuned by MiniCPM-Llama3-V-2_5 or not.
+you only need to set the model_path 、save_path and run bash code 
+
+cd MiniCPM-V
+python quantize/bnb_quantize.py
+
+you will get the quantized model in save_path、quantized_model test time and gpu usage
+"""
+
+
+import torch
+from transformers import AutoModel, AutoTokenizer, BitsAndBytesConfig
+from PIL import Image
+import time
+import torch
+import GPUtil
+import os
+
+assert torch.cuda.is_available(),"CUDA is not available, but this code requires a GPU."
+
+device = 'cuda'  # Select GPU to use
+model_path = '/root/ld/ld_model_pretrained/MiniCPM-Llama3-V-2_5' # Model download path
+save_path = '/root/ld/ld_model_pretrain/MiniCPM-Llama3-V-2_5_int4' # Quantized model save path
+image_path = './assets/airplane.jpeg'
+
+
+# Create a configuration object to specify quantization parameters
+quantization_config = BitsAndBytesConfig(
+    load_in_4bit=True,  # Whether to perform 4-bit quantization
+    load_in_8bit=False,  # Whether to perform 8-bit quantization
+    bnb_4bit_compute_dtype=torch.float16,  # Computation precision setting
+    bnb_4bit_quant_storage=torch.uint8,  # Storage format for quantized weights
+    bnb_4bit_quant_type="nf4",  # Quantization format, here using normally distributed int4
+    bnb_4bit_use_double_quant=True,  # Whether to use double quantization, i.e., quantizing zeropoint and scaling parameters
+    llm_int8_enable_fp32_cpu_offload=False,  # Whether LLM uses int8, with fp32 parameters stored on the CPU
+    llm_int8_has_fp16_weight=False,  # Whether mixed precision is enabled
+    llm_int8_skip_modules=["out_proj", "kv_proj", "lm_head"],  # Modules not to be quantized
+    llm_int8_threshold=6.0  # Outlier value in the llm.int8() algorithm, distinguishing whether to perform quantization based on this value
+)
+
+tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
+model = AutoModel.from_pretrained(
+    model_path,
+    device_map=device,  # Allocate model to device
+    quantization_config=quantization_config,
+    trust_remote_code=True
+)
+
+gpu_usage = GPUtil.getGPUs()[0].memoryUsed  
+start=time.time()
+response = model.chat(
+    image=Image.open(image_path).convert("RGB"),
+    msgs=[
+        {
+            "role": "user",
+            "content": "What is in this picture?"
+        }
+    ],
+    tokenizer=tokenizer
+) # 模型推理
+print('Output after quantization:',response)
+print('Inference time after quantization:',time.time()-start)
+print(f"GPU memory usage after quantization: {round(gpu_usage/1024,2)}GB")
+
+"""
+Expected output:
+
+    Output after quantization: This picture contains specific parts of an airplane, including wings, engines, and tail sections. These components are key parts of large commercial aircraft.
+    The wings support lift during flight, while the engines provide thrust to move the plane forward. The tail section is typically used for stabilizing flight and plays a role in airline branding.
+    The design and color of the airplane indicate that it belongs to Air China, likely a passenger aircraft due to its large size and twin-engine configuration.
+    There are no markings or insignia on the airplane indicating the specific model or registration number; such information may require additional context or a clearer perspective to discern.
+    Inference time after quantization: 8.583992719650269 seconds
+    GPU memory usage after quantization: 6.41 GB
+"""
+
+# Save the model and tokenizer
+os.makedirs(save_path, exist_ok=True)
+model.save_pretrained(save_path, safe_serialization=True)
+tokenizer.save_pretrained(save_path)