Merge pull request #334 from LDLINGLINGLING/main

增加了量化脚本,SWIFT 和 Xinference 的推理文档,在 readme 中增加了常用模块和新模块的快速导航
This commit is contained in:
Tianyu Yu
2024-07-31 06:59:04 +08:00
committed by GitHub
9 changed files with 302 additions and 0 deletions

View File

@@ -70,6 +70,15 @@ Join our <a href="docs/wechat.md" target="_blank"> 💬 WeChat</a>
- [🌟 Star History](#-star-history)
- [Citation](#citation)
## MiniCPM-Llama3-V 2.5 Common Module Navigation <!-- omit in toc -->
You can click on the following table to quickly access the commonly used content you need in MiniCPM-Llama3-V 2.5.
| Functional Categories | | | | | | | ||
|:--------:|:------:|:--------------:|:--------:|:-------:|:-----------:|:-----------:|:--------:|:-----------:|
| Inference | [Transformers](https://github.com/OpenBMB/MiniCPM-V/blob/main/docs/inference_on_multiple_gpus.md) | [ollama](https://github.com/OpenBMB/ollama/tree/minicpm-v2.5/examples/minicpm-v2.5) | [SWIFT](./docs/swift_train_and_infer.md) | [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-v2.5/examples/minicpmv/README.md) | [Xinfrence](./docs/xinference_infer.md) | [Gradio](./web_demo_2.5.py) | [Streamlit](./web_demo_streamlit-2_5.py) |[vLLM](#vllm)
| Finetune | [Full-parameter](./finetune/readme.md) | [Lora](./finetune/readme.md) | [SWIFT](./docs/swift_train_and_infer.md) | | | | | |
| Edge Deployment | [apk](http://minicpm.modelbest.cn/android/modelbest-release-20240528_182155.apk) | [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-v2.5/examples/minicpmv/README.md) | | | | | | |
| Quantize | [Bnb](./quantize/bnb_quantize.py) |
## MiniCPM-Llama3-V 2.5
**MiniCPM-Llama3-V 2.5** is the latest model in the MiniCPM-V series. The model is built on SigLip-400M and Llama3-8B-Instruct with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.0. Notable features of MiniCPM-Llama3-V 2.5 include:

View File

@@ -75,6 +75,16 @@
- [🌟 Star History](#-star-history)
- [引用](#引用)
## MiniCPM-Llama3-V 2.5快速导航 <!-- omit in toc -->
你可以点击以下表格快速访问MiniCPM-Llama3-V 2.5中你所需要的常用内容
| 功能分类 | | | | | | | ||
|:--------:|:------:|:--------------:|:--------:|:-------:|:-----------:|:-----------:|:--------:|:-----------:|
| 推理 | [Transformers](https://github.com/OpenBMB/MiniCPM-V/blob/main/docs/inference_on_multiple_gpus.md) | [ollama](https://github.com/OpenBMB/ollama/tree/minicpm-v2.5/examples/minicpm-v2.5) | [SWIFT](./docs/swift_train_and_infer.md) | [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-v2.5/examples/minicpmv/README.md) | [Xinfrence](./docs/xinference_infer.md) | [Gradio](./web_demo_2.5.py) | [Streamlit](./web_demo_streamlit-2_5.py) |[vLLM](#vllm)
| 微调 | [Full-parameter](./finetune/readme.md) | [LoRA](./finetune/readme.md) | [SWIFT](./docs/swift_train_and_infer.md) | | | | | |
| 安卓部署 | [apk](http://minicpm.modelbest.cn/android/modelbest-release-20240528_182155.apk) | [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-v2.5/examples/minicpmv/README.md) | | | | | | |
| 量化 | [Bnb](./quantize/bnb_quantize.py) |
## MiniCPM-Llama3-V 2.5
**MiniCPM-Llama3-V 2.5** 是 MiniCPM-V 系列的最新版本模型,基于 SigLip-400M 和 Llama3-8B-Instruct 构建,共 8B 参数量,相较于 MiniCPM-V 2.0 性能取得较大幅度提升。MiniCPM-Llama3-V 2.5 值得关注的特点包括:

Binary file not shown.

After

Width:  |  Height:  |  Size: 120 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 88 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 8.7 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 29 KiB

View File

@@ -0,0 +1,135 @@
## SWIFT install
You can quickly install SWIFT using bash commands.
``` bash
git clone https://github.com/modelscope/swift.git
cd swift
pip install -r requirements.txt
pip install -e '.[llm]'
```
## SWIFT Infer
Inference using SWIFT can be carried out in two ways: through a command line interface and via Python code.
### Quick start
Here are the steps to launch SWIFT from the Bash command line:
1. Run the bash code will download the model of MiniCPM-Llama3-V-2_5 and run the inference
``` shell
CUDA_VISIBLE_DEVICES=0 swift infer --model_type minicpm-v-v2_5-chat
```
2. You can also run the code with more arguments below to run the inference:
```
model_id_or_path # Can be the model ID from Hugging Face or the local path to the model
infer_backend ['AUTO', 'vllm', 'pt'] # Backend for inference, default is auto
dtype ['bf16', 'fp16', 'fp32', 'AUTO'] # Computational precision
max_length # Maximum length
max_new_tokens: int = 2048 # Maximum number of tokens to generate
do_sample: bool = True # Whether to sample during generation
temperature: float = 0.3 # Temperature coefficient during generation
top_k: int = 20
top_p: float = 0.7
repetition_penalty: float = 1. # Penalty for repetition
num_beams: int = 1 # Number of beams for beam search
stop_words: List[str] = None # List of stop words
quant_method ['bnb', 'hqq', 'eetq', 'awq', 'gptq', 'aqlm'] # Quantization method for the model
quantization_bit [0, 1, 2, 3, 4, 8] # Default is 0, which means no quantization is used
```
3. Example:
``` shell
CUDA_VISIBLE_DEVICES=01 swift infer \
--model_type minicpm-v-v2_5-chat \
--model_id_or_path /root/ld/ld_model_pretrain/MiniCPM-Llama3-V-2_5 \
--dtype bf16
```
### Python code with SWIFT infer
The following demonstrates using Python code to initiate inference with the MiniCPM-Llama3-V-2_5 model through SWIFT.
```python
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1' # Set the number of GPUs to use
from swift.llm import (
get_model_tokenizer, get_template, inference, ModelType,
get_default_template_type, inference_stream
) # Import necessary modules
from swift.utils import seed_everything # Set random seed
import torch
model_type = ModelType.minicpm_v_v2_5_chat
template_type = get_default_template_type(model_type) # Obtain the template type, primarily used for constructing special tokens and image processing workflow
print(f'template_type: {template_type}')
model, tokenizer = get_model_tokenizer(model_type, torch.bfloat16,
model_id_or_path='/root/ld/ld_model_pretrain/MiniCPM-Llama3-V-2_5',
model_kwargs={'device_map': 'auto'}) # Load the model, set model type, model path, model parameters, device allocation, etc., computation precision, etc.
model.generation_config.max_new_tokens = 256
template = get_template(template_type, tokenizer) # Construct the template based on the template type
seed_everything(42)
images = ['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png'] # Image URL
query = '距离各城市多远?' # Note: Query is still in Chinese, consider translating if needed
response, history = inference(model, template, query, images=images) # Obtain results through inference
print(f'query: {query}')
print(f'response: {response}')
# Streaming output
query = '距离最远的城市是哪?' # Note: Query is still in Chinese, consider translating if needed
gen = inference_stream(model, template, query, history, images=images) # Call the streaming output interface
print_idx = 0
print(f'query: {query}\nresponse: ', end='')
for response, history in gen:
delta = response[print_idx:]
print(delta, end='', flush=True)
print_idx = len(response)
print()
print(f'history: {history}')
```
## SWIFT train
SWIFT supports training on the local dataset,the training steps are as follows:
1. Make the train data like this:
```jsonl
{"query": "What does this picture describe?", "response": "This picture has a giant panda.", "images": ["local_image_path"]}
{"query": "What does this picture describe?", "response": "This picture has a giant panda.", "history": [], "images": ["image_path"]}
{"query": "Is bamboo tasty?", "response": "It seems pretty tasty judging by the panda's expression.", "history": [["What's in this picture?", "There's a giant panda in this picture."], ["What is the panda doing?", "Eating bamboo."]], "images": ["image_url"]}
```
2. LoRA Tuning:
The LoRA target model are k and v weight in LLM you should pay attention to the eval_steps,maybe you should set the eval_steps to a large value, like 200000,beacuase in the eval time , SWIFT will return a memory bug so you should set the eval_steps to a very large value.
```shell
# Experimental environment: A100
# 32GB GPU memory
CUDA_VISIBLE_DEVICES=0 swift sft \
--model_type minicpm-v-v2_5-chat \
--dataset coco-en-2-mini \
```
3. All parameters finetune:
When the argument of lora_target_modules is ALL, the model will finetune all the parameters.
```shell
CUDA_VISIBLE_DEVICES=0,1 swift sft \
--model_type minicpm-v-v2_5-chat \
--dataset coco-en-2-mini \
--lora_target_modules ALL \
--eval_steps 200000
```
## LoRA Merge and Infer
The LoRA weight can be merge to the base model and then load to infer.
1. Load the LoRA weight to infer run the follow code:
```shell
CUDA_VISIBLE_DEVICES=0 swift infer \
--ckpt_dir /your/lora/save/checkpoint
```
2. Merge the LoRA weight to the base model:
The code will load and merge the LoRA weight to the base model, save the merge model to the LoRA save path and load the merge model to infer
```shell
CUDA_VISIBLE_DEVICES=0 swift infer \
--ckpt_dir your/lora/save/checkpoint \
--merge_lora true
```

67
docs/xinference_infer.md Normal file
View File

@@ -0,0 +1,67 @@
## Xinference Infer
Xinference is a unified inference platform that provides a unified interface for different inference engines. It supports LLM, text generation, image generation, and more.but it's not bigger than Swift too much.
### Xinference install
Xinference can be installed simply by using the following easy bash code:
```shell
pip install "xinference[all]"
```
### Quick start
The initial steps for conducting inference with Xinference involve downloading the model during the first launch.
1. Start Xinference in the terminal:
```shell
xinference
```
2. Start the web ui.
3. Search for "MiniCPM-Llama3-V-2_5" in the search box.
![alt text](../assets/xinferenc_demo_image/xinference_search_box.png)
4. Find and click the MiniCPM-Llama3-V-2_5 button.
5. Follow the config and launch the model.
```plaintext
Model engine : Transformers
model format : pytorch
Model size : 8
quantization : none
N-GPU : auto
Replica : 1
```
6. After first click the launch button,xinference will download the model from huggingface. We should click the webui button.
![alt text](../assets/xinferenc_demo_image/xinference_webui_button.png)
7. Upload the image and chatting with the MiniCPM-Llama3-V-2_5
### Local MiniCPM-Llama3-V-2_5 Launch
If you have already downloaded the MiniCPM-Llama3-V-2_5 model locally, you can proceed with Xinference inference following these steps:
1. Start Xinference
```shell
xinference
```
2. Start the web ui.
3. To register a new model, follow these steps: the settings highlighted in red are fixed and cannot be changed, whereas others are customizable according to your needs. Complete the process by clicking the 'Register Model' button.
![alt text](../assets/xinferenc_demo_image/xinference_register_model1.png)
![alt text](../assets/xinferenc_demo_image/xinference_register_model2.png)
4. After completing the model registration, proceed to 'Custom Models' and locate the model you just registered.
5. Follow the config and launch the model.
```plaintext
Model engine : Transformers
model format : pytorch
Model size : 8
quantization : none
N-GPU : auto
Replica : 1
```
6. After first click the launch button,Xinference will download the model from Huggingface. we should click the chat button.
![alt text](../assets/xinferenc_demo_image/xinference_webui_button.png)
7. Upload the image and chatting with the MiniCPM-Llama3-V-2_5
### FAQ
1. Why can't the sixth step open the WebUI?
Maybe your firewall or mac os to prevent the web to open.

81
quantize/bnb_quantize.py Normal file
View File

@@ -0,0 +1,81 @@
"""
the script will use bitandbytes to quantize the MiniCPM-Llama3-V-2_5 model.
the be quantized model can be finetuned by MiniCPM-Llama3-V-2_5 or not.
you only need to set the model_path 、save_path and run bash code
cd MiniCPM-V
python quantize/bnb_quantize.py
you will get the quantized model in save_path、quantized_model test time and gpu usage
"""
import torch
from transformers import AutoModel, AutoTokenizer, BitsAndBytesConfig
from PIL import Image
import time
import torch
import GPUtil
import os
assert torch.cuda.is_available(),"CUDA is not available, but this code requires a GPU."
device = 'cuda' # Select GPU to use
model_path = '/root/ld/ld_model_pretrained/MiniCPM-Llama3-V-2_5' # Model download path
save_path = '/root/ld/ld_model_pretrain/MiniCPM-Llama3-V-2_5_int4' # Quantized model save path
image_path = './assets/airplane.jpeg'
# Create a configuration object to specify quantization parameters
quantization_config = BitsAndBytesConfig(
load_in_4bit=True, # Whether to perform 4-bit quantization
load_in_8bit=False, # Whether to perform 8-bit quantization
bnb_4bit_compute_dtype=torch.float16, # Computation precision setting
bnb_4bit_quant_storage=torch.uint8, # Storage format for quantized weights
bnb_4bit_quant_type="nf4", # Quantization format, here using normally distributed int4
bnb_4bit_use_double_quant=True, # Whether to use double quantization, i.e., quantizing zeropoint and scaling parameters
llm_int8_enable_fp32_cpu_offload=False, # Whether LLM uses int8, with fp32 parameters stored on the CPU
llm_int8_has_fp16_weight=False, # Whether mixed precision is enabled
llm_int8_skip_modules=["out_proj", "kv_proj", "lm_head"], # Modules not to be quantized
llm_int8_threshold=6.0 # Outlier value in the llm.int8() algorithm, distinguishing whether to perform quantization based on this value
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_pretrained(
model_path,
device_map=device, # Allocate model to device
quantization_config=quantization_config,
trust_remote_code=True
)
gpu_usage = GPUtil.getGPUs()[0].memoryUsed
start=time.time()
response = model.chat(
image=Image.open(image_path).convert("RGB"),
msgs=[
{
"role": "user",
"content": "What is in this picture?"
}
],
tokenizer=tokenizer
) # 模型推理
print('Output after quantization:',response)
print('Inference time after quantization:',time.time()-start)
print(f"GPU memory usage after quantization: {round(gpu_usage/1024,2)}GB")
"""
Expected output:
Output after quantization: This picture contains specific parts of an airplane, including wings, engines, and tail sections. These components are key parts of large commercial aircraft.
The wings support lift during flight, while the engines provide thrust to move the plane forward. The tail section is typically used for stabilizing flight and plays a role in airline branding.
The design and color of the airplane indicate that it belongs to Air China, likely a passenger aircraft due to its large size and twin-engine configuration.
There are no markings or insignia on the airplane indicating the specific model or registration number; such information may require additional context or a clearer perspective to discern.
Inference time after quantization: 8.583992719650269 seconds
GPU memory usage after quantization: 6.41 GB
"""
# Save the model and tokenizer
os.makedirs(save_path, exist_ok=True)
model.save_pretrained(save_path, safe_serialization=True)
tokenizer.save_pretrained(save_path)