diff --git a/README.md b/README.md index 6469b63..4bb9ec8 100644 --- a/README.md +++ b/README.md @@ -70,6 +70,15 @@ Join our 💬 WeChat - [🌟 Star History](#-star-history) - [Citation](#citation) +## MiniCPM-Llama3-V 2.5 Common Module Navigation +You can click on the following table to quickly access the commonly used content you need in MiniCPM-Llama3-V 2.5. +| Functional Categories | | | | | | | || +|:--------:|:------:|:--------------:|:--------:|:-------:|:-----------:|:-----------:|:--------:|:-----------:| +| Inference | [Transformers](https://github.com/OpenBMB/MiniCPM-V/blob/main/docs/inference_on_multiple_gpus.md) | [ollama](https://github.com/OpenBMB/ollama/tree/minicpm-v2.5/examples/minicpm-v2.5) | [SWIFT](./docs/swift_train_and_infer.md) | [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-v2.5/examples/minicpmv/README.md) | [Xinfrence](./docs/xinference_infer.md) | [Gradio](./web_demo_2.5.py) | [Streamlit](./web_demo_streamlit-2_5.py) |[vLLM](#vllm) +| Finetune | [Full-parameter](./finetune/readme.md) | [Lora](./finetune/readme.md) | [SWIFT](./docs/swift_train_and_infer.md) | | | | | | +| Edge Deployment | [apk](http://minicpm.modelbest.cn/android/modelbest-release-20240528_182155.apk) | [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-v2.5/examples/minicpmv/README.md) | | | | | | | +| Quantize | [Bnb](./quantize/bnb_quantize.py) | + ## MiniCPM-Llama3-V 2.5 **MiniCPM-Llama3-V 2.5** is the latest model in the MiniCPM-V series. The model is built on SigLip-400M and Llama3-8B-Instruct with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.0. Notable features of MiniCPM-Llama3-V 2.5 include: diff --git a/README_zh.md b/README_zh.md index 8f5ca2c..34eee2f 100644 --- a/README_zh.md +++ b/README_zh.md @@ -75,6 +75,16 @@ - [🌟 Star History](#-star-history) - [引用](#引用) +## MiniCPM-Llama3-V 2.5快速导航 +你可以点击以下表格快速访问MiniCPM-Llama3-V 2.5中你所需要的常用内容 + +| 功能分类 | | | | | | | || +|:--------:|:------:|:--------------:|:--------:|:-------:|:-----------:|:-----------:|:--------:|:-----------:| +| 推理 | [Transformers](https://github.com/OpenBMB/MiniCPM-V/blob/main/docs/inference_on_multiple_gpus.md) | [ollama](https://github.com/OpenBMB/ollama/tree/minicpm-v2.5/examples/minicpm-v2.5) | [SWIFT](./docs/swift_train_and_infer.md) | [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-v2.5/examples/minicpmv/README.md) | [Xinfrence](./docs/xinference_infer.md) | [Gradio](./web_demo_2.5.py) | [Streamlit](./web_demo_streamlit-2_5.py) |[vLLM](#vllm) +| 微调 | [Full-parameter](./finetune/readme.md) | [LoRA](./finetune/readme.md) | [SWIFT](./docs/swift_train_and_infer.md) | | | | | | +| 安卓部署 | [apk](http://minicpm.modelbest.cn/android/modelbest-release-20240528_182155.apk) | [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-v2.5/examples/minicpmv/README.md) | | | | | | | +| 量化 | [Bnb](./quantize/bnb_quantize.py) | + ## MiniCPM-Llama3-V 2.5 **MiniCPM-Llama3-V 2.5** 是 MiniCPM-V 系列的最新版本模型,基于 SigLip-400M 和 Llama3-8B-Instruct 构建,共 8B 参数量,相较于 MiniCPM-V 2.0 性能取得较大幅度提升。MiniCPM-Llama3-V 2.5 值得关注的特点包括: diff --git a/assets/xinferenc_demo_image/xinference_register_model1.png b/assets/xinferenc_demo_image/xinference_register_model1.png new file mode 100644 index 0000000..7f8b883 Binary files /dev/null and b/assets/xinferenc_demo_image/xinference_register_model1.png differ diff --git a/assets/xinferenc_demo_image/xinference_register_model2.png b/assets/xinferenc_demo_image/xinference_register_model2.png new file mode 100644 index 0000000..ad847d7 Binary files /dev/null and b/assets/xinferenc_demo_image/xinference_register_model2.png differ diff --git a/assets/xinferenc_demo_image/xinference_search_box.png b/assets/xinferenc_demo_image/xinference_search_box.png new file mode 100644 index 0000000..a2a44c1 Binary files /dev/null and b/assets/xinferenc_demo_image/xinference_search_box.png differ diff --git a/assets/xinferenc_demo_image/xinference_webui_button.png b/assets/xinferenc_demo_image/xinference_webui_button.png new file mode 100644 index 0000000..f01b688 Binary files /dev/null and b/assets/xinferenc_demo_image/xinference_webui_button.png differ diff --git a/docs/swift_train_and_infer.md b/docs/swift_train_and_infer.md new file mode 100644 index 0000000..693efe6 --- /dev/null +++ b/docs/swift_train_and_infer.md @@ -0,0 +1,135 @@ +## SWIFT install +You can quickly install SWIFT using bash commands. + +``` bash +git clone https://github.com/modelscope/swift.git +cd swift +pip install -r requirements.txt +pip install -e '.[llm]' +``` + +## SWIFT Infer +Inference using SWIFT can be carried out in two ways: through a command line interface and via Python code. + +### Quick start +Here are the steps to launch SWIFT from the Bash command line: + +1. Run the bash code will download the model of MiniCPM-Llama3-V-2_5 and run the inference +``` shell +CUDA_VISIBLE_DEVICES=0 swift infer --model_type minicpm-v-v2_5-chat +``` + +2. You can also run the code with more arguments below to run the inference: +``` +model_id_or_path # Can be the model ID from Hugging Face or the local path to the model +infer_backend ['AUTO', 'vllm', 'pt'] # Backend for inference, default is auto +dtype ['bf16', 'fp16', 'fp32', 'AUTO'] # Computational precision +max_length # Maximum length +max_new_tokens: int = 2048 # Maximum number of tokens to generate +do_sample: bool = True # Whether to sample during generation +temperature: float = 0.3 # Temperature coefficient during generation +top_k: int = 20 +top_p: float = 0.7 +repetition_penalty: float = 1. # Penalty for repetition +num_beams: int = 1 # Number of beams for beam search +stop_words: List[str] = None # List of stop words +quant_method ['bnb', 'hqq', 'eetq', 'awq', 'gptq', 'aqlm'] # Quantization method for the model +quantization_bit [0, 1, 2, 3, 4, 8] # Default is 0, which means no quantization is used +``` +3. Example: +``` shell +CUDA_VISIBLE_DEVICES=0,1 swift infer \ +--model_type minicpm-v-v2_5-chat \ +--model_id_or_path /root/ld/ld_model_pretrain/MiniCPM-Llama3-V-2_5 \ +--dtype bf16 +``` +### Python code with SWIFT infer +The following demonstrates using Python code to initiate inference with the MiniCPM-Llama3-V-2_5 model through SWIFT. + +```python +import os +os.environ['CUDA_VISIBLE_DEVICES'] = '0,1' # Set the number of GPUs to use + +from swift.llm import ( + get_model_tokenizer, get_template, inference, ModelType, + get_default_template_type, inference_stream +) # Import necessary modules + +from swift.utils import seed_everything # Set random seed +import torch + +model_type = ModelType.minicpm_v_v2_5_chat +template_type = get_default_template_type(model_type) # Obtain the template type, primarily used for constructing special tokens and image processing workflow +print(f'template_type: {template_type}') + +model, tokenizer = get_model_tokenizer(model_type, torch.bfloat16, + model_id_or_path='/root/ld/ld_model_pretrain/MiniCPM-Llama3-V-2_5', + model_kwargs={'device_map': 'auto'}) # Load the model, set model type, model path, model parameters, device allocation, etc., computation precision, etc. +model.generation_config.max_new_tokens = 256 +template = get_template(template_type, tokenizer) # Construct the template based on the template type +seed_everything(42) + +images = ['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png'] # Image URL +query = '距离各城市多远?' # Note: Query is still in Chinese, consider translating if needed +response, history = inference(model, template, query, images=images) # Obtain results through inference +print(f'query: {query}') +print(f'response: {response}') + +# Streaming output +query = '距离最远的城市是哪?' # Note: Query is still in Chinese, consider translating if needed +gen = inference_stream(model, template, query, history, images=images) # Call the streaming output interface +print_idx = 0 +print(f'query: {query}\nresponse: ', end='') +for response, history in gen: + delta = response[print_idx:] + print(delta, end='', flush=True) + print_idx = len(response) +print() +print(f'history: {history}') +``` + +## SWIFT train +SWIFT supports training on the local dataset,the training steps are as follows: +1. Make the train data like this: +```jsonl +{"query": "What does this picture describe?", "response": "This picture has a giant panda.", "images": ["local_image_path"]} +{"query": "What does this picture describe?", "response": "This picture has a giant panda.", "history": [], "images": ["image_path"]} +{"query": "Is bamboo tasty?", "response": "It seems pretty tasty judging by the panda's expression.", "history": [["What's in this picture?", "There's a giant panda in this picture."], ["What is the panda doing?", "Eating bamboo."]], "images": ["image_url"]} +``` +2. LoRA Tuning: + +The LoRA target model are k and v weight in LLM you should pay attention to the eval_steps,maybe you should set the eval_steps to a large value, like 200000,beacuase in the eval time , SWIFT will return a memory bug so you should set the eval_steps to a very large value. +```shell +# Experimental environment: A100 +# 32GB GPU memory +CUDA_VISIBLE_DEVICES=0 swift sft \ +--model_type minicpm-v-v2_5-chat \ +--dataset coco-en-2-mini \ +``` +3. All parameters finetune: + +When the argument of lora_target_modules is ALL, the model will finetune all the parameters. +```shell +CUDA_VISIBLE_DEVICES=0,1 swift sft \ +--model_type minicpm-v-v2_5-chat \ +--dataset coco-en-2-mini \ +--lora_target_modules ALL \ +--eval_steps 200000 +``` + +## LoRA Merge and Infer +The LoRA weight can be merge to the base model and then load to infer. + +1. Load the LoRA weight to infer run the follow code: +```shell +CUDA_VISIBLE_DEVICES=0 swift infer \ +--ckpt_dir /your/lora/save/checkpoint +``` +2. Merge the LoRA weight to the base model: + +The code will load and merge the LoRA weight to the base model, save the merge model to the LoRA save path and load the merge model to infer +```shell +CUDA_VISIBLE_DEVICES=0 swift infer \ +--ckpt_dir your/lora/save/checkpoint \ +--merge_lora true +``` \ No newline at end of file diff --git a/docs/xinference_infer.md b/docs/xinference_infer.md new file mode 100644 index 0000000..ab06a2b --- /dev/null +++ b/docs/xinference_infer.md @@ -0,0 +1,67 @@ +## Xinference Infer +Xinference is a unified inference platform that provides a unified interface for different inference engines. It supports LLM, text generation, image generation, and more.but it's not bigger than Swift too much. + + +### Xinference install +Xinference can be installed simply by using the following easy bash code: +```shell +pip install "xinference[all]" +``` + +### Quick start +The initial steps for conducting inference with Xinference involve downloading the model during the first launch. +1. Start Xinference in the terminal: +```shell +xinference +``` +2. Start the web ui. +3. Search for "MiniCPM-Llama3-V-2_5" in the search box. + +![alt text](../assets/xinferenc_demo_image/xinference_search_box.png) + +4. Find and click the MiniCPM-Llama3-V-2_5 button. +5. Follow the config and launch the model. +```plaintext +Model engine : Transformers +model format : pytorch +Model size : 8 +quantization : none +N-GPU : auto +Replica : 1 +``` +6. After first click the launch button,xinference will download the model from huggingface. We should click the webui button. + +![alt text](../assets/xinferenc_demo_image/xinference_webui_button.png) + +7. Upload the image and chatting with the MiniCPM-Llama3-V-2_5 + +### Local MiniCPM-Llama3-V-2_5 Launch +If you have already downloaded the MiniCPM-Llama3-V-2_5 model locally, you can proceed with Xinference inference following these steps: +1. Start Xinference +```shell +xinference +``` +2. Start the web ui. +3. To register a new model, follow these steps: the settings highlighted in red are fixed and cannot be changed, whereas others are customizable according to your needs. Complete the process by clicking the 'Register Model' button. + +![alt text](../assets/xinferenc_demo_image/xinference_register_model1.png) +![alt text](../assets/xinferenc_demo_image/xinference_register_model2.png) + +4. After completing the model registration, proceed to 'Custom Models' and locate the model you just registered. +5. Follow the config and launch the model. +```plaintext +Model engine : Transformers +model format : pytorch +Model size : 8 +quantization : none +N-GPU : auto +Replica : 1 +``` +6. After first click the launch button,Xinference will download the model from Huggingface. we should click the chat button. +![alt text](../assets/xinferenc_demo_image/xinference_webui_button.png) +7. Upload the image and chatting with the MiniCPM-Llama3-V-2_5 + +### FAQ +1. Why can't the sixth step open the WebUI? + +Maybe your firewall or mac os to prevent the web to open. \ No newline at end of file diff --git a/quantize/bnb_quantize.py b/quantize/bnb_quantize.py new file mode 100644 index 0000000..7aa7b46 --- /dev/null +++ b/quantize/bnb_quantize.py @@ -0,0 +1,81 @@ +""" +the script will use bitandbytes to quantize the MiniCPM-Llama3-V-2_5 model. +the be quantized model can be finetuned by MiniCPM-Llama3-V-2_5 or not. +you only need to set the model_path 、save_path and run bash code + +cd MiniCPM-V +python quantize/bnb_quantize.py + +you will get the quantized model in save_path、quantized_model test time and gpu usage +""" + + +import torch +from transformers import AutoModel, AutoTokenizer, BitsAndBytesConfig +from PIL import Image +import time +import torch +import GPUtil +import os + +assert torch.cuda.is_available(),"CUDA is not available, but this code requires a GPU." + +device = 'cuda' # Select GPU to use +model_path = '/root/ld/ld_model_pretrained/MiniCPM-Llama3-V-2_5' # Model download path +save_path = '/root/ld/ld_model_pretrain/MiniCPM-Llama3-V-2_5_int4' # Quantized model save path +image_path = './assets/airplane.jpeg' + + +# Create a configuration object to specify quantization parameters +quantization_config = BitsAndBytesConfig( + load_in_4bit=True, # Whether to perform 4-bit quantization + load_in_8bit=False, # Whether to perform 8-bit quantization + bnb_4bit_compute_dtype=torch.float16, # Computation precision setting + bnb_4bit_quant_storage=torch.uint8, # Storage format for quantized weights + bnb_4bit_quant_type="nf4", # Quantization format, here using normally distributed int4 + bnb_4bit_use_double_quant=True, # Whether to use double quantization, i.e., quantizing zeropoint and scaling parameters + llm_int8_enable_fp32_cpu_offload=False, # Whether LLM uses int8, with fp32 parameters stored on the CPU + llm_int8_has_fp16_weight=False, # Whether mixed precision is enabled + llm_int8_skip_modules=["out_proj", "kv_proj", "lm_head"], # Modules not to be quantized + llm_int8_threshold=6.0 # Outlier value in the llm.int8() algorithm, distinguishing whether to perform quantization based on this value +) + +tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) +model = AutoModel.from_pretrained( + model_path, + device_map=device, # Allocate model to device + quantization_config=quantization_config, + trust_remote_code=True +) + +gpu_usage = GPUtil.getGPUs()[0].memoryUsed +start=time.time() +response = model.chat( + image=Image.open(image_path).convert("RGB"), + msgs=[ + { + "role": "user", + "content": "What is in this picture?" + } + ], + tokenizer=tokenizer +) # 模型推理 +print('Output after quantization:',response) +print('Inference time after quantization:',time.time()-start) +print(f"GPU memory usage after quantization: {round(gpu_usage/1024,2)}GB") + +""" +Expected output: + + Output after quantization: This picture contains specific parts of an airplane, including wings, engines, and tail sections. These components are key parts of large commercial aircraft. + The wings support lift during flight, while the engines provide thrust to move the plane forward. The tail section is typically used for stabilizing flight and plays a role in airline branding. + The design and color of the airplane indicate that it belongs to Air China, likely a passenger aircraft due to its large size and twin-engine configuration. + There are no markings or insignia on the airplane indicating the specific model or registration number; such information may require additional context or a clearer perspective to discern. + Inference time after quantization: 8.583992719650269 seconds + GPU memory usage after quantization: 6.41 GB +""" + +# Save the model and tokenizer +os.makedirs(save_path, exist_ok=True) +model.save_pretrained(save_path, safe_serialization=True) +tokenizer.save_pretrained(save_path) \ No newline at end of file