From cd0972c7a1f66bcae99e458801538161122779c6 Mon Sep 17 00:00:00 2001 From: Zhangchi Feng <64362896+BUAADreamer@users.noreply.github.com> Date: Thu, 16 Jan 2025 09:50:18 +0800 Subject: [PATCH] Best Practice with LLaMA-Factory (#711) * add llamafactory examples * tiny fix * update doc about inference --- README.md | 6 +- README_zh.md | 6 +- docs/llamafactory_train_and_infer.md | 382 +++++++++++++++++++++++++++ 3 files changed, 388 insertions(+), 6 deletions(-) create mode 100644 docs/llamafactory_train_and_infer.md diff --git a/README.md b/README.md index 282f427..9c68497 100644 --- a/README.md +++ b/README.md @@ -131,8 +131,8 @@ Advancing popular visual capabilites from MiniCPM-V series, MiniCPM-o 2.6 can pr In addition to its friendly size, MiniCPM-o 2.6 also shows **state-of-the-art token density** (i.e., number of pixels encoded into each visual token). **It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models**. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-o 2.6 can efficiently support **multimodal live streaming** on end-side devices such as iPad. - 💫 **Easy Usage.** -MiniCPM-o 2.6 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4) and [GGUF](https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf) format quantized models in 16 sizes, (3) [vLLM](#efficient-inference-with-llamacpp-ollama-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks with [LLaMA-Factory](./docs/llamafactory_train.md), (5) quick [local WebUI demo](#chat-with-our-demo-on-gradio), and (6) online web demo on [server](https://minicpm-omni-webdemo-us.modelbest.cn/). +MiniCPM-o 2.6 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4) and [GGUF](https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf) format quantized models in 16 sizes, (3) [vLLM](#efficient-inference-with-llamacpp-ollama-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks with [LLaMA-Factory](./docs/llamafactory_train_and_infer.md), (5) quick [local WebUI demo](#chat-with-our-demo-on-gradio), and (6) online web demo on [server](https://minicpm-omni-webdemo-us.modelbest.cn/). **Model Architecture.** @@ -2488,7 +2488,7 @@ We support simple fine-tuning with Hugging Face for MiniCPM-o 2.6, MiniCPM-V 2.6 We support fine-tuning MiniCPM-o-2.6 and MiniCPM-V 2.6 with the LLaMA-Factory framework. LLaMA-Factory provides a solution for flexibly customizing the fine-tuning (Lora/Full/Qlora) of 200+ LLMs without the need for coding through the built-in web UI LLaMABoard. It supports various training methods like sft/ppo/dpo/kto and advanced algorithms like Galore/BAdam/LLaMA-Pro/Pissa/LongLoRA. -Best Practices: [MiniCPM-V-2.6 | MiniCPM-o-2.6](./docs/llamafactory_train.md). +Best Practices: [MiniCPM-o-2.6 | MiniCPM-V-2.6](./docs/llamafactory_train_and_infer.md). ### With the SWIFT Framework @@ -2574,4 +2574,4 @@ If you find our model/code/paper helpful, please consider citing our papers 📝 journal={arXiv preprint arXiv:2408.01800}, year={2024} } -``` +``` \ No newline at end of file diff --git a/README_zh.md b/README_zh.md index 4609706..d245f12 100644 --- a/README_zh.md +++ b/README_zh.md @@ -121,8 +121,8 @@ MiniCPM-o 2.6 进一步优化了 MiniCPM-V 2.6 的众多视觉理解能力,其 - 💫 **易于使用。** -MiniCPM-o 2.6 可以通过多种方式轻松使用:(1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md) 支持在本地设备上进行高效的 CPU 推理,(2) [int4](https://huggingface.co/openbmb/MiniCPM-V-2_6-int4) 和 [GGUF](https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf) 格式的量化模型,有 16 种尺寸,(3) [vLLM](#基于-llamacppollamavllm-的高效推理) 支持高吞吐量和内存高效的推理,(4) 通过[LLaMA-Factory](./docs/llamafactory_train.md)框架针对新领域和任务进行微调,(5) 使用 [Gradio](#本地-webui-demo-) 快速设置本地 WebUI 演示,(6) 部署于服务器的在线 [demo](https://minicpm-omni-webdemo-us.modelbest.cn/)。 +MiniCPM-o 2.6 可以通过多种方式轻松使用:(1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md) 支持在本地设备上进行高效的 CPU 推理,(2) [int4](https://huggingface.co/openbmb/MiniCPM-V-2_6-int4) 和 [GGUF](https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf) 格式的量化模型,有 16 种尺寸,(3) [vLLM](#基于-llamacppollamavllm-的高效推理) 支持高吞吐量和内存高效的推理,(4) 通过[LLaMA-Factory](./docs/llamafactory_train_and_infer.md)框架针对新领域和任务进行微调,(5) 使用 [Gradio](#本地-webui-demo-) 快速设置本地 WebUI 演示,(6) 部署于服务器的在线 [demo](https://minicpm-omni-webdemo-us.modelbest.cn/)。 **模型架构。** @@ -2498,7 +2498,7 @@ ollama 用法请参考[我们的fork ollama](https://github.com/OpenBMB/ollama/b 我们支持使用 LLaMA-Factory 微调 MiniCPM-o-2.6 和 MiniCPM-V 2.6。LLaMA-Factory 提供了一种灵活定制 200 多个大型语言模型(LLM)微调(Lora/Full/Qlora)解决方案,无需编写代码,通过内置的 Web 用户界面 LLaMABoard 即可实现训练/推理/评估。它支持多种训练方法,如 sft/ppo/dpo/kto,并且还支持如 Galore/BAdam/LLaMA-Pro/Pissa/LongLoRA 等高级算法。 -最佳实践: [MiniCPM-V-2.6 | MiniCPM-o-2.6](https://github.com/openbmb/MiniCPM-V/blob/main/docs/llamafactory_train.md). +最佳实践: [MiniCPM-o-2.6 | MiniCPM-V-2.6](./docs/llamafactory_train_and_infer.md). ### 使用 SWIFT 框架 @@ -2586,4 +2586,4 @@ ollama 用法请参考[我们的fork ollama](https://github.com/OpenBMB/ollama/b journal={arXiv preprint arXiv:2408.01800}, year={2024} } -``` +``` \ No newline at end of file diff --git a/docs/llamafactory_train_and_infer.md b/docs/llamafactory_train_and_infer.md new file mode 100644 index 0000000..108d1e9 --- /dev/null +++ b/docs/llamafactory_train_and_infer.md @@ -0,0 +1,382 @@ +# Best Practice with LLaMA-Factory + +## Contents + +- [Support Models](#Support-Models) +- [LLaMA-Factory Installation](#LLaMA-Factory-Installation) +- [Dataset Prepare](#Dataset-Prepare) +- [Lora Fine-Tuning](#Lora-Fine-Tuning) +- [Full Parameters Fine-Tuning](#Full-Parameters-Fine-Tuning) +- [Inference](#Inference) + +## Support Models +* [openbmb/MiniCPM-o-2_6](https://huggingface.co/openbmb/MiniCPM-o-2_6) +* [openbmb/MiniCPM-V-2_6](https://huggingface.co/openbmb/MiniCPM-V-2_6) + +## LLaMA-Factory Installation + +You can install LLaMA-Factory using commands below. + +``` +git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git +cd LLaMA-Factory +pip install -e ".[torch,metrics,deepspeed,minicpm_v]" +mkdir configs # let's put all yaml files here +``` + +## Dataset Prepare + +Refer to [data/dataset_info.json](https://github.com/hiyouga/LLaMA-Factory/blob/main/data/dataset_info.json) to add your customised dataset. Let's use the two existing demo datasets `mllm_demo` and `mllm_video_demo` as examples. + +### Image Dataset + +Refer to image sft demo data: [data/mllm_demo.json](https://github.com/hiyouga/LLaMA-Factory/blob/main/data/mllm_demo.json) + +
+ + data/mllm_demo.json + + +```json +[ + { + "messages": [ + { + "content": "Who are they?", + "role": "user" + }, + { + "content": "They're Kane and Gretzka from Bayern Munich.", + "role": "assistant" + }, + { + "content": "What are they doing?", + "role": "user" + }, + { + "content": "They are celebrating on the soccer field.", + "role": "assistant" + } + ], + "images": [ + "mllm_demo_data/1.jpg" + ] + }, + { + "messages": [ + { + "content": "Who is he?", + "role": "user" + }, + { + "content": "He's Thomas Muller from Bayern Munich.", + "role": "assistant" + }, + { + "content": "Why is he on the ground?", + "role": "user" + }, + { + "content": "Because he's sliding on his knees to celebrate.", + "role": "assistant" + } + ], + "images": [ + "mllm_demo_data/2.jpg" + ] + }, + { + "messages": [ + { + "content": "Please describe this image", + "role": "user" + }, + { + "content": "Chinese astronaut Gui Haichao is giving a speech.", + "role": "assistant" + }, + { + "content": "What has he accomplished?", + "role": "user" + }, + { + "content": "He was appointed to be a payload specialist on Shenzhou 16 mission in June 2022, thus becoming the first Chinese civilian of Group 3 in space on 30 May 2023. He is responsible for the on-orbit operation of space science experimental payloads.", + "role": "assistant" + } + ], + "images": [ + "mllm_demo_data/3.jpg" + ] + } +] +``` + +
+ + +### Video Dataset + +Refer to video sft demo data: [data/mllm_video_demo.json](https://github.com/hiyouga/LLaMA-Factory/blob/main/data/mllm_video_demo.json) + +
+ + data/mllm_video_demo.json + + +```json +[ + { + "messages": [ + { + "content": "
+ + +## Lora Fine-Tuning + +We can use one command to do lora sft: + +```shell +CUDA_VISIBLE_DEVICES=0 llamafactory-cli train configs/minicpmo_2_6_lora_sft.yaml +``` + +
+ + configs/minicpmo_2_6_lora_sft.yaml + + +```yaml +### model +model_name_or_path: openbmb/MiniCPM-o-2_6 # MiniCPM-o-2_6 MiniCPM-V-2_6 +trust_remote_code: true + +### method +stage: sft +do_train: true +finetuning_type: lora +lora_target: q_proj,v_proj + +### dataset +dataset: mllm_demo # mllm_demo mllm_video_demo +template: minicpm_v +cutoff_len: 3072 +max_samples: 1000 +overwrite_cache: true +preprocessing_num_workers: 16 + +### output +output_dir: saves/minicpmo_2_6/lora/sft +logging_steps: 1 +save_steps: 100 +plot_loss: true +overwrite_output_dir: true +save_total_limit: 10 + +### train +per_device_train_batch_size: 2 +gradient_accumulation_steps: 1 +learning_rate: 1.0e-5 +num_train_epochs: 20.0 +lr_scheduler_type: cosine +warmup_ratio: 0.1 +bf16: true +ddp_timeout: 180000000 +save_only_model: true + +### eval +do_eval: false +``` + +
+ +### Lora Model Export + +One command to export lora model + +```shell +llamafactory-cli export configs/minicpmo_2_6_lora_export.yaml +``` + +
+ + configs/minicpmo_2_6_lora_export.yaml + + +```yaml +### model +model_name_or_path: openbmb/MiniCPM-o-2_6 # MiniCPM-o-2_6 MiniCPM-V-2_6 +adapter_name_or_path: saves/minicpmo_2_6/lora/sft +template: minicpm_v +finetuning_type: lora +trust_remote_code: true + +### export +export_dir: models/minicpmo_2_6_lora_sft +export_size: 2 +export_device: cpu +export_legacy_format: false +``` + +
+ +## Full Parameters Fine-Tuning + +We can use one command to do full sft: + +```shell +llamafactory-cli train configs/minicpmo_2_6_full_sft.yaml +``` + +
+ + configs/minicpmo_2_6_full_sft.yaml + + +```yaml +### model +model_name_or_path: openbmb/MiniCPM-o-2_6 # MiniCPM-o-2_6 MiniCPM-V-2_6 +trust_remote_code: true +freeze_vision_tower: true +print_param_status: true +flash_attn: fa2 + +### method +stage: sft +do_train: true +finetuning_type: full +deepspeed: configs/deepspeed/ds_z2_config.json + +### dataset +dataset: mllm_demo # mllm_demo mllm_video_demo +template: minicpm_v +cutoff_len: 3072 +max_samples: 1000 +overwrite_cache: true +preprocessing_num_workers: 16 + +### output +output_dir: saves/minicpmo_2_6/full/sft +logging_steps: 1 +save_steps: 100 +plot_loss: true +overwrite_output_dir: true +save_total_limit: 10 + +### train +per_device_train_batch_size: 2 +gradient_accumulation_steps: 1 +learning_rate: 1.0e-5 +num_train_epochs: 20.0 +lr_scheduler_type: cosine +warmup_ratio: 0.1 +bf16: true +ddp_timeout: 180000000 +save_only_model: true + +### eval +do_eval: false +``` +
+ +## Inference + +### Web UI ChatBox + +Refer [LLaMA-Factory doc](https://github.com/hiyouga/LLaMA-Factory/tree/main/examples#inferring-lora-fine-tuned-models) for more inference usages. + +For example, we can use one command to run web chat: + +```shell +CUDA_VISIBLE_DEVICES=0 llamafactory-cli webchat configs/minicpmo_2_6_infer.yaml +``` + +
+ + configs/minicpmo_2_6_infer.yaml + + +```yaml +model_name_or_path: saves/minicpmo_2_6/full/sft +template: minicpm_v +infer_backend: huggingface +trust_remote_code: true +``` +
+ +### Official Code +You can also use official code to inference + +
+ + official inference code + + +```python +# test.py +import torch +from PIL import Image +from transformers import AutoModel, AutoTokenizer + +model_id = "saves/minicpmo_2_6/full/sft" +model = AutoModel.from_pretrained(model_id, trust_remote_code=True, + attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager +model = model.eval().cuda() +tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) + +image = Image.open('data/mllm_demo_data/1.jpg').convert('RGB') +question = 'Who are they??' +msgs = [{'role': 'user', 'content': [image, question]}] + +res = model.chat( + image=None, + msgs=msgs, + tokenizer=tokenizer +) +print(res) +``` + +
\ No newline at end of file