mirror of
https://github.com/FunAudioLLM/CosyVoice.git
synced 2026-02-05 18:09:24 +08:00
update readme
This commit is contained in:
@@ -1,5 +1,5 @@
|
|||||||
FROM verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.2-te2.2
|
FROM verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.2-te2.2
|
||||||
COPY requirements-cosyvoice.txt /myworkspace/requirements.txt
|
COPY requirements.txt /myworkspace/requirements.txt
|
||||||
RUN pip install -r /myworkspace/requirements.txt
|
RUN pip install -r /myworkspace/requirements.txt
|
||||||
RUN pip install -U nvidia-pytriton
|
RUN pip install -U nvidia-pytriton
|
||||||
RUN git clone https://github.com/yuekaizhang/verl.git /myworkspace/verl -b thread && cd /myworkspace/verl && pip install --no-deps -e .
|
RUN git clone https://github.com/yuekaizhang/verl.git /myworkspace/verl -b thread && cd /myworkspace/verl && pip install --no-deps -e .
|
||||||
|
|||||||
@@ -18,6 +18,7 @@ We recommend using the pre-built Docker image below. Alternatively, you can manu
|
|||||||
```bash
|
```bash
|
||||||
docker pull soar97/verl:app-verl0.4-vllm0.8.5-mcore0.12.2-te2.2
|
docker pull soar97/verl:app-verl0.4-vllm0.8.5-mcore0.12.2-te2.2
|
||||||
```
|
```
|
||||||
|
If Docker is not available, you can refer to `run.sh` `stage -2` to install the dependencies locally.
|
||||||
|
|
||||||
## Data Preparation
|
## Data Preparation
|
||||||
|
|
||||||
@@ -43,16 +44,16 @@ data/parquet_tiny/train.parquet
|
|||||||
data/parquet_tiny/test.parquet
|
data/parquet_tiny/test.parquet
|
||||||
```
|
```
|
||||||
|
|
||||||
Each sample is automatically wrapped into a cosyvoice2-style prompt so that the LLM learns to output CosyVoice2 speech tokens.
|
Each sample is automatically wrapped into a CosyVoice2-style prompt so that the LLM learns to output CosyVoice2 speech tokens.
|
||||||
|
|
||||||
|
|
||||||
## Reward Function & ASR Server
|
## Reward Function & ASR Server
|
||||||
|
|
||||||
To compute rewards we run a lightweight server that:
|
To compute rewards, we run a lightweight server that:
|
||||||
|
|
||||||
1. Converts generated speech tokens back to a 16 kHz waveform with the **CosyVoice2** pretrained U-Net model.
|
1. Converts generated speech tokens back to a 16 kHz waveform with the **CosyVoice2** pretrained U-Net model.
|
||||||
2. Transcribes the waveform with **SenseVoice** ASR.
|
2. Transcribes the waveform with **SenseVoice** ASR.
|
||||||
3. Calculates the pinyin-level error rate relative to the ground-truth text and maps it to a score in the range \[0-1\].
|
3. Calculates the pinyin-level error rate relative to the ground-truth text and maps it to a score between 0 and 1.
|
||||||
|
|
||||||
Start the server (stage `1`) in a dedicated terminal or on a separate GPU:
|
Start the server (stage `1`) in a dedicated terminal or on a separate GPU:
|
||||||
|
|
||||||
@@ -61,7 +62,7 @@ bash run.sh 1 1
|
|||||||
# Triton server listens on ports 8000/8001/8002
|
# Triton server listens on ports 8000/8001/8002
|
||||||
```
|
```
|
||||||
|
|
||||||
The custom reward implementation lives in [`reward_tts.py`](./reward_tts.py) and calls the server to obtain the reward score.
|
The custom reward implementation is located in [`reward_tts.py`](./reward_tts.py) and calls the server to obtain the reward score.
|
||||||
|
|
||||||
## Training
|
## Training
|
||||||
|
|
||||||
@@ -78,10 +79,12 @@ Key CLI arguments passed to `verl.trainer.main_ppo`:
|
|||||||
* `custom_reward_function.path=reward_tts.py` – custom reward function described above.
|
* `custom_reward_function.path=reward_tts.py` – custom reward function described above.
|
||||||
|
|
||||||
Adjust `CUDA_VISIBLE_DEVICES`, batch sizes, and other hyperparameters to match your hardware.
|
Adjust `CUDA_VISIBLE_DEVICES`, batch sizes, and other hyperparameters to match your hardware.
|
||||||
|
> [!TIP]
|
||||||
|
> Note: the lm_head bias is disabled during training to make the model compatible with VLLM and Transformers' Qwen model.
|
||||||
|
|
||||||
## Evaluation
|
## Evaluation
|
||||||
|
|
||||||
After training completes, collect the sharded FSDP weights and export a Hugging Face-style checkpoint (stage `3`):
|
After training is complete, collect the sharded FSDP weights and export a Hugging Face-style checkpoint (stage `3`):
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
bash run.sh 3 3 # merges weights into $llm_path/merged_hf_model
|
bash run.sh 3 3 # merges weights into $llm_path/merged_hf_model
|
||||||
@@ -107,15 +110,16 @@ bash run.sh 5 5
|
|||||||
```
|
```
|
||||||
|
|
||||||
The script converts the Hugging Face checkpoint back into the format expected by the CosyVoice repository.
|
The script converts the Hugging Face checkpoint back into the format expected by the CosyVoice repository.
|
||||||
|
> [!TIP]
|
||||||
|
> However, we observed a slight accuracy drop when using the RL-trained model after conversion, compared with the Hugging Face format.
|
||||||
|
|
||||||
## Results
|
## Results
|
||||||
|
|
||||||
| Model | Seed-TTS `test_zh` CER | CosyVoice3 `zero_shot_zh` CER | Comment |
|
| Model | Seed-TTS `test_zh` CER | CosyVoice3 `zero_shot_zh` CER | Comment |
|
||||||
|-------|------------------------|------------------------------|---------|
|
|-------|------------------------|------------------------------|---------|
|
||||||
| CosyVoice2 LLM (official) | 1.45% | 4.08% | See the [paper](https://arxiv.org/abs/2412.10117) |
|
| CosyVoice2 LLM (official) | 1.45% | 4.08% | See the [paper](https://arxiv.org/abs/2412.10117) |
|
||||||
| CosyVoice2 LLM + GRPO | 1.37 % | **3.36 %** | See the [decoding results](yuekai/official-cosyvoice-llm-grpo-aishell3) |
|
| CosyVoice2 LLM + GRPO | 1.37% | **3.36%** | See the [decoding results](yuekai/official-cosyvoice-llm-grpo-aishell3), Hugging Face-format model |
|
||||||
|
|
||||||
## Acknowledgement
|
## Acknowledgement
|
||||||
|
|
||||||
This work was inspired by the implementation in [ch-tts-llasa-rl-grpo](https://github.com/channel-io/ch-tts-llasa-rl-grpo).
|
This work was inspired by the implementation in [ch-tts-llasa-rl-grpo](https://github.com/channel-io/ch-tts-llasa-rl-grpo).
|
||||||
|
|
||||||
|
|||||||
@@ -1,3 +1,4 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
|
||||||
# SPDX-FileCopyrightText: Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
|
# SPDX-FileCopyrightText: Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
|
||||||
# SPDX-License-Identifier: Apache-2.0
|
# SPDX-License-Identifier: Apache-2.0
|
||||||
@@ -94,7 +95,8 @@ if __name__ == "__main__":
|
|||||||
with torch.no_grad():
|
with torch.no_grad():
|
||||||
# set the weight and bias of the new lm_head to 0
|
# set the weight and bias of the new lm_head to 0
|
||||||
new_lm_head.weight.data.zero_()
|
new_lm_head.weight.data.zero_()
|
||||||
new_lm_head.bias.data.zero_()
|
# make bias value -inf
|
||||||
|
new_lm_head.bias.data.fill_(-float('inf'))
|
||||||
new_lm_head.weight[original_tokenizer_vocab_size:original_tokenizer_vocab_size + cosyvoice2_token_size + 3] = llm_decoder.weight
|
new_lm_head.weight[original_tokenizer_vocab_size:original_tokenizer_vocab_size + cosyvoice2_token_size + 3] = llm_decoder.weight
|
||||||
new_lm_head.bias[original_tokenizer_vocab_size:original_tokenizer_vocab_size + cosyvoice2_token_size + 3] = llm_decoder.bias
|
new_lm_head.bias[original_tokenizer_vocab_size:original_tokenizer_vocab_size + cosyvoice2_token_size + 3] = llm_decoder.bias
|
||||||
|
|
||||||
@@ -107,8 +109,7 @@ if __name__ == "__main__":
|
|||||||
|
|
||||||
eos_token_ids = [original_tokenizer_vocab_size + cosyvoice2_token_size,
|
eos_token_ids = [original_tokenizer_vocab_size + cosyvoice2_token_size,
|
||||||
original_tokenizer_vocab_size + cosyvoice2_token_size + 1,
|
original_tokenizer_vocab_size + cosyvoice2_token_size + 1,
|
||||||
original_tokenizer_vocab_size + cosyvoice2_token_size + 2,
|
original_tokenizer_vocab_size + cosyvoice2_token_size + 2]
|
||||||
original_tokenizer_vocab_size + cosyvoice2_token_size + 3]
|
|
||||||
llm.generation_config.eos_token_id = eos_token_ids
|
llm.generation_config.eos_token_id = eos_token_ids
|
||||||
llm.generation_config.temperature = 1.0
|
llm.generation_config.temperature = 1.0
|
||||||
llm.generation_config.top_p = 0.8
|
llm.generation_config.top_p = 0.8
|
||||||
@@ -121,6 +122,14 @@ if __name__ == "__main__":
|
|||||||
llm.to(torch.bfloat16)
|
llm.to(torch.bfloat16)
|
||||||
llm.save_pretrained(args.save_path)
|
llm.save_pretrained(args.save_path)
|
||||||
|
|
||||||
TEMPLATE = "{%- for message in messages %}{%- if message['role'] == 'user' %}{{- '<|sos|>' + message['content'] + '<|task_id|>' }}{%- elif message['role'] == 'assistant' %}{{- message['content']}}{%- endif %}{%- endfor %}"
|
TEMPLATE = (
|
||||||
|
"{%- for message in messages %}"
|
||||||
|
"{%- if message['role'] == 'user' %}"
|
||||||
|
"{{- '<|sos|>' + message['content'] + '<|task_id|>' }}"
|
||||||
|
"{%- elif message['role'] == 'assistant' %}"
|
||||||
|
"{{- message['content']}}"
|
||||||
|
"{%- endif %}"
|
||||||
|
"{%- endfor %}"
|
||||||
|
)
|
||||||
tokenizer.chat_template = TEMPLATE
|
tokenizer.chat_template = TEMPLATE
|
||||||
tokenizer.save_pretrained(args.save_path)
|
tokenizer.save_pretrained(args.save_path)
|
||||||
|
|||||||
@@ -3,7 +3,7 @@
|
|||||||
set -eou pipefail
|
set -eou pipefail
|
||||||
|
|
||||||
stage=-1
|
stage=-1
|
||||||
stop_stage=5
|
stop_stage=4
|
||||||
|
|
||||||
log() {
|
log() {
|
||||||
# This function is from espnet
|
# This function is from espnet
|
||||||
@@ -15,6 +15,22 @@ export PYTHONPATH=/workspace/CosyVoice
|
|||||||
model_scope_model_path=./CosyVoice2-0.5B
|
model_scope_model_path=./CosyVoice2-0.5B
|
||||||
sft_model_path=./transformers_cosyvoice2_llm
|
sft_model_path=./transformers_cosyvoice2_llm
|
||||||
|
|
||||||
|
if [ $stage -le -2 ] && [ $stop_stage -ge -2 ]; then
|
||||||
|
log "stage -2: install dependencies locally if pre-built docker image is not available"
|
||||||
|
conda create -n cosyvoice2 python=3.10 -y
|
||||||
|
conda activate cosyvoice2
|
||||||
|
# install verl
|
||||||
|
git clone https://github.com/yuekaizhang/verl.git -b thread
|
||||||
|
cd verl
|
||||||
|
USE_MEGATRON=0 bash scripts/install_vllm_sglang_mcore.sh
|
||||||
|
pip install --no-deps -e .
|
||||||
|
cd -
|
||||||
|
# install requirements
|
||||||
|
pip install -r requirements.txt
|
||||||
|
pip install -U nvidia-pytriton
|
||||||
|
git clone https://github.com/yuekaizhang/PytritonSenseVoice.git && cd PytritonSenseVoice && pip install -e .
|
||||||
|
fi
|
||||||
|
|
||||||
if [ $stage -le -1 ] && [ $stop_stage -ge -1 ]; then
|
if [ $stage -le -1 ] && [ $stop_stage -ge -1 ]; then
|
||||||
log "stage -1: download official CosyVoice2-0.5B LLM model and convert to huggingface compatible checkpoint"
|
log "stage -1: download official CosyVoice2-0.5B LLM model and convert to huggingface compatible checkpoint"
|
||||||
modelscope download --model iic/CosyVoice2-0.5B --local_dir $model_scope_model_path
|
modelscope download --model iic/CosyVoice2-0.5B --local_dir $model_scope_model_path
|
||||||
@@ -24,13 +40,15 @@ if [ $stage -le -1 ] && [ $stop_stage -ge -1 ]; then
|
|||||||
|
|
||||||
# Or, you could use the following command to download the huggingface compatible checkpoint
|
# Or, you could use the following command to download the huggingface compatible checkpoint
|
||||||
# huggingface-cli download --local-dir $sft_model_path yuekai/cosyvoice2_llm
|
# huggingface-cli download --local-dir $sft_model_path yuekai/cosyvoice2_llm
|
||||||
|
|
||||||
|
# Note: we remove the lm_head's bias to make it compatible with the Qwen2.5-0.5B model in Transformers.
|
||||||
fi
|
fi
|
||||||
|
|
||||||
data_dir=data/parquet_aishell3
|
data_dir=data/parquet_aishell3
|
||||||
if [ $stage -le 0 ] && [ $stop_stage -ge 0 ]; then
|
if [ $stage -le 0 ] && [ $stop_stage -ge 0 ]; then
|
||||||
log "stage 0: prepare data into verl format"
|
log "stage 0: prepare data into verl format"
|
||||||
mkdir -p $data_dir
|
mkdir -p $data_dir
|
||||||
wget https://huggingface.co/datasets/SparkAudio/voxbox/resolve/main/metadata/aishell-3.jsonl -O data/aishell-3.jsonl
|
wget -O data/aishell-3.jsonl https://huggingface.co/datasets/SparkAudio/voxbox/resolve/main/metadata/aishell-3.jsonl
|
||||||
# total 88035 samples
|
# total 88035 samples
|
||||||
head -n 80000 data/aishell-3.jsonl > data/train.jsonl
|
head -n 80000 data/aishell-3.jsonl > data/train.jsonl
|
||||||
tail -n 100 data/aishell-3.jsonl > data/test.jsonl
|
tail -n 100 data/aishell-3.jsonl > data/test.jsonl
|
||||||
@@ -98,7 +116,8 @@ if [ $stage -le 2 ] && [ $stop_stage -ge 2 ]; then
|
|||||||
trainer.val_before_train=False
|
trainer.val_before_train=False
|
||||||
fi
|
fi
|
||||||
|
|
||||||
step=400
|
steps=(100 200 300 400 500)
|
||||||
|
for step in ${steps[@]}; do
|
||||||
llm_path=./checkpoints/cosyvoice2_grpo/$exp_name/global_step_${step}
|
llm_path=./checkpoints/cosyvoice2_grpo/$exp_name/global_step_${step}
|
||||||
if [ $stage -le 3 ] && [ $stop_stage -ge 3 ]; then
|
if [ $stage -le 3 ] && [ $stop_stage -ge 3 ]; then
|
||||||
log "stage 3: merge the model"
|
log "stage 3: merge the model"
|
||||||
@@ -111,7 +130,7 @@ fi
|
|||||||
if [ $stage -le 4 ] && [ $stop_stage -ge 4 ]; then
|
if [ $stage -le 4 ] && [ $stop_stage -ge 4 ]; then
|
||||||
log "stage 4: Test the model"
|
log "stage 4: Test the model"
|
||||||
dataset=zero_shot_zh
|
dataset=zero_shot_zh
|
||||||
# dataset=test_zh
|
# dataset=test_zh seed_tts test_zh
|
||||||
output_dir=./outputs_${exp_name}_${step}_${dataset}
|
output_dir=./outputs_${exp_name}_${step}_${dataset}
|
||||||
|
|
||||||
token2wav_path=/workspace/CosyVoice2-0.5B
|
token2wav_path=/workspace/CosyVoice2-0.5B
|
||||||
@@ -127,12 +146,14 @@ if [ $stage -le 4 ] && [ $stop_stage -ge 4 ]; then
|
|||||||
|
|
||||||
bash scripts/compute_wer.sh $output_dir ${dataset}
|
bash scripts/compute_wer.sh $output_dir ${dataset}
|
||||||
fi
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
if [ $stage -le 5 ] && [ $stop_stage -ge 5 ]; then
|
if [ $stage -le 5 ] && [ $stop_stage -ge 5 ]; then
|
||||||
log "stage 5: Convert the RL trained model to CosyVoice repo format"
|
log "stage 5: Convert the RL trained model to CosyVoice repo format"
|
||||||
python3 huggingface_to_pretrained.py \
|
python3 huggingface_to_pretrained.py \
|
||||||
--hf-cosyvoice2-llm-path $llm_path/merged_hf_model \
|
--hf-cosyvoice2-llm-path $llm_path/merged_hf_model \
|
||||||
--pretrained-cosyvoice2-path /workspace/CosyVoice2-0.5B \
|
|
||||||
--output-path /workspace/CosyVoice2-0.5B/llm-new.pt
|
--output-path /workspace/CosyVoice2-0.5B/llm-new.pt
|
||||||
# You need to manually move the llm-new.pt to overwrite /workspace/CosyVoice2-0.5B/llm.pt
|
# You need to manually move the llm-new.pt to overwrite /workspace/CosyVoice2-0.5B/llm.pt
|
||||||
|
# However, we found that the RL trained model accuracy would slightly drop after this conversion.
|
||||||
|
# Please be careful or use the huggingface format inference code.
|
||||||
fi
|
fi
|
||||||
@@ -10,6 +10,7 @@ model_path=models/sherpa-onnx-paraformer-zh-2023-09-14
|
|||||||
if [ ! -d $model_path ]; then
|
if [ ! -d $model_path ]; then
|
||||||
pip install sherpa-onnx
|
pip install sherpa-onnx
|
||||||
wget -nc https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-paraformer-zh-2023-09-14.tar.bz2
|
wget -nc https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-paraformer-zh-2023-09-14.tar.bz2
|
||||||
|
mkdir models
|
||||||
tar xvf sherpa-onnx-paraformer-zh-2023-09-14.tar.bz2 -C models
|
tar xvf sherpa-onnx-paraformer-zh-2023-09-14.tar.bz2 -C models
|
||||||
fi
|
fi
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user