This recipe demonstrates how to fine-tune the CosyVoice2 large language model with reinforcement learning algorithms—specifically GRPO—using the veRL framework. Our experiments show that applying GRPO reduces the character error rate (CER) on the CosyVoice3 zero_shot_zh set from 4.08 % to 3.36 %.

Environment Setup
Data Preparation
Reward Function & ASR Server
Training
Evaluation
Export Model
Results
Acknowledgement

Environment Setup

We recommend using the pre-built Docker image below. Alternatively, you can manually install the dependencies following the Dockerfile.

docker pull soar97/verl:app-verl0.4-vllm0.8.5-mcore0.12.2-te2.2

Data Preparation

prepare_data.py expects a JSON/JSONL file with at least the following schema:

{
  "text": "An example sentence to be synthesized."
}

You can download the JSONL files from the metadata directory of the SparkAudio/voxbox dataset on Hugging Face.

Stage 0 converts raw JSONL files into the parquet format expected by veRL:

bash run.sh 0 0

Create two JSONL files—train.jsonl and test.jsonl.
The script will then generate two Parquet files:

data/parquet_tiny/train.parquet
data/parquet_tiny/test.parquet

Each sample is automatically wrapped into a cosyvoice2-style prompt so that the LLM learns to output CosyVoice2 speech tokens.

Reward Function & ASR Server

To compute rewards we run a lightweight server that:

Converts generated speech tokens back to a 16 kHz waveform with the CosyVoice2 pretrained U-Net model.
Transcribes the waveform with SenseVoice ASR.
Calculates the pinyin-level error rate relative to the ground-truth text and maps it to a score in the range [0-1].

Start the server (stage 1) in a dedicated terminal or on a separate GPU:

bash run.sh 1 1
# Triton server listens on ports 8000/8001/8002

The custom reward implementation lives in reward_tts.py and calls the server to obtain the reward score.

Training

Run stage 2 to start GRPO training:

bash run.sh 2 2

Key CLI arguments passed to verl.trainer.main_ppo:

algorithm.adv_estimator=grpo – use GRPO instead of PPO.
data.train_files=data/parquet_aishell3/train.parquet and data.val_files=data/parquet_aishell3/test.parquet
custom_reward_function.path=reward_tts.py – custom reward function described above.

Adjust CUDA_VISIBLE_DEVICES, batch sizes, and other hyperparameters to match your hardware.

Evaluation

After training completes, collect the sharded FSDP weights and export a Hugging Face-style checkpoint (stage 3):

bash run.sh 3 3   # merges weights into $llm_path/merged_hf_model

You can then evaluate the model on the CosyVoice3 zero-shot Chinese test set (stage 4):

bash run.sh 4 4

This command launches distributed inference via infer_dataset.py and computes WER with scripts/compute_wer.sh.

Tip

The script also supports the Seed-TTS test set by setting dataset=test_zh.

Export Model

To use the RL-trained model with the official CosyVoice repository:

bash run.sh 5 5

The script converts the Hugging Face checkpoint back into the format expected by the CosyVoice repository.

Results

Model	Seed-TTS `test_zh` CER	CosyVoice3 `zero_shot_zh` CER	Comment
CosyVoice2 LLM (official)	1.45 %	4.08 %	See the paper
CosyVoice2 LLM + GRPO	1.37 %	3.36 %	See the decoding results

Acknowledgement

This work was inspired by the implementation in ch-tts-llasa-rl-grpo.

README.md Unescape Escape

CosyVoice2 LLM Reinforcement Learning Recipe

Table of Contents