update readme

This commit is contained in:
yuekaiz
2025-09-03 17:42:14 +08:00
parent e04699c6da
commit 633b991290
5 changed files with 143 additions and 77 deletions

View File

@@ -1,15 +1,17 @@
## Best Practices for Serving CosyVoice with NVIDIA Triton Inference Server ## Serving CosyVoice with NVIDIA Triton Inference Server
Thanks to the contribution from NVIDIA Yuekai Zhang. Contributed by Yuekai Zhang (NVIDIA).
### Quick Start ### Quick Start
Launch the service directly with Docker Compose: Launch the service directly with Docker Compose:
```sh ```sh
docker compose up docker compose up
``` ```
### Build the Docker Image ### Build the Docker Image
Build the image from scratch:
To build the image from scratch:
```sh ```sh
docker build . -f Dockerfile.server -t soar97/triton-cosyvoice:25.06 docker build . -f Dockerfile.server -t soar97/triton-cosyvoice:25.06
``` ```
@@ -21,71 +23,89 @@ docker run -it --name "cosyvoice-server" --gpus all --net host -v $your_mount_di
``` ```
### Understanding `run.sh` ### Understanding `run.sh`
The `run.sh` script orchestrates the entire workflow through numbered stages. The `run.sh` script orchestrates the entire workflow through numbered stages.
Run a subset of stages with: You can run a subset of stages with:
```sh ```sh
bash run.sh <start_stage> <stop_stage> [service_type] bash run.sh <start_stage> <stop_stage> [service_type]
``` ```
- `<start_stage>` stage to start from (0-5). - `<start_stage>`: The stage to start from (0-5).
- `<stop_stage>` stage to stop after (0-5). - `<stop_stage>`: The stage to stop after (0-5).
Stages: **Stages:**
- **Stage 0** Download the cosyvoice-2 0.5B model from HuggingFace.
- **Stage 1** Convert the HuggingFace checkpoint to TensorRT-LLM format and build TensorRT engines. - **Stage 0**: Downloads the `cosyvoice-2 0.5B` model from HuggingFace.
- **Stage 2** Create the Triton model repository and configure the model files (adjusts depending on whether `Decoupled=True/False` will be used later). - **Stage 1**: Converts the HuggingFace checkpoint to the TensorRT-LLM format and builds the TensorRT engines.
- **Stage 3** Launch the Triton Inference Server. - **Stage 2**: Creates the Triton model repository and configures the model files. The configuration is adjusted based on whether `Decoupled=True` (streaming) or `Decoupled=False` (offline) will be used.
- **Stage 4** Run the single-utterance HTTP client. - **Stage 3**: Launches the Triton Inference Server.
- **Stage 5** Run the gRPC benchmark client. - **Stage 4**: Runs the single-utterance HTTP client for testing.
- **Stage 5**: Runs the gRPC benchmark client.
### Export Models and Launch Server
### Export Models to TensorRT-LLM and Launch the Server
Inside the Docker container, prepare the models and start the Triton server by running stages 0-3: Inside the Docker container, prepare the models and start the Triton server by running stages 0-3:
```sh ```sh
# Runs stages 0, 1, 2, and 3 # This command runs stages 0, 1, 2, and 3
bash run.sh 0 3 bash run.sh 0 3
``` ```
*Note: Stage 2 prepares the model repository differently depending on whether you intend to run with `Decoupled=False` or `Decoupled=True`. Rerun stage 2 if you switch the service type.* > [!TIP]
> Both streaming and offline (non-streaming) TTS modes are supported. For streaming TTS, set `Decoupled=True`. For offline TTS, set `Decoupled=False`. You need to rerun stage 2 if you switch between modes.
### Single-Utterance HTTP Client ### Single-Utterance HTTP Client
Send a single HTTP inference request:
Sends a single HTTP inference request. This is intended for testing the offline TTS mode (`Decoupled=False`):
```sh ```sh
bash run.sh 4 4 bash run.sh 4 4
``` ```
### Benchmark with a Dataset ### Benchmark with a Dataset
Benchmark the running Triton server. Pass either `streaming` or `offline` as the third argument.
```sh
bash run.sh 5 5
# You can also customise parameters such as num_task and dataset split directly: To benchmark the running Triton server, pass `streaming` or `offline` as the third argument:
```sh
bash run.sh 5 5 # [streaming|offline]
# You can also customize parameters such as the number of tasks and the dataset split:
# python3 client_grpc.py --num-tasks 2 --huggingface-dataset yuekai/seed_tts_cosy2 --split-name test_zh --mode [streaming|offline] # python3 client_grpc.py --num-tasks 2 --huggingface-dataset yuekai/seed_tts_cosy2 --split-name test_zh --mode [streaming|offline]
``` ```
> [!TIP] > [!TIP]
> Only offline CosyVoice TTS is currently supported. Setting the client to `streaming` simply enables NVIDIA Tritons decoupled mode so that responses are returned as soon as they are ready. > It is recommended to run the benchmark multiple times to get stable results after the initial server warm-up.
### Benchmark Results ### Benchmark Results
Decoding on a single L20 GPU with 26 prompt_audio/target_text [pairs](https://huggingface.co/datasets/yuekai/seed_tts) (≈221 s of audio): The following results were obtained by decoding on a single L20 GPU with 26 prompt audio/target text pairs from the [yuekai/seed_tts](https://huggingface.co/datasets/yuekai/seed_tts) dataset (approximately 170 seconds of audio):
**Streaming TTS (First Chunk Latency)**
| Mode | Concurrency | Avg Latency (ms) | P50 Latency (ms) | RTF |
|---|---|---|---|---|
| Streaming, Decoupled=True | 1 | 220.43 | 218.07 | 0.1237 |
| Streaming, Decoupled=True | 2 | 476.97 | 369.25 | 0.1022 |
| Streaming, Decoupled=True | 4 | 1107.34 | 1243.75| 0.0922 |
**Offline TTS (Full Sentence Latency)**
| Mode | Note | Concurrency | Avg Latency (ms) | P50 Latency (ms) | RTF | | Mode | Note | Concurrency | Avg Latency (ms) | P50 Latency (ms) | RTF |
|------|------|-------------|------------------|------------------|-----| |---|---|---|---|---|---|
| Decoupled=False | [Commit](https://github.com/yuekaizhang/CosyVoice/commit/b44f12110224cb11c03aee4084b1597e7b9331cb) | 1 | 758.04 | 615.79 | 0.0891 | | Offline, Decoupled=False | [Commit](https://github.com/yuekaizhang/CosyVoice/commit/b44f12110224cb11c03aee4084b1597e7b9331cb) | 1 | 758.04 | 615.79 | 0.0891 |
| Decoupled=False | [Commit](https://github.com/yuekaizhang/CosyVoice/commit/b44f12110224cb11c03aee4084b1597e7b9331cb) | 2 | 1025.93 | 901.68 | 0.0657 | | Offline, Decoupled=False | [Commit](https://github.com/yuekaizhang/CosyVoice/commit/b44f12110224cb11c03aee4084b1597e7b9331cb) | 2 | 1025.93 | 901.68 | 0.0657 |
| Decoupled=False | [Commit](https://github.com/yuekaizhang/CosyVoice/commit/b44f12110224cb11c03aee4084b1597e7b9331cb) | 4 | 1914.13 | 1783.58 | 0.0610 | | Offline, Decoupled=False | [Commit](https://github.com/yuekaizhang/CosyVoice/commit/b44f12110224cb11c03aee4084b1597e7b9331cb) | 4 | 1914.13 | 1783.58 | 0.0610 |
| Decoupled=True | [Commit](https://github.com/yuekaizhang/CosyVoice/commit/b44f12110224cb11c03aee4084b1597e7b9331cb) | 1 | 659.87 | 655.63 | 0.0891 |
| Decoupled=True | [Commit](https://github.com/yuekaizhang/CosyVoice/commit/b44f12110224cb11c03aee4084b1597e7b9331cb) | 2 | 1103.16 | 992.96 | 0.0693 |
| Decoupled=True | [Commit](https://github.com/yuekaizhang/CosyVoice/commit/b44f12110224cb11c03aee4084b1597e7b9331cb) | 4 | 1790.91 | 1668.63 | 0.0604 |
### OpenAI-Compatible Server ### OpenAI-Compatible Server
To launch an OpenAI-compatible service, run:
To launch an OpenAI-compatible API service, run the following commands:
```sh ```sh
git clone https://github.com/yuekaizhang/Triton-OpenAI-Speech.git git clone https://github.com/yuekaizhang/Triton-OpenAI-Speech.git
cd Triton-OpenAI-Speech
pip install -r requirements.txt pip install -r requirements.txt
# After the Triton service is up, start the FastAPI bridge:
# After the Triton service is running, start the FastAPI bridge:
python3 tts_server.py --url http://localhost:8000 --ref_audios_dir ./ref_audios/ --port 10086 --default_sample_rate 24000 python3 tts_server.py --url http://localhost:8000 --ref_audios_dir ./ref_audios/ --port 10086 --default_sample_rate 24000
# Test with curl
# Test the service with curl:
bash test/test_cosyvoice.sh bash test/test_cosyvoice.sh
``` ```
> [!NOTE]
> Currently, only the offline TTS mode is compatible with the OpenAI-compatible server.
### Acknowledgements ### Acknowledgements
This section originates from the NVIDIA CISI project. We also provide other multimodal resources—see [mair-hub](https://github.com/nvidia-china-sae/mair-hub) for details.
This work originates from the NVIDIA CISI project. For more multimodal resources, please see [mair-hub](https://github.com/nvidia-china-sae/mair-hub).

View File

@@ -32,7 +32,7 @@ import triton_python_backend_utils as pb_utils
import os import os
import numpy as np import numpy as np
import s3tokenizer import s3tokenizer
torch.set_num_threads(1)
ORIGINAL_VOCAB_SIZE = 151663 ORIGINAL_VOCAB_SIZE = 151663

View File

@@ -28,6 +28,8 @@ import json
import math import math
import os import os
import re import re
import threading
import time
from typing import Dict, List, Tuple, Optional, Union from typing import Dict, List, Tuple, Optional, Union
import numpy as np import numpy as np
@@ -42,6 +44,7 @@ import torchaudio
from matcha.utils.audio import mel_spectrogram from matcha.utils.audio import mel_spectrogram
torch.set_num_threads(1)
class TritonPythonModel: class TritonPythonModel:
"""Triton Python model for Spark TTS. """Triton Python model for Spark TTS.
@@ -62,6 +65,8 @@ class TritonPythonModel:
parameters = self.model_config['parameters'] parameters = self.model_config['parameters']
model_params = {k: v["string_value"] for k, v in parameters.items()} model_params = {k: v["string_value"] for k, v in parameters.items()}
self.logger.log_info(f"model_params:{model_params}") self.logger.log_info(f"model_params:{model_params}")
self.dynamic_chunk_strategy = model_params.get("dynamic_chunk_strategy", "exponential") # "exponential" or "time_based"
self.logger.log_info(f"Using dynamic chunk strategy: {self.dynamic_chunk_strategy}")
# Initialize tokenizer # Initialize tokenizer
llm_tokenizer_dir = model_params["llm_tokenizer_dir"] llm_tokenizer_dir = model_params["llm_tokenizer_dir"]
@@ -72,6 +77,10 @@ class TritonPythonModel:
self.device = torch.device("cuda") self.device = torch.device("cuda")
self.decoupled = pb_utils.using_decoupled_model_transaction_policy(self.model_config) self.decoupled = pb_utils.using_decoupled_model_transaction_policy(self.model_config)
self.token_frame_rate = 25
self.flow_pre_lookahead_len = 3
self.token_hop_len = 15
def forward_llm(self, input_ids): def forward_llm(self, input_ids):
""" """
Prepares the response from the language model based on the provided Prepares the response from the language model based on the provided
@@ -99,7 +108,7 @@ class TritonPythonModel:
""" """
# convert input_ids to numpy, with shape [1, sequence_length] # convert input_ids to numpy, with shape [1, sequence_length]
input_ids = input_ids.cpu().numpy() input_ids = input_ids.cpu().numpy()
max_tokens = 1024 max_tokens = 750
input_dict = { input_dict = {
"request_output_len": np.array([[max_tokens]], dtype=np.int32), "request_output_len": np.array([[max_tokens]], dtype=np.int32),
"end_id": np.array([[self.eos_token_id]], dtype=np.int32), "end_id": np.array([[self.eos_token_id]], dtype=np.int32),
@@ -109,6 +118,7 @@ class TritonPythonModel:
"runtime_top_k": np.array([[50]], dtype=np.int32), "runtime_top_k": np.array([[50]], dtype=np.int32),
"temperature": np.array([[0.8]], dtype=np.float32), "temperature": np.array([[0.8]], dtype=np.float32),
"repetition_penalty": np.array([[1.1]], dtype=np.float32), "repetition_penalty": np.array([[1.1]], dtype=np.float32),
"random_seed": np.array([[42]], dtype=np.uint64),
"input_ids": input_ids, "input_ids": input_ids,
"input_lengths": np.array([[input_ids.shape[1]]], dtype=np.int32), "input_lengths": np.array([[input_ids.shape[1]]], dtype=np.int32),
} }
@@ -139,7 +149,6 @@ class TritonPythonModel:
# Get actual output IDs up to the sequence length # Get actual output IDs up to the sequence length
actual_output_ids = output_ids[0][0][:seq_lens[0][0]] actual_output_ids = output_ids[0][0][:seq_lens[0][0]]
print(f"actual_output_ids: {actual_output_ids}")
yield actual_output_ids yield actual_output_ids
else: else:
@@ -290,6 +299,15 @@ class TritonPythonModel:
speech_feat = speech_feat.unsqueeze(dim=0) speech_feat = speech_feat.unsqueeze(dim=0)
return speech_feat return speech_feat
def _llm_gen_thread(self, generated_ids_iter, semantic_token_ids_arr, llm_is_done_flag):
for generated_ids in generated_ids_iter:
generated_ids = generated_ids.tolist()
if len(generated_ids) == 0:
break
semantic_token_ids_arr.extend(generated_ids)
llm_is_done_flag[0] = True
def execute(self, requests): def execute(self, requests):
"""Execute inference on the batched requests. """Execute inference on the batched requests.
@@ -322,9 +340,7 @@ class TritonPythonModel:
flow_prompt_speech_token_len = prompt_speech_tokens.shape[-1] flow_prompt_speech_token_len = prompt_speech_tokens.shape[-1]
token_hop_len = 25
flow_pre_lookahead_len = 3
reference_text = pb_utils.get_input_tensor_by_name(request, "reference_text").as_numpy() reference_text = pb_utils.get_input_tensor_by_name(request, "reference_text").as_numpy()
reference_text = reference_text[0][0].decode('utf-8') reference_text = reference_text[0][0].decode('utf-8')
@@ -340,47 +356,75 @@ class TritonPythonModel:
# Generate semantic tokens with LLM # Generate semantic tokens with LLM
generated_ids_iter = self.forward_llm(input_ids) generated_ids_iter = self.forward_llm(input_ids)
prompt_spk_embedding = self.forward_speaker_embedding(wav_tensor) prompt_spk_embedding = self.forward_speaker_embedding(wav_tensor)
print(f"here2")
if self.decoupled: if self.decoupled:
response_sender = request.get_response_sender() response_sender = request.get_response_sender()
semantic_token_ids_arr = []
llm_is_done_flag = [False]
llm_thread = threading.Thread(
target=self._llm_gen_thread,
args=(generated_ids_iter, semantic_token_ids_arr, llm_is_done_flag)
)
semantic_token_ids_arr, token_offset = [], 0 llm_thread.start()
for generated_ids in generated_ids_iter:
generated_ids = generated_ids.tolist() token_offset, chunk_index = 0, 0
print(f"generated_id: {generated_ids}") start_time = time.time()
semantic_token_ids_arr.extend(generated_ids) this_token_hop_len = self.token_hop_len
prompt_token_pad = int(np.ceil(flow_prompt_speech_token_len / token_hop_len) * token_hop_len - flow_prompt_speech_token_len) while True:
this_token_hop_len = token_hop_len + prompt_token_pad if token_offset == 0 else token_hop_len pending_num = len(semantic_token_ids_arr) - token_offset
print(f"this_token_hop_len: {this_token_hop_len}")
if len(semantic_token_ids_arr) - token_offset >= this_token_hop_len + flow_pre_lookahead_len: if llm_is_done_flag[0]:
this_tts_speech_token = semantic_token_ids_arr[:token_offset + this_token_hop_len + flow_pre_lookahead_len] break
print(f"this_tts_speech_token: {this_tts_speech_token}")
if pending_num >= this_token_hop_len + self.flow_pre_lookahead_len:
this_tts_speech_token = semantic_token_ids_arr[:token_offset + this_token_hop_len + self.flow_pre_lookahead_len]
this_tts_speech_token = torch.tensor(this_tts_speech_token).unsqueeze(dim=0).to(torch.int32).to(self.device) this_tts_speech_token = torch.tensor(this_tts_speech_token).unsqueeze(dim=0).to(torch.int32).to(self.device)
print(f"here3")
sub_tts_speech = self.forward_token2wav(prompt_speech_tokens, prompt_speech_feat, prompt_spk_embedding, this_tts_speech_token, request_id, token_offset, False) sub_tts_speech = self.forward_token2wav(prompt_speech_tokens, prompt_speech_feat, prompt_spk_embedding, this_tts_speech_token, request_id, token_offset, False)
print(f"here4")
# Prepare response to send
audio_tensor = pb_utils.Tensor.from_dlpack("waveform", to_dlpack(sub_tts_speech)) audio_tensor = pb_utils.Tensor.from_dlpack("waveform", to_dlpack(sub_tts_speech))
inference_response = pb_utils.InferenceResponse(output_tensors=[audio_tensor]) inference_response = pb_utils.InferenceResponse(output_tensors=[audio_tensor])
response_sender.send(inference_response) response_sender.send(inference_response)
self.logger.log_info(f"[{request_id}]")
token_offset += this_token_hop_len token_offset += this_token_hop_len
print(f"here") self.logger.log_info(f"chunk_index: {chunk_index}, current_token_hop_len: {this_token_hop_len}")
if self.dynamic_chunk_strategy == "exponential":
this_token_hop_len = self.token_frame_rate * (2 ** chunk_index)
elif self.dynamic_chunk_strategy == "time_based":
# see https://github.com/qi-hua/async_cosyvoice/blob/main/model.py#L306
cost_time = time.time() - start_time
duration = token_offset / self.token_frame_rate
if chunk_index > 0 and cost_time > 0:
avg_chunk_processing_time = cost_time / (chunk_index + 1)
if avg_chunk_processing_time > 0:
multiples = (duration - cost_time) / avg_chunk_processing_time
self.logger.log_info(f"multiples: {multiples}")
next_pending_num = len(semantic_token_ids_arr) - token_offset
if multiples > 4:
this_token_hop_len = (next_pending_num // self.token_hop_len + 1) * self.token_hop_len
elif multiples > 2:
this_token_hop_len = (next_pending_num // self.token_hop_len) * self.token_hop_len
else:
this_token_hop_len = self.token_hop_len
this_token_hop_len = max(self.token_hop_len, this_token_hop_len)
chunk_index += 1
else:
time.sleep(0.02)
this_tts_speech_token = torch.tensor(semantic_token_ids_arr).unsqueeze(dim=0).to(torch.int32).to(self.device) this_tts_speech_token = torch.tensor(semantic_token_ids_arr).unsqueeze(dim=0).to(torch.int32).to(self.device)
sub_tts_speech = self.forward_token2wav(prompt_speech_tokens, prompt_speech_feat, prompt_spk_embedding, this_tts_speech_token, request_id, token_offset, True) sub_tts_speech = self.forward_token2wav(prompt_speech_tokens, prompt_speech_feat, prompt_spk_embedding, this_tts_speech_token, request_id, token_offset, True)
audio_tensor = pb_utils.Tensor.from_dlpack("waveform", to_dlpack(sub_tts_speech)) audio_tensor = pb_utils.Tensor.from_dlpack("waveform", to_dlpack(sub_tts_speech))
inference_response = pb_utils.InferenceResponse(output_tensors=[audio_tensor]) inference_response = pb_utils.InferenceResponse(output_tensors=[audio_tensor])
response_sender.send(inference_response) response_sender.send(inference_response)
llm_thread.join()
response_sender.send(flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL) response_sender.send(flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL)
self.logger.log_info("send tritonserver_response_complete_final to end") self.logger.log_info("send tritonserver_response_complete_final to end")
else: else:

View File

@@ -47,11 +47,11 @@ logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(level
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
ORIGINAL_VOCAB_SIZE = 151663 ORIGINAL_VOCAB_SIZE = 151663
torch.set_num_threads(1)
class CosyVoice2: class CosyVoice2:
def __init__(self, model_dir, load_jit=False, load_trt=False, fp16=False, trt_concurrent=1): def __init__(self, model_dir, load_jit=False, load_trt=False, fp16=False, trt_concurrent=1, device='cuda'):
self.model_dir = model_dir self.model_dir = model_dir
self.fp16 = fp16 self.fp16 = fp16
@@ -61,7 +61,7 @@ class CosyVoice2:
raise ValueError('{} not found!'.format(hyper_yaml_path)) raise ValueError('{} not found!'.format(hyper_yaml_path))
with open(hyper_yaml_path, 'r') as f: with open(hyper_yaml_path, 'r') as f:
configs = load_hyperpyyaml(f, overrides={'qwen_pretrain_path': os.path.join(model_dir, 'CosyVoice-BlankEN')}) configs = load_hyperpyyaml(f, overrides={'qwen_pretrain_path': os.path.join(model_dir, 'CosyVoice-BlankEN')})
self.model = CosyVoice2Model(configs['flow'], configs['hift'], fp16) self.model = CosyVoice2Model(configs['flow'], configs['hift'], fp16, device)
self.model.load('{}/flow.pt'.format(model_dir), '{}/hift.pt'.format(model_dir)) self.model.load('{}/flow.pt'.format(model_dir), '{}/hift.pt'.format(model_dir))
if load_jit: if load_jit:
self.model.load_jit('{}/flow.encoder.{}.zip'.format(model_dir, 'fp16' if self.fp16 is True else 'fp32')) self.model.load_jit('{}/flow.encoder.{}.zip'.format(model_dir, 'fp16' if self.fp16 is True else 'fp32'))
@@ -77,8 +77,9 @@ class CosyVoice2Model:
def __init__(self, def __init__(self,
flow: torch.nn.Module, flow: torch.nn.Module,
hift: torch.nn.Module, hift: torch.nn.Module,
fp16: bool = False): fp16: bool = False,
self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') device: str = 'cuda'):
self.device = device
self.flow = flow self.flow = flow
self.hift = hift self.hift = hift
self.fp16 = fp16 self.fp16 = fp16
@@ -179,11 +180,11 @@ class TritonPythonModel:
model_dir = model_params["model_dir"] model_dir = model_params["model_dir"]
# Initialize device and vocoder # Initialize device and vocoder
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu") self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
logger.info(f"Initializing vocoder from {model_dir} on {self.device}") logger.info(f"Initializing vocoder from {model_dir} on {self.device}")
self.token2wav_model = CosyVoice2( self.token2wav_model = CosyVoice2(
model_dir, load_jit=True, load_trt=True, fp16=True model_dir, load_jit=False, load_trt=True, fp16=True, device=self.device
) )
logger.info("Token2Wav initialized successfully") logger.info("Token2Wav initialized successfully")
@@ -224,7 +225,6 @@ class TritonPythonModel:
else: else:
stream = False stream = False
request_id = request.request_id() request_id = request.request_id()
print(f"token_offset: {token_offset}, finalize: {finalize}, request_id: {request_id}")
audio_hat = self.token2wav_model.model.token2wav(token=target_speech_tokens, audio_hat = self.token2wav_model.model.token2wav(token=target_speech_tokens,
prompt_token=prompt_speech_tokens, prompt_token=prompt_speech_tokens,
prompt_feat=prompt_speech_feat, prompt_feat=prompt_speech_feat,
@@ -234,7 +234,6 @@ class TritonPythonModel:
stream=stream, stream=stream,
finalize=finalize) finalize=finalize)
if finalize: if finalize:
print(f"dict keys: {self.token2wav_model.model.hift_cache_dict.keys()}")
self.token2wav_model.model.hift_cache_dict.pop(request_id) self.token2wav_model.model.hift_cache_dict.pop(request_id)
else: else:

View File

@@ -60,6 +60,7 @@ if [ $stage -le 2 ] && [ $stop_stage -ge 2 ]; then
cp -r ./model_repo/audio_tokenizer $model_repo cp -r ./model_repo/audio_tokenizer $model_repo
cp -r ./model_repo/tensorrt_llm $model_repo cp -r ./model_repo/tensorrt_llm $model_repo
cp -r ./model_repo/token2wav $model_repo cp -r ./model_repo/token2wav $model_repo
cp -r ./model_repo/speaker_embedding $model_repo
ENGINE_PATH=$trt_engines_dir ENGINE_PATH=$trt_engines_dir
MAX_QUEUE_DELAY_MICROSECONDS=0 MAX_QUEUE_DELAY_MICROSECONDS=0
@@ -67,11 +68,12 @@ if [ $stage -le 2 ] && [ $stop_stage -ge 2 ]; then
LLM_TOKENIZER_DIR=$huggingface_model_local_dir LLM_TOKENIZER_DIR=$huggingface_model_local_dir
BLS_INSTANCE_NUM=4 BLS_INSTANCE_NUM=4
TRITON_MAX_BATCH_SIZE=16 TRITON_MAX_BATCH_SIZE=16
DECOUPLED_MODE=False DECOUPLED_MODE=True # True for streaming, False for offline
python3 scripts/fill_template.py -i ${model_repo}/token2wav/config.pbtxt model_dir:${MODEL_DIR},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MICROSECONDS} python3 scripts/fill_template.py -i ${model_repo}/token2wav/config.pbtxt model_dir:${MODEL_DIR},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MICROSECONDS}
python3 scripts/fill_template.py -i ${model_repo}/audio_tokenizer/config.pbtxt model_dir:${MODEL_DIR},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MICROSECONDS} python3 scripts/fill_template.py -i ${model_repo}/audio_tokenizer/config.pbtxt model_dir:${MODEL_DIR},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MICROSECONDS}
python3 scripts/fill_template.py -i ${model_repo}/${cosyvoice2_dir}/config.pbtxt model_dir:${MODEL_DIR},bls_instance_num:${BLS_INSTANCE_NUM},llm_tokenizer_dir:${LLM_TOKENIZER_DIR},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MICROSECONDS} python3 scripts/fill_template.py -i ${model_repo}/${cosyvoice2_dir}/config.pbtxt model_dir:${MODEL_DIR},bls_instance_num:${BLS_INSTANCE_NUM},llm_tokenizer_dir:${LLM_TOKENIZER_DIR},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MICROSECONDS}
python3 scripts/fill_template.py -i ${model_repo}/speaker_embedding/config.pbtxt model_dir:${MODEL_DIR},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MICROSECONDS}
python3 scripts/fill_template.py -i ${model_repo}/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MICROSECONDS},encoder_input_features_data_type:TYPE_FP16,logits_datatype:TYPE_FP32 python3 scripts/fill_template.py -i ${model_repo}/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MICROSECONDS},encoder_input_features_data_type:TYPE_FP16,logits_datatype:TYPE_FP32
fi fi
@@ -82,7 +84,7 @@ if [ $stage -le 3 ] && [ $stop_stage -ge 3 ]; then
fi fi
if [ $stage -le 4 ] && [ $stop_stage -ge 4 ]; then if [ $stage -le 4 ] && [ $stop_stage -ge 4 ]; then
echo "Single request test http" echo "Single request test http, only work for offline TTS mode"
python3 client_http.py \ python3 client_http.py \
--reference-audio ./assets/prompt_audio.wav \ --reference-audio ./assets/prompt_audio.wav \
--reference-text "吃燕窝就选燕之屋本节目由26年专注高品质燕窝的燕之屋冠名播出。豆奶牛奶换着喝营养更均衡本节目由豆本豆豆奶特约播出。" \ --reference-text "吃燕窝就选燕之屋本节目由26年专注高品质燕窝的燕之屋冠名播出。豆奶牛奶换着喝营养更均衡本节目由豆本豆豆奶特约播出。" \
@@ -92,15 +94,16 @@ fi
if [ $stage -le 5 ] && [ $stop_stage -ge 5 ]; then if [ $stage -le 5 ] && [ $stop_stage -ge 5 ]; then
echo "Running benchmark client grpc" echo "Running benchmark client grpc"
num_task=4 num_task=1
# set mode=streaming, when decoupled=True
# set mode=offline, when decoupled=False mode=streaming
mode=offline BLS_INSTANCE_NUM=4
python3 client_grpc.py \ python3 client_grpc.py \
--server-addr localhost \ --server-addr localhost \
--model-name cosyvoice2 \ --model-name cosyvoice2 \
--num-tasks $num_task \ --num-tasks $num_task \
--mode $mode \ --mode $mode \
--huggingface-dataset yuekai/seed_tts_cosy2 \ --huggingface-dataset yuekai/seed_tts_cosy2 \
--log-dir ./log_concurrent_tasks_${num_task}_${mode}_bls_4_${trt_dtype} --log-dir ./log_concurrent_tasks_${num_task}_${mode}_bls_${BLS_INSTANCE_NUM}
fi fi