232 Commits

Author SHA1 Message Date
雾聪
0a44bca692 update requirements.txt 2025-03-01 21:13:50 +08:00
雾聪
d94908f9d7 update func inference_vllm 2025-03-01 19:12:13 +08:00
雾聪
11dbd88947 update func inference 2025-03-01 19:10:20 +08:00
雾聪
9a4aebb0ea add func inference_bistream_vllm 2025-03-01 18:50:19 +08:00
雾聪
54e9384fb1 update export_codec_vllm 2025-02-26 20:25:14 +08:00
雾聪
f280558bcb update func export_codec_vllm 2025-02-26 16:48:21 +08:00
雾聪
f6a18ee07a update vllm_codec_engine 2025-02-25 19:40:30 +08:00
雾聪
4df0683a37 add vllm_codec_engine 2025-02-25 17:43:33 +08:00
雾聪
c37c00ff94 add func export_codec_vllm 2025-02-25 15:01:29 +08:00
Xiang Lyu
fd45708e4b Merge pull request #977 from hanasay/main
Convert audio to mono while extract speech token
2025-02-16 12:51:04 +08:00
hanasay
296ed4f526 Convert audio to mono while extract speech token
modified:     tools/extract_speech_token.py
2025-02-14 15:25:45 +08:00
Xiang Lyu
95e99e0417 Merge pull request #940 from c4fun/fastapi-cosyvoice2
Add a inference_instruct2 route to support and defaultly supports cosyvoice2 in fastapi server
2025-02-07 16:30:27 +08:00
c4fun
ba6d8c07ba revert the main function to original and only preserve the inference endpoint 2025-02-07 16:28:30 +08:00
Xiang Lyu
da3f129977 Merge pull request #936 from huyyxy/main
feat(docker): update CUDA base image to 12.4.1 for TensorRT support
2025-02-06 11:39:53 +08:00
c4fun
2889c25863 supports and defaultly supports cosyvoice2 in fastapi server 2025-01-27 20:51:57 +08:00
huyyxy
aa65200713 feat(docker): update CUDA base image to 12.4.1 for TensorRT support
- Upgrade base image from nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04 to nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
- Enable CUDA 12.4 runtime environment
- Ensure TensorRT dependency compatibility
- Validation steps:
  - Verify CUDA version via nvidia-smi after build
  - Test import tensorrt in container without errors

Closes #935
2025-01-26 12:33:50 +08:00
Xiang Lyu
86e26f54c7 Merge pull request #930 from sd0ric4/main
feat: add POST endpoints to resolve browser error about GET request wi…
2025-01-26 10:17:15 +08:00
sd0ric4
f1c214377c fix: add POST endpoints to resolve browser error about GET request with body 2025-01-24 19:41:14 +08:00
Xiang Lyu
369ea80bd4 Merge pull request #926 from Vinkle-hzt/main
fix bistream extra token
2025-01-23 22:02:25 +08:00
huzetao.hzt
69518b2bde fix bistream extra token 2025-01-23 19:08:18 +08:00
Xiang Lyu
276cfa02b6 Merge pull request #925 from FunAudioLLM/dev/lyuxiang.lx
fix pitch computation
2025-01-23 15:45:42 +08:00
lyuxiang.lx
190840b8dc fix pitch computation 2025-01-23 15:44:03 +08:00
lyuxiang.lx
c6c3f27ecc fix typo 2025-01-23 11:27:10 +08:00
Xiang Lyu
49761d2474 Merge pull request #924 from FunAudioLLM/dev/lyuxiang.lx
add llm bistream
2025-01-23 10:19:21 +08:00
lyuxiang.lx
07e477519b add llm bistream 2025-01-23 10:12:06 +08:00
Xiang Lyu
41c5e8cd6d Merge pull request #887 from Wauplin/patch-1
Fix diffusers / huggingface_hub compatibility in requirements.txt
2025-01-15 18:23:05 +08:00
Lucain
66ceaff472 Fix diffusers / huggingface_hub compatibility in requirements.txt
As mentioned in https://github.com/FunAudioLLM/CosyVoice/issues/516#issuecomment-2592067949 and https://github.com/FunAudioLLM/CosyVoice/issues/527#issuecomment-2592067100, it is more future-proof to upgrade `diffusers` version rather than downgrading `huggingface_hub` to an old one. This will also fix the `cannot import name 'cached_download' from 'huggingface_hub'` issue without relying on outdated packages.

Sorry again for the inconvenience 🙏
2025-01-15 10:21:08 +01:00
Xiang Lyu
07a314767f Merge pull request #884 from FunAudioLLM/dev/lyuxiang.lx
update
2025-01-14 22:56:21 +08:00
lyuxiang.lx
0b75c3a03f update 2025-01-14 22:55:13 +08:00
Xiang Lyu
b4dea3d64a Merge pull request #878 from FunAudioLLM/dev/lyuxiang.lx
update
2025-01-13 10:31:12 +08:00
lyuxiang.lx
43f9e9ab20 update 2025-01-13 10:30:13 +08:00
Xiang Lyu
025f6f0f7f Merge pull request #875 from lsby/main
fix docker python version
2025-01-13 10:27:49 +08:00
Xiang Lyu
69051d11ec Merge pull request #876 from FunAudioLLM/dev/lyuxiang.lx
fix bug
2025-01-12 21:21:25 +08:00
lyuxiang.lx
59fa786769 fix bug 2025-01-12 21:18:41 +08:00
hbybyyang
f38f594303 fix docker python version 2025-01-12 15:59:58 +08:00
Xiang Lyu
eb4d5d053f Merge pull request #868 from FunAudioLLM/dev/lyuxiang.lx
move prompt wav to asset
2025-01-10 17:53:32 +08:00
lyuxiang.lx
d450c32296 update 2025-01-10 17:52:25 +08:00
lyuxiang.lx
e84d72a4d9 move prompt wav to asset 2025-01-10 17:51:21 +08:00
Xiang Lyu
06e86619c2 Merge pull request #867 from FunAudioLLM/dev/lyuxiang.lx
Dev/lyuxiang.lx
2025-01-10 16:46:11 +08:00
lyuxiang.lx
e257c16796 Merge branch 'dev/lyuxiang.lx' of github.com:FunAudioLLM/CosyVoice into dev/lyuxiang.lx 2025-01-10 16:44:01 +08:00
lyuxiang.lx
87475ccf41 fix conflict 2025-01-10 16:43:31 +08:00
Xiang Lyu
8a1bce6c81 Merge pull request #865 from FunAudioLLM/dev/lyuxiang.lx
Dev/lyuxiang.lx
2025-01-10 14:18:21 +08:00
Xiang Lyu
b1e966309d Merge branch 'main' into dev/lyuxiang.lx 2025-01-10 14:16:34 +08:00
lyuxiang.lx
b95f18909e add empty cache 2025-01-10 14:14:32 +08:00
lyuxiang.lx
1cfc5dd077 add online trt export 2025-01-10 13:55:05 +08:00
lyuxiang.lx
d2e43fe6f4 update 2025-01-08 17:01:40 +08:00
Xiang Lyu
426c4001ca Merge pull request #842 from Vinkle-hzt/main
support online onnx to trt conversion
2025-01-08 11:21:42 +08:00
Xiang Lyu
92f1c659b9 Merge branch 'dev/lyuxiang.lx' into main 2025-01-08 11:21:06 +08:00
Xiang Lyu
ac75ae5184 Merge pull request #849 from FunAudioLLM/dev/lyuxiang.lx
update gradio
2025-01-08 11:12:31 +08:00
lyuxiang.lx
1e52c6071e update gradio 2025-01-08 11:11:16 +08:00
huzetao.hzt
2a0dd5447a add usage 2025-01-07 17:29:20 +08:00
huzetao.hzt
b6a1116d15 support online onnx to trt conversion 2025-01-07 17:20:06 +08:00
Xiang Lyu
5d12ced727 Merge pull request #832 from FunAudioLLM/dev/lyuxiang.lx
update
2025-01-04 13:45:33 +08:00
lyuxiang.lx
2ea414922a update 2025-01-04 13:44:41 +08:00
Xiang Lyu
f0b5fbb658 Merge pull request #831 from FunAudioLLM/dev/lyuxiang.lx
update readme
2025-01-04 13:41:07 +08:00
lyuxiang.lx
0b17753abe update readme 2025-01-04 13:39:40 +08:00
Xiang Lyu
9e156428e2 Merge pull request #821 from FunAudioLLM/dev/lyuxiang.lx
update
2025-01-02 12:35:14 +08:00
lyuxiang.lx
99ab0f4fcb fix lint 2025-01-02 12:32:43 +08:00
lyuxiang.lx
77d8cf13a3 update 2025-01-02 10:55:59 +08:00
Xiang Lyu
6b21f8e82c Merge pull request #813 from FunAudioLLM/dev/lyuxiang.lx
update
2024-12-31 11:02:30 +08:00
lyuxiang.lx
2745d47e92 update 2024-12-31 11:01:45 +08:00
Xiang Lyu
737d10191b Merge pull request #811 from FunAudioLLM/dev/lyuxiang.lx
update
2024-12-30 17:41:58 +08:00
lyuxiang.lx
d3b1a8e352 update 2024-12-30 17:40:39 +08:00
Xiang Lyu
88f6a8e2fa Merge pull request #810 from FunAudioLLM/dev/lyuxiang.lx
add some instruction and assert
2024-12-30 16:44:37 +08:00
lyuxiang.lx
b9ddcba5fd add some instruction and assert 2024-12-30 16:41:57 +08:00
Xiang Lyu
1f30317247 Merge pull request #809 from FunAudioLLM/dev/lyuxiang.lx
fix requirements
2024-12-30 14:50:59 +08:00
lyuxiang.lx
bfcbc73df8 fix requirements 2024-12-30 14:48:43 +08:00
Xiang Lyu
4d49b68207 Merge pull request #799 from ArtificialZeng/patch-1
Update README.md fix
2024-12-30 11:43:49 +08:00
Xiang Lyu
5aa3a46d96 Merge pull request #710 from 0xCAFEBABE0/bug_cpu_hang
fix(bug).when generating text that contains only punctuation marks or…
2024-12-30 10:55:56 +08:00
0xCAFEBABE0
b60c37b31a fix(bug).when generating text that contains only punctuation marks or whitespace characters, the CPU usage reaches 100%, and the process crashes. 2024-12-30 10:48:43 +08:00
Xiang Lyu
3d0458af31 Merge pull request #744 from garywill/patch-1
Fix typo in README.md
2024-12-29 17:20:54 +08:00
Xiang Lyu
0f6ff298dd Merge pull request #745 from garywill/patch-2
Rename misspelled list_avaliable_spks() to list_available_spks()
2024-12-29 17:20:17 +08:00
Xiang Lyu
877cf1c873 Merge pull request #756 from v3ucn/v2
Update frontend.py
2024-12-29 17:16:11 +08:00
Dr. Artificial曾小健
dec008e1b7 Update README.md fix 2024-12-28 00:01:59 +08:00
刘悦
178f4bbaf9 Update frontend.py 2024-12-19 21:30:20 +08:00
刘悦
5627adefb1 Update frontend.py
# NOTE(xcsong): 和默认参数不一致时,必须重新构图,要重新构图请务必指定 `overwrite_cache=True`
#               When the parameters differ from the defaults, it is mandatory to re-compose. To re-compose, please ensure you specify `overwrite_cache=True`.

https://github.com/wenet-e2e/WeTextProcessing
2024-12-19 21:26:30 +08:00
志浩
d95aaea3c5 update paper url of cosyvoice2 2024-12-18 16:12:52 +08:00
garywill
bd4be3fc05 rename list_avaliable_spks() to list_available_spks()
Signed-off-by: garywill <garywill@disroot.org>
2024-12-18 11:19:43 +08:00
Garry W
7a969b10bb Fix typo in README.md 2024-12-18 03:16:37 +00:00
ShiLiang Zhang
87cec23fd0 Update README.md 2024-12-17 18:49:35 +08:00
ShiLiang Zhang
5b4ddd26fa Update README.md 2024-12-17 18:49:04 +08:00
Xiang Lyu
0d1e562f1d Merge pull request #736 from FunAudioLLM/dev/lyuxiang.lx
fix cosyvoice1.0 bug
2024-12-17 16:06:41 +08:00
lyuxiang.lx
b00d8a073c fix cosyvoice1.0 bug 2024-12-17 16:05:37 +08:00
Xiang Lyu
8a88446858 Merge pull request #735 from FunAudioLLM/dev/lyuxiang.lx
add text_frontend arg
2024-12-17 14:04:45 +08:00
lyuxiang.lx
26c774098d add text_frontend arg 2024-12-17 14:03:35 +08:00
Xiang Lyu
81edc83648 Merge pull request #728 from FunAudioLLM/dev/lyuxiang.lx
update
2024-12-16 15:34:26 +08:00
lyuxiang.lx
60b0416229 update 2024-12-16 15:32:30 +08:00
Xiang Lyu
32e6684025 Merge pull request #725 from FunAudioLLM/dev/lyuxiang.lx
add trt bash script
2024-12-16 14:39:24 +08:00
lyuxiang.lx
8ec41faf91 add trt bash script 2024-12-16 14:35:41 +08:00
Xiang Lyu
6c93fe86c5 Merge pull request #722 from FunAudioLLM/dev/lyuxiang.lx
update
2024-12-16 14:08:20 +08:00
lyuxiang.lx
8266566144 update 2024-12-16 14:07:39 +08:00
Xiang Lyu
091e5c4ed8 Merge pull request #721 from FunAudioLLM/dev/lyuxiang.lx
update readme
2024-12-16 14:06:03 +08:00
lyuxiang.lx
1298d90e48 update readme 2024-12-16 14:05:00 +08:00
0xCAFEBABE0
bcc58cb4cb Update common.py 2024-12-16 13:57:38 +08:00
0xCAFEBABE0
1d8d94de82 Update common.py 2024-12-16 13:56:28 +08:00
0xCAFEBABE0
0993ec5f08 Merge branch 'main' into bug_cpu_hang 2024-12-16 13:54:53 +08:00
Xiang Lyu
c4688b68eb Merge pull request #719 from FunAudioLLM/dev/lyuxiang.lx
add instruct usage
2024-12-16 11:16:33 +08:00
lyuxiang.lx
d43a0171d4 add instruct usage 2024-12-16 11:15:51 +08:00
Xiang Lyu
c4c8050532 Merge pull request #717 from FunAudioLLM/dev/lyuxiang.lx
fix lint
2024-12-16 10:38:31 +08:00
lyuxiang.lx
3581caec76 fix lint 2024-12-16 10:37:10 +08:00
Xiang Lyu
94d6ce1006 Merge pull request #715 from FunAudioLLM/dev/lyuxiang.lx
Dev/lyuxiang.lx
2024-12-16 09:55:27 +08:00
lyuxiang.lx
ac70560364 fix lint 2024-12-16 09:54:24 +08:00
lyuxiang.lx
6b5931dc70 update 2024-12-16 09:42:08 +08:00
志浩
7c561b6a7f add pdf url 2024-12-14 18:18:37 +08:00
ShiLiang Zhang
30851ede4b Update README.md 2024-12-14 16:04:00 +08:00
ShiLiang Zhang
fc3ff075ec Update README.md 2024-12-14 16:03:21 +08:00
ShiLiang Zhang
e982eecc27 Update README.md 2024-12-14 16:02:00 +08:00
ShiLiang Zhang
66fbcf6ac2 Update README.md 2024-12-14 16:01:44 +08:00
ShiLiang Zhang
07d23ab08b Update README.md 2024-12-14 16:01:20 +08:00
ShiLiang Zhang
0bcd2318c7 Update README.md 2024-12-14 16:00:53 +08:00
ShiLiang Zhang
c9b047fbda Update README.md 2024-12-14 15:53:36 +08:00
tramphero
66ae73e409 add cosyvoice2 readme 2024-12-14 15:53:08 +08:00
tramphero
1e853ed080 add cosyvoice2 readme 2024-12-14 15:46:02 +08:00
lyuxiang.lx
6b9cebea14 update 2024-12-13 15:33:36 +08:00
lyuxiang.lx
9b8f28aa32 update 2024-12-12 22:48:31 +08:00
lyuxiang.lx
2511a49a72 update 2024-12-12 18:48:25 +08:00
0xCAFEBABE0
014fed4405 fix(bug).when generating text that contains only punctuation marks or whitespace characters, the CPU usage reaches 100%, and the process crashes. 2024-12-12 16:55:43 +08:00
0xCAFEBABE0
f56c2583e8 fix(bug).when generating text that contains only punctuation marks or whitespace characters, the CPU usage reaches 100%, and the process crashes. 2024-12-12 16:53:30 +08:00
0xCAFEBABE0
84015697c2 fix(bug).when generating text that contains only punctuation marks or whitespace characters, the CPU usage reaches 100%, and the process crashes. 2024-12-12 16:49:39 +08:00
lyuxiang.lx
c693039d14 update 2024-12-12 16:46:28 +08:00
lyuxiang.lx
2345ce6be2 update 2024-12-12 15:43:17 +08:00
lyuxiang.lx
0bf706c26f update 2024-12-11 16:37:15 +08:00
lyuxiang.lx
3e381002d7 add cosyvoice2 2024-12-11 16:14:19 +08:00
lyuxiang.lx
cde3cec6fa add qwen lm 2024-12-10 14:50:20 +08:00
Xiang Lyu
07352a50b3 Merge pull request #670 from FunAudioLLM/dev/lyuxiang.lx
fix bug
2024-11-22 15:12:50 +08:00
lyuxiang.lx
dc3f6432ba fix bug 2024-11-22 15:11:55 +08:00
Xiang Lyu
d6dbdfbf31 Merge pull request #653 from FunAudioLLM/dev/lyuxiang.lx
resume training
2024-11-15 16:43:44 +08:00
lyuxiang.lx
c3dfd23399 resume training 2024-11-15 16:41:17 +08:00
Xiang Lyu
7701325969 Merge pull request #639 from FunAudioLLM/dev/lyuxiang.lx
use stream read to save memory
2024-11-11 22:17:21 +08:00
lyuxiang.lx
5ed5bb15c8 use stream read to save memory 2024-11-11 22:14:52 +08:00
Xiang Lyu
6d22d0b76f Merge pull request #618 from FunAudioLLM/dev/lyuxiang.lx
Dev/lyuxiang.lx
2024-11-05 09:36:51 +08:00
lyuxiang.lx
487701c98c fix huggingface-hub version 2024-11-05 09:34:51 +08:00
Xiang Lyu
3914b54c82 Merge pull request #603 from oongrider/f_free_cache_dict
release flow_cache_dict to prevent potential memory leaks in long-running processes
2024-11-04 11:18:17 +08:00
lyuxiang.lx
a2ece33477 fix hifigan yaml bug 2024-11-01 16:17:01 +08:00
boostarea
3411e1f599 feat.release flow_cache_dict to prevent potential memory leaks in long-running processes. 2024-10-28 12:02:07 +08:00
Xiang Lyu
dfcd6d0a64 Merge pull request #504 from liutaocode/patch-1
Update model.py
2024-10-22 13:54:01 +08:00
lyuxiang.lx
16d66dc6a6 fix short tts_text bug 2024-10-22 12:56:29 +08:00
Xiang Lyu
d8f00f4793 Merge pull request #580 from FunAudioLLM/dev/lyuxiang.lx
fix hifigan init bug
2024-10-22 11:23:32 +08:00
lyuxiang.lx
0930b4a106 fix lint 2024-10-22 11:22:35 +08:00
lyuxiang.lx
d554db7e32 fix hifigan init bug 2024-10-22 11:20:47 +08:00
Xiang Lyu
027e1ccb82 Merge pull request #511 from FunAudioLLM/dev/lyuxiang.lx
remove unnecessary f0 loss in discrimnator
2024-10-18 16:03:15 +08:00
lyuxiang.lx
d1f7c1c9d7 remove unnecessary code 2024-10-18 16:00:14 +08:00
lyuxiang.lx
5bd5dfecab remove unnecessary f0 loss in discrimnator 2024-10-18 15:58:19 +08:00
Tao Liu
18b9a8c844 Update model.py
The torch.tensor() function does not have a dim parameter
2024-10-17 11:53:09 +08:00
Xiang Lyu
0f19b97c5a Merge pull request #497 from FunAudioLLM/dev/lyuxiang.lx
Dev/lyuxiang.lx
2024-10-16 15:35:27 +08:00
lyuxiang.lx
a4db3db8ed update flow cache 2024-10-16 15:24:47 +08:00
Xiang Lyu
ace734def8 Merge pull request #455 from boji123/bj_dev_stream_fix_promptcache
[debug] support flow cache, for sharper tts_mel output (handle prompt bug)
2024-10-16 14:12:40 +08:00
Xiang Lyu
5157baf166 Merge pull request #496 from FunAudioLLM/dev/lyuxiang.lx
fix typo
2024-10-16 13:58:08 +08:00
lyuxiang.lx
6b7286eb62 fix typo 2024-10-16 13:57:27 +08:00
Xiang Lyu
21ddaeccf2 Merge pull request #495 from FunAudioLLM/dev/lyuxiang.lx
fix bug
2024-10-16 13:31:15 +08:00
lyuxiang.lx
7e6d60c24c fix bug 2024-10-16 13:30:13 +08:00
Xiang Lyu
de76577a9f Merge pull request #494 from FunAudioLLM/dev/lyuxiang.lx
Dev/lyuxiang.lx
2024-10-16 13:07:25 +08:00
lyuxiang.lx
29507bc77a update lint 2024-10-16 13:06:31 +08:00
lyuxiang.lx
555efd0301 update readme 2024-10-16 12:59:57 +08:00
lyuxiang.lx
ea7d709fbb update hifigan 2024-10-16 12:43:55 +08:00
lyuxiang.lx
73784974ce update hifigan 2024-10-16 12:15:49 +08:00
lyuxiang.lx
789ee9e5e7 add hifigan train 2024-10-16 11:37:32 +08:00
lyuxiang.lx
cb200b21c5 add hifigan train code 2024-10-09 17:36:42 +08:00
boji123
8130abb5ea [debug] handle cache with prompt 2024-09-29 19:12:30 +08:00
boji123
c9acce1482 [debug] support flow cache, for sharper tts_mel output 2024-09-29 16:26:11 +08:00
Xiang Lyu
d49259855b Merge pull request #453 from FunAudioLLM/dev/lyuxiang.lx
Dev/lyuxiang.lx
2024-09-29 14:54:31 +08:00
Xiang Lyu
67f298d94a Merge pull request #451 from zhuzizyf/bug-fix-split_paragraph
Bug fix for split_paragraph
2024-09-29 14:51:18 +08:00
Xiang Lyu
4a1ec98304 Merge pull request #450 from zhuzizyf/optimize-TN
Optimize TN processing
2024-09-29 14:49:32 +08:00
zhuyunfeng
0b76dfa1eb Update frontend_utils.py
Fix typo
2024-09-29 14:41:33 +08:00
zhuyunfeng
74a449ad1f Update frontend_utils.py
Fix typo
2024-09-29 14:26:33 +08:00
zhuyunfeng
9c0aa1918b Update frontend_utils.py
"Fix the bug in `split_paragraph` where the last sentence of synthesized text with multiple paragraphs loses punctuation, causing it to be lost."
2024-09-29 14:19:31 +08:00
zhuyunfeng
2c1877a5d4 Update frontend.py
Fix potential issues caused by ending with a Chinese comma
2024-09-29 13:29:28 +08:00
lyuxiang.lx
abc6f70ace update 25hz yaml 2024-09-29 10:42:57 +08:00
lyuxiang.lx
ffa28e3bbd update token args 2024-09-29 10:35:10 +08:00
lyuxiang.lx
8555ab4ded update readme 2024-09-26 15:02:27 +08:00
Xiang Lyu
69202008ce Merge pull request #436 from FunAudioLLM/dev/lyuxiang.lx
Dev/lyuxiang.lx
2024-09-26 15:00:22 +08:00
lyuxiang.lx
ba3d9693da load jit to device 2024-09-26 14:55:03 +08:00
lyuxiang.lx
06934c38c7 update vc code 2024-09-26 14:46:24 +08:00
lyuxiang.lx
d52358f6c5 update readme 2024-09-26 12:10:52 +08:00
lyuxiang.lx
72b89a52fb update vc/tts code 2024-09-26 11:53:10 +08:00
lyuxiang.lx
49015f63e6 add vc code 2024-09-26 10:49:22 +08:00
lyuxiang.lx
ed87445540 add 25hz text tokenizer 2024-09-25 11:24:02 +08:00
Xiang Lyu
f6d44af146 Merge pull request #424 from zhoupc2015/main
fastapi服务启动限制ip从127.0.0.1改为0.0.0.0,这样docker映射端口才可以使用
2024-09-23 16:06:30 +08:00
zhoupc
7a2014bee3 fastapi服务启动限制ip从127.0.0.1改为0.0.0.0,这样docker映射端口才可以使用 2024-09-23 15:39:37 +08:00
Xiang Lyu
95051e5761 Merge pull request #408 from FunAudioLLM/dev/lyuxiang.lx
Dev/lyuxiang.lx
2024-09-19 18:07:11 +08:00
lyuxiang.lx
f65eca6723 add speech fade in out 2024-09-19 18:02:42 +08:00
Xiang Lyu
cd26f11859 Merge pull request #379 from boji123/bj_dev_stream_fix
[debug] fix badcase, add fade on speech output
2024-09-19 17:32:04 +08:00
Xiang Lyu
28f1353324 Merge pull request #404 from FunAudioLLM/dev/lyuxiang.lx
Dev/lyuxiang.lx
2024-09-18 17:43:45 +08:00
lyuxiang.lx
f6b5c42823 fix flake 2024-09-18 17:42:54 +08:00
lyuxiang.lx
ff8e63567a use thread pool in tools 2024-09-18 17:41:15 +08:00
Xiang Lyu
2665b06e95 Merge pull request #356 from MiXaiLL76/main
Implemented fast processing of extract_embedding
2024-09-18 16:16:46 +08:00
Xiang Lyu
2898d5a851 Merge pull request #403 from FunAudioLLM/dev/lyuxiang.lx
update tempo change
2024-09-18 16:14:40 +08:00
lyuxiang.lx
e19e80fcd8 update tempo change 2024-09-18 16:13:40 +08:00
Xiang Lyu
f517d3627a Merge pull request #378 from dongwon00kim/docker
Add dockerfile for python 3.8, cuda 11.8 and torch 2.0.1 version
2024-09-18 11:51:57 +08:00
liubaiji
9e0b99e48e [feature] fix badcase, add fade on speech output 2024-09-11 10:41:50 +08:00
liubaiji
df653f1e98 [refator] modify fade_in_out func to a commom form 2024-09-11 10:36:32 +08:00
Dongwon Kim
c6b16c06e8 Add dockerfile for python 3.8, cuda 11.8 and torch 2.0.1 version 2024-09-11 11:13:22 +09:00
Xiang Lyu
c901a12789 Merge pull request #360 from FunAudioLLM/dev/lyuxiang.lx
set onnx to false as last chunk rtf unstable
2024-09-06 17:12:23 +08:00
lyuxiang.lx
122df8c420 set onnx to false as last chunk rtf unstable 2024-09-06 17:10:54 +08:00
MiXaiLL76
1d05ae5fd3 fix 2024-09-06 11:40:27 +03:00
MiXaiLL76
73271d46f9 Implementing concurrent.futures 2024-09-06 11:08:11 +03:00
MiXaiLL76
7b3e285bca add threading 2024-09-05 14:32:37 +03:00
lyuxiang.lx
bcda6d807c add prompt contraint 2024-09-05 17:09:07 +08:00
lyuxiang.lx
4d6a55243c update workflow 2024-09-05 16:25:25 +08:00
Xiang Lyu
33a585374a Merge pull request #354 from FunAudioLLM/dev/lyuxiang.lx
fix readme
2024-09-05 16:16:42 +08:00
lyuxiang.lx
90433f5373 fix lint 2024-09-05 16:15:34 +08:00
lyuxiang.lx
eeebc45313 fix readme 2024-09-05 14:39:42 +08:00
lyuxiang.lx
9100813b79 add lint workflow 2024-09-05 14:37:21 +08:00
Xiang Lyu
7f5e391041 Merge pull request #353 from FunAudioLLM/inference_streaming
onnx and fastapi
2024-09-05 14:28:16 +08:00
lyuxiang.lx
e141634da1 remove unnecessary code 2024-09-05 14:26:12 +08:00
lyuxiang.lx
11eacb810e add uvicorn requirements 2024-09-05 14:12:27 +08:00
lyuxiang.lx
7555afb90a update fastapi 2024-09-05 14:09:40 +08:00
lyuxiang.lx
2ce724045b add onnx export 2024-09-04 18:15:33 +08:00
lyuxiang.lx
7795445ed9 update roadmap 2024-09-04 16:36:18 +08:00
Xiang Lyu
d8197de4cc Merge pull request #347 from hexisyztem/inference_streaming
export onnx
2024-09-03 11:25:15 +08:00
禾息
752103a307 Merge remote-tracking branch 'origin/inference_streaming' into inference_streaming 2024-09-03 11:13:25 +08:00
禾息
a801416805 mirror modify 2024-09-03 11:07:47 +08:00
禾息
fadb22086f export onnx 2024-09-03 11:06:24 +08:00
Xiang Lyu
d2dea3d928 Merge pull request #330 from hexisyztem/flow_tensorrt
Flow tensorrt
2024-08-30 14:20:25 +08:00
Xiang Lyu
ee988420f3 Merge branch 'inference_streaming' into flow_tensorrt 2024-08-30 14:20:06 +08:00
禾息
18599be8d5 mirror modify 2024-08-30 14:15:24 +08:00
zhoubofan.zbf
29408360fb fix bug 2024-08-30 13:43:54 +08:00
禾息
6e7f5b922a update 2024-08-30 13:14:44 +08:00
zhoubofan.zbf
53a3c1b17f update 2024-08-30 00:47:40 +08:00
Xiang Lyu
20e0715dac Merge pull request #327 from FunAudioLLM/inference_streaming
Inference streaming
2024-08-29 23:51:21 +08:00
Xiang Lyu
662012999a Merge branch 'main' into inference_streaming 2024-08-29 23:48:02 +08:00
lyuxiang.lx
1ab3186799 revert trt TODO 2024-08-29 23:35:19 +08:00
zhoubofan.zbf
5f21aef786 add flow decoder tensorrt infer 2024-08-29 23:35:07 +08:00
lyuxiang.lx
1d881df8b2 fix vocoder speech overlap 2024-08-29 19:10:08 +08:00
lyuxiang.lx
f1e374a9bb add trt script TODO 2024-08-29 10:44:04 +08:00
lyuxiang.lx
8b097f7625 add export script 2024-08-28 19:37:35 +08:00
Xiang Lyu
dcc943db43 Merge pull request #323 from FunAudioLLM/hengwu.zty/zh_example
change dropout_rate to float type
2024-08-28 19:00:45 +08:00
lyuxiang.lx
9ab298dd49 add llm export script 2024-08-28 18:21:11 +08:00
Xiang Lyu
bb690d9d1e Merge pull request #293 from FunAudioLLM/hengwu.zty/zh_example
add chinese example
2024-08-19 10:04:31 +08:00
lyuxiang.lx
f4e70e222c update stream code 2024-08-07 17:09:23 +08:00
lyuxiang.lx
02f941d348 update model inference 2024-07-24 19:18:09 +08:00
lyuxiang.lx
a13411c561 add stream code 2024-07-23 00:02:30 +08:00
64 changed files with 62671 additions and 694 deletions

56
.github/workflows/lint.yml vendored Normal file
View File

@@ -0,0 +1,56 @@
name: Lint
on:
pull_request:
push:
jobs:
quick-checks:
runs-on: ubuntu-latest
steps:
- name: Fetch CosyVoice
uses: actions/checkout@v1
- name: Checkout PR tip
run: |
set -eux
if [[ "${{ github.event_name }}" == "pull_request" ]]; then
# We are on a PR, so actions/checkout leaves us on a merge commit.
# Check out the actual tip of the branch.
git checkout ${{ github.event.pull_request.head.sha }}
fi
echo ::set-output name=commit_sha::$(git rev-parse HEAD)
id: get_pr_tip
- name: Ensure no tabs
run: |
(! git grep -I -l $'\t' -- . ':(exclude)*.txt' ':(exclude)*.svg' ':(exclude)**Makefile' ':(exclude)**/contrib/**' ':(exclude)third_party' ':(exclude).gitattributes' ':(exclude).gitmodules' || (echo "The above files have tabs; please convert them to spaces"; false))
- name: Ensure no trailing whitespace
run: |
(! git grep -I -n $' $' -- . ':(exclude)*.txt' ':(exclude)third_party' ':(exclude).gitattributes' ':(exclude).gitmodules' || (echo "The above files have trailing whitespace; please remove them"; false))
flake8-py3:
runs-on: ubuntu-latest
steps:
- name: Setup Python
uses: actions/setup-python@v1
with:
python-version: 3.9
architecture: x64
- name: Fetch CosyVoice
uses: actions/checkout@v1
- name: Checkout PR tip
run: |
set -eux
if [[ "${{ github.event_name }}" == "pull_request" ]]; then
# We are on a PR, so actions/checkout leaves us on a merge commit.
# Check out the actual tip of the branch.
git checkout ${{ github.event.pull_request.head.sha }}
fi
echo ::set-output name=commit_sha::$(git rev-parse HEAD)
id: get_pr_tip
- name: Run flake8
run: |
set -eux
pip install flake8==3.8.2 flake8-bugbear flake8-comprehensions flake8-executable flake8-pyi==20.5.0 mccabe pycodestyle==2.6.0 pyflakes==2.2.0
flake8 --version
flake8 --max-line-length 180 --ignore B006,B008,B905,C408,E402,E731,E741,W503,W504 --exclude ./third_party/,./runtime/python/grpc/cosyvoice_pb2*py
if [ $? != 0 ]; then exit 1; fi

22
.github/workflows/stale-issues.yml vendored Normal file
View File

@@ -0,0 +1,22 @@
name: Close inactive issues
on:
schedule:
- cron: "30 1 * * *"
jobs:
close-issues:
runs-on: ubuntu-latest
permissions:
issues: write
pull-requests: write
steps:
- uses: actions/stale@v5
with:
days-before-issue-stale: 30
days-before-issue-close: 14
stale-issue-label: "stale"
stale-issue-message: "This issue is stale because it has been open for 30 days with no activity."
close-issue-message: "This issue was closed because it has been inactive for 14 days since being marked as stale."
days-before-pr-stale: -1
days-before-pr-close: -1
repo-token: ${{ secrets.GITHUB_TOKEN }}

5
.gitignore vendored
View File

@@ -43,7 +43,10 @@ compile_commands.json
# train/inference files # train/inference files
*.wav *.wav
*.m4a
*.aac
*.pt *.pt
pretrained_models/* pretrained_models/*
*_pb2_grpc.py *_pb2_grpc.py
*_pb2.py *_pb2.py
*.tar

170
README.md
View File

@@ -1,38 +1,51 @@
# CosyVoice [![SVG Banners](https://svg-banners.vercel.app/api?type=origin&text1=CosyVoice🤠&text2=Text-to-Speech%20💖%20Large%20Language%20Model&width=800&height=210)](https://github.com/Akshay090/svg-banners)
## 👉🏻 [CosyVoice Demos](https://fun-audio-llm.github.io/) 👈🏻
[[CosyVoice Paper](https://fun-audio-llm.github.io/pdf/CosyVoice_v1.pdf)][[CosyVoice Studio](https://www.modelscope.cn/studios/iic/CosyVoice-300M)][[CosyVoice Code](https://github.com/FunAudioLLM/CosyVoice)]
For `SenseVoice`, visit [SenseVoice repo](https://github.com/FunAudioLLM/SenseVoice) and [SenseVoice space](https://www.modelscope.cn/studios/iic/SenseVoice). ## 👉🏻 CosyVoice 👈🏻
**CosyVoice 2.0**: [Demos](https://funaudiollm.github.io/cosyvoice2/); [Paper](https://arxiv.org/abs/2412.10117); [Modelscope](https://www.modelscope.cn/studios/iic/CosyVoice2-0.5B); [HuggingFace](https://huggingface.co/spaces/FunAudioLLM/CosyVoice2-0.5B)
**CosyVoice 1.0**: [Demos](https://fun-audio-llm.github.io); [Paper](https://funaudiollm.github.io/pdf/CosyVoice_v1.pdf); [Modelscope](https://www.modelscope.cn/studios/iic/CosyVoice-300M)
## Highlight🔥
**CosyVoice 2.0** has been released! Compared to version 1.0, the new version offers more accurate, more stable, faster, and better speech generation capabilities.
### Multilingual
- **Supported Language**: Chinese, English, Japanese, Korean, Chinese dialects (Cantonese, Sichuanese, Shanghainese, Tianjinese, Wuhanese, etc.)
- **Crosslingual & Mixlingual**Support zero-shot voice cloning for cross-lingual and code-switching scenarios.
### Ultra-Low Latency
- **Bidirectional Streaming Support**: CosyVoice 2.0 integrates offline and streaming modeling technologies.
- **Rapid First Packet Synthesis**: Achieves latency as low as 150ms while maintaining high-quality audio output.
### High Accuracy
- **Improved Pronunciation**: Reduces pronunciation errors by 30% to 50% compared to CosyVoice 1.0.
- **Benchmark Achievements**: Attains the lowest character error rate on the hard test set of the Seed-TTS evaluation set.
### Strong Stability
- **Consistency in Timbre**: Ensures reliable voice consistency for zero-shot and cross-language speech synthesis.
- **Cross-language Synthesis**: Marked improvements compared to version 1.0.
### Natural Experience
- **Enhanced Prosody and Sound Quality**: Improved alignment of synthesized audio, raising MOS evaluation scores from 5.4 to 5.53.
- **Emotional and Dialectal Flexibility**: Now supports more granular emotional controls and accent adjustments.
## Roadmap ## Roadmap
- [x] 2024/12
- [x] 25hz cosyvoice 2.0 released
- [x] 2024/09
- [x] 25hz cosyvoice base model
- [x] 25hz cosyvoice voice conversion model
- [x] 2024/08
- [x] Repetition Aware Sampling(RAS) inference for llm stability
- [x] Streaming inference mode support, including kv cache and sdpa for rtf optimization
- [x] 2024/07 - [x] 2024/07
- [x] Flow matching training support - [x] Flow matching training support
- [x] WeTextProcessing support when ttsfrd is not avaliable - [x] WeTextProcessing support when ttsfrd is not available
- [x] Fastapi server and client - [x] Fastapi server and client
- [ ] 2024/08
- [ ] Repetition Aware Sampling(RAS) inference for llm stability
- [ ] Streaming inference mode support, including kv cache and sdpa for rtf optimization
- [ ] 2024/09
- [ ] 50hz llm model which supports 10 language
- [ ] 2024/10
- [ ] 50hz llama based llm model which supports lora finetune
- [ ] TBD
- [ ] Support more instruction mode
- [ ] Voice conversion
- [ ] Music generation
- [ ] Training script sample based on Mandarin
- [ ] CosyVoice-500M trained with more multi-lingual data
- [ ] More...
## Install ## Install
@@ -50,7 +63,7 @@ git submodule update --init --recursive
- Create Conda env: - Create Conda env:
``` sh ``` sh
conda create -n cosyvoice python=3.8 conda create -n cosyvoice -y python=3.10
conda activate cosyvoice conda activate cosyvoice
# pynini is required by WeTextProcessing, use conda to install it as it can be executed on all platform. # pynini is required by WeTextProcessing, use conda to install it as it can be executed on all platform.
conda install -y -c conda-forge pynini==2.1.5 conda install -y -c conda-forge pynini==2.1.5
@@ -65,14 +78,14 @@ sudo yum install sox sox-devel
**Model download** **Model download**
We strongly recommend that you download our pretrained `CosyVoice-300M` `CosyVoice-300M-SFT` `CosyVoice-300M-Instruct` model and `CosyVoice-ttsfrd` resource. We strongly recommend that you download our pretrained `CosyVoice2-0.5B` `CosyVoice-300M` `CosyVoice-300M-SFT` `CosyVoice-300M-Instruct` model and `CosyVoice-ttsfrd` resource.
If you are expert in this field, and you are only interested in training your own CosyVoice model from scratch, you can skip this step.
``` python ``` python
# SDK模型下载 # SDK模型下载
from modelscope import snapshot_download from modelscope import snapshot_download
snapshot_download('iic/CosyVoice2-0.5B', local_dir='pretrained_models/CosyVoice2-0.5B')
snapshot_download('iic/CosyVoice-300M', local_dir='pretrained_models/CosyVoice-300M') snapshot_download('iic/CosyVoice-300M', local_dir='pretrained_models/CosyVoice-300M')
snapshot_download('iic/CosyVoice-300M-25Hz', local_dir='pretrained_models/CosyVoice-300M-25Hz')
snapshot_download('iic/CosyVoice-300M-SFT', local_dir='pretrained_models/CosyVoice-300M-SFT') snapshot_download('iic/CosyVoice-300M-SFT', local_dir='pretrained_models/CosyVoice-300M-SFT')
snapshot_download('iic/CosyVoice-300M-Instruct', local_dir='pretrained_models/CosyVoice-300M-Instruct') snapshot_download('iic/CosyVoice-300M-Instruct', local_dir='pretrained_models/CosyVoice-300M-Instruct')
snapshot_download('iic/CosyVoice-ttsfrd', local_dir='pretrained_models/CosyVoice-ttsfrd') snapshot_download('iic/CosyVoice-ttsfrd', local_dir='pretrained_models/CosyVoice-ttsfrd')
@@ -81,64 +94,100 @@ snapshot_download('iic/CosyVoice-ttsfrd', local_dir='pretrained_models/CosyVoice
``` sh ``` sh
# git模型下载请确保已安装git lfs # git模型下载请确保已安装git lfs
mkdir -p pretrained_models mkdir -p pretrained_models
git clone https://www.modelscope.cn/iic/CosyVoice2-0.5B.git pretrained_models/CosyVoice2-0.5B
git clone https://www.modelscope.cn/iic/CosyVoice-300M.git pretrained_models/CosyVoice-300M git clone https://www.modelscope.cn/iic/CosyVoice-300M.git pretrained_models/CosyVoice-300M
git clone https://www.modelscope.cn/iic/CosyVoice-300M-25Hz.git pretrained_models/CosyVoice-300M-25Hz
git clone https://www.modelscope.cn/iic/CosyVoice-300M-SFT.git pretrained_models/CosyVoice-300M-SFT git clone https://www.modelscope.cn/iic/CosyVoice-300M-SFT.git pretrained_models/CosyVoice-300M-SFT
git clone https://www.modelscope.cn/iic/CosyVoice-300M-Instruct.git pretrained_models/CosyVoice-300M-Instruct git clone https://www.modelscope.cn/iic/CosyVoice-300M-Instruct.git pretrained_models/CosyVoice-300M-Instruct
git clone https://www.modelscope.cn/iic/CosyVoice-ttsfrd.git pretrained_models/CosyVoice-ttsfrd git clone https://www.modelscope.cn/iic/CosyVoice-ttsfrd.git pretrained_models/CosyVoice-ttsfrd
``` ```
Optionaly, you can unzip `ttsfrd` resouce and install `ttsfrd` package for better text normalization performance. Optionally, you can unzip `ttsfrd` resouce and install `ttsfrd` package for better text normalization performance.
Notice that this step is not necessary. If you do not install `ttsfrd` package, we will use WeTextProcessing by default. Notice that this step is not necessary. If you do not install `ttsfrd` package, we will use WeTextProcessing by default.
``` sh ``` sh
cd pretrained_models/CosyVoice-ttsfrd/ cd pretrained_models/CosyVoice-ttsfrd/
unzip resource.zip -d . unzip resource.zip -d .
pip install ttsfrd-0.3.6-cp38-cp38-linux_x86_64.whl pip install ttsfrd_dependency-0.1-py3-none-any.whl
pip install ttsfrd-0.4.2-cp310-cp310-linux_x86_64.whl
``` ```
**Basic Usage** **Basic Usage**
For zero_shot/cross_lingual inference, please use `CosyVoice-300M` model. We strongly recommend using `CosyVoice2-0.5B` for better performance.
For sft inference, please use `CosyVoice-300M-SFT` model. Follow code below for detailed usage of each model.
For instruct inference, please use `CosyVoice-300M-Instruct` model.
First, add `third_party/Matcha-TTS` to your `PYTHONPATH`.
``` sh
export PYTHONPATH=third_party/Matcha-TTS
```
``` python ``` python
from cosyvoice.cli.cosyvoice import CosyVoice import sys
sys.path.append('third_party/Matcha-TTS')
from cosyvoice.cli.cosyvoice import CosyVoice, CosyVoice2
from cosyvoice.utils.file_utils import load_wav from cosyvoice.utils.file_utils import load_wav
import torchaudio import torchaudio
```
cosyvoice = CosyVoice('pretrained_models/CosyVoice-300M-SFT') **CosyVoice2 Usage**
```python
cosyvoice = CosyVoice2('pretrained_models/CosyVoice2-0.5B', load_jit=False, load_trt=False, fp16=False)
# NOTE if you want to reproduce the results on https://funaudiollm.github.io/cosyvoice2, please add text_frontend=False during inference
# zero_shot usage
prompt_speech_16k = load_wav('./asset/zero_shot_prompt.wav', 16000)
for i, j in enumerate(cosyvoice.inference_zero_shot('收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。', '希望你以后能够做的比我还好呦。', prompt_speech_16k, stream=False)):
torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
# fine grained control, for supported control, check cosyvoice/tokenizer/tokenizer.py#L248
for i, j in enumerate(cosyvoice.inference_cross_lingual('在他讲述那个荒诞故事的过程中,他突然[laughter]停下来,因为他自己也被逗笑了[laughter]。', prompt_speech_16k, stream=False)):
torchaudio.save('fine_grained_control_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
# instruct usage
for i, j in enumerate(cosyvoice.inference_instruct2('收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。', '用四川话说这句话', prompt_speech_16k, stream=False)):
torchaudio.save('instruct_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
# bistream usage, you can use generator as input, this is useful when using text llm model as input
# NOTE you should still have some basic sentence split logic because llm can not handle arbitrary sentence length
def text_generator():
yield '收到好友从远方寄来的生日礼物,'
yield '那份意外的惊喜与深深的祝福'
yield '让我心中充满了甜蜜的快乐,'
yield '笑容如花儿般绽放。'
for i, j in enumerate(cosyvoice.inference_zero_shot(text_generator(), '希望你以后能够做的比我还好呦。', prompt_speech_16k, stream=False)):
torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
```
**CosyVoice Usage**
```python
cosyvoice = CosyVoice('pretrained_models/CosyVoice-300M-SFT', load_jit=False, load_trt=False, fp16=False)
# sft usage # sft usage
print(cosyvoice.list_avaliable_spks()) print(cosyvoice.list_available_spks())
output = cosyvoice.inference_sft('你好,我是通义生成式语音大模型,请问有什么可以帮您的吗?', '中文女') # change stream=True for chunk stream inference
torchaudio.save('sft.wav', output['tts_speech'], 22050) for i, j in enumerate(cosyvoice.inference_sft('你好,我是通义生成式语音大模型,请问有什么可以帮您的吗?', '中文女', stream=False)):
torchaudio.save('sft_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
cosyvoice = CosyVoice('pretrained_models/CosyVoice-300M') cosyvoice = CosyVoice('pretrained_models/CosyVoice-300M') # or change to pretrained_models/CosyVoice-300M-25Hz for 25Hz inference
# zero_shot usage, <|zh|><|en|><|jp|><|yue|><|ko|> for Chinese/English/Japanese/Cantonese/Korean # zero_shot usage, <|zh|><|en|><|jp|><|yue|><|ko|> for Chinese/English/Japanese/Cantonese/Korean
prompt_speech_16k = load_wav('zero_shot_prompt.wav', 16000) prompt_speech_16k = load_wav('./asset/zero_shot_prompt.wav', 16000)
output = cosyvoice.inference_zero_shot('收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。', '希望你以后能够做的比我还好呦。', prompt_speech_16k) for i, j in enumerate(cosyvoice.inference_zero_shot('收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。', '希望你以后能够做的比我还好呦。', prompt_speech_16k, stream=False)):
torchaudio.save('zero_shot.wav', output['tts_speech'], 22050) torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
# cross_lingual usage # cross_lingual usage
prompt_speech_16k = load_wav('cross_lingual_prompt.wav', 16000) prompt_speech_16k = load_wav('./asset/cross_lingual_prompt.wav', 16000)
output = cosyvoice.inference_cross_lingual('<|en|>And then later on, fully acquiring that company. So keeping management in line, interest in line with the asset that\'s coming into the family is a reason why sometimes we don\'t buy the whole thing.', prompt_speech_16k) for i, j in enumerate(cosyvoice.inference_cross_lingual('<|en|>And then later on, fully acquiring that company. So keeping management in line, interest in line with the asset that\'s coming into the family is a reason why sometimes we don\'t buy the whole thing.', prompt_speech_16k, stream=False)):
torchaudio.save('cross_lingual.wav', output['tts_speech'], 22050) torchaudio.save('cross_lingual_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
# vc usage
prompt_speech_16k = load_wav('./asset/zero_shot_prompt.wav', 16000)
source_speech_16k = load_wav('./asset/cross_lingual_prompt.wav', 16000)
for i, j in enumerate(cosyvoice.inference_vc(source_speech_16k, prompt_speech_16k, stream=False)):
torchaudio.save('vc_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
cosyvoice = CosyVoice('pretrained_models/CosyVoice-300M-Instruct') cosyvoice = CosyVoice('pretrained_models/CosyVoice-300M-Instruct')
# instruct usage, support <laughter></laughter><strong></strong>[laughter][breath] # instruct usage, support <laughter></laughter><strong></strong>[laughter][breath]
output = cosyvoice.inference_instruct('在面对挑战时,他展现了非凡的<strong>勇气</strong>与<strong>智慧</strong>。', '中文男', 'Theo \'Crimson\', is a fiery, passionate rebel leader. Fights with fervor for justice, but struggles with impulsiveness.') for i, j in enumerate(cosyvoice.inference_instruct('在面对挑战时,他展现了非凡的<strong>勇气</strong>与<strong>智慧</strong>。', '中文男', 'Theo \'Crimson\', is a fiery, passionate rebel leader. Fights with fervor for justice, but struggles with impulsiveness.', stream=False)):
torchaudio.save('instruct.wav', output['tts_speech'], 22050) torchaudio.save('instruct_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
``` ```
**Start web demo** **Start web demo**
You can use our web demo page to get familiar with CosyVoice quickly. You can use our web demo page to get familiar with CosyVoice quickly.
We support sft/zero_shot/cross_lingual/instruct inference in web demo.
Please see the demo website for details. Please see the demo website for details.
@@ -150,12 +199,11 @@ python3 webui.py --port 50000 --model_dir pretrained_models/CosyVoice-300M
**Advanced Usage** **Advanced Usage**
For advanced user, we have provided train and inference scripts in `examples/libritts/cosyvoice/run.sh`. For advanced user, we have provided train and inference scripts in `examples/libritts/cosyvoice/run.sh`.
You can get familiar with CosyVoice following this recipie.
**Build for deployment** **Build for deployment**
Optionally, if you want to use grpc for service deployment, Optionally, if you want service deployment,
you can run following steps. Otherwise, you can just ignore this step. you can run following steps.
``` sh ``` sh
cd runtime/python cd runtime/python
@@ -163,10 +211,10 @@ docker build -t cosyvoice:v1.0 .
# change iic/CosyVoice-300M to iic/CosyVoice-300M-Instruct if you want to use instruct inference # change iic/CosyVoice-300M to iic/CosyVoice-300M-Instruct if you want to use instruct inference
# for grpc usage # for grpc usage
docker run -d --runtime=nvidia -p 50000:50000 cosyvoice:v1.0 /bin/bash -c "cd /opt/CosyVoice/CosyVoice/runtime/python/grpc && python3 server.py --port 50000 --max_conc 4 --model_dir iic/CosyVoice-300M && sleep infinity" docker run -d --runtime=nvidia -p 50000:50000 cosyvoice:v1.0 /bin/bash -c "cd /opt/CosyVoice/CosyVoice/runtime/python/grpc && python3 server.py --port 50000 --max_conc 4 --model_dir iic/CosyVoice-300M && sleep infinity"
python3 grpc/client.py --port 50000 --mode <sft|zero_shot|cross_lingual|instruct> cd grpc && python3 client.py --port 50000 --mode <sft|zero_shot|cross_lingual|instruct>
# for fastapi usage # for fastapi usage
docker run -d --runtime=nvidia -p 50000:50000 cosyvoice:v1.0 /bin/bash -c "cd /opt/CosyVoice/CosyVoice/runtime/python/fastapi && MODEL_DIR=iic/CosyVoice-300M fastapi dev --port 50000 server.py && sleep infinity" docker run -d --runtime=nvidia -p 50000:50000 cosyvoice:v1.0 /bin/bash -c "cd /opt/CosyVoice/CosyVoice/runtime/python/fastapi && python3 server.py --port 50000 --model_dir iic/CosyVoice-300M && sleep infinity"
python3 fastapi/client.py --port 50000 --mode <sft|zero_shot|cross_lingual|instruct> cd fastapi && python3 client.py --port 50000 --mode <sft|zero_shot|cross_lingual|instruct>
``` ```
## Discussion & Communication ## Discussion & Communication

View File

@@ -0,0 +1,92 @@
# Copyright (c) 2020 Mobvoi Inc (Di Wu)
# Copyright (c) 2024 Alibaba Inc (authors: Xiang Lyu)
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import argparse
import glob
import yaml
import torch
def get_args():
parser = argparse.ArgumentParser(description='average model')
parser.add_argument('--dst_model', required=True, help='averaged model')
parser.add_argument('--src_path',
required=True,
help='src model path for average')
parser.add_argument('--val_best',
action="store_true",
help='averaged model')
parser.add_argument('--num',
default=5,
type=int,
help='nums for averaged model')
args = parser.parse_args()
print(args)
return args
def main():
args = get_args()
val_scores = []
if args.val_best:
yamls = glob.glob('{}/*.yaml'.format(args.src_path))
yamls = [
f for f in yamls
if not (os.path.basename(f).startswith('train')
or os.path.basename(f).startswith('init'))
]
for y in yamls:
with open(y, 'r') as f:
dic_yaml = yaml.load(f, Loader=yaml.BaseLoader)
loss = float(dic_yaml['loss_dict']['loss'])
epoch = int(dic_yaml['epoch'])
step = int(dic_yaml['step'])
tag = dic_yaml['tag']
val_scores += [[epoch, step, loss, tag]]
sorted_val_scores = sorted(val_scores,
key=lambda x: x[2],
reverse=False)
print("best val (epoch, step, loss, tag) = " +
str(sorted_val_scores[:args.num]))
path_list = [
args.src_path + '/epoch_{}_whole.pt'.format(score[0])
for score in sorted_val_scores[:args.num]
]
print(path_list)
avg = {}
num = args.num
assert num == len(path_list)
for path in path_list:
print('Processing {}'.format(path))
states = torch.load(path, map_location=torch.device('cpu'))
for k in states.keys():
if k not in avg.keys():
avg[k] = states[k].clone()
else:
avg[k] += states[k]
# average
for k in avg.keys():
if avg[k] is not None:
# pytorch 1.6 use true_divide instead of /=
avg[k] = torch.true_divide(avg[k], num)
print('Saving to {}'.format(args.dst_model))
torch.save(avg, args.dst_model)
if __name__ == '__main__':
main()

View File

@@ -0,0 +1,91 @@
# Copyright (c) 2024 Alibaba Inc (authors: Xiang Lyu)
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import print_function
import argparse
import logging
logging.getLogger('matplotlib').setLevel(logging.WARNING)
import os
import sys
import torch
ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
sys.path.append('{}/../..'.format(ROOT_DIR))
sys.path.append('{}/../../third_party/Matcha-TTS'.format(ROOT_DIR))
from cosyvoice.cli.cosyvoice import CosyVoice, CosyVoice2
def get_args():
parser = argparse.ArgumentParser(description='export your model for deployment')
parser.add_argument('--model_dir',
type=str,
default='pretrained_models/CosyVoice-300M',
help='local path')
args = parser.parse_args()
print(args)
return args
def get_optimized_script(model, preserved_attrs=[]):
script = torch.jit.script(model)
if preserved_attrs != []:
script = torch.jit.freeze(script, preserved_attrs=preserved_attrs)
else:
script = torch.jit.freeze(script)
script = torch.jit.optimize_for_inference(script)
return script
def main():
args = get_args()
logging.basicConfig(level=logging.DEBUG,
format='%(asctime)s %(levelname)s %(message)s')
torch._C._jit_set_fusion_strategy([('STATIC', 1)])
torch._C._jit_set_profiling_mode(False)
torch._C._jit_set_profiling_executor(False)
try:
model = CosyVoice(args.model_dir)
except Exception:
try:
model = CosyVoice2(args.model_dir)
except Exception:
raise TypeError('no valid model_type!')
if not isinstance(model, CosyVoice2):
# 1. export llm text_encoder
llm_text_encoder = model.model.llm.text_encoder
script = get_optimized_script(llm_text_encoder)
script.save('{}/llm.text_encoder.fp32.zip'.format(args.model_dir))
script = get_optimized_script(llm_text_encoder.half())
script.save('{}/llm.text_encoder.fp16.zip'.format(args.model_dir))
# 2. export llm llm
llm_llm = model.model.llm.llm
script = get_optimized_script(llm_llm, ['forward_chunk'])
script.save('{}/llm.llm.fp32.zip'.format(args.model_dir))
script = get_optimized_script(llm_llm.half(), ['forward_chunk'])
script.save('{}/llm.llm.fp16.zip'.format(args.model_dir))
# 3. export flow encoder
flow_encoder = model.model.flow.encoder
script = get_optimized_script(flow_encoder)
script.save('{}/flow.encoder.fp32.zip'.format(args.model_dir))
script = get_optimized_script(flow_encoder.half())
script.save('{}/flow.encoder.fp16.zip'.format(args.model_dir))
if __name__ == '__main__':
main()

View File

@@ -0,0 +1,116 @@
# Copyright (c) 2024 Antgroup Inc (authors: Zhoubofan, hexisyztem@icloud.com)
# Copyright (c) 2024 Alibaba Inc (authors: Xiang Lyu)
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import print_function
import argparse
import logging
logging.getLogger('matplotlib').setLevel(logging.WARNING)
import os
import sys
import onnxruntime
import random
import torch
from tqdm import tqdm
ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
sys.path.append('{}/../..'.format(ROOT_DIR))
sys.path.append('{}/../../third_party/Matcha-TTS'.format(ROOT_DIR))
from cosyvoice.cli.cosyvoice import CosyVoice, CosyVoice2
def get_dummy_input(batch_size, seq_len, out_channels, device):
x = torch.rand((batch_size, out_channels, seq_len), dtype=torch.float32, device=device)
mask = torch.ones((batch_size, 1, seq_len), dtype=torch.float32, device=device)
mu = torch.rand((batch_size, out_channels, seq_len), dtype=torch.float32, device=device)
t = torch.rand((batch_size), dtype=torch.float32, device=device)
spks = torch.rand((batch_size, out_channels), dtype=torch.float32, device=device)
cond = torch.rand((batch_size, out_channels, seq_len), dtype=torch.float32, device=device)
return x, mask, mu, t, spks, cond
def get_args():
parser = argparse.ArgumentParser(description='export your model for deployment')
parser.add_argument('--model_dir',
type=str,
default='pretrained_models/CosyVoice-300M',
help='local path')
args = parser.parse_args()
print(args)
return args
def main():
args = get_args()
logging.basicConfig(level=logging.DEBUG,
format='%(asctime)s %(levelname)s %(message)s')
try:
model = CosyVoice(args.model_dir)
except Exception:
try:
model = CosyVoice2(args.model_dir)
except Exception:
raise TypeError('no valid model_type!')
# 1. export flow decoder estimator
estimator = model.model.flow.decoder.estimator
device = model.model.device
batch_size, seq_len = 2, 256
out_channels = model.model.flow.decoder.estimator.out_channels
x, mask, mu, t, spks, cond = get_dummy_input(batch_size, seq_len, out_channels, device)
torch.onnx.export(
estimator,
(x, mask, mu, t, spks, cond),
'{}/flow.decoder.estimator.fp32.onnx'.format(args.model_dir),
export_params=True,
opset_version=18,
do_constant_folding=True,
input_names=['x', 'mask', 'mu', 't', 'spks', 'cond'],
output_names=['estimator_out'],
dynamic_axes={
'x': {2: 'seq_len'},
'mask': {2: 'seq_len'},
'mu': {2: 'seq_len'},
'cond': {2: 'seq_len'},
'estimator_out': {2: 'seq_len'},
}
)
# 2. test computation consistency
option = onnxruntime.SessionOptions()
option.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
option.intra_op_num_threads = 1
providers = ['CUDAExecutionProvider' if torch.cuda.is_available() else 'CPUExecutionProvider']
estimator_onnx = onnxruntime.InferenceSession('{}/flow.decoder.estimator.fp32.onnx'.format(args.model_dir),
sess_options=option, providers=providers)
for _ in tqdm(range(10)):
x, mask, mu, t, spks, cond = get_dummy_input(batch_size, random.randint(16, 512), out_channels, device)
output_pytorch = estimator(x, mask, mu, t, spks, cond)
ort_inputs = {
'x': x.cpu().numpy(),
'mask': mask.cpu().numpy(),
'mu': mu.cpu().numpy(),
't': t.cpu().numpy(),
'spks': spks.cpu().numpy(),
'cond': cond.cpu().numpy()
}
output_onnx = estimator_onnx.run(None, ort_inputs)[0]
torch.testing.assert_allclose(output_pytorch, torch.from_numpy(output_onnx).to(device), rtol=1e-2, atol=1e-4)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,10 @@
#!/bin/bash
# Copyright 2024 Alibaba Inc. All Rights Reserved.
# download tensorrt from https://developer.nvidia.com/tensorrt/download/10x, check your system and cuda for compatibability
# for example for linux + cuda12.4, you can download https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.0.1/tars/TensorRT-10.0.1.6.Linux.x86_64-gnu.cuda-12.4.tar.gz
TRT_DIR=<YOUR_TRT_DIR>
MODEL_DIR=<COSYVOICE2_MODEL_DIR>
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$TRT_DIR/lib:/usr/local/cuda/lib64
$TRT_DIR/bin/trtexec --onnx=$MODEL_DIR/flow.decoder.estimator.fp32.onnx --saveEngine=$MODEL_DIR/flow.decoder.estimator.fp32.mygpu.plan --minShapes=x:2x80x4,mask:2x1x4,mu:2x80x4,cond:2x80x4 --optShapes=x:2x80x193,mask:2x1x193,mu:2x80x193,cond:2x80x193 --maxShapes=x:2x80x6800,mask:2x1x6800,mu:2x80x6800,cond:2x80x6800 --inputIOFormats=fp32:chw,fp32:chw,fp32:chw,fp32:chw,fp32:chw,fp32:chw --outputIOFormats=fp32:chw
$TRT_DIR/bin/trtexec --onnx=$MODEL_DIR/flow.decoder.estimator.fp32.onnx --saveEngine=$MODEL_DIR/flow.decoder.estimator.fp16.mygpu.plan --fp16 --minShapes=x:2x80x4,mask:2x1x4,mu:2x80x4,cond:2x80x4 --optShapes=x:2x80x193,mask:2x1x193,mu:2x80x193,cond:2x80x193 --maxShapes=x:2x80x6800,mask:2x1x6800,mu:2x80x6800,cond:2x80x6800 --inputIOFormats=fp16:chw,fp16:chw,fp16:chw,fp16:chw,fp16:chw,fp16:chw --outputIOFormats=fp16:chw

View File

@@ -18,16 +18,15 @@ import argparse
import logging import logging
logging.getLogger('matplotlib').setLevel(logging.WARNING) logging.getLogger('matplotlib').setLevel(logging.WARNING)
import os import os
import torch import torch
from torch.utils.data import DataLoader from torch.utils.data import DataLoader
import torchaudio import torchaudio
from hyperpyyaml import load_hyperpyyaml from hyperpyyaml import load_hyperpyyaml
from tqdm import tqdm from tqdm import tqdm
from cosyvoice.cli.model import CosyVoiceModel from cosyvoice.cli.model import CosyVoiceModel
from cosyvoice.dataset.dataset import Dataset from cosyvoice.dataset.dataset import Dataset
def get_args(): def get_args():
parser = argparse.ArgumentParser(description='inference with your model') parser = argparse.ArgumentParser(description='inference with your model')
parser.add_argument('--config', required=True, help='config file') parser.add_argument('--config', required=True, help='config file')
@@ -66,7 +65,8 @@ def main():
model = CosyVoiceModel(configs['llm'], configs['flow'], configs['hift']) model = CosyVoiceModel(configs['llm'], configs['flow'], configs['hift'])
model.load(args.llm_model, args.flow_model, args.hifigan_model) model.load(args.llm_model, args.flow_model, args.hifigan_model)
test_dataset = Dataset(args.prompt_data, data_pipeline=configs['data_pipeline'], mode='inference', shuffle=False, partition=False, tts_file=args.tts_text, prompt_utt2data=args.prompt_utt2data) test_dataset = Dataset(args.prompt_data, data_pipeline=configs['data_pipeline'], mode='inference', shuffle=False, partition=False,
tts_file=args.tts_text, prompt_utt2data=args.prompt_utt2data)
test_data_loader = DataLoader(test_dataset, batch_size=None, num_workers=0) test_data_loader = DataLoader(test_dataset, batch_size=None, num_workers=0)
del configs del configs
@@ -74,13 +74,11 @@ def main():
fn = os.path.join(args.result_dir, 'wav.scp') fn = os.path.join(args.result_dir, 'wav.scp')
f = open(fn, 'w') f = open(fn, 'w')
with torch.no_grad(): with torch.no_grad():
for batch_idx, batch in tqdm(enumerate(test_data_loader)): for _, batch in tqdm(enumerate(test_data_loader)):
utts = batch["utts"] utts = batch["utts"]
assert len(utts) == 1, "inference mode only support batchsize 1" assert len(utts) == 1, "inference mode only support batchsize 1"
text = batch["text"]
text_token = batch["text_token"].to(device) text_token = batch["text_token"].to(device)
text_token_len = batch["text_token_len"].to(device) text_token_len = batch["text_token_len"].to(device)
tts_text = batch["tts_text"]
tts_index = batch["tts_index"] tts_index = batch["tts_index"]
tts_text_token = batch["tts_text_token"].to(device) tts_text_token = batch["tts_text_token"].to(device)
tts_text_token_len = batch["tts_text_token_len"].to(device) tts_text_token_len = batch["tts_text_token_len"].to(device)
@@ -100,10 +98,13 @@ def main():
'flow_prompt_speech_token': speech_token, 'flow_prompt_speech_token_len': speech_token_len, 'flow_prompt_speech_token': speech_token, 'flow_prompt_speech_token_len': speech_token_len,
'prompt_speech_feat': speech_feat, 'prompt_speech_feat_len': speech_feat_len, 'prompt_speech_feat': speech_feat, 'prompt_speech_feat_len': speech_feat_len,
'llm_embedding': utt_embedding, 'flow_embedding': utt_embedding} 'llm_embedding': utt_embedding, 'flow_embedding': utt_embedding}
model_output = model.inference(**model_input) tts_speeches = []
for model_output in model.tts(**model_input):
tts_speeches.append(model_output['tts_speech'])
tts_speeches = torch.concat(tts_speeches, dim=1)
tts_key = '{}_{}'.format(utts[0], tts_index[0]) tts_key = '{}_{}'.format(utts[0], tts_index[0])
tts_fn = os.path.join(args.result_dir, '{}.wav'.format(tts_key)) tts_fn = os.path.join(args.result_dir, '{}.wav'.format(tts_key))
torchaudio.save(tts_fn, model_output['tts_speech'], sample_rate=22050) torchaudio.save(tts_fn, tts_speeches, sample_rate=22050)
f.write('{} {}\n'.format(tts_key, tts_fn)) f.write('{} {}\n'.format(tts_key, tts_fn))
f.flush() f.flush()
f.close() f.close()

View File

@@ -18,6 +18,7 @@ import datetime
import logging import logging
logging.getLogger('matplotlib').setLevel(logging.WARNING) logging.getLogger('matplotlib').setLevel(logging.WARNING)
from copy import deepcopy from copy import deepcopy
import os
import torch import torch
import torch.distributed as dist import torch.distributed as dist
import deepspeed import deepspeed
@@ -67,13 +68,17 @@ def get_args():
action='store_true', action='store_true',
default=False, default=False,
help='Use pinned memory buffers used for reading') help='Use pinned memory buffers used for reading')
parser.add_argument('--use_amp',
action='store_true',
default=False,
help='Use automatic mixed precision training')
parser.add_argument('--deepspeed.save_states', parser.add_argument('--deepspeed.save_states',
dest='save_states', dest='save_states',
default='model_only', default='model_only',
choices=['model_only', 'model+optimizer'], choices=['model_only', 'model+optimizer'],
help='save model/optimizer states') help='save model/optimizer states')
parser.add_argument('--timeout', parser.add_argument('--timeout',
default=30, default=60,
type=int, type=int,
help='timeout (in seconds) of cosyvoice_join.') help='timeout (in seconds) of cosyvoice_join.')
parser = deepspeed.add_config_arguments(parser) parser = deepspeed.add_config_arguments(parser)
@@ -86,10 +91,16 @@ def main():
args = get_args() args = get_args()
logging.basicConfig(level=logging.DEBUG, logging.basicConfig(level=logging.DEBUG,
format='%(asctime)s %(levelname)s %(message)s') format='%(asctime)s %(levelname)s %(message)s')
# gan train has some special initialization logic
gan = True if args.model == 'hifigan' else False
override_dict = {k: None for k in ['llm', 'flow', 'hift'] if k != args.model} override_dict = {k: None for k in ['llm', 'flow', 'hift', 'hifigan'] if k != args.model}
if gan is True:
override_dict.pop('hift')
with open(args.config, 'r') as f: with open(args.config, 'r') as f:
configs = load_hyperpyyaml(f, overrides=override_dict) configs = load_hyperpyyaml(f, overrides=override_dict)
if gan is True:
configs['train_conf'] = configs['train_conf_gan']
configs['train_conf'].update(vars(args)) configs['train_conf'].update(vars(args))
# Init env for ddp # Init env for ddp
@@ -97,7 +108,7 @@ def main():
# Get dataset & dataloader # Get dataset & dataloader
train_dataset, cv_dataset, train_data_loader, cv_data_loader = \ train_dataset, cv_dataset, train_data_loader, cv_data_loader = \
init_dataset_and_dataloader(args, configs) init_dataset_and_dataloader(args, configs, gan)
# Do some sanity checks and save config to arsg.model_dir # Do some sanity checks and save config to arsg.model_dir
configs = check_modify_and_save_config(args, configs) configs = check_modify_and_save_config(args, configs)
@@ -107,30 +118,53 @@ def main():
# load checkpoint # load checkpoint
model = configs[args.model] model = configs[args.model]
start_step, start_epoch = 0, -1
if args.checkpoint is not None: if args.checkpoint is not None:
model.load_state_dict(torch.load(args.checkpoint, map_location='cpu')) if os.path.exists(args.checkpoint):
state_dict = torch.load(args.checkpoint, map_location='cpu')
model.load_state_dict(state_dict, strict=False)
if 'step' in state_dict:
start_step = state_dict['step']
if 'epoch' in state_dict:
start_epoch = state_dict['epoch']
else:
logging.warning('checkpoint {} do not exsist!'.format(args.checkpoint))
# Dispatch model from cpu to gpu # Dispatch model from cpu to gpu
model = wrap_cuda_model(args, model) model = wrap_cuda_model(args, model)
# Get optimizer & scheduler # Get optimizer & scheduler
model, optimizer, scheduler = init_optimizer_and_scheduler(args, configs, model) model, optimizer, scheduler, optimizer_d, scheduler_d = init_optimizer_and_scheduler(args, configs, model, gan)
scheduler.set_step(start_step)
if scheduler_d is not None:
scheduler_d.set_step(start_step)
# Save init checkpoints # Save init checkpoints
info_dict = deepcopy(configs['train_conf']) info_dict = deepcopy(configs['train_conf'])
info_dict['step'] = start_step
info_dict['epoch'] = start_epoch
save_model(model, 'init', info_dict) save_model(model, 'init', info_dict)
# Get executor # Get executor
executor = Executor() executor = Executor(gan=gan)
executor.step = start_step
# Init scaler, used for pytorch amp mixed precision training
scaler = torch.cuda.amp.GradScaler() if args.use_amp else None
print('start step {} start epoch {}'.format(start_step, start_epoch))
# Start training loop # Start training loop
for epoch in range(info_dict['max_epoch']): for epoch in range(start_epoch + 1, info_dict['max_epoch']):
executor.epoch = epoch executor.epoch = epoch
train_dataset.set_epoch(epoch) train_dataset.set_epoch(epoch)
dist.barrier() dist.barrier()
group_join = dist.new_group(backend="gloo", timeout=datetime.timedelta(seconds=args.timeout)) group_join = dist.new_group(backend="gloo", timeout=datetime.timedelta(seconds=args.timeout))
executor.train_one_epoc(model, optimizer, scheduler, train_data_loader, cv_data_loader, writer, info_dict, group_join) if gan is True:
executor.train_one_epoc_gan(model, optimizer, scheduler, optimizer_d, scheduler_d, train_data_loader, cv_data_loader,
writer, info_dict, scaler, group_join)
else:
executor.train_one_epoc(model, optimizer, scheduler, train_data_loader, cv_data_loader, writer, info_dict, scaler, group_join)
dist.destroy_process_group(group_join) dist.destroy_process_group(group_join)
if __name__ == '__main__': if __name__ == '__main__':
main() main()

View File

@@ -12,72 +12,173 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
import os import os
import torch import time
from typing import Generator
from tqdm import tqdm
from hyperpyyaml import load_hyperpyyaml from hyperpyyaml import load_hyperpyyaml
from modelscope import snapshot_download from modelscope import snapshot_download
import torch
from cosyvoice.cli.frontend import CosyVoiceFrontEnd from cosyvoice.cli.frontend import CosyVoiceFrontEnd
from cosyvoice.cli.model import CosyVoiceModel from cosyvoice.cli.model import CosyVoiceModel, CosyVoice2Model
from cosyvoice.utils.file_utils import logging
from cosyvoice.utils.class_utils import get_model_type
class CosyVoice: class CosyVoice:
def __init__(self, model_dir): def __init__(self, model_dir, load_jit=False, load_trt=False, fp16=False):
instruct = True if '-Instruct' in model_dir else False self.instruct = True if '-Instruct' in model_dir else False
self.model_dir = model_dir self.model_dir = model_dir
self.fp16 = fp16
if not os.path.exists(model_dir): if not os.path.exists(model_dir):
model_dir = snapshot_download(model_dir) model_dir = snapshot_download(model_dir)
with open('{}/cosyvoice.yaml'.format(model_dir), 'r') as f: with open('{}/cosyvoice.yaml'.format(model_dir), 'r') as f:
configs = load_hyperpyyaml(f) configs = load_hyperpyyaml(f)
assert get_model_type(configs) != CosyVoice2Model, 'do not use {} for CosyVoice initialization!'.format(model_dir)
self.frontend = CosyVoiceFrontEnd(configs['get_tokenizer'], self.frontend = CosyVoiceFrontEnd(configs['get_tokenizer'],
configs['feat_extractor'], configs['feat_extractor'],
'{}/campplus.onnx'.format(model_dir), '{}/campplus.onnx'.format(model_dir),
'{}/speech_tokenizer_v1.onnx'.format(model_dir), '{}/speech_tokenizer_v1.onnx'.format(model_dir),
'{}/spk2info.pt'.format(model_dir), '{}/spk2info.pt'.format(model_dir),
instruct,
configs['allowed_special']) configs['allowed_special'])
self.model = CosyVoiceModel(configs['llm'], configs['flow'], configs['hift']) self.sample_rate = configs['sample_rate']
if torch.cuda.is_available() is False and (load_jit is True or load_trt is True or fp16 is True):
load_jit, load_trt, fp16 = False, False, False
logging.warning('no cuda device, set load_jit/load_trt/fp16 to False')
self.model = CosyVoiceModel(configs['llm'], configs['flow'], configs['hift'], fp16)
self.model.load('{}/llm.pt'.format(model_dir), self.model.load('{}/llm.pt'.format(model_dir),
'{}/flow.pt'.format(model_dir), '{}/flow.pt'.format(model_dir),
'{}/hift.pt'.format(model_dir)) '{}/hift.pt'.format(model_dir))
self.vllm_codec_engine = None
if load_jit:
self.model.load_jit('{}/llm.text_encoder.{}.zip'.format(model_dir, 'fp16' if self.fp16 is True else 'fp32'),
'{}/llm.llm.{}.zip'.format(model_dir, 'fp16' if self.fp16 is True else 'fp32'),
'{}/flow.encoder.{}.zip'.format(model_dir, 'fp16' if self.fp16 is True else 'fp32'))
if load_trt:
self.model.load_trt('{}/flow.decoder.estimator.{}.mygpu.plan'.format(model_dir, 'fp16' if self.fp16 is True else 'fp32'),
'{}/flow.decoder.estimator.fp32.onnx'.format(model_dir),
self.fp16)
del configs del configs
def list_avaliable_spks(self): def list_available_spks(self):
spks = list(self.frontend.spk2info.keys()) spks = list(self.frontend.spk2info.keys())
return spks return spks
def inference_sft(self, tts_text, spk_id): def inference_sft(self, tts_text, spk_id, stream=False, speed=1.0, text_frontend=True):
tts_speeches = [] for i in tqdm(self.frontend.text_normalize(tts_text, split=True, text_frontend=text_frontend)):
for i in self.frontend.text_normalize(tts_text, split=True):
model_input = self.frontend.frontend_sft(i, spk_id) model_input = self.frontend.frontend_sft(i, spk_id)
model_output = self.model.inference(**model_input) start_time = time.time()
tts_speeches.append(model_output['tts_speech']) logging.info('synthesis text {}'.format(i))
return {'tts_speech': torch.concat(tts_speeches, dim=1)} for model_output in self.model.tts(**model_input, stream=stream, speed=speed):
speech_len = model_output['tts_speech'].shape[1] / self.sample_rate
logging.info('yield speech len {}, rtf {}'.format(speech_len, (time.time() - start_time) / speech_len))
yield model_output
start_time = time.time()
def inference_zero_shot(self, tts_text, prompt_text, prompt_speech_16k): def inference_zero_shot(self, tts_text, prompt_text, prompt_speech_16k, stream=False, speed=1.0, text_frontend=True):
prompt_text = self.frontend.text_normalize(prompt_text, split=False) prompt_text = self.frontend.text_normalize(prompt_text, split=False, text_frontend=text_frontend)
tts_speeches = [] for i in tqdm(self.frontend.text_normalize(tts_text, split=True, text_frontend=text_frontend)):
for i in self.frontend.text_normalize(tts_text, split=True): if (not isinstance(i, Generator)) and len(i) < 0.5 * len(prompt_text):
model_input = self.frontend.frontend_zero_shot(i, prompt_text, prompt_speech_16k) logging.warning('synthesis text {} too short than prompt text {}, this may lead to bad performance'.format(i, prompt_text))
model_output = self.model.inference(**model_input) model_input = self.frontend.frontend_zero_shot(i, prompt_text, prompt_speech_16k, self.sample_rate)
tts_speeches.append(model_output['tts_speech']) start_time = time.time()
return {'tts_speech': torch.concat(tts_speeches, dim=1)} logging.info('synthesis text {}'.format(i))
for model_output in self.model.tts(**model_input, stream=stream, speed=speed):
speech_len = model_output['tts_speech'].shape[1] / self.sample_rate
logging.info('yield speech len {}, rtf {}'.format(speech_len, (time.time() - start_time) / speech_len))
yield model_output
start_time = time.time()
def inference_cross_lingual(self, tts_text, prompt_speech_16k): def inference_cross_lingual(self, tts_text, prompt_speech_16k, stream=False, speed=1.0, text_frontend=True):
if self.frontend.instruct is True: for i in tqdm(self.frontend.text_normalize(tts_text, split=True, text_frontend=text_frontend)):
raise ValueError('{} do not support cross_lingual inference'.format(self.model_dir)) model_input = self.frontend.frontend_cross_lingual(i, prompt_speech_16k, self.sample_rate)
tts_speeches = [] start_time = time.time()
for i in self.frontend.text_normalize(tts_text, split=True): logging.info('synthesis text {}'.format(i))
model_input = self.frontend.frontend_cross_lingual(i, prompt_speech_16k) for model_output in self.model.tts(**model_input, stream=stream, speed=speed):
model_output = self.model.inference(**model_input) speech_len = model_output['tts_speech'].shape[1] / self.sample_rate
tts_speeches.append(model_output['tts_speech']) logging.info('yield speech len {}, rtf {}'.format(speech_len, (time.time() - start_time) / speech_len))
return {'tts_speech': torch.concat(tts_speeches, dim=1)} yield model_output
start_time = time.time()
def inference_instruct(self, tts_text, spk_id, instruct_text): def inference_instruct(self, tts_text, spk_id, instruct_text, stream=False, speed=1.0, text_frontend=True):
if self.frontend.instruct is False: assert isinstance(self.model, CosyVoiceModel), 'inference_instruct is only implemented for CosyVoice!'
if self.instruct is False:
raise ValueError('{} do not support instruct inference'.format(self.model_dir)) raise ValueError('{} do not support instruct inference'.format(self.model_dir))
instruct_text = self.frontend.text_normalize(instruct_text, split=False) instruct_text = self.frontend.text_normalize(instruct_text, split=False, text_frontend=text_frontend)
tts_speeches = [] for i in tqdm(self.frontend.text_normalize(tts_text, split=True, text_frontend=text_frontend)):
for i in self.frontend.text_normalize(tts_text, split=True):
model_input = self.frontend.frontend_instruct(i, spk_id, instruct_text) model_input = self.frontend.frontend_instruct(i, spk_id, instruct_text)
model_output = self.model.inference(**model_input) start_time = time.time()
tts_speeches.append(model_output['tts_speech']) logging.info('synthesis text {}'.format(i))
return {'tts_speech': torch.concat(tts_speeches, dim=1)} for model_output in self.model.tts(**model_input, stream=stream, speed=speed):
speech_len = model_output['tts_speech'].shape[1] / self.sample_rate
logging.info('yield speech len {}, rtf {}'.format(speech_len, (time.time() - start_time) / speech_len))
yield model_output
start_time = time.time()
def inference_vc(self, source_speech_16k, prompt_speech_16k, stream=False, speed=1.0):
model_input = self.frontend.frontend_vc(source_speech_16k, prompt_speech_16k, self.sample_rate)
start_time = time.time()
for model_output in self.model.vc(**model_input, stream=stream, speed=speed):
speech_len = model_output['tts_speech'].shape[1] / self.sample_rate
logging.info('yield speech len {}, rtf {}'.format(speech_len, (time.time() - start_time) / speech_len))
yield model_output
start_time = time.time()
class CosyVoice2(CosyVoice):
def __init__(self, model_dir, load_jit=False, load_trt=False, fp16=False, use_vllm=False):
self.instruct = True if '-Instruct' in model_dir else False
self.model_dir = model_dir
self.fp16 = fp16
if not os.path.exists(model_dir):
model_dir = snapshot_download(model_dir)
with open('{}/cosyvoice.yaml'.format(model_dir), 'r') as f:
configs = load_hyperpyyaml(f, overrides={'qwen_pretrain_path': os.path.join(model_dir, 'CosyVoice-BlankEN')})
assert get_model_type(configs) == CosyVoice2Model, 'do not use {} for CosyVoice2 initialization!'.format(model_dir)
self.frontend = CosyVoiceFrontEnd(configs['get_tokenizer'],
configs['feat_extractor'],
'{}/campplus.onnx'.format(model_dir),
'{}/speech_tokenizer_v2.onnx'.format(model_dir),
'{}/spk2info.pt'.format(model_dir),
configs['allowed_special'])
self.sample_rate = configs['sample_rate']
if torch.cuda.is_available() is False and (load_jit is True or load_trt is True or fp16 is True):
load_jit, load_trt, fp16 = False, False, False
logging.warning('no cuda device, set load_jit/load_trt/fp16 to False')
self.model = CosyVoice2Model(configs['llm'], configs['flow'], configs['hift'], fp16)
self.model.load('{}/llm.pt'.format(model_dir),
'{}/flow.pt'.format(model_dir),
'{}/hift.pt'.format(model_dir))
self.vllm_codec_engine = None
if use_vllm:
from vllm import EngineArgs, LLMEngine
self.model.export_codec_vllm(''.join([model_dir, '/codec_vllm_model']))
engine_args = EngineArgs(model=''.join([model_dir, '/codec_vllm_model']),
skip_tokenizer_init=True,
gpu_memory_utilization=0.2)
self.vllm_codec_engine = LLMEngine.from_engine_args(engine_args)
self.model.vllm_codec_engine = self.vllm_codec_engine
if load_jit:
self.model.load_jit('{}/flow.encoder.{}.zip'.format(model_dir, 'fp16' if self.fp16 is True else 'fp32'))
if load_trt:
self.model.load_trt('{}/flow.decoder.estimator.{}.mygpu.plan'.format(model_dir, 'fp16' if self.fp16 is True else 'fp32'),
'{}/flow.decoder.estimator.fp32.onnx'.format(model_dir),
self.fp16)
del configs
def inference_instruct(self, *args, **kwargs):
raise NotImplementedError('inference_instruct is not implemented for CosyVoice2!')
def inference_instruct2(self, tts_text, instruct_text, prompt_speech_16k, stream=False, speed=1.0, text_frontend=True):
assert isinstance(self.model, CosyVoice2Model), 'inference_instruct2 is only implemented for CosyVoice2!'
for i in tqdm(self.frontend.text_normalize(tts_text, split=True, text_frontend=text_frontend)):
model_input = self.frontend.frontend_instruct2(i, instruct_text, prompt_speech_16k, self.sample_rate)
start_time = time.time()
logging.info('synthesis text {}'.format(i))
for model_output in self.model.tts(**model_input, stream=stream, speed=speed):
speech_len = model_output['tts_speech'].shape[1] / self.sample_rate
logging.info('yield speech len {}, rtf {}'.format(speech_len, (time.time() - start_time) / speech_len))
yield model_output
start_time = time.time()

View File

@@ -12,6 +12,8 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
from functools import partial from functools import partial
from typing import Generator
import json
import onnxruntime import onnxruntime
import torch import torch
import numpy as np import numpy as np
@@ -30,7 +32,8 @@ except ImportError:
from tn.chinese.normalizer import Normalizer as ZhNormalizer from tn.chinese.normalizer import Normalizer as ZhNormalizer
from tn.english.normalizer import Normalizer as EnNormalizer from tn.english.normalizer import Normalizer as EnNormalizer
use_ttsfrd = False use_ttsfrd = False
from cosyvoice.utils.frontend_utils import contains_chinese, replace_blank, replace_corner_mark, remove_bracket, spell_out_number, split_paragraph from cosyvoice.utils.file_utils import logging
from cosyvoice.utils.frontend_utils import contains_chinese, replace_blank, replace_corner_mark, remove_bracket, spell_out_number, split_paragraph, is_only_punctuation
class CosyVoiceFrontEnd: class CosyVoiceFrontEnd:
@@ -41,7 +44,6 @@ class CosyVoiceFrontEnd:
campplus_model: str, campplus_model: str,
speech_tokenizer_model: str, speech_tokenizer_model: str,
spk2info: str = '', spk2info: str = '',
instruct: bool = False,
allowed_special: str = 'all'): allowed_special: str = 'all'):
self.tokenizer = get_tokenizer() self.tokenizer = get_tokenizer()
self.feat_extractor = feat_extractor self.feat_extractor = feat_extractor
@@ -50,34 +52,51 @@ class CosyVoiceFrontEnd:
option.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL option.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
option.intra_op_num_threads = 1 option.intra_op_num_threads = 1
self.campplus_session = onnxruntime.InferenceSession(campplus_model, sess_options=option, providers=["CPUExecutionProvider"]) self.campplus_session = onnxruntime.InferenceSession(campplus_model, sess_options=option, providers=["CPUExecutionProvider"])
self.speech_tokenizer_session = onnxruntime.InferenceSession(speech_tokenizer_model, sess_options=option, providers=["CUDAExecutionProvider"if torch.cuda.is_available() else "CPUExecutionProvider"]) self.speech_tokenizer_session = onnxruntime.InferenceSession(speech_tokenizer_model, sess_options=option,
providers=["CUDAExecutionProvider" if torch.cuda.is_available() else
"CPUExecutionProvider"])
if os.path.exists(spk2info): if os.path.exists(spk2info):
self.spk2info = torch.load(spk2info, map_location=self.device) self.spk2info = torch.load(spk2info, map_location=self.device)
self.instruct = instruct else:
self.spk2info = {}
self.allowed_special = allowed_special self.allowed_special = allowed_special
self.inflect_parser = inflect.engine()
self.use_ttsfrd = use_ttsfrd self.use_ttsfrd = use_ttsfrd
if self.use_ttsfrd: if self.use_ttsfrd:
self.frd = ttsfrd.TtsFrontendEngine() self.frd = ttsfrd.TtsFrontendEngine()
ROOT_DIR = os.path.dirname(os.path.abspath(__file__)) ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
assert self.frd.initialize('{}/../../pretrained_models/CosyVoice-ttsfrd/resource'.format(ROOT_DIR)) is True, 'failed to initialize ttsfrd resource' assert self.frd.initialize('{}/../../pretrained_models/CosyVoice-ttsfrd/resource'.format(ROOT_DIR)) is True, \
self.frd.set_lang_type('pinyin') 'failed to initialize ttsfrd resource'
self.frd.enable_pinyin_mix(True) self.frd.set_lang_type('pinyinvg')
self.frd.set_breakmodel_index(1)
else: else:
self.zh_tn_model = ZhNormalizer(remove_erhua=False, full_to_half=False) self.zh_tn_model = ZhNormalizer(remove_erhua=False, full_to_half=False, overwrite_cache=True)
self.en_tn_model = EnNormalizer() self.en_tn_model = EnNormalizer()
self.inflect_parser = inflect.engine()
def _extract_text_token(self, text): def _extract_text_token(self, text):
text_token = self.tokenizer.encode(text, allowed_special=self.allowed_special) if isinstance(text, Generator):
text_token = torch.tensor([text_token], dtype=torch.int32).to(self.device) logging.info('get tts_text generator, will return _extract_text_token_generator!')
text_token_len = torch.tensor([text_token.shape[1]], dtype=torch.int32).to(self.device) # NOTE add a dummy text_token_len for compatibility
return text_token, text_token_len return self._extract_text_token_generator(text), torch.tensor([0], dtype=torch.int32).to(self.device)
else:
text_token = self.tokenizer.encode(text, allowed_special=self.allowed_special)
text_token = torch.tensor([text_token], dtype=torch.int32).to(self.device)
text_token_len = torch.tensor([text_token.shape[1]], dtype=torch.int32).to(self.device)
return text_token, text_token_len
def _extract_text_token_generator(self, text_generator):
for text in text_generator:
text_token, _ = self._extract_text_token(text)
for i in range(text_token.shape[1]):
yield text_token[:, i: i + 1]
def _extract_speech_token(self, speech): def _extract_speech_token(self, speech):
assert speech.shape[1] / 16000 <= 30, 'do not support extract speech token for audio longer than 30s'
feat = whisper.log_mel_spectrogram(speech, n_mels=128) feat = whisper.log_mel_spectrogram(speech, n_mels=128)
speech_token = self.speech_tokenizer_session.run(None, {self.speech_tokenizer_session.get_inputs()[0].name: feat.detach().cpu().numpy(), speech_token = self.speech_tokenizer_session.run(None,
self.speech_tokenizer_session.get_inputs()[1].name: np.array([feat.shape[2]], dtype=np.int32)})[0].flatten().tolist() {self.speech_tokenizer_session.get_inputs()[0].name:
feat.detach().cpu().numpy(),
self.speech_tokenizer_session.get_inputs()[1].name:
np.array([feat.shape[2]], dtype=np.int32)})[0].flatten().tolist()
speech_token = torch.tensor([speech_token], dtype=torch.int32).to(self.device) speech_token = torch.tensor([speech_token], dtype=torch.int32).to(self.device)
speech_token_len = torch.tensor([speech_token.shape[1]], dtype=torch.int32).to(self.device) speech_token_len = torch.tensor([speech_token.shape[1]], dtype=torch.int32).to(self.device)
return speech_token, speech_token_len return speech_token, speech_token_len
@@ -88,7 +107,8 @@ class CosyVoiceFrontEnd:
dither=0, dither=0,
sample_frequency=16000) sample_frequency=16000)
feat = feat - feat.mean(dim=0, keepdim=True) feat = feat - feat.mean(dim=0, keepdim=True)
embedding = self.campplus_session.run(None, {self.campplus_session.get_inputs()[0].name: feat.unsqueeze(dim=0).cpu().numpy()})[0].flatten().tolist() embedding = self.campplus_session.run(None,
{self.campplus_session.get_inputs()[0].name: feat.unsqueeze(dim=0).cpu().numpy()})[0].flatten().tolist()
embedding = torch.tensor([embedding]).to(self.device) embedding = torch.tensor([embedding]).to(self.device)
return embedding return embedding
@@ -98,35 +118,35 @@ class CosyVoiceFrontEnd:
speech_feat_len = torch.tensor([speech_feat.shape[1]], dtype=torch.int32).to(self.device) speech_feat_len = torch.tensor([speech_feat.shape[1]], dtype=torch.int32).to(self.device)
return speech_feat, speech_feat_len return speech_feat, speech_feat_len
def text_normalize(self, text, split=True): def text_normalize(self, text, split=True, text_frontend=True):
if isinstance(text, Generator):
logging.info('get tts_text generator, will skip text_normalize!')
return [text]
if text_frontend is False:
return [text] if split is True else text
text = text.strip() text = text.strip()
if contains_chinese(text): if self.use_ttsfrd:
if self.use_ttsfrd: texts = [i["text"] for i in json.loads(self.frd.do_voicegen_frd(text))["sentences"]]
text = self.frd.get_frd_extra_info(text, 'input') text = ''.join(texts)
else:
text = self.zh_tn_model.normalize(text)
text = text.replace("\n", "")
text = replace_blank(text)
text = replace_corner_mark(text)
text = text.replace(".", "")
text = text.replace(" - ", "")
text = remove_bracket(text)
text = re.sub(r'[,]+$', '', text)
texts = [i for i in split_paragraph(text, partial(self.tokenizer.encode, allowed_special=self.allowed_special), "zh", token_max_n=80,
token_min_n=60, merge_len=20,
comma_split=False)]
else: else:
if self.use_ttsfrd: if contains_chinese(text):
text = self.frd.get_frd_extra_info(text, 'input') text = self.zh_tn_model.normalize(text)
text = text.replace("\n", "")
text = replace_blank(text)
text = replace_corner_mark(text)
text = text.replace(".", "")
text = text.replace(" - ", "")
text = remove_bracket(text)
text = re.sub(r'[,、]+$', '', text)
texts = list(split_paragraph(text, partial(self.tokenizer.encode, allowed_special=self.allowed_special), "zh", token_max_n=80,
token_min_n=60, merge_len=20, comma_split=False))
else: else:
text = self.en_tn_model.normalize(text) text = self.en_tn_model.normalize(text)
text = spell_out_number(text, self.inflect_parser) text = spell_out_number(text, self.inflect_parser)
texts = [i for i in split_paragraph(text, partial(self.tokenizer.encode, allowed_special=self.allowed_special), "en", token_max_n=80, texts = list(split_paragraph(text, partial(self.tokenizer.encode, allowed_special=self.allowed_special), "en", token_max_n=80,
token_min_n=60, merge_len=20, token_min_n=60, merge_len=20, comma_split=False))
comma_split=False)] texts = [i for i in texts if not is_only_punctuation(i)]
if split is False: return texts if split is True else text
return text
return texts
def frontend_sft(self, tts_text, spk_id): def frontend_sft(self, tts_text, spk_id):
tts_text_token, tts_text_token_len = self._extract_text_token(tts_text) tts_text_token, tts_text_token_len = self._extract_text_token(tts_text)
@@ -134,12 +154,17 @@ class CosyVoiceFrontEnd:
model_input = {'text': tts_text_token, 'text_len': tts_text_token_len, 'llm_embedding': embedding, 'flow_embedding': embedding} model_input = {'text': tts_text_token, 'text_len': tts_text_token_len, 'llm_embedding': embedding, 'flow_embedding': embedding}
return model_input return model_input
def frontend_zero_shot(self, tts_text, prompt_text, prompt_speech_16k): def frontend_zero_shot(self, tts_text, prompt_text, prompt_speech_16k, resample_rate):
tts_text_token, tts_text_token_len = self._extract_text_token(tts_text) tts_text_token, tts_text_token_len = self._extract_text_token(tts_text)
prompt_text_token, prompt_text_token_len = self._extract_text_token(prompt_text) prompt_text_token, prompt_text_token_len = self._extract_text_token(prompt_text)
prompt_speech_22050 = torchaudio.transforms.Resample(orig_freq=16000, new_freq=22050)(prompt_speech_16k) prompt_speech_resample = torchaudio.transforms.Resample(orig_freq=16000, new_freq=resample_rate)(prompt_speech_16k)
speech_feat, speech_feat_len = self._extract_speech_feat(prompt_speech_22050) speech_feat, speech_feat_len = self._extract_speech_feat(prompt_speech_resample)
speech_token, speech_token_len = self._extract_speech_token(prompt_speech_16k) speech_token, speech_token_len = self._extract_speech_token(prompt_speech_16k)
if resample_rate == 24000:
# cosyvoice2, force speech_feat % speech_token = 2
token_len = min(int(speech_feat.shape[1] / 2), speech_token.shape[1])
speech_feat, speech_feat_len[:] = speech_feat[:, :2 * token_len], 2 * token_len
speech_token, speech_token_len[:] = speech_token[:, :token_len], token_len
embedding = self._extract_spk_embedding(prompt_speech_16k) embedding = self._extract_spk_embedding(prompt_speech_16k)
model_input = {'text': tts_text_token, 'text_len': tts_text_token_len, model_input = {'text': tts_text_token, 'text_len': tts_text_token_len,
'prompt_text': prompt_text_token, 'prompt_text_len': prompt_text_token_len, 'prompt_text': prompt_text_token, 'prompt_text_len': prompt_text_token_len,
@@ -149,8 +174,8 @@ class CosyVoiceFrontEnd:
'llm_embedding': embedding, 'flow_embedding': embedding} 'llm_embedding': embedding, 'flow_embedding': embedding}
return model_input return model_input
def frontend_cross_lingual(self, tts_text, prompt_speech_16k): def frontend_cross_lingual(self, tts_text, prompt_speech_16k, resample_rate):
model_input = self.frontend_zero_shot(tts_text, '', prompt_speech_16k) model_input = self.frontend_zero_shot(tts_text, '', prompt_speech_16k, resample_rate)
# in cross lingual mode, we remove prompt in llm # in cross lingual mode, we remove prompt in llm
del model_input['prompt_text'] del model_input['prompt_text']
del model_input['prompt_text_len'] del model_input['prompt_text_len']
@@ -166,3 +191,21 @@ class CosyVoiceFrontEnd:
model_input['prompt_text'] = instruct_text_token model_input['prompt_text'] = instruct_text_token
model_input['prompt_text_len'] = instruct_text_token_len model_input['prompt_text_len'] = instruct_text_token_len
return model_input return model_input
def frontend_instruct2(self, tts_text, instruct_text, prompt_speech_16k, resample_rate):
model_input = self.frontend_zero_shot(tts_text, instruct_text + '<|endofprompt|>', prompt_speech_16k, resample_rate)
del model_input['llm_prompt_speech_token']
del model_input['llm_prompt_speech_token_len']
return model_input
def frontend_vc(self, source_speech_16k, prompt_speech_16k, resample_rate):
prompt_speech_token, prompt_speech_token_len = self._extract_speech_token(prompt_speech_16k)
prompt_speech_resample = torchaudio.transforms.Resample(orig_freq=16000, new_freq=resample_rate)(prompt_speech_16k)
prompt_speech_feat, prompt_speech_feat_len = self._extract_speech_feat(prompt_speech_resample)
embedding = self._extract_spk_embedding(prompt_speech_16k)
source_speech_token, source_speech_token_len = self._extract_speech_token(source_speech_16k)
model_input = {'source_speech_token': source_speech_token, 'source_speech_token_len': source_speech_token_len,
'flow_prompt_speech_token': prompt_speech_token, 'flow_prompt_speech_token_len': prompt_speech_token_len,
'prompt_speech_feat': prompt_speech_feat, 'prompt_speech_feat_len': prompt_speech_feat_len,
'flow_embedding': embedding}
return model_input

View File

@@ -11,50 +11,454 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
import os
from typing import Generator
import torch import torch
import numpy as np
import threading
import time
from torch import nn
from torch.nn import functional as F
from contextlib import nullcontext
import uuid
from cosyvoice.utils.common import fade_in_out
from cosyvoice.utils.file_utils import convert_onnx_to_trt
class CosyVoiceModel: class CosyVoiceModel:
def __init__(self, def __init__(self,
llm: torch.nn.Module, llm: torch.nn.Module,
flow: torch.nn.Module, flow: torch.nn.Module,
hift: torch.nn.Module): hift: torch.nn.Module,
fp16: bool):
self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
self.llm = llm self.llm = llm
self.flow = flow self.flow = flow
self.hift = hift self.hift = hift
self.fp16 = fp16
self.llm.fp16 = fp16
self.flow.fp16 = fp16
if self.fp16 is True:
self.llm.half()
self.flow.half()
self.token_min_hop_len = 2 * self.flow.input_frame_rate
self.token_max_hop_len = 4 * self.flow.input_frame_rate
self.token_overlap_len = 20
# here we fix set flow.decoder.estimator.static_chunk_size = 0 for compatibability
self.flow.decoder.estimator.static_chunk_size = 0
# mel fade in out
self.mel_overlap_len = int(self.token_overlap_len / self.flow.input_frame_rate * 22050 / 256)
self.mel_window = np.hamming(2 * self.mel_overlap_len)
# hift cache
self.mel_cache_len = 20
self.source_cache_len = int(self.mel_cache_len * 256)
# speech fade in out
self.speech_window = np.hamming(2 * self.source_cache_len)
# rtf and decoding related
self.stream_scale_factor = 1
assert self.stream_scale_factor >= 1, 'stream_scale_factor should be greater than 1, change it according to your actual rtf'
self.llm_context = torch.cuda.stream(torch.cuda.Stream(self.device)) if torch.cuda.is_available() else nullcontext()
self.lock = threading.Lock()
# dict used to store session related variable
self.tts_speech_token_dict = {}
self.llm_end_dict = {}
self.mel_overlap_dict = {}
self.flow_cache_dict = {}
self.hift_cache_dict = {}
self.vllm_codec_engine = None
def load(self, llm_model, flow_model, hift_model): def load(self, llm_model, flow_model, hift_model):
self.llm.load_state_dict(torch.load(llm_model, map_location=self.device)) self.llm.load_state_dict(torch.load(llm_model, map_location=self.device), strict=True)
self.llm.to(self.device).eval() self.llm.to(self.device).eval()
self.flow.load_state_dict(torch.load(flow_model, map_location=self.device)) self.flow.load_state_dict(torch.load(flow_model, map_location=self.device), strict=True)
self.flow.to(self.device).eval() self.flow.to(self.device).eval()
self.hift.load_state_dict(torch.load(hift_model, map_location=self.device)) # in case hift_model is a hifigan model
hift_state_dict = {k.replace('generator.', ''): v for k, v in torch.load(hift_model, map_location=self.device).items()}
self.hift.load_state_dict(hift_state_dict, strict=True)
self.hift.to(self.device).eval() self.hift.to(self.device).eval()
def inference(self, text, text_len, flow_embedding, llm_embedding=torch.zeros(0, 192), def load_jit(self, llm_text_encoder_model, llm_llm_model, flow_encoder_model):
prompt_text=torch.zeros(1, 0, dtype=torch.int32), prompt_text_len=torch.zeros(1, dtype=torch.int32), llm_text_encoder = torch.jit.load(llm_text_encoder_model, map_location=self.device)
llm_prompt_speech_token=torch.zeros(1, 0, dtype=torch.int32), llm_prompt_speech_token_len=torch.zeros(1, dtype=torch.int32), self.llm.text_encoder = llm_text_encoder
flow_prompt_speech_token=torch.zeros(1, 0, dtype=torch.int32), flow_prompt_speech_token_len=torch.zeros(1, dtype=torch.int32), llm_llm = torch.jit.load(llm_llm_model, map_location=self.device)
prompt_speech_feat=torch.zeros(1, 0, 80), prompt_speech_feat_len=torch.zeros(1, dtype=torch.int32)): self.llm.llm = llm_llm
tts_speech_token = self.llm.inference(text=text.to(self.device), flow_encoder = torch.jit.load(flow_encoder_model, map_location=self.device)
text_len=text_len.to(self.device), self.flow.encoder = flow_encoder
prompt_text=prompt_text.to(self.device),
prompt_text_len=prompt_text_len.to(self.device), def load_trt(self, flow_decoder_estimator_model, flow_decoder_onnx_model, fp16):
prompt_speech_token=llm_prompt_speech_token.to(self.device), assert torch.cuda.is_available(), 'tensorrt only supports gpu!'
prompt_speech_token_len=llm_prompt_speech_token_len.to(self.device), if not os.path.exists(flow_decoder_estimator_model):
embedding=llm_embedding.to(self.device), convert_onnx_to_trt(flow_decoder_estimator_model, flow_decoder_onnx_model, fp16)
beam_size=1, if os.path.getsize(flow_decoder_estimator_model) == 0:
sampling=25, raise ValueError('{} is empty file, delete it and export again!'.format(flow_decoder_estimator_model))
max_token_text_ratio=30, del self.flow.decoder.estimator
min_token_text_ratio=3) import tensorrt as trt
tts_mel = self.flow.inference(token=tts_speech_token, with open(flow_decoder_estimator_model, 'rb') as f:
token_len=torch.tensor([tts_speech_token.size(1)], dtype=torch.int32).to(self.device), self.flow.decoder.estimator_engine = trt.Runtime(trt.Logger(trt.Logger.INFO)).deserialize_cuda_engine(f.read())
prompt_token=flow_prompt_speech_token.to(self.device), if self.flow.decoder.estimator_engine is None:
prompt_token_len=flow_prompt_speech_token_len.to(self.device), raise ValueError('failed to load trt {}'.format(flow_decoder_estimator_model))
prompt_feat=prompt_speech_feat.to(self.device), self.flow.decoder.estimator = self.flow.decoder.estimator_engine.create_execution_context()
prompt_feat_len=prompt_speech_feat_len.to(self.device),
embedding=flow_embedding.to(self.device)) def llm_job(self, text, prompt_text, llm_prompt_speech_token, llm_embedding, uuid):
tts_speech = self.hift.inference(mel=tts_mel).cpu() with self.llm_context:
if isinstance(text, Generator):
assert isinstance(self, CosyVoice2Model), 'streaming input text is only implemented for CosyVoice2!'
if self.vllm_codec_engine is None:
for i in self.llm.inference_bistream(text=text,
prompt_text=prompt_text.to(self.device),
prompt_text_len=torch.tensor([prompt_text.shape[1]], dtype=torch.int32).to(self.device),
prompt_speech_token=llm_prompt_speech_token.to(self.device),
prompt_speech_token_len=torch.tensor([llm_prompt_speech_token.shape[1]], dtype=torch.int32).to(self.device),
embedding=llm_embedding.to(self.device)):
self.tts_speech_token_dict[uuid].append(i)
else:
for i in self.llm.inference_bistream_vllm(text=text,
prompt_text=prompt_text.to(self.device),
prompt_text_len=torch.tensor([prompt_text.shape[1]], dtype=torch.int32).to(self.device),
prompt_speech_token=llm_prompt_speech_token.to(self.device),
prompt_speech_token_len=torch.tensor([llm_prompt_speech_token.shape[1]], dtype=torch.int32).to(self.device),
embedding=llm_embedding.to(self.device),
vllm_codec_engine=self.vllm_codec_engine):
self.tts_speech_token_dict[uuid].append(i)
else:
for i in self.llm.inference(text=text.to(self.device),
text_len=torch.tensor([text.shape[1]], dtype=torch.int32).to(self.device),
prompt_text=prompt_text.to(self.device),
prompt_text_len=torch.tensor([prompt_text.shape[1]], dtype=torch.int32).to(self.device),
prompt_speech_token=llm_prompt_speech_token.to(self.device),
prompt_speech_token_len=torch.tensor([llm_prompt_speech_token.shape[1]], dtype=torch.int32).to(self.device),
embedding=llm_embedding.to(self.device),
vllm_codec_engine=self.vllm_codec_engine):
self.tts_speech_token_dict[uuid].append(i)
self.llm_end_dict[uuid] = True
def token2wav(self, token, prompt_token, prompt_feat, embedding, uuid, finalize=False, speed=1.0):
tts_mel, flow_cache = self.flow.inference(token=token.to(self.device),
token_len=torch.tensor([token.shape[1]], dtype=torch.int32).to(self.device),
prompt_token=prompt_token.to(self.device),
prompt_token_len=torch.tensor([prompt_token.shape[1]], dtype=torch.int32).to(self.device),
prompt_feat=prompt_feat.to(self.device),
prompt_feat_len=torch.tensor([prompt_feat.shape[1]], dtype=torch.int32).to(self.device),
embedding=embedding.to(self.device),
flow_cache=self.flow_cache_dict[uuid])
self.flow_cache_dict[uuid] = flow_cache
# mel overlap fade in out
if self.mel_overlap_dict[uuid].shape[2] != 0:
tts_mel = fade_in_out(tts_mel, self.mel_overlap_dict[uuid], self.mel_window)
# append hift cache
if self.hift_cache_dict[uuid] is not None:
hift_cache_mel, hift_cache_source = self.hift_cache_dict[uuid]['mel'], self.hift_cache_dict[uuid]['source']
tts_mel = torch.concat([hift_cache_mel, tts_mel], dim=2)
else:
hift_cache_source = torch.zeros(1, 1, 0)
# keep overlap mel and hift cache
if finalize is False:
self.mel_overlap_dict[uuid] = tts_mel[:, :, -self.mel_overlap_len:]
tts_mel = tts_mel[:, :, :-self.mel_overlap_len]
tts_speech, tts_source = self.hift.inference(speech_feat=tts_mel, cache_source=hift_cache_source)
if self.hift_cache_dict[uuid] is not None:
tts_speech = fade_in_out(tts_speech, self.hift_cache_dict[uuid]['speech'], self.speech_window)
self.hift_cache_dict[uuid] = {'mel': tts_mel[:, :, -self.mel_cache_len:],
'source': tts_source[:, :, -self.source_cache_len:],
'speech': tts_speech[:, -self.source_cache_len:]}
tts_speech = tts_speech[:, :-self.source_cache_len]
else:
if speed != 1.0:
assert self.hift_cache_dict[uuid] is None, 'speed change only support non-stream inference mode'
tts_mel = F.interpolate(tts_mel, size=int(tts_mel.shape[2] / speed), mode='linear')
tts_speech, tts_source = self.hift.inference(speech_feat=tts_mel, cache_source=hift_cache_source)
if self.hift_cache_dict[uuid] is not None:
tts_speech = fade_in_out(tts_speech, self.hift_cache_dict[uuid]['speech'], self.speech_window)
return tts_speech
def tts(self, text, flow_embedding, llm_embedding=torch.zeros(0, 192),
prompt_text=torch.zeros(1, 0, dtype=torch.int32),
llm_prompt_speech_token=torch.zeros(1, 0, dtype=torch.int32),
flow_prompt_speech_token=torch.zeros(1, 0, dtype=torch.int32),
prompt_speech_feat=torch.zeros(1, 0, 80), stream=False, speed=1.0, **kwargs):
# this_uuid is used to track variables related to this inference thread
this_uuid = str(uuid.uuid1())
with self.lock:
self.tts_speech_token_dict[this_uuid], self.llm_end_dict[this_uuid] = [], False
self.hift_cache_dict[this_uuid] = None
self.mel_overlap_dict[this_uuid] = torch.zeros(1, 80, 0)
self.flow_cache_dict[this_uuid] = torch.zeros(1, 80, 0, 2)
p = threading.Thread(target=self.llm_job, args=(text, prompt_text, llm_prompt_speech_token, llm_embedding, this_uuid))
p.start()
if stream is True:
token_hop_len = self.token_min_hop_len
while True:
time.sleep(0.1)
if len(self.tts_speech_token_dict[this_uuid]) >= token_hop_len + self.token_overlap_len:
this_tts_speech_token = torch.tensor(self.tts_speech_token_dict[this_uuid][:token_hop_len + self.token_overlap_len]) \
.unsqueeze(dim=0)
this_tts_speech = self.token2wav(token=this_tts_speech_token,
prompt_token=flow_prompt_speech_token,
prompt_feat=prompt_speech_feat,
embedding=flow_embedding,
uuid=this_uuid,
finalize=False)
yield {'tts_speech': this_tts_speech.cpu()}
with self.lock:
self.tts_speech_token_dict[this_uuid] = self.tts_speech_token_dict[this_uuid][token_hop_len:]
# increase token_hop_len for better speech quality
token_hop_len = min(self.token_max_hop_len, int(token_hop_len * self.stream_scale_factor))
if self.llm_end_dict[this_uuid] is True and len(self.tts_speech_token_dict[this_uuid]) < token_hop_len + self.token_overlap_len:
break
p.join()
# deal with remain tokens, make sure inference remain token len equals token_hop_len when cache_speech is not None
this_tts_speech_token = torch.tensor(self.tts_speech_token_dict[this_uuid]).unsqueeze(dim=0)
this_tts_speech = self.token2wav(token=this_tts_speech_token,
prompt_token=flow_prompt_speech_token,
prompt_feat=prompt_speech_feat,
embedding=flow_embedding,
uuid=this_uuid,
finalize=True)
yield {'tts_speech': this_tts_speech.cpu()}
else:
# deal with all tokens
p.join()
this_tts_speech_token = torch.tensor(self.tts_speech_token_dict[this_uuid]).unsqueeze(dim=0)
this_tts_speech = self.token2wav(token=this_tts_speech_token,
prompt_token=flow_prompt_speech_token,
prompt_feat=prompt_speech_feat,
embedding=flow_embedding,
uuid=this_uuid,
finalize=True,
speed=speed)
yield {'tts_speech': this_tts_speech.cpu()}
with self.lock:
self.tts_speech_token_dict.pop(this_uuid)
self.llm_end_dict.pop(this_uuid)
self.mel_overlap_dict.pop(this_uuid)
self.hift_cache_dict.pop(this_uuid)
self.flow_cache_dict.pop(this_uuid)
torch.cuda.empty_cache()
def vc(self, source_speech_token, flow_prompt_speech_token, prompt_speech_feat, flow_embedding, stream=False, speed=1.0, **kwargs):
# this_uuid is used to track variables related to this inference thread
this_uuid = str(uuid.uuid1())
with self.lock:
self.tts_speech_token_dict[this_uuid], self.llm_end_dict[this_uuid] = source_speech_token.flatten().tolist(), True
self.hift_cache_dict[this_uuid] = None
self.mel_overlap_dict[this_uuid] = torch.zeros(1, 80, 0)
self.flow_cache_dict[this_uuid] = torch.zeros(1, 80, 0, 2)
if stream is True:
token_hop_len = self.token_min_hop_len
while True:
if len(self.tts_speech_token_dict[this_uuid]) >= token_hop_len + self.token_overlap_len:
this_tts_speech_token = torch.tensor(self.tts_speech_token_dict[this_uuid][:token_hop_len + self.token_overlap_len]) \
.unsqueeze(dim=0)
this_tts_speech = self.token2wav(token=this_tts_speech_token,
prompt_token=flow_prompt_speech_token,
prompt_feat=prompt_speech_feat,
embedding=flow_embedding,
uuid=this_uuid,
finalize=False)
yield {'tts_speech': this_tts_speech.cpu()}
with self.lock:
self.tts_speech_token_dict[this_uuid] = self.tts_speech_token_dict[this_uuid][token_hop_len:]
# increase token_hop_len for better speech quality
token_hop_len = min(self.token_max_hop_len, int(token_hop_len * self.stream_scale_factor))
if self.llm_end_dict[this_uuid] is True and len(self.tts_speech_token_dict[this_uuid]) < token_hop_len + self.token_overlap_len:
break
# deal with remain tokens, make sure inference remain token len equals token_hop_len when cache_speech is not None
this_tts_speech_token = torch.tensor(self.tts_speech_token_dict[this_uuid]).unsqueeze(dim=0)
this_tts_speech = self.token2wav(token=this_tts_speech_token,
prompt_token=flow_prompt_speech_token,
prompt_feat=prompt_speech_feat,
embedding=flow_embedding,
uuid=this_uuid,
finalize=True)
yield {'tts_speech': this_tts_speech.cpu()}
else:
# deal with all tokens
this_tts_speech_token = torch.tensor(self.tts_speech_token_dict[this_uuid]).unsqueeze(dim=0)
this_tts_speech = self.token2wav(token=this_tts_speech_token,
prompt_token=flow_prompt_speech_token,
prompt_feat=prompt_speech_feat,
embedding=flow_embedding,
uuid=this_uuid,
finalize=True,
speed=speed)
yield {'tts_speech': this_tts_speech.cpu()}
with self.lock:
self.tts_speech_token_dict.pop(this_uuid)
self.llm_end_dict.pop(this_uuid)
self.mel_overlap_dict.pop(this_uuid)
self.hift_cache_dict.pop(this_uuid)
torch.cuda.empty_cache()
class CosyVoice2Model(CosyVoiceModel):
def __init__(self,
llm: torch.nn.Module,
flow: torch.nn.Module,
hift: torch.nn.Module,
fp16: bool):
self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
self.llm = llm
self.flow = flow
self.hift = hift
self.fp16 = fp16
self.llm.fp16 = fp16
self.flow.fp16 = fp16
if self.fp16 is True:
self.llm.half()
self.flow.half()
self.token_hop_len = 2 * self.flow.input_frame_rate
# here we fix flow encoder/decoder decoding_chunk_size, in the future we will send it as arguments, or use cache
self.flow.encoder.static_chunk_size = 2 * self.flow.input_frame_rate
self.flow.decoder.estimator.static_chunk_size = 2 * self.flow.input_frame_rate * self.flow.token_mel_ratio
# hift cache
self.mel_cache_len = 8
self.source_cache_len = int(self.mel_cache_len * 480)
# speech fade in out
self.speech_window = np.hamming(2 * self.source_cache_len)
# rtf and decoding related
self.stream_scale_factor = 1
self.llm_context = torch.cuda.stream(torch.cuda.Stream(self.device)) if torch.cuda.is_available() else nullcontext()
self.lock = threading.Lock()
# dict used to store session related variable
self.tts_speech_token_dict = {}
self.llm_end_dict = {}
self.hift_cache_dict = {}
self.vllm_codec_engine = None
def load_jit(self, flow_encoder_model):
flow_encoder = torch.jit.load(flow_encoder_model, map_location=self.device)
self.flow.encoder = flow_encoder
def export_codec_vllm(self, model_path):
if os.path.exists(model_path):
return
pad_to = DEFAULT_VOCAB_PADDING_SIZE = 64
vocab_size = self.llm.speech_embedding.num_embeddings
feature_size = self.llm.speech_embedding.embedding_dim
pad_vocab_size = ((vocab_size + pad_to - 1) // pad_to) * pad_to
dtype = torch.bfloat16
# lm_head
new_lm_head = nn.Linear(in_features=feature_size, out_features=pad_vocab_size, bias=True)
with torch.no_grad():
new_lm_head.weight[:vocab_size] = self.llm.llm_decoder.weight
new_lm_head.bias[:vocab_size] = self.llm.llm_decoder.bias
new_lm_head.weight[vocab_size:] = 0
new_lm_head.bias[vocab_size:] = 0
self.llm.llm.model.lm_head = new_lm_head
new_codec_embed = nn.Linear(in_features=feature_size, out_features=pad_vocab_size)
# embed_tokens
embed_tokens = self.llm.llm.model.model.embed_tokens
with torch.no_grad():
new_codec_embed.weight[:vocab_size] = self.llm.speech_embedding.weight
new_codec_embed.weight[vocab_size:] = 0
self.llm.llm.model.set_input_embeddings(new_codec_embed)
self.llm.llm.model.to(self.device)
self.llm.llm.model.to(dtype)
tmp_vocab_size = self.llm.llm.model.config.vocab_size
tmp_tie_embedding = self.llm.llm.model.config.tie_word_embeddings
del self.llm.llm.model.generation_config.eos_token_id
del self.llm.llm.model.config.bos_token_id
del self.llm.llm.model.config.eos_token_id
self.llm.llm.model.config.vocab_size = pad_vocab_size
self.llm.llm.model.config.tie_word_embeddings = False
self.llm.llm.model.config.use_bias = True
self.llm.llm.model.save_pretrained(model_path)
self.llm.llm.model.config.vocab_size = tmp_vocab_size
self.llm.llm.model.config.tie_word_embeddings = tmp_tie_embedding
self.llm.llm.model.set_input_embeddings(embed_tokens)
def token2wav(self, token, prompt_token, prompt_feat, embedding, uuid, token_offset, finalize=False, speed=1.0):
tts_mel, _ = self.flow.inference(token=token.to(self.device),
token_len=torch.tensor([token.shape[1]], dtype=torch.int32).to(self.device),
prompt_token=prompt_token.to(self.device),
prompt_token_len=torch.tensor([prompt_token.shape[1]], dtype=torch.int32).to(self.device),
prompt_feat=prompt_feat.to(self.device),
prompt_feat_len=torch.tensor([prompt_feat.shape[1]], dtype=torch.int32).to(self.device),
embedding=embedding.to(self.device),
finalize=finalize)
tts_mel = tts_mel[:, :, token_offset * self.flow.token_mel_ratio:]
# append hift cache
if self.hift_cache_dict[uuid] is not None:
hift_cache_mel, hift_cache_source = self.hift_cache_dict[uuid]['mel'], self.hift_cache_dict[uuid]['source']
tts_mel = torch.concat([hift_cache_mel, tts_mel], dim=2)
else:
hift_cache_source = torch.zeros(1, 1, 0)
# keep overlap mel and hift cache
if finalize is False:
tts_speech, tts_source = self.hift.inference(speech_feat=tts_mel, cache_source=hift_cache_source)
if self.hift_cache_dict[uuid] is not None:
tts_speech = fade_in_out(tts_speech, self.hift_cache_dict[uuid]['speech'], self.speech_window)
self.hift_cache_dict[uuid] = {'mel': tts_mel[:, :, -self.mel_cache_len:],
'source': tts_source[:, :, -self.source_cache_len:],
'speech': tts_speech[:, -self.source_cache_len:]}
tts_speech = tts_speech[:, :-self.source_cache_len]
else:
if speed != 1.0:
assert self.hift_cache_dict[uuid] is None, 'speed change only support non-stream inference mode'
tts_mel = F.interpolate(tts_mel, size=int(tts_mel.shape[2] / speed), mode='linear')
tts_speech, tts_source = self.hift.inference(speech_feat=tts_mel, cache_source=hift_cache_source)
if self.hift_cache_dict[uuid] is not None:
tts_speech = fade_in_out(tts_speech, self.hift_cache_dict[uuid]['speech'], self.speech_window)
return tts_speech
def tts(self, text, flow_embedding, llm_embedding=torch.zeros(0, 192),
prompt_text=torch.zeros(1, 0, dtype=torch.int32),
llm_prompt_speech_token=torch.zeros(1, 0, dtype=torch.int32),
flow_prompt_speech_token=torch.zeros(1, 0, dtype=torch.int32),
prompt_speech_feat=torch.zeros(1, 0, 80), stream=False, speed=1.0, **kwargs):
# this_uuid is used to track variables related to this inference thread
this_uuid = str(uuid.uuid1())
with self.lock:
self.tts_speech_token_dict[this_uuid], self.llm_end_dict[this_uuid] = [], False
self.hift_cache_dict[this_uuid] = None
p = threading.Thread(target=self.llm_job, args=(text, prompt_text, llm_prompt_speech_token, llm_embedding, this_uuid))
p.start()
if stream is True:
token_offset = 0
while True:
time.sleep(0.1)
if len(self.tts_speech_token_dict[this_uuid]) - token_offset >= self.token_hop_len + self.flow.pre_lookahead_len:
this_tts_speech_token = torch.tensor(self.tts_speech_token_dict[this_uuid][:token_offset + self.token_hop_len + self.flow.pre_lookahead_len]).unsqueeze(dim=0)
this_tts_speech = self.token2wav(token=this_tts_speech_token,
prompt_token=flow_prompt_speech_token,
prompt_feat=prompt_speech_feat,
embedding=flow_embedding,
uuid=this_uuid,
token_offset=token_offset,
finalize=False)
token_offset += self.token_hop_len
yield {'tts_speech': this_tts_speech.cpu()}
if self.llm_end_dict[this_uuid] is True and len(self.tts_speech_token_dict[this_uuid]) - token_offset < self.token_hop_len + self.flow.pre_lookahead_len:
break
p.join()
# deal with remain tokens, make sure inference remain token len equals token_hop_len when cache_speech is not None
this_tts_speech_token = torch.tensor(self.tts_speech_token_dict[this_uuid]).unsqueeze(dim=0)
this_tts_speech = self.token2wav(token=this_tts_speech_token,
prompt_token=flow_prompt_speech_token,
prompt_feat=prompt_speech_feat,
embedding=flow_embedding,
uuid=this_uuid,
token_offset=token_offset,
finalize=True)
yield {'tts_speech': this_tts_speech.cpu()}
else:
# deal with all tokens
p.join()
this_tts_speech_token = torch.tensor(self.tts_speech_token_dict[this_uuid]).unsqueeze(dim=0)
this_tts_speech = self.token2wav(token=this_tts_speech_token,
prompt_token=flow_prompt_speech_token,
prompt_feat=prompt_speech_feat,
embedding=flow_embedding,
uuid=this_uuid,
token_offset=0,
finalize=True,
speed=speed)
yield {'tts_speech': this_tts_speech.cpu()}
with self.lock:
self.tts_speech_token_dict.pop(this_uuid)
self.llm_end_dict.pop(this_uuid)
torch.cuda.empty_cache() torch.cuda.empty_cache()
return {'tts_speech': tts_speech}

View File

@@ -126,6 +126,7 @@ class DataList(IterableDataset):
def Dataset(data_list_file, def Dataset(data_list_file,
data_pipeline, data_pipeline,
mode='train', mode='train',
gan=False,
shuffle=True, shuffle=True,
partition=True, partition=True,
tts_file='', tts_file='',
@@ -148,13 +149,16 @@ def Dataset(data_list_file,
tts_data = json.load(f) tts_data = json.load(f)
utt2lists = read_json_lists(prompt_utt2data) utt2lists = read_json_lists(prompt_utt2data)
# filter unnecessary file in inference mode # filter unnecessary file in inference mode
lists = list(set([utt2lists[utt] for utt in tts_data.keys() if utt2lists[utt] in lists])) lists = list({utt2lists[utt] for utt in tts_data.keys() if utt2lists[utt] in lists})
dataset = DataList(lists, dataset = DataList(lists,
shuffle=shuffle, shuffle=shuffle,
partition=partition) partition=partition)
if mode == 'inference': if mode == 'inference':
# map partial arg tts_data in inference mode # map partial arg to parquet_opener func in inference mode
data_pipeline[0] = partial(data_pipeline[0], tts_data=tts_data) data_pipeline[0] = partial(data_pipeline[0], tts_data=tts_data)
if gan is True:
# map partial arg to padding func in gan mode
data_pipeline[-1] = partial(data_pipeline[-1], gan=gan)
for func in data_pipeline: for func in data_pipeline:
dataset = Processor(dataset, func, mode=mode) dataset = Processor(dataset, func, mode=mode)
return dataset return dataset

View File

@@ -20,10 +20,10 @@ import torch
import torchaudio import torchaudio
from torch.nn.utils.rnn import pad_sequence from torch.nn.utils.rnn import pad_sequence
import torch.nn.functional as F import torch.nn.functional as F
import pyworld as pw
torchaudio.set_audio_backend('soundfile')
AUDIO_FORMAT_SETS = set(['flac', 'mp3', 'm4a', 'ogg', 'opus', 'wav', 'wma']) AUDIO_FORMAT_SETS = {'flac', 'mp3', 'm4a', 'ogg', 'opus', 'wav', 'wma'}
def parquet_opener(data, mode='train', tts_data={}): def parquet_opener(data, mode='train', tts_data={}):
@@ -40,20 +40,22 @@ def parquet_opener(data, mode='train', tts_data={}):
assert 'src' in sample assert 'src' in sample
url = sample['src'] url = sample['src']
try: try:
df = pq.read_table(url).to_pandas() for df in pq.ParquetFile(url).iter_batches(batch_size=64):
for i in range(len(df)): df = df.to_pandas()
if mode == 'inference' and df.loc[i, 'utt'] not in tts_data: for i in range(len(df)):
continue if mode == 'inference' and df.loc[i, 'utt'] not in tts_data:
sample.update(dict(df.loc[i])) continue
if mode == 'train': sample.update(dict(df.loc[i]))
# NOTE do not return sample directly, must initialize a new dict if mode == 'train':
yield {**sample} # NOTE do not return sample directly, must initialize a new dict
else: yield {**sample}
for index, text in enumerate(tts_data[df.loc[i, 'utt']]): else:
yield {**sample, 'tts_index': index, 'tts_text': text} for index, text in enumerate(tts_data[df.loc[i, 'utt']]):
yield {**sample, 'tts_index': index, 'tts_text': text}
except Exception as ex: except Exception as ex:
logging.warning('Failed to open {}, ex info {}'.format(url, ex)) logging.warning('Failed to open {}, ex info {}'.format(url, ex))
def filter(data, def filter(data,
max_length=10240, max_length=10240,
min_length=10, min_length=10,
@@ -84,6 +86,7 @@ def filter(data,
""" """
for sample in data: for sample in data:
sample['speech'], sample['sample_rate'] = torchaudio.load(BytesIO(sample['audio_data'])) sample['speech'], sample['sample_rate'] = torchaudio.load(BytesIO(sample['audio_data']))
sample['speech'] = sample['speech'].mean(dim=0, keepdim=True)
del sample['audio_data'] del sample['audio_data']
# sample['wav'] is torch.Tensor, we have 100 frames every second # sample['wav'] is torch.Tensor, we have 100 frames every second
num_frames = sample['speech'].size(1) / sample['sample_rate'] * 100 num_frames = sample['speech'].size(1) / sample['sample_rate'] * 100
@@ -133,6 +136,27 @@ def resample(data, resample_rate=22050, min_sample_rate=16000, mode='train'):
yield sample yield sample
def truncate(data, truncate_length=24576, mode='train'):
""" Truncate data.
Args:
data: Iterable[{key, wav, label, sample_rate}]
truncate_length: truncate length
Returns:
Iterable[{key, wav, label, sample_rate}]
"""
for sample in data:
waveform = sample['speech']
if waveform.shape[1] > truncate_length:
start = random.randint(0, waveform.shape[1] - truncate_length)
waveform = waveform[:, start: start + truncate_length]
else:
waveform = torch.concat([waveform, torch.zeros(1, truncate_length - waveform.shape[1])], dim=1)
sample['speech'] = waveform
yield sample
def compute_fbank(data, def compute_fbank(data,
feat_extractor, feat_extractor,
mode='train'): mode='train'):
@@ -152,7 +176,31 @@ def compute_fbank(data,
waveform = sample['speech'] waveform = sample['speech']
mat = feat_extractor(waveform).squeeze(dim=0).transpose(0, 1) mat = feat_extractor(waveform).squeeze(dim=0).transpose(0, 1)
sample['speech_feat'] = mat sample['speech_feat'] = mat
del sample['speech'] yield sample
def compute_f0(data, sample_rate, hop_size, mode='train'):
""" Extract f0
Args:
data: Iterable[{key, wav, label, sample_rate}]
Returns:
Iterable[{key, feat, label}]
"""
frame_period = hop_size * 1000 / sample_rate
for sample in data:
assert 'sample_rate' in sample
assert 'speech' in sample
assert 'utt' in sample
assert 'text_token' in sample
waveform = sample['speech']
_f0, t = pw.harvest(waveform.squeeze(dim=0).numpy().astype('double'), sample_rate, frame_period=frame_period)
if sum(_f0 != 0) < 5: # this happens when the algorithm fails
_f0, t = pw.dio(waveform.squeeze(dim=0).numpy().astype('double'), sample_rate, frame_period=frame_period) # if harvest fails, try dio
f0 = pw.stonemask(waveform.squeeze(dim=0).numpy().astype('double'), _f0, t, sample_rate)
f0 = F.interpolate(torch.from_numpy(f0).view(1, 1, -1), size=sample['speech_feat'].shape[0], mode='linear').view(-1)
sample['pitch_feat'] = f0
yield sample yield sample
@@ -308,7 +356,7 @@ def batch(data, batch_type='static', batch_size=16, max_frames_in_batch=12000, m
logging.fatal('Unsupported batch type {}'.format(batch_type)) logging.fatal('Unsupported batch type {}'.format(batch_type))
def padding(data, use_spk_embedding, mode='train'): def padding(data, use_spk_embedding, mode='train', gan=False):
""" Padding the data into training data """ Padding the data into training data
Args: Args:
@@ -324,6 +372,9 @@ def padding(data, use_spk_embedding, mode='train'):
order = torch.argsort(speech_feat_len, descending=True) order = torch.argsort(speech_feat_len, descending=True)
utts = [sample[i]['utt'] for i in order] utts = [sample[i]['utt'] for i in order]
speech = [sample[i]['speech'].squeeze(dim=0) for i in order]
speech_len = torch.tensor([i.size(0) for i in speech], dtype=torch.int32)
speech = pad_sequence(speech, batch_first=True, padding_value=0)
speech_token = [torch.tensor(sample[i]['speech_token']) for i in order] speech_token = [torch.tensor(sample[i]['speech_token']) for i in order]
speech_token_len = torch.tensor([i.size(0) for i in speech_token], dtype=torch.int32) speech_token_len = torch.tensor([i.size(0) for i in speech_token], dtype=torch.int32)
speech_token = pad_sequence(speech_token, speech_token = pad_sequence(speech_token,
@@ -342,6 +393,8 @@ def padding(data, use_spk_embedding, mode='train'):
spk_embedding = torch.stack([sample[i]['spk_embedding'] for i in order], dim=0) spk_embedding = torch.stack([sample[i]['spk_embedding'] for i in order], dim=0)
batch = { batch = {
"utts": utts, "utts": utts,
"speech": speech,
"speech_len": speech_len,
"speech_token": speech_token, "speech_token": speech_token,
"speech_token_len": speech_token_len, "speech_token_len": speech_token_len,
"speech_feat": speech_feat, "speech_feat": speech_feat,
@@ -352,6 +405,19 @@ def padding(data, use_spk_embedding, mode='train'):
"utt_embedding": utt_embedding, "utt_embedding": utt_embedding,
"spk_embedding": spk_embedding, "spk_embedding": spk_embedding,
} }
if gan is True:
# in gan train, we need pitch_feat
pitch_feat = [sample[i]['pitch_feat'] for i in order]
pitch_feat_len = torch.tensor([i.size(0) for i in pitch_feat], dtype=torch.int32)
pitch_feat = pad_sequence(pitch_feat,
batch_first=True,
padding_value=0)
batch["pitch_feat"] = pitch_feat
batch["pitch_feat_len"] = pitch_feat_len
else:
# only gan train needs speech, delete it to save memory
del batch["speech"]
del batch["speech_len"]
if mode == 'inference': if mode == 'inference':
tts_text = [sample[i]['tts_text'] for i in order] tts_text = [sample[i]['tts_text'] for i in order]
tts_index = [sample[i]['tts_index'] for i in order] tts_index = [sample[i]['tts_index'] for i in order]

105
cosyvoice/flow/decoder.py Executable file → Normal file
View File

@@ -13,16 +13,83 @@
# limitations under the License. # limitations under the License.
import torch import torch
import torch.nn as nn import torch.nn as nn
import torch.nn.functional as F
from einops import pack, rearrange, repeat from einops import pack, rearrange, repeat
from cosyvoice.utils.common import mask_to_bias
from cosyvoice.utils.mask import add_optional_chunk_mask
from matcha.models.components.decoder import SinusoidalPosEmb, Block1D, ResnetBlock1D, Downsample1D, TimestepEmbedding, Upsample1D from matcha.models.components.decoder import SinusoidalPosEmb, Block1D, ResnetBlock1D, Downsample1D, TimestepEmbedding, Upsample1D
from matcha.models.components.transformer import BasicTransformerBlock from matcha.models.components.transformer import BasicTransformerBlock
class Transpose(torch.nn.Module):
def __init__(self, dim0: int, dim1: int):
super().__init__()
self.dim0 = dim0
self.dim1 = dim1
def forward(self, x: torch.Tensor):
x = torch.transpose(x, self.dim0, self.dim1)
return x
class CausalBlock1D(Block1D):
def __init__(self, dim: int, dim_out: int):
super(CausalBlock1D, self).__init__(dim, dim_out)
self.block = torch.nn.Sequential(
CausalConv1d(dim, dim_out, 3),
Transpose(1, 2),
nn.LayerNorm(dim_out),
Transpose(1, 2),
nn.Mish(),
)
def forward(self, x: torch.Tensor, mask: torch.Tensor):
output = self.block(x * mask)
return output * mask
class CausalResnetBlock1D(ResnetBlock1D):
def __init__(self, dim: int, dim_out: int, time_emb_dim: int, groups: int = 8):
super(CausalResnetBlock1D, self).__init__(dim, dim_out, time_emb_dim, groups)
self.block1 = CausalBlock1D(dim, dim_out)
self.block2 = CausalBlock1D(dim_out, dim_out)
class CausalConv1d(torch.nn.Conv1d):
def __init__(
self,
in_channels: int,
out_channels: int,
kernel_size: int,
stride: int = 1,
dilation: int = 1,
groups: int = 1,
bias: bool = True,
padding_mode: str = 'zeros',
device=None,
dtype=None
) -> None:
super(CausalConv1d, self).__init__(in_channels, out_channels,
kernel_size, stride,
padding=0, dilation=dilation,
groups=groups, bias=bias,
padding_mode=padding_mode,
device=device, dtype=dtype)
assert stride == 1
self.causal_padding = (kernel_size - 1, 0)
def forward(self, x: torch.Tensor):
x = F.pad(x, self.causal_padding)
x = super(CausalConv1d, self).forward(x)
return x
class ConditionalDecoder(nn.Module): class ConditionalDecoder(nn.Module):
def __init__( def __init__(
self, self,
in_channels, in_channels,
out_channels, out_channels,
causal=False,
channels=(256, 256), channels=(256, 256),
dropout=0.05, dropout=0.05,
attention_head_dim=64, attention_head_dim=64,
@@ -39,7 +106,7 @@ class ConditionalDecoder(nn.Module):
channels = tuple(channels) channels = tuple(channels)
self.in_channels = in_channels self.in_channels = in_channels
self.out_channels = out_channels self.out_channels = out_channels
self.causal = causal
self.time_embeddings = SinusoidalPosEmb(in_channels) self.time_embeddings = SinusoidalPosEmb(in_channels)
time_embed_dim = channels[0] * 4 time_embed_dim = channels[0] * 4
self.time_mlp = TimestepEmbedding( self.time_mlp = TimestepEmbedding(
@@ -56,7 +123,8 @@ class ConditionalDecoder(nn.Module):
input_channel = output_channel input_channel = output_channel
output_channel = channels[i] output_channel = channels[i]
is_last = i == len(channels) - 1 is_last = i == len(channels) - 1
resnet = ResnetBlock1D(dim=input_channel, dim_out=output_channel, time_emb_dim=time_embed_dim) resnet = CausalResnetBlock1D(dim=input_channel, dim_out=output_channel, time_emb_dim=time_embed_dim) if self.causal else \
ResnetBlock1D(dim=input_channel, dim_out=output_channel, time_emb_dim=time_embed_dim)
transformer_blocks = nn.ModuleList( transformer_blocks = nn.ModuleList(
[ [
BasicTransformerBlock( BasicTransformerBlock(
@@ -70,14 +138,16 @@ class ConditionalDecoder(nn.Module):
] ]
) )
downsample = ( downsample = (
Downsample1D(output_channel) if not is_last else nn.Conv1d(output_channel, output_channel, 3, padding=1) Downsample1D(output_channel) if not is_last else
CausalConv1d(output_channel, output_channel, 3) if self.causal else nn.Conv1d(output_channel, output_channel, 3, padding=1)
) )
self.down_blocks.append(nn.ModuleList([resnet, transformer_blocks, downsample])) self.down_blocks.append(nn.ModuleList([resnet, transformer_blocks, downsample]))
for i in range(num_mid_blocks): for _ in range(num_mid_blocks):
input_channel = channels[-1] input_channel = channels[-1]
out_channels = channels[-1] out_channels = channels[-1]
resnet = ResnetBlock1D(dim=input_channel, dim_out=output_channel, time_emb_dim=time_embed_dim) resnet = CausalResnetBlock1D(dim=input_channel, dim_out=output_channel, time_emb_dim=time_embed_dim) if self.causal else \
ResnetBlock1D(dim=input_channel, dim_out=output_channel, time_emb_dim=time_embed_dim)
transformer_blocks = nn.ModuleList( transformer_blocks = nn.ModuleList(
[ [
@@ -99,7 +169,11 @@ class ConditionalDecoder(nn.Module):
input_channel = channels[i] * 2 input_channel = channels[i] * 2
output_channel = channels[i + 1] output_channel = channels[i + 1]
is_last = i == len(channels) - 2 is_last = i == len(channels) - 2
resnet = ResnetBlock1D( resnet = CausalResnetBlock1D(
dim=input_channel,
dim_out=output_channel,
time_emb_dim=time_embed_dim,
) if self.causal else ResnetBlock1D(
dim=input_channel, dim=input_channel,
dim_out=output_channel, dim_out=output_channel,
time_emb_dim=time_embed_dim, time_emb_dim=time_embed_dim,
@@ -119,14 +193,13 @@ class ConditionalDecoder(nn.Module):
upsample = ( upsample = (
Upsample1D(output_channel, use_conv_transpose=True) Upsample1D(output_channel, use_conv_transpose=True)
if not is_last if not is_last
else nn.Conv1d(output_channel, output_channel, 3, padding=1) else CausalConv1d(output_channel, output_channel, 3) if self.causal else nn.Conv1d(output_channel, output_channel, 3, padding=1)
) )
self.up_blocks.append(nn.ModuleList([resnet, transformer_blocks, upsample])) self.up_blocks.append(nn.ModuleList([resnet, transformer_blocks, upsample]))
self.final_block = Block1D(channels[-1], channels[-1]) self.final_block = CausalBlock1D(channels[-1], channels[-1]) if self.causal else Block1D(channels[-1], channels[-1])
self.final_proj = nn.Conv1d(channels[-1], self.out_channels, 1) self.final_proj = nn.Conv1d(channels[-1], self.out_channels, 1)
self.initialize_weights() self.initialize_weights()
def initialize_weights(self): def initialize_weights(self):
for m in self.modules(): for m in self.modules():
if isinstance(m, nn.Conv1d): if isinstance(m, nn.Conv1d):
@@ -159,7 +232,7 @@ class ConditionalDecoder(nn.Module):
_type_: _description_ _type_: _description_
""" """
t = self.time_embeddings(t) t = self.time_embeddings(t).to(t.dtype)
t = self.time_mlp(t) t = self.time_mlp(t)
x = pack([x, mu], "b * t")[0] x = pack([x, mu], "b * t")[0]
@@ -176,7 +249,9 @@ class ConditionalDecoder(nn.Module):
mask_down = masks[-1] mask_down = masks[-1]
x = resnet(x, mask_down, t) x = resnet(x, mask_down, t)
x = rearrange(x, "b c t -> b t c").contiguous() x = rearrange(x, "b c t -> b t c").contiguous()
attn_mask = torch.matmul(mask_down.transpose(1, 2).contiguous(), mask_down) # attn_mask = torch.matmul(mask_down.transpose(1, 2).contiguous(), mask_down)
attn_mask = add_optional_chunk_mask(x, mask_down.bool(), False, False, 0, self.static_chunk_size, -1)
attn_mask = mask_to_bias(attn_mask == 1, x.dtype)
for transformer_block in transformer_blocks: for transformer_block in transformer_blocks:
x = transformer_block( x = transformer_block(
hidden_states=x, hidden_states=x,
@@ -193,7 +268,9 @@ class ConditionalDecoder(nn.Module):
for resnet, transformer_blocks in self.mid_blocks: for resnet, transformer_blocks in self.mid_blocks:
x = resnet(x, mask_mid, t) x = resnet(x, mask_mid, t)
x = rearrange(x, "b c t -> b t c").contiguous() x = rearrange(x, "b c t -> b t c").contiguous()
attn_mask = torch.matmul(mask_mid.transpose(1, 2).contiguous(), mask_mid) # attn_mask = torch.matmul(mask_mid.transpose(1, 2).contiguous(), mask_mid)
attn_mask = add_optional_chunk_mask(x, mask_mid.bool(), False, False, 0, self.static_chunk_size, -1)
attn_mask = mask_to_bias(attn_mask == 1, x.dtype)
for transformer_block in transformer_blocks: for transformer_block in transformer_blocks:
x = transformer_block( x = transformer_block(
hidden_states=x, hidden_states=x,
@@ -208,7 +285,9 @@ class ConditionalDecoder(nn.Module):
x = pack([x[:, :, :skip.shape[-1]], skip], "b * t")[0] x = pack([x[:, :, :skip.shape[-1]], skip], "b * t")[0]
x = resnet(x, mask_up, t) x = resnet(x, mask_up, t)
x = rearrange(x, "b c t -> b t c").contiguous() x = rearrange(x, "b c t -> b t c").contiguous()
attn_mask = torch.matmul(mask_up.transpose(1, 2).contiguous(), mask_up) # attn_mask = torch.matmul(mask_up.transpose(1, 2).contiguous(), mask_up)
attn_mask = add_optional_chunk_mask(x, mask_up.bool(), False, False, 0, self.static_chunk_size, -1)
attn_mask = mask_to_bias(attn_mask == 1, x.dtype)
for transformer_block in transformer_blocks: for transformer_block in transformer_blocks:
x = transformer_block( x = transformer_block(
hidden_states=x, hidden_states=x,

View File

@@ -33,8 +33,13 @@ class MaskedDiffWithXvec(torch.nn.Module):
encoder: torch.nn.Module = None, encoder: torch.nn.Module = None,
length_regulator: torch.nn.Module = None, length_regulator: torch.nn.Module = None,
decoder: torch.nn.Module = None, decoder: torch.nn.Module = None,
decoder_conf: Dict = {'in_channels': 240, 'out_channel': 80, 'spk_emb_dim': 80, 'n_spks': 1, 'cfm_params': DictConfig({'sigma_min': 1e-06, 'solver': 'euler', 't_scheduler': 'cosine', 'training_cfg_rate': 0.2, 'inference_cfg_rate': 0.7, 'reg_loss_type': 'l1'}), 'decoder_params': {'channels': [256, 256], 'dropout': 0.0, 'attention_head_dim': 64, 'n_blocks': 4, 'num_mid_blocks': 12, 'num_heads': 8, 'act_fn': 'gelu'}}, decoder_conf: Dict = {'in_channels': 240, 'out_channel': 80, 'spk_emb_dim': 80, 'n_spks': 1,
mel_feat_conf: Dict = {'n_fft': 1024, 'num_mels': 80, 'sampling_rate': 22050, 'hop_size': 256, 'win_size': 1024, 'fmin': 0, 'fmax': 8000}): 'cfm_params': DictConfig({'sigma_min': 1e-06, 'solver': 'euler', 't_scheduler': 'cosine',
'training_cfg_rate': 0.2, 'inference_cfg_rate': 0.7, 'reg_loss_type': 'l1'}),
'decoder_params': {'channels': [256, 256], 'dropout': 0.0, 'attention_head_dim': 64,
'n_blocks': 4, 'num_mid_blocks': 12, 'num_heads': 8, 'act_fn': 'gelu'}},
mel_feat_conf: Dict = {'n_fft': 1024, 'num_mels': 80, 'sampling_rate': 22050,
'hop_size': 256, 'win_size': 1024, 'fmin': 0, 'fmax': 8000}):
super().__init__() super().__init__()
self.input_size = input_size self.input_size = input_size
self.output_size = output_size self.output_size = output_size
@@ -104,7 +109,101 @@ class MaskedDiffWithXvec(torch.nn.Module):
prompt_token_len, prompt_token_len,
prompt_feat, prompt_feat,
prompt_feat_len, prompt_feat_len,
embedding): embedding,
flow_cache):
if self.fp16 is True:
prompt_feat = prompt_feat.half()
embedding = embedding.half()
assert token.shape[0] == 1
# xvec projection
embedding = F.normalize(embedding, dim=1)
embedding = self.spk_embed_affine_layer(embedding)
# concat text and prompt_text
token_len1, token_len2 = prompt_token.shape[1], token.shape[1]
token, token_len = torch.concat([prompt_token, token], dim=1), prompt_token_len + token_len
mask = (~make_pad_mask(token_len)).unsqueeze(-1).to(embedding)
token = self.input_embedding(torch.clamp(token, min=0)) * mask
# text encode
h, h_lengths = self.encoder(token, token_len)
h = self.encoder_proj(h)
mel_len1, mel_len2 = prompt_feat.shape[1], int(token_len2 / self.input_frame_rate * 22050 / 256)
h, h_lengths = self.length_regulator.inference(h[:, :token_len1], h[:, token_len1:], mel_len1, mel_len2, self.input_frame_rate)
# get conditions
conds = torch.zeros([1, mel_len1 + mel_len2, self.output_size], device=token.device).to(h.dtype)
conds[:, :mel_len1] = prompt_feat
conds = conds.transpose(1, 2)
mask = (~make_pad_mask(torch.tensor([mel_len1 + mel_len2]))).to(h)
feat, flow_cache = self.decoder(
mu=h.transpose(1, 2).contiguous(),
mask=mask.unsqueeze(1),
spks=embedding,
cond=conds,
n_timesteps=10,
prompt_len=mel_len1,
flow_cache=flow_cache
)
feat = feat[:, :, mel_len1:]
assert feat.shape[2] == mel_len2
return feat.float(), flow_cache
class CausalMaskedDiffWithXvec(torch.nn.Module):
def __init__(self,
input_size: int = 512,
output_size: int = 80,
spk_embed_dim: int = 192,
output_type: str = "mel",
vocab_size: int = 4096,
input_frame_rate: int = 50,
only_mask_loss: bool = True,
token_mel_ratio: int = 2,
pre_lookahead_len: int = 3,
encoder: torch.nn.Module = None,
decoder: torch.nn.Module = None,
decoder_conf: Dict = {'in_channels': 240, 'out_channel': 80, 'spk_emb_dim': 80, 'n_spks': 1,
'cfm_params': DictConfig({'sigma_min': 1e-06, 'solver': 'euler', 't_scheduler': 'cosine',
'training_cfg_rate': 0.2, 'inference_cfg_rate': 0.7, 'reg_loss_type': 'l1'}),
'decoder_params': {'channels': [256, 256], 'dropout': 0.0, 'attention_head_dim': 64,
'n_blocks': 4, 'num_mid_blocks': 12, 'num_heads': 8, 'act_fn': 'gelu'}},
mel_feat_conf: Dict = {'n_fft': 1024, 'num_mels': 80, 'sampling_rate': 22050,
'hop_size': 256, 'win_size': 1024, 'fmin': 0, 'fmax': 8000}):
super().__init__()
self.input_size = input_size
self.output_size = output_size
self.decoder_conf = decoder_conf
self.mel_feat_conf = mel_feat_conf
self.vocab_size = vocab_size
self.output_type = output_type
self.input_frame_rate = input_frame_rate
logging.info(f"input frame rate={self.input_frame_rate}")
self.input_embedding = nn.Embedding(vocab_size, input_size)
self.spk_embed_affine_layer = torch.nn.Linear(spk_embed_dim, output_size)
self.encoder = encoder
self.encoder_proj = torch.nn.Linear(self.encoder.output_size(), output_size)
self.decoder = decoder
self.only_mask_loss = only_mask_loss
self.token_mel_ratio = token_mel_ratio
self.pre_lookahead_len = pre_lookahead_len
@torch.inference_mode()
def inference(self,
token,
token_len,
prompt_token,
prompt_token_len,
prompt_feat,
prompt_feat_len,
embedding,
finalize):
if self.fp16 is True:
prompt_feat = prompt_feat.half()
embedding = embedding.half()
assert token.shape[0] == 1 assert token.shape[0] == 1
# xvec projection # xvec projection
embedding = F.normalize(embedding, dim=1) embedding = F.normalize(embedding, dim=1)
@@ -112,30 +211,29 @@ class MaskedDiffWithXvec(torch.nn.Module):
# concat text and prompt_text # concat text and prompt_text
token, token_len = torch.concat([prompt_token, token], dim=1), prompt_token_len + token_len token, token_len = torch.concat([prompt_token, token], dim=1), prompt_token_len + token_len
mask = (~make_pad_mask(token_len)).float().unsqueeze(-1).to(embedding) mask = (~make_pad_mask(token_len)).unsqueeze(-1).to(embedding)
token = self.input_embedding(torch.clamp(token, min=0)) * mask token = self.input_embedding(torch.clamp(token, min=0)) * mask
# text encode # text encode
h, h_lengths = self.encoder(token, token_len) h, h_lengths = self.encoder(token, token_len)
if finalize is False:
h = h[:, :-self.pre_lookahead_len * self.token_mel_ratio]
mel_len1, mel_len2 = prompt_feat.shape[1], h.shape[1] - prompt_feat.shape[1]
h = self.encoder_proj(h) h = self.encoder_proj(h)
feat_len = (token_len / 50 * 22050 / 256).int()
h, h_lengths = self.length_regulator(h, feat_len)
# get conditions # get conditions
conds = torch.zeros([1, feat_len.max().item(), self.output_size], device=token.device) conds = torch.zeros([1, mel_len1 + mel_len2, self.output_size], device=token.device).to(h.dtype)
if prompt_feat.shape[1] != 0: conds[:, :mel_len1] = prompt_feat
for i, j in enumerate(prompt_feat_len):
conds[i, :j] = prompt_feat[i]
conds = conds.transpose(1, 2) conds = conds.transpose(1, 2)
mask = (~make_pad_mask(feat_len)).to(h) mask = (~make_pad_mask(torch.tensor([mel_len1 + mel_len2]))).to(h)
feat = self.decoder( feat, _ = self.decoder(
mu=h.transpose(1, 2).contiguous(), mu=h.transpose(1, 2).contiguous(),
mask=mask.unsqueeze(1), mask=mask.unsqueeze(1),
spks=embedding, spks=embedding,
cond=conds, cond=conds,
n_timesteps=10 n_timesteps=10
) )
if prompt_feat.shape[1] != 0: feat = feat[:, :, mel_len1:]
feat = feat[:, :, prompt_feat.shape[1]:] assert feat.shape[2] == mel_len2
return feat return feat.float(), None

109
cosyvoice/flow/flow_matching.py Executable file → Normal file
View File

@@ -11,10 +11,12 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
import threading
import torch import torch
import torch.nn.functional as F import torch.nn.functional as F
from matcha.models.components.flow_matching import BASECFM from matcha.models.components.flow_matching import BASECFM
class ConditionalCFM(BASECFM): class ConditionalCFM(BASECFM):
def __init__(self, in_channels, cfm_params, n_spks=1, spk_emb_dim=64, estimator: torch.nn.Module = None): def __init__(self, in_channels, cfm_params, n_spks=1, spk_emb_dim=64, estimator: torch.nn.Module = None):
super().__init__( super().__init__(
@@ -29,9 +31,10 @@ class ConditionalCFM(BASECFM):
in_channels = in_channels + (spk_emb_dim if n_spks > 0 else 0) in_channels = in_channels + (spk_emb_dim if n_spks > 0 else 0)
# Just change the architecture of the estimator here # Just change the architecture of the estimator here
self.estimator = estimator self.estimator = estimator
self.lock = threading.Lock()
@torch.inference_mode() @torch.inference_mode()
def forward(self, mu, mask, n_timesteps, temperature=1.0, spks=None, cond=None): def forward(self, mu, mask, n_timesteps, temperature=1.0, spks=None, cond=None, prompt_len=0, flow_cache=torch.zeros(1, 80, 0, 2)):
"""Forward diffusion """Forward diffusion
Args: Args:
@@ -49,11 +52,21 @@ class ConditionalCFM(BASECFM):
sample: generated mel-spectrogram sample: generated mel-spectrogram
shape: (batch_size, n_feats, mel_timesteps) shape: (batch_size, n_feats, mel_timesteps)
""" """
z = torch.randn_like(mu) * temperature
t_span = torch.linspace(0, 1, n_timesteps + 1, device=mu.device) z = torch.randn_like(mu).to(mu.device).to(mu.dtype) * temperature
cache_size = flow_cache.shape[2]
# fix prompt and overlap part mu and z
if cache_size != 0:
z[:, :, :cache_size] = flow_cache[:, :, :, 0]
mu[:, :, :cache_size] = flow_cache[:, :, :, 1]
z_cache = torch.concat([z[:, :, :prompt_len], z[:, :, -34:]], dim=2)
mu_cache = torch.concat([mu[:, :, :prompt_len], mu[:, :, -34:]], dim=2)
flow_cache = torch.stack([z_cache, mu_cache], dim=-1)
t_span = torch.linspace(0, 1, n_timesteps + 1, device=mu.device, dtype=mu.dtype)
if self.t_scheduler == 'cosine': if self.t_scheduler == 'cosine':
t_span = 1 - torch.cos(t_span * 0.5 * torch.pi) t_span = 1 - torch.cos(t_span * 0.5 * torch.pi)
return self.solve_euler(z, t_span=t_span, mu=mu, mask=mask, spks=spks, cond=cond) return self.solve_euler(z, t_span=t_span, mu=mu, mask=mask, spks=spks, cond=cond), flow_cache
def solve_euler(self, x, t_span, mu, mask, spks, cond): def solve_euler(self, x, t_span, mu, mask, spks, cond):
""" """
@@ -71,30 +84,63 @@ class ConditionalCFM(BASECFM):
cond: Not used but kept for future purposes cond: Not used but kept for future purposes
""" """
t, _, dt = t_span[0], t_span[-1], t_span[1] - t_span[0] t, _, dt = t_span[0], t_span[-1], t_span[1] - t_span[0]
t = t.unsqueeze(dim=0)
# I am storing this because I can later plot it by putting a debugger here and saving it to a file # I am storing this because I can later plot it by putting a debugger here and saving it to a file
# Or in future might add like a return_all_steps flag # Or in future might add like a return_all_steps flag
sol = [] sol = []
# Do not use concat, it may cause memory format changed and trt infer with wrong results!
x_in = torch.zeros([2, 80, x.size(2)], device=x.device, dtype=x.dtype)
mask_in = torch.zeros([2, 1, x.size(2)], device=x.device, dtype=x.dtype)
mu_in = torch.zeros([2, 80, x.size(2)], device=x.device, dtype=x.dtype)
t_in = torch.zeros([2], device=x.device, dtype=x.dtype)
spks_in = torch.zeros([2, 80], device=x.device, dtype=x.dtype)
cond_in = torch.zeros([2, 80, x.size(2)], device=x.device, dtype=x.dtype)
for step in range(1, len(t_span)): for step in range(1, len(t_span)):
dphi_dt = self.estimator(x, mask, mu, t, spks, cond)
# Classifier-Free Guidance inference introduced in VoiceBox # Classifier-Free Guidance inference introduced in VoiceBox
if self.inference_cfg_rate > 0: x_in[:] = x
cfg_dphi_dt = self.estimator( mask_in[:] = mask
x, mask, mu_in[0] = mu
torch.zeros_like(mu), t, t_in[:] = t.unsqueeze(0)
torch.zeros_like(spks) if spks is not None else None, spks_in[0] = spks
torch.zeros_like(cond) cond_in[0] = cond
) dphi_dt = self.forward_estimator(
dphi_dt = ((1.0 + self.inference_cfg_rate) * dphi_dt - x_in, mask_in,
self.inference_cfg_rate * cfg_dphi_dt) mu_in, t_in,
spks_in,
cond_in
)
dphi_dt, cfg_dphi_dt = torch.split(dphi_dt, [x.size(0), x.size(0)], dim=0)
dphi_dt = ((1.0 + self.inference_cfg_rate) * dphi_dt - self.inference_cfg_rate * cfg_dphi_dt)
x = x + dt * dphi_dt x = x + dt * dphi_dt
t = t + dt t = t + dt
sol.append(x) sol.append(x)
if step < len(t_span) - 1: if step < len(t_span) - 1:
dt = t_span[step + 1] - t dt = t_span[step + 1] - t
return sol[-1] return sol[-1].float()
def forward_estimator(self, x, mask, mu, t, spks, cond):
if isinstance(self.estimator, torch.nn.Module):
return self.estimator.forward(x, mask, mu, t, spks, cond)
else:
with self.lock:
self.estimator.set_input_shape('x', (2, 80, x.size(2)))
self.estimator.set_input_shape('mask', (2, 1, x.size(2)))
self.estimator.set_input_shape('mu', (2, 80, x.size(2)))
self.estimator.set_input_shape('t', (2,))
self.estimator.set_input_shape('spks', (2, 80))
self.estimator.set_input_shape('cond', (2, 80, x.size(2)))
# run trt engine
self.estimator.execute_v2([x.contiguous().data_ptr(),
mask.contiguous().data_ptr(),
mu.contiguous().data_ptr(),
t.contiguous().data_ptr(),
spks.contiguous().data_ptr(),
cond.contiguous().data_ptr(),
x.data_ptr()])
return x
def compute_loss(self, x1, mask, mu, spks=None, cond=None): def compute_loss(self, x1, mask, mu, spks=None, cond=None):
"""Computes diffusion loss """Computes diffusion loss
@@ -136,3 +182,36 @@ class ConditionalCFM(BASECFM):
pred = self.estimator(y, mask, mu, t.squeeze(), spks, cond) pred = self.estimator(y, mask, mu, t.squeeze(), spks, cond)
loss = F.mse_loss(pred * mask, u * mask, reduction="sum") / (torch.sum(mask) * u.shape[1]) loss = F.mse_loss(pred * mask, u * mask, reduction="sum") / (torch.sum(mask) * u.shape[1])
return loss, y return loss, y
class CausalConditionalCFM(ConditionalCFM):
def __init__(self, in_channels, cfm_params, n_spks=1, spk_emb_dim=64, estimator: torch.nn.Module = None):
super().__init__(in_channels, cfm_params, n_spks, spk_emb_dim, estimator)
self.rand_noise = torch.randn([1, 80, 50 * 300])
@torch.inference_mode()
def forward(self, mu, mask, n_timesteps, temperature=1.0, spks=None, cond=None):
"""Forward diffusion
Args:
mu (torch.Tensor): output of encoder
shape: (batch_size, n_feats, mel_timesteps)
mask (torch.Tensor): output_mask
shape: (batch_size, 1, mel_timesteps)
n_timesteps (int): number of diffusion steps
temperature (float, optional): temperature for scaling noise. Defaults to 1.0.
spks (torch.Tensor, optional): speaker ids. Defaults to None.
shape: (batch_size, spk_emb_dim)
cond: Not used but kept for future purposes
Returns:
sample: generated mel-spectrogram
shape: (batch_size, n_feats, mel_timesteps)
"""
z = self.rand_noise[:, :, :mu.size(2)].to(mu.device).to(mu.dtype) * temperature
# fix prompt and overlap part mu and z
t_span = torch.linspace(0, 1, n_timesteps + 1, device=mu.device, dtype=mu.dtype)
if self.t_scheduler == 'cosine':
t_span = 1 - torch.cos(t_span * 0.5 * torch.pi)
return self.solve_euler(z, t_span=t_span, mu=mu, mask=mask, spks=spks, cond=cond), None

22
cosyvoice/flow/length_regulator.py Executable file → Normal file
View File

@@ -13,6 +13,7 @@
# limitations under the License. # limitations under the License.
from typing import Tuple from typing import Tuple
import torch.nn as nn import torch.nn as nn
import torch
from torch.nn import functional as F from torch.nn import functional as F
from cosyvoice.utils.mask import make_pad_mask from cosyvoice.utils.mask import make_pad_mask
@@ -43,7 +44,26 @@ class InterpolateRegulator(nn.Module):
def forward(self, x, ylens=None): def forward(self, x, ylens=None):
# x in (B, T, D) # x in (B, T, D)
mask = (~make_pad_mask(ylens)).to(x).unsqueeze(-1) mask = (~make_pad_mask(ylens)).to(x).unsqueeze(-1)
x = F.interpolate(x.transpose(1, 2).contiguous(), size=ylens.max(), mode='nearest') x = F.interpolate(x.transpose(1, 2).contiguous(), size=ylens.max(), mode='linear')
out = self.model(x).transpose(1, 2).contiguous() out = self.model(x).transpose(1, 2).contiguous()
olens = ylens olens = ylens
return out * mask, olens return out * mask, olens
def inference(self, x1, x2, mel_len1, mel_len2, input_frame_rate=50):
# in inference mode, interploate prompt token and token(head/mid/tail) seprately, so we can get a clear separation point of mel
# x in (B, T, D)
if x2.shape[1] > 40:
x2_head = F.interpolate(x2[:, :20].transpose(1, 2).contiguous(), size=int(20 / input_frame_rate * 22050 / 256), mode='linear')
x2_mid = F.interpolate(x2[:, 20:-20].transpose(1, 2).contiguous(), size=mel_len2 - int(20 / input_frame_rate * 22050 / 256) * 2,
mode='linear')
x2_tail = F.interpolate(x2[:, -20:].transpose(1, 2).contiguous(), size=int(20 / input_frame_rate * 22050 / 256), mode='linear')
x2 = torch.concat([x2_head, x2_mid, x2_tail], dim=2)
else:
x2 = F.interpolate(x2.transpose(1, 2).contiguous(), size=mel_len2, mode='linear')
if x1.shape[1] != 0:
x1 = F.interpolate(x1.transpose(1, 2).contiguous(), size=mel_len1, mode='linear')
x = torch.concat([x1, x2], dim=2)
else:
x = x2
out = self.model(x).transpose(1, 2).contiguous()
return out, mel_len1 + mel_len2

View File

@@ -0,0 +1,140 @@
import torch
import torch.nn as nn
from torch.nn.utils.parametrizations import weight_norm
from typing import List, Optional, Tuple
from einops import rearrange
from torchaudio.transforms import Spectrogram
class MultipleDiscriminator(nn.Module):
def __init__(
self, mpd: nn.Module, mrd: nn.Module
):
super().__init__()
self.mpd = mpd
self.mrd = mrd
def forward(self, y: torch.Tensor, y_hat: torch.Tensor):
y_d_rs, y_d_gs, fmap_rs, fmap_gs = [], [], [], []
this_y_d_rs, this_y_d_gs, this_fmap_rs, this_fmap_gs = self.mpd(y.unsqueeze(dim=1), y_hat.unsqueeze(dim=1))
y_d_rs += this_y_d_rs
y_d_gs += this_y_d_gs
fmap_rs += this_fmap_rs
fmap_gs += this_fmap_gs
this_y_d_rs, this_y_d_gs, this_fmap_rs, this_fmap_gs = self.mrd(y, y_hat)
y_d_rs += this_y_d_rs
y_d_gs += this_y_d_gs
fmap_rs += this_fmap_rs
fmap_gs += this_fmap_gs
return y_d_rs, y_d_gs, fmap_rs, fmap_gs
class MultiResolutionDiscriminator(nn.Module):
def __init__(
self,
fft_sizes: Tuple[int, ...] = (2048, 1024, 512),
num_embeddings: Optional[int] = None,
):
"""
Multi-Resolution Discriminator module adapted from https://github.com/descriptinc/descript-audio-codec.
Additionally, it allows incorporating conditional information with a learned embeddings table.
Args:
fft_sizes (tuple[int]): Tuple of window lengths for FFT. Defaults to (2048, 1024, 512).
num_embeddings (int, optional): Number of embeddings. None means non-conditional discriminator.
Defaults to None.
"""
super().__init__()
self.discriminators = nn.ModuleList(
[DiscriminatorR(window_length=w, num_embeddings=num_embeddings) for w in fft_sizes]
)
def forward(
self, y: torch.Tensor, y_hat: torch.Tensor, bandwidth_id: torch.Tensor = None
) -> Tuple[List[torch.Tensor], List[torch.Tensor], List[List[torch.Tensor]], List[List[torch.Tensor]]]:
y_d_rs = []
y_d_gs = []
fmap_rs = []
fmap_gs = []
for d in self.discriminators:
y_d_r, fmap_r = d(x=y, cond_embedding_id=bandwidth_id)
y_d_g, fmap_g = d(x=y_hat, cond_embedding_id=bandwidth_id)
y_d_rs.append(y_d_r)
fmap_rs.append(fmap_r)
y_d_gs.append(y_d_g)
fmap_gs.append(fmap_g)
return y_d_rs, y_d_gs, fmap_rs, fmap_gs
class DiscriminatorR(nn.Module):
def __init__(
self,
window_length: int,
num_embeddings: Optional[int] = None,
channels: int = 32,
hop_factor: float = 0.25,
bands: Tuple[Tuple[float, float], ...] = ((0.0, 0.1), (0.1, 0.25), (0.25, 0.5), (0.5, 0.75), (0.75, 1.0)),
):
super().__init__()
self.window_length = window_length
self.hop_factor = hop_factor
self.spec_fn = Spectrogram(
n_fft=window_length, hop_length=int(window_length * hop_factor), win_length=window_length, power=None
)
n_fft = window_length // 2 + 1
bands = [(int(b[0] * n_fft), int(b[1] * n_fft)) for b in bands]
self.bands = bands
convs = lambda: nn.ModuleList(
[
weight_norm(nn.Conv2d(2, channels, (3, 9), (1, 1), padding=(1, 4))),
weight_norm(nn.Conv2d(channels, channels, (3, 9), (1, 2), padding=(1, 4))),
weight_norm(nn.Conv2d(channels, channels, (3, 9), (1, 2), padding=(1, 4))),
weight_norm(nn.Conv2d(channels, channels, (3, 9), (1, 2), padding=(1, 4))),
weight_norm(nn.Conv2d(channels, channels, (3, 3), (1, 1), padding=(1, 1))),
]
)
self.band_convs = nn.ModuleList([convs() for _ in range(len(self.bands))])
if num_embeddings is not None:
self.emb = torch.nn.Embedding(num_embeddings=num_embeddings, embedding_dim=channels)
torch.nn.init.zeros_(self.emb.weight)
self.conv_post = weight_norm(nn.Conv2d(channels, 1, (3, 3), (1, 1), padding=(1, 1)))
def spectrogram(self, x):
# Remove DC offset
x = x - x.mean(dim=-1, keepdims=True)
# Peak normalize the volume of input audio
x = 0.8 * x / (x.abs().max(dim=-1, keepdim=True)[0] + 1e-9)
x = self.spec_fn(x)
x = torch.view_as_real(x)
x = rearrange(x, "b f t c -> b c t f")
# Split into bands
x_bands = [x[..., b[0]: b[1]] for b in self.bands]
return x_bands
def forward(self, x: torch.Tensor, cond_embedding_id: torch.Tensor = None):
x_bands = self.spectrogram(x)
fmap = []
x = []
for band, stack in zip(x_bands, self.band_convs):
for i, layer in enumerate(stack):
band = layer(band)
band = torch.nn.functional.leaky_relu(band, 0.1)
if i > 0:
fmap.append(band)
x.append(band)
x = torch.cat(x, dim=-1)
if cond_embedding_id is not None:
emb = self.emb(cond_embedding_id)
h = (emb.view(1, -1, 1, 1) * x).sum(dim=1, keepdims=True)
else:
h = 0
x = self.conv_post(x)
fmap.append(x)
x += h
return x, fmap

2
cosyvoice/hifigan/f0_predictor.py Executable file → Normal file
View File

@@ -13,7 +13,7 @@
# limitations under the License. # limitations under the License.
import torch import torch
import torch.nn as nn import torch.nn as nn
from torch.nn.utils import weight_norm from torch.nn.utils.parametrizations import weight_norm
class ConvRNNF0Predictor(nn.Module): class ConvRNNF0Predictor(nn.Module):

View File

@@ -14,7 +14,7 @@
"""HIFI-GAN""" """HIFI-GAN"""
import typing as tp from typing import Dict, Optional, List
import numpy as np import numpy as np
from scipy.signal import get_window from scipy.signal import get_window
import torch import torch
@@ -23,7 +23,7 @@ import torch.nn.functional as F
from torch.nn import Conv1d from torch.nn import Conv1d
from torch.nn import ConvTranspose1d from torch.nn import ConvTranspose1d
from torch.nn.utils import remove_weight_norm from torch.nn.utils import remove_weight_norm
from torch.nn.utils import weight_norm from torch.nn.utils.parametrizations import weight_norm
from torch.distributions.uniform import Uniform from torch.distributions.uniform import Uniform
from cosyvoice.transformer.activation import Snake from cosyvoice.transformer.activation import Snake
@@ -38,13 +38,15 @@ This code is modified from https://github.com/jik876/hifi-gan
https://github.com/NVIDIA/BigVGAN https://github.com/NVIDIA/BigVGAN
""" """
class ResBlock(torch.nn.Module): class ResBlock(torch.nn.Module):
"""Residual block module in HiFiGAN/BigVGAN.""" """Residual block module in HiFiGAN/BigVGAN."""
def __init__( def __init__(
self, self,
channels: int = 512, channels: int = 512,
kernel_size: int = 3, kernel_size: int = 3,
dilations: tp.List[int] = [1, 3, 5], dilations: List[int] = [1, 3, 5],
): ):
super(ResBlock, self).__init__() super(ResBlock, self).__init__()
self.convs1 = nn.ModuleList() self.convs1 = nn.ModuleList()
@@ -100,6 +102,7 @@ class ResBlock(torch.nn.Module):
remove_weight_norm(self.convs1[idx]) remove_weight_norm(self.convs1[idx])
remove_weight_norm(self.convs2[idx]) remove_weight_norm(self.convs2[idx])
class SineGen(torch.nn.Module): class SineGen(torch.nn.Module):
""" Definition of sine generator """ Definition of sine generator
SineGen(samp_rate, harmonic_num = 0, SineGen(samp_rate, harmonic_num = 0,
@@ -231,13 +234,13 @@ class HiFTGenerator(nn.Module):
nsf_alpha: float = 0.1, nsf_alpha: float = 0.1,
nsf_sigma: float = 0.003, nsf_sigma: float = 0.003,
nsf_voiced_threshold: float = 10, nsf_voiced_threshold: float = 10,
upsample_rates: tp.List[int] = [8, 8], upsample_rates: List[int] = [8, 8],
upsample_kernel_sizes: tp.List[int] = [16, 16], upsample_kernel_sizes: List[int] = [16, 16],
istft_params: tp.Dict[str, int] = {"n_fft": 16, "hop_len": 4}, istft_params: Dict[str, int] = {"n_fft": 16, "hop_len": 4},
resblock_kernel_sizes: tp.List[int] = [3, 7, 11], resblock_kernel_sizes: List[int] = [3, 7, 11],
resblock_dilation_sizes: tp.List[tp.List[int]] = [[1, 3, 5], [1, 3, 5], [1, 3, 5]], resblock_dilation_sizes: List[List[int]] = [[1, 3, 5], [1, 3, 5], [1, 3, 5]],
source_resblock_kernel_sizes: tp.List[int] = [7, 11], source_resblock_kernel_sizes: List[int] = [7, 11],
source_resblock_dilation_sizes: tp.List[tp.List[int]] = [[1, 3, 5], [1, 3, 5]], source_resblock_dilation_sizes: List[List[int]] = [[1, 3, 5], [1, 3, 5]],
lrelu_slope: float = 0.1, lrelu_slope: float = 0.1,
audio_limit: float = 0.99, audio_limit: float = 0.99,
f0_predictor: torch.nn.Module = None, f0_predictor: torch.nn.Module = None,
@@ -286,8 +289,7 @@ class HiFTGenerator(nn.Module):
self.source_resblocks = nn.ModuleList() self.source_resblocks = nn.ModuleList()
downsample_rates = [1] + upsample_rates[::-1][:-1] downsample_rates = [1] + upsample_rates[::-1][:-1]
downsample_cum_rates = np.cumprod(downsample_rates) downsample_cum_rates = np.cumprod(downsample_rates)
for i, (u, k, d) in enumerate(zip(downsample_cum_rates[::-1], source_resblock_kernel_sizes, for i, (u, k, d) in enumerate(zip(downsample_cum_rates[::-1], source_resblock_kernel_sizes, source_resblock_dilation_sizes)):
source_resblock_dilation_sizes)):
if u == 1: if u == 1:
self.source_downs.append( self.source_downs.append(
Conv1d(istft_params["n_fft"] + 2, base_channels // (2 ** (i + 1)), 1, 1) Conv1d(istft_params["n_fft"] + 2, base_channels // (2 ** (i + 1)), 1, 1)
@@ -304,7 +306,7 @@ class HiFTGenerator(nn.Module):
self.resblocks = nn.ModuleList() self.resblocks = nn.ModuleList()
for i in range(len(self.ups)): for i in range(len(self.ups)):
ch = base_channels // (2**(i + 1)) ch = base_channels // (2**(i + 1))
for j, (k, d) in enumerate(zip(resblock_kernel_sizes, resblock_dilation_sizes)): for _, (k, d) in enumerate(zip(resblock_kernel_sizes, resblock_dilation_sizes)):
self.resblocks.append(ResBlock(ch, k, d)) self.resblocks.append(ResBlock(ch, k, d))
self.conv_post = weight_norm(Conv1d(ch, istft_params["n_fft"] + 2, 7, 1, padding=3)) self.conv_post = weight_norm(Conv1d(ch, istft_params["n_fft"] + 2, 7, 1, padding=3))
@@ -314,11 +316,19 @@ class HiFTGenerator(nn.Module):
self.stft_window = torch.from_numpy(get_window("hann", istft_params["n_fft"], fftbins=True).astype(np.float32)) self.stft_window = torch.from_numpy(get_window("hann", istft_params["n_fft"], fftbins=True).astype(np.float32))
self.f0_predictor = f0_predictor self.f0_predictor = f0_predictor
def _f02source(self, f0: torch.Tensor) -> torch.Tensor: def remove_weight_norm(self):
f0 = self.f0_upsamp(f0[:, None]).transpose(1, 2) # bs,n,t print('Removing weight norm...')
for l in self.ups:
har_source, _, _ = self.m_source(f0) remove_weight_norm(l)
return har_source.transpose(1, 2) for l in self.resblocks:
l.remove_weight_norm()
remove_weight_norm(self.conv_pre)
remove_weight_norm(self.conv_post)
self.m_source.remove_weight_norm()
for l in self.source_downs:
remove_weight_norm(l)
for l in self.source_resblocks:
l.remove_weight_norm()
def _stft(self, x): def _stft(self, x):
spec = torch.stft( spec = torch.stft(
@@ -332,13 +342,11 @@ class HiFTGenerator(nn.Module):
magnitude = torch.clip(magnitude, max=1e2) magnitude = torch.clip(magnitude, max=1e2)
real = magnitude * torch.cos(phase) real = magnitude * torch.cos(phase)
img = magnitude * torch.sin(phase) img = magnitude * torch.sin(phase)
inverse_transform = torch.istft(torch.complex(real, img), self.istft_params["n_fft"], self.istft_params["hop_len"], self.istft_params["n_fft"], window=self.stft_window.to(magnitude.device)) inverse_transform = torch.istft(torch.complex(real, img), self.istft_params["n_fft"], self.istft_params["hop_len"],
self.istft_params["n_fft"], window=self.stft_window.to(magnitude.device))
return inverse_transform return inverse_transform
def forward(self, x: torch.Tensor) -> torch.Tensor: def decode(self, x: torch.Tensor, s: torch.Tensor = torch.zeros(1, 1, 0)) -> torch.Tensor:
f0 = self.f0_predictor(x)
s = self._f02source(f0)
s_stft_real, s_stft_imag = self._stft(s.squeeze(1)) s_stft_real, s_stft_imag = self._stft(s.squeeze(1))
s_stft = torch.cat([s_stft_real, s_stft_imag], dim=1) s_stft = torch.cat([s_stft_real, s_stft_imag], dim=1)
@@ -372,20 +380,32 @@ class HiFTGenerator(nn.Module):
x = torch.clamp(x, -self.audio_limit, self.audio_limit) x = torch.clamp(x, -self.audio_limit, self.audio_limit)
return x return x
def remove_weight_norm(self): def forward(
print('Removing weight norm...') self,
for l in self.ups: batch: dict,
remove_weight_norm(l) device: torch.device,
for l in self.resblocks: ) -> Dict[str, Optional[torch.Tensor]]:
l.remove_weight_norm() speech_feat = batch['speech_feat'].transpose(1, 2).to(device)
remove_weight_norm(self.conv_pre) # mel->f0
remove_weight_norm(self.conv_post) f0 = self.f0_predictor(speech_feat)
self.source_module.remove_weight_norm() # f0->source
for l in self.source_downs: s = self.f0_upsamp(f0[:, None]).transpose(1, 2) # bs,n,t
remove_weight_norm(l) s, _, _ = self.m_source(s)
for l in self.source_resblocks: s = s.transpose(1, 2)
l.remove_weight_norm() # mel+source->speech
generated_speech = self.decode(x=speech_feat, s=s)
return generated_speech, f0
@torch.inference_mode() @torch.inference_mode()
def inference(self, mel: torch.Tensor) -> torch.Tensor: def inference(self, speech_feat: torch.Tensor, cache_source: torch.Tensor = torch.zeros(1, 1, 0)) -> torch.Tensor:
return self.forward(x=mel) # mel->f0
f0 = self.f0_predictor(speech_feat)
# f0->source
s = self.f0_upsamp(f0[:, None]).transpose(1, 2) # bs,n,t
s, _, _ = self.m_source(s)
s = s.transpose(1, 2)
# use cache_source to avoid glitch
if cache_source.shape[2] != 0:
s[:, :, :cache_source.shape[2]] = cache_source
generated_speech = self.decode(x=speech_feat, s=s)
return generated_speech, s

View File

@@ -0,0 +1,67 @@
from typing import Dict, Optional
import torch
import torch.nn as nn
import torch.nn.functional as F
from matcha.hifigan.models import feature_loss, generator_loss, discriminator_loss
from cosyvoice.utils.losses import tpr_loss, mel_loss
class HiFiGan(nn.Module):
def __init__(self, generator, discriminator, mel_spec_transform,
multi_mel_spectral_recon_loss_weight=45, feat_match_loss_weight=2.0,
tpr_loss_weight=1.0, tpr_loss_tau=0.04):
super(HiFiGan, self).__init__()
self.generator = generator
self.discriminator = discriminator
self.mel_spec_transform = mel_spec_transform
self.multi_mel_spectral_recon_loss_weight = multi_mel_spectral_recon_loss_weight
self.feat_match_loss_weight = feat_match_loss_weight
self.tpr_loss_weight = tpr_loss_weight
self.tpr_loss_tau = tpr_loss_tau
def forward(
self,
batch: dict,
device: torch.device,
) -> Dict[str, Optional[torch.Tensor]]:
if batch['turn'] == 'generator':
return self.forward_generator(batch, device)
else:
return self.forward_discriminator(batch, device)
def forward_generator(self, batch, device):
real_speech = batch['speech'].to(device)
pitch_feat = batch['pitch_feat'].to(device)
# 1. calculate generator outputs
generated_speech, generated_f0 = self.generator(batch, device)
# 2. calculate discriminator outputs
y_d_rs, y_d_gs, fmap_rs, fmap_gs = self.discriminator(real_speech, generated_speech)
# 3. calculate generator losses, feature loss, mel loss, tpr losses [Optional]
loss_gen, _ = generator_loss(y_d_gs)
loss_fm = feature_loss(fmap_rs, fmap_gs)
loss_mel = mel_loss(real_speech, generated_speech, self.mel_spec_transform)
if self.tpr_loss_weight != 0:
loss_tpr = tpr_loss(y_d_rs, y_d_gs, self.tpr_loss_tau)
else:
loss_tpr = torch.zeros(1).to(device)
loss_f0 = F.l1_loss(generated_f0, pitch_feat)
loss = loss_gen + self.feat_match_loss_weight * loss_fm + \
self.multi_mel_spectral_recon_loss_weight * loss_mel + \
self.tpr_loss_weight * loss_tpr + loss_f0
return {'loss': loss, 'loss_gen': loss_gen, 'loss_fm': loss_fm, 'loss_mel': loss_mel, 'loss_tpr': loss_tpr, 'loss_f0': loss_f0}
def forward_discriminator(self, batch, device):
real_speech = batch['speech'].to(device)
# 1. calculate generator outputs
with torch.no_grad():
generated_speech, generated_f0 = self.generator(batch, device)
# 2. calculate discriminator outputs
y_d_rs, y_d_gs, fmap_rs, fmap_gs = self.discriminator(real_speech, generated_speech)
# 3. calculate discriminator losses, tpr losses [Optional]
loss_disc, _, _ = discriminator_loss(y_d_rs, y_d_gs)
if self.tpr_loss_weight != 0:
loss_tpr = tpr_loss(y_d_rs, y_d_gs, self.tpr_loss_tau)
else:
loss_tpr = torch.zeros(1).to(device)
loss = loss_disc + self.tpr_loss_weight * loss_tpr
return {'loss': loss, 'loss_disc': loss_disc, 'loss_tpr': loss_tpr}

View File

@@ -11,14 +11,16 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
from typing import Dict, Optional, Union from typing import Dict, Optional, Callable, List, Generator
import torch import torch
from torch import nn from torch import nn
import torch.nn.functional as F import torch.nn.functional as F
from transformers import Qwen2ForCausalLM
from torch.nn.utils.rnn import pad_sequence, unpad_sequence from torch.nn.utils.rnn import pad_sequence, unpad_sequence
from cosyvoice.utils.common import IGNORE_ID from cosyvoice.utils.common import IGNORE_ID
from cosyvoice.transformer.label_smoothing_loss import LabelSmoothingLoss from cosyvoice.transformer.label_smoothing_loss import LabelSmoothingLoss
from cosyvoice.utils.common import th_accuracy from cosyvoice.utils.common import th_accuracy
from cosyvoice.utils.file_utils import logging
class TransformerLM(torch.nn.Module): class TransformerLM(torch.nn.Module):
@@ -31,6 +33,7 @@ class TransformerLM(torch.nn.Module):
speech_token_size: int, speech_token_size: int,
text_encoder: torch.nn.Module, text_encoder: torch.nn.Module,
llm: torch.nn.Module, llm: torch.nn.Module,
sampling: Callable,
length_normalized_loss: bool = True, length_normalized_loss: bool = True,
lsm_weight: float = 0.0, lsm_weight: float = 0.0,
spk_embed_dim: int = 192, spk_embed_dim: int = 192,
@@ -63,6 +66,9 @@ class TransformerLM(torch.nn.Module):
self.speech_embedding = torch.nn.Embedding(speech_token_size, llm_input_size) self.speech_embedding = torch.nn.Embedding(speech_token_size, llm_input_size)
self.spk_embed_affine_layer = torch.nn.Linear(spk_embed_dim, llm_input_size) self.spk_embed_affine_layer = torch.nn.Linear(spk_embed_dim, llm_input_size)
# 4. sampling method
self.sampling = sampling
def encode( def encode(
self, self,
text: torch.Tensor, text: torch.Tensor,
@@ -76,7 +82,8 @@ class TransformerLM(torch.nn.Module):
def pad_unpad_sequence(self, sos_eos_emb, embedding, text_token, text_token_len, task_id_emb, speech_token, speech_token_len): def pad_unpad_sequence(self, sos_eos_emb, embedding, text_token, text_token_len, task_id_emb, speech_token, speech_token_len):
text_token = unpad_sequence(text_token, text_token_len.cpu(), batch_first=True) text_token = unpad_sequence(text_token, text_token_len.cpu(), batch_first=True)
speech_token = unpad_sequence(speech_token, speech_token_len.cpu(), batch_first=True) speech_token = unpad_sequence(speech_token, speech_token_len.cpu(), batch_first=True)
lm_input = [torch.concat([sos_eos_emb.squeeze(dim=0), embedding[i], text_token[i], task_id_emb.squeeze(dim=0), speech_token[i]], dim=0) for i in range(len(text_token))] lm_input = [torch.concat([sos_eos_emb.squeeze(dim=0), embedding[i], text_token[i], task_id_emb.squeeze(dim=0), speech_token[i]], dim=0)
for i in range(len(text_token))]
lm_input_len = torch.tensor([i.size(0) for i in lm_input], dtype=torch.int32) lm_input_len = torch.tensor([i.size(0) for i in lm_input], dtype=torch.int32)
lm_input = pad_sequence(lm_input, batch_first=True, padding_value=IGNORE_ID) lm_input = pad_sequence(lm_input, batch_first=True, padding_value=IGNORE_ID)
return lm_input, lm_input_len return lm_input, lm_input_len
@@ -100,7 +107,8 @@ class TransformerLM(torch.nn.Module):
embedding = batch['embedding'].to(device) embedding = batch['embedding'].to(device)
# 1. prepare llm_target # 1. prepare llm_target
lm_target = [torch.tensor([IGNORE_ID] * (2 + text_token_len[i]) + speech_token[i, :speech_token_len[i]].tolist() + [self.speech_token_size]) for i in range(text_token.size(0))] lm_target = [torch.tensor([IGNORE_ID] * (2 + text_token_len[i]) + speech_token[i, :speech_token_len[i]].tolist() +
[self.speech_token_size]) for i in range(text_token.size(0))]
lm_target = pad_sequence(lm_target, batch_first=True, padding_value=IGNORE_ID).to(device) lm_target = pad_sequence(lm_target, batch_first=True, padding_value=IGNORE_ID).to(device)
# 1. encode text_token # 1. encode text_token
@@ -120,7 +128,8 @@ class TransformerLM(torch.nn.Module):
speech_token = self.speech_embedding(speech_token) speech_token = self.speech_embedding(speech_token)
# 5. unpad and pad # 5. unpad and pad
lm_input, lm_input_len = self.pad_unpad_sequence(sos_eos_emb, embedding, text_token, text_token_len, task_id_emb, speech_token, speech_token_len) lm_input, lm_input_len = self.pad_unpad_sequence(sos_eos_emb, embedding, text_token, text_token_len,
task_id_emb, speech_token, speech_token_len)
# 6. run lm forward # 6. run lm forward
lm_output, lm_output_mask = self.llm(lm_input, lm_input_len.to(device)) lm_output, lm_output_mask = self.llm(lm_input, lm_input_len.to(device))
@@ -132,16 +141,18 @@ class TransformerLM(torch.nn.Module):
def sampling_ids( def sampling_ids(
self, self,
weighted_scores: torch.Tensor, weighted_scores: torch.Tensor,
sampling: Union[bool, int, float] = True, decoded_tokens: List,
beam_size: int = 1, sampling: int,
ignore_eos: bool = True, ignore_eos: bool = True,
): ):
num_trials, max_trials = 0, 100
while True: while True:
prob, indices = weighted_scores.softmax(dim=-1).topk(sampling) top_ids = self.sampling(weighted_scores, decoded_tokens, sampling)
top_ids = prob.multinomial(beam_size, replacement=True)
top_ids = indices[top_ids]
if (not ignore_eos) or (self.speech_token_size not in top_ids): if (not ignore_eos) or (self.speech_token_size not in top_ids):
break break
num_trials += 1
if num_trials > max_trials:
raise RuntimeError('sampling reaches max_trials {} and still get eos when ignore_eos is True, check your input!'.format(max_trials))
return top_ids return top_ids
@torch.inference_mode() @torch.inference_mode()
@@ -154,11 +165,13 @@ class TransformerLM(torch.nn.Module):
prompt_speech_token: torch.Tensor, prompt_speech_token: torch.Tensor,
prompt_speech_token_len: torch.Tensor, prompt_speech_token_len: torch.Tensor,
embedding: torch.Tensor, embedding: torch.Tensor,
beam_size: int = 1,
sampling: int = 25, sampling: int = 25,
max_token_text_ratio: float = 20, max_token_text_ratio: float = 20,
min_token_text_ratio: float = 2, min_token_text_ratio: float = 2,
) -> torch.Tensor: ) -> Generator[torch.Tensor, None, None]:
if self.fp16 is True:
embedding = embedding.half()
device = text.device device = text.device
text = torch.concat([prompt_text, text], dim=1) text = torch.concat([prompt_text, text], dim=1)
text_len += prompt_text_len text_len += prompt_text_len
@@ -173,7 +186,7 @@ class TransformerLM(torch.nn.Module):
embedding = self.spk_embed_affine_layer(embedding) embedding = self.spk_embed_affine_layer(embedding)
embedding = embedding.unsqueeze(dim=1) embedding = embedding.unsqueeze(dim=1)
else: else:
embedding = torch.zeros(1, 0, self.llm_input_size).to(device) embedding = torch.zeros(1, 0, self.llm_input_size, dtype=text.dtype).to(device).to(text.dtype)
# 3. concat llm_input # 3. concat llm_input
sos_eos_emb = self.llm_embedding.weight[self.sos_eos].reshape(1, 1, -1) sos_eos_emb = self.llm_embedding.weight[self.sos_eos].reshape(1, 1, -1)
@@ -181,7 +194,7 @@ class TransformerLM(torch.nn.Module):
if prompt_speech_token_len != 0: if prompt_speech_token_len != 0:
prompt_speech_token_emb = self.speech_embedding(prompt_speech_token) prompt_speech_token_emb = self.speech_embedding(prompt_speech_token)
else: else:
prompt_speech_token_emb = torch.zeros(1, 0, self.llm_input_size).to(device) prompt_speech_token_emb = torch.zeros(1, 0, self.llm_input_size, dtype=text.dtype).to(device)
lm_input = torch.concat([sos_eos_emb, embedding, text, task_id_emb, prompt_speech_token_emb], dim=1) lm_input = torch.concat([sos_eos_emb, embedding, text, task_id_emb, prompt_speech_token_emb], dim=1)
# 4. cal min/max_length # 4. cal min/max_length
@@ -193,14 +206,385 @@ class TransformerLM(torch.nn.Module):
offset = 0 offset = 0
att_cache, cnn_cache = torch.zeros((0, 0, 0, 0), device=lm_input.device), torch.zeros((0, 0, 0, 0), device=lm_input.device) att_cache, cnn_cache = torch.zeros((0, 0, 0, 0), device=lm_input.device), torch.zeros((0, 0, 0, 0), device=lm_input.device)
for i in range(max_len): for i in range(max_len):
y_pred, att_cache, cnn_cache = self.llm.forward_chunk(lm_input, offset=0, required_cache_size=-1, att_cache=att_cache, cnn_cache=cnn_cache, y_pred, att_cache, cnn_cache = self.llm.forward_chunk(lm_input, offset=offset, required_cache_size=-1,
att_mask=torch.tril(torch.ones((1, lm_input.shape[1], lm_input.shape[1]), device=lm_input.device)).to(torch.bool)) att_cache=att_cache, cnn_cache=cnn_cache,
att_mask=torch.tril(torch.ones((1, lm_input.shape[1], lm_input.shape[1]),
device=lm_input.device)).to(torch.bool))
logp = self.llm_decoder(y_pred[:, -1]).log_softmax(dim=-1) logp = self.llm_decoder(y_pred[:, -1]).log_softmax(dim=-1)
top_ids = self.sampling_ids(logp.squeeze(dim=0), sampling, beam_size, ignore_eos=True if i < min_len else False).item() # force continue decode first token
if i == 0:
logp[:, self.speech_token_size] = -float('inf')
top_ids = self.sampling_ids(logp.squeeze(dim=0), out_tokens, sampling, ignore_eos=True if i < min_len else False).item()
if top_ids == self.speech_token_size: if top_ids == self.speech_token_size:
break break
# in stream mode, yield token one by one
yield top_ids
out_tokens.append(top_ids) out_tokens.append(top_ids)
offset += lm_input.size(1) offset += lm_input.size(1)
lm_input = self.speech_embedding.weight[top_ids].reshape(1, 1, -1) lm_input = self.speech_embedding.weight[top_ids].reshape(1, 1, -1)
return torch.tensor([out_tokens], dtype=torch.int64, device=device)
class Qwen2Encoder(torch.nn.Module):
def __init__(self, pretrain_path):
super().__init__()
self.model = Qwen2ForCausalLM.from_pretrained(pretrain_path)
def forward_one_step(self, xs, masks, cache=None):
input_masks = masks[:, -1, :]
outs = self.model(
inputs_embeds=xs,
attention_mask=input_masks,
output_hidden_states=True,
return_dict=True,
use_cache=True,
past_key_values=cache,
)
xs = outs.hidden_states[-1]
new_cache = outs.past_key_values
return xs, new_cache
class Qwen2LM(TransformerLM):
def __init__(
self,
llm_input_size: int,
llm_output_size: int,
speech_token_size: int,
llm: torch.nn.Module,
sampling: Callable,
length_normalized_loss: bool = True,
lsm_weight: float = 0.0,
mix_ratio: List[int] = [5, 15],
):
torch.nn.Module.__init__(self)
self.llm_input_size = llm_input_size
self.llm_output_size = llm_output_size
self.speech_token_size = speech_token_size
# 2. build speech token language model related modules
self.sos_eos = 0
self.task_id = 1
self.fill_token = 2
self.llm_embedding = torch.nn.Embedding(2, llm_input_size)
self.llm = llm
self.llm_decoder = nn.Linear(llm_output_size, speech_token_size + 3)
self.criterion_ce = LabelSmoothingLoss(
size=speech_token_size + 3,
padding_idx=IGNORE_ID,
smoothing=lsm_weight,
normalize_length=length_normalized_loss,
)
# 3. [Optional] build speech token related modules
self.speech_embedding = torch.nn.Embedding(speech_token_size + 3, llm_input_size)
# 4. sampling method
self.sampling = sampling
self.mix_ratio = mix_ratio
@torch.inference_mode()
def inference(
self,
text: torch.Tensor,
text_len: torch.Tensor,
prompt_text: torch.Tensor,
prompt_text_len: torch.Tensor,
prompt_speech_token: torch.Tensor,
prompt_speech_token_len: torch.Tensor,
embedding: torch.Tensor,
sampling: int = 25,
max_token_text_ratio: float = 20,
min_token_text_ratio: float = 2,
vllm_codec_engine=None,
) -> Generator[torch.Tensor, None, None]:
device = text.device
text = torch.concat([prompt_text, text], dim=1)
text_len += prompt_text_len
text = self.llm.model.model.embed_tokens(text)
# 3. concat llm_input
sos_eos_emb = self.llm_embedding.weight[self.sos_eos].reshape(1, 1, -1)
task_id_emb = self.llm_embedding.weight[self.task_id].reshape(1, 1, -1)
if prompt_speech_token_len != 0:
prompt_speech_token_emb = self.speech_embedding(prompt_speech_token)
else:
prompt_speech_token_emb = torch.zeros(1, 0, self.llm_input_size, dtype=text.dtype).to(device)
lm_input = torch.concat([sos_eos_emb, text, task_id_emb, prompt_speech_token_emb], dim=1)
# 4. cal min/max_length
min_len = int((text_len - prompt_text_len) * min_token_text_ratio)
max_len = int((text_len - prompt_text_len) * max_token_text_ratio)
# 5. step by step decode
if vllm_codec_engine is None:
out_tokens = []
cache = None
for i in range(max_len):
y_pred, cache = self.llm.forward_one_step(lm_input,
masks=torch.tril(torch.ones((1, lm_input.shape[1], lm_input.shape[1]), device=lm_input.device)).to(torch.bool),
cache=cache)
logp = self.llm_decoder(y_pred[:, -1]).log_softmax(dim=-1)
top_ids = self.sampling_ids(logp.squeeze(dim=0), out_tokens, sampling, ignore_eos=True if i < min_len else False).item()
if top_ids == self.speech_token_size:
break
if top_ids > self.speech_token_size:
continue
# in stream mode, yield token one by one
yield top_ids
out_tokens.append(top_ids)
lm_input = self.speech_embedding.weight[top_ids].reshape(1, 1, -1)
else:
from vllm import SamplingParams, RequestOutput
import uuid
sampling_params = SamplingParams(top_k=sampling,
stop_token_ids=[6561, 6563],
min_tokens=min_len,
max_tokens=max_len)
request_id = uuid.uuid4()
vllm_codec_engine.add_request(request_id,
{"prompt_embeds": lm_input.squeeze(0).to(torch.bfloat16).to(device)},
sampling_params)
while True:
speech_token_break = False
request_outputs: List[RequestOutput] = vllm_codec_engine.step()
for request_output in request_outputs:
if str(request_output.request_id) != str(request_id):
continue
# print(f"request output: {request_output}")
top_ids = list(request_output.outputs[0].token_ids)[-1]
if top_ids == self.speech_token_size:
speech_token_break = True
break
if top_ids > self.speech_token_size:
continue
yield top_ids
if not vllm_codec_engine.has_unfinished_requests() or speech_token_break:
break
@torch.inference_mode()
def inference_bistream(
self,
text: Generator,
prompt_text: torch.Tensor,
prompt_text_len: torch.Tensor,
prompt_speech_token: torch.Tensor,
prompt_speech_token_len: torch.Tensor,
embedding: torch.Tensor,
sampling: int = 25,
max_token_text_ratio: float = 20,
min_token_text_ratio: float = 2,
) -> Generator[torch.Tensor, None, None]:
device = prompt_text.device
# 1. prepare input
sos_eos_emb = self.llm_embedding.weight[self.sos_eos].reshape(1, 1, -1)
task_id_emb = self.llm_embedding.weight[self.task_id].reshape(1, 1, -1)
if prompt_speech_token_len != 0:
prompt_speech_token_emb = self.speech_embedding(prompt_speech_token)
else:
prompt_speech_token_emb = torch.zeros(1, 0, self.llm_input_size, dtype=prompt_text.dtype).to(device)
lm_input = torch.concat([sos_eos_emb], dim=1)
# 2. iterate text
out_tokens = []
cache = None
# NOTE init prompt_text as text_cache as it is basically impossible prompt_speech_token/prompt_text < 15/5
text_cache = self.llm.model.model.embed_tokens(prompt_text)
next_fill_index = -1
for this_text in text:
text_cache = torch.concat([text_cache, self.llm.model.model.embed_tokens(this_text)], dim=1)
# prompt_speech_token_emb not empty, try append to lm_input
while prompt_speech_token_emb.size(1) != 0:
if text_cache.size(1) >= self.mix_ratio[0]:
lm_input_text, lm_input_speech = text_cache[:, :self.mix_ratio[0]], prompt_speech_token_emb[:, :self.mix_ratio[1]]
logging.info('append {} text token {} speech token'.format(lm_input_text.size(1), lm_input_speech.size(1)))
lm_input = torch.concat([lm_input, lm_input_text, lm_input_speech], dim=1)
text_cache, prompt_speech_token_emb = text_cache[:, self.mix_ratio[0]:], prompt_speech_token_emb[:, self.mix_ratio[1]:]
else:
logging.info('not enough text token to decode, wait for more')
break
# no prompt_speech_token_emb remain, can decode some speech token
if prompt_speech_token_emb.size(1) == 0:
if (len(out_tokens) != 0 and out_tokens[-1] == self.speech_token_size + 2) or (len(out_tokens) == 0 and lm_input.size(1) == 1):
logging.info('get fill token, need to append more text token')
if text_cache.size(1) >= self.mix_ratio[0]:
lm_input_text = text_cache[:, :self.mix_ratio[0]]
logging.info('append {} text token'.format(lm_input_text.size(1)))
if len(out_tokens) != 0 and out_tokens[-1] == self.speech_token_size + 2:
lm_input = lm_input_text
else:
lm_input = torch.concat([lm_input, lm_input_text], dim=1)
text_cache = text_cache[:, self.mix_ratio[0]:]
else:
logging.info('not enough text token to decode, wait for more')
continue
while True:
seq_len = lm_input.shape[1] if cache is None else lm_input.shape[1] + cache[0][0].size(2)
y_pred, cache = self.llm.forward_one_step(lm_input,
masks=torch.tril(torch.ones((1, seq_len, seq_len), device=lm_input.device)).to(torch.bool),
cache=cache)
logp = self.llm_decoder(y_pred[:, -1]).log_softmax(dim=-1)
if next_fill_index != -1 and len(out_tokens) == next_fill_index:
top_ids = self.speech_token_size + 2
next_fill_index += (self.mix_ratio[1] + 1)
else:
top_ids = self.sampling_ids(logp.squeeze(dim=0), out_tokens, sampling, ignore_eos=True).item()
if top_ids == self.speech_token_size + 2:
next_fill_index = len(out_tokens) + self.mix_ratio[1] + 1
logging.info('fill_token index {} next fill_token index {}'.format(len(out_tokens), next_fill_index))
out_tokens.append(top_ids)
if top_ids >= self.speech_token_size:
if top_ids == self.speech_token_size + 2:
break
else:
raise ValueError('should not get token {}'.format(top_ids))
yield top_ids
lm_input = self.speech_embedding.weight[top_ids].reshape(1, 1, -1)
# 3. final decode
lm_input = torch.concat([lm_input, text_cache, task_id_emb], dim=1)
logging.info('no more text token, decode until met eos')
while True:
seq_len = lm_input.shape[1] if cache is None else lm_input.shape[1] + cache[0][0].size(2)
y_pred, cache = self.llm.forward_one_step(lm_input,
masks=torch.tril(torch.ones((1, seq_len, seq_len), device=lm_input.device)).to(torch.bool),
cache=cache)
logp = self.llm_decoder(y_pred[:, -1]).log_softmax(dim=-1)
top_ids = self.sampling_ids(logp.squeeze(dim=0), out_tokens, sampling, ignore_eos=False).item()
out_tokens.append(top_ids)
if top_ids >= self.speech_token_size:
if top_ids == self.speech_token_size:
break
else:
raise ValueError('should not get token {}'.format(top_ids))
# in stream mode, yield token one by one
yield top_ids
lm_input = self.speech_embedding.weight[top_ids].reshape(1, 1, -1)
@torch.inference_mode()
def inference_bistream_vllm(
self,
text: Generator,
prompt_text: torch.Tensor,
prompt_text_len: torch.Tensor,
prompt_speech_token: torch.Tensor,
prompt_speech_token_len: torch.Tensor,
embedding: torch.Tensor,
sampling: int = 25,
max_token_text_ratio: float = 20,
min_token_text_ratio: float = 2,
vllm_codec_engine=None,
) -> Generator[torch.Tensor, None, None]:
device = prompt_text.device
# 1. prepare input
sos_eos_emb = self.llm_embedding.weight[self.sos_eos].reshape(1, 1, -1)
task_id_emb = self.llm_embedding.weight[self.task_id].reshape(1, 1, -1)
if prompt_speech_token_len != 0:
prompt_speech_token_emb = self.speech_embedding(prompt_speech_token)
else:
prompt_speech_token_emb = torch.zeros(1, 0, self.llm_input_size, dtype=prompt_text.dtype).to(device)
lm_input = torch.concat([sos_eos_emb], dim=1)
# 2. iterate text
out_tokens = []
cache = None
# NOTE init prompt_text as text_cache as it is basically impossible prompt_speech_token/prompt_text < 15/5
text_cache = self.llm.model.model.embed_tokens(prompt_text)
next_fill_index = -1
from vllm import SamplingParams, RequestOutput
import uuid
sampling_params = SamplingParams(top_k=sampling,
stop_token_ids=[6561, 6563],
max_tokens=10000)
for this_text in text:
text_cache = torch.concat([text_cache, self.llm.model.model.embed_tokens(this_text)], dim=1)
# prompt_speech_token_emb not empty, try append to lm_input
while prompt_speech_token_emb.size(1) != 0:
if text_cache.size(1) >= self.mix_ratio[0]:
lm_input_text, lm_input_speech = text_cache[:, :self.mix_ratio[0]], prompt_speech_token_emb[:, :self.mix_ratio[1]]
logging.info('append {} text token {} speech token'.format(lm_input_text.size(1), lm_input_speech.size(1)))
lm_input = torch.concat([lm_input, lm_input_text, lm_input_speech], dim=1)
text_cache, prompt_speech_token_emb = text_cache[:, self.mix_ratio[0]:], prompt_speech_token_emb[:, self.mix_ratio[1]:]
else:
logging.info('not enough text token to decode, wait for more')
break
# no prompt_speech_token_emb remain, can decode some speech token
if prompt_speech_token_emb.size(1) == 0:
if (len(out_tokens) != 0 and out_tokens[-1] == self.speech_token_size + 2) or (len(out_tokens) == 0 and lm_input.size(1) == 1):
logging.info('get fill token, need to append more text token')
if text_cache.size(1) >= self.mix_ratio[0]:
lm_input_text = text_cache[:, :self.mix_ratio[0]]
logging.info('append {} text token'.format(lm_input_text.size(1)))
if vllm_codec_engine is None and len(out_tokens) != 0 and out_tokens[-1] == self.speech_token_size + 2:
lm_input = lm_input_text
else:
lm_input = torch.concat([lm_input, lm_input_text], dim=1)
text_cache = text_cache[:, self.mix_ratio[0]:]
else:
logging.info('not enough text token to decode, wait for more')
continue
request_id = uuid.uuid4()
vllm_codec_engine.add_request(request_id,
{"prompt_embeds": lm_input.squeeze(0).to(torch.bfloat16).to(device)},
sampling_params)
## generator
while True:
speech_token_break = False
request_outputs: List[RequestOutput] = vllm_codec_engine.step()
for request_output in request_outputs:
if str(request_output.request_id) != str(request_id):
continue
# print(f"request output: {request_output}")
out_token = list(request_output.outputs[0].token_ids)[-1]
if next_fill_index != -1 and len(out_tokens) == next_fill_index:
top_ids = self.speech_token_size + 2
next_fill_index += (self.mix_ratio[1] + 1)
else:
top_ids = out_token
if top_ids == self.speech_token_size + 2:
next_fill_index = len(out_tokens) + self.mix_ratio[1] + 1
logging.info('fill_token index {} next fill_token index {}'.format(len(out_tokens), next_fill_index))
out_tokens.append(top_ids)
if top_ids >= self.speech_token_size:
if top_ids == self.speech_token_size + 2:
speech_token_break = True
break
else:
raise ValueError('should not get token {}'.format(top_ids))
yield top_ids
token_embedding = self.speech_embedding.weight[top_ids].reshape(1, 1, -1)
lm_input = torch.concat([lm_input, token_embedding], dim=1)
if not vllm_codec_engine.has_unfinished_requests() or speech_token_break:
break
# 3. final decode
lm_input = torch.concat([lm_input, text_cache, task_id_emb], dim=1)
logging.info('no more text token, decode until met eos')
request_id = uuid.uuid4()
vllm_codec_engine.add_request(request_id,
{"prompt_embeds": lm_input.squeeze(0).to(torch.bfloat16).to(device)},
sampling_params)
## generator
while True:
speech_token_break = False
request_outputs: List[RequestOutput] = vllm_codec_engine.step()
for request_output in request_outputs:
if str(request_output.request_id) != str(request_id):
continue
# print(f"request output: {request_output}")
top_ids = list(request_output.outputs[0].token_ids)[-1]
out_tokens.append(top_ids)
if top_ids >= self.speech_token_size:
if top_ids == self.speech_token_size:
speech_token_break = True
break
else:
raise ValueError('should not get token {}'.format(top_ids))
# in stream mode, yield token one by one
yield top_ids
if not vllm_codec_engine.has_unfinished_requests() or speech_token_break:
break

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,279 @@
import base64
import os
from functools import lru_cache
from typing import Optional
import torch
from transformers import AutoTokenizer
from whisper.tokenizer import Tokenizer
import tiktoken
LANGUAGES = {
"en": "english",
"zh": "chinese",
"de": "german",
"es": "spanish",
"ru": "russian",
"ko": "korean",
"fr": "french",
"ja": "japanese",
"pt": "portuguese",
"tr": "turkish",
"pl": "polish",
"ca": "catalan",
"nl": "dutch",
"ar": "arabic",
"sv": "swedish",
"it": "italian",
"id": "indonesian",
"hi": "hindi",
"fi": "finnish",
"vi": "vietnamese",
"he": "hebrew",
"uk": "ukrainian",
"el": "greek",
"ms": "malay",
"cs": "czech",
"ro": "romanian",
"da": "danish",
"hu": "hungarian",
"ta": "tamil",
"no": "norwegian",
"th": "thai",
"ur": "urdu",
"hr": "croatian",
"bg": "bulgarian",
"lt": "lithuanian",
"la": "latin",
"mi": "maori",
"ml": "malayalam",
"cy": "welsh",
"sk": "slovak",
"te": "telugu",
"fa": "persian",
"lv": "latvian",
"bn": "bengali",
"sr": "serbian",
"az": "azerbaijani",
"sl": "slovenian",
"kn": "kannada",
"et": "estonian",
"mk": "macedonian",
"br": "breton",
"eu": "basque",
"is": "icelandic",
"hy": "armenian",
"ne": "nepali",
"mn": "mongolian",
"bs": "bosnian",
"kk": "kazakh",
"sq": "albanian",
"sw": "swahili",
"gl": "galician",
"mr": "marathi",
"pa": "punjabi",
"si": "sinhala",
"km": "khmer",
"sn": "shona",
"yo": "yoruba",
"so": "somali",
"af": "afrikaans",
"oc": "occitan",
"ka": "georgian",
"be": "belarusian",
"tg": "tajik",
"sd": "sindhi",
"gu": "gujarati",
"am": "amharic",
"yi": "yiddish",
"lo": "lao",
"uz": "uzbek",
"fo": "faroese",
"ht": "haitian creole",
"ps": "pashto",
"tk": "turkmen",
"nn": "nynorsk",
"mt": "maltese",
"sa": "sanskrit",
"lb": "luxembourgish",
"my": "myanmar",
"bo": "tibetan",
"tl": "tagalog",
"mg": "malagasy",
"as": "assamese",
"tt": "tatar",
"haw": "hawaiian",
"ln": "lingala",
"ha": "hausa",
"ba": "bashkir",
"jw": "javanese",
"su": "sundanese",
"yue": "cantonese",
"minnan": "minnan",
"wuyu": "wuyu",
"dialect": "dialect",
"zh/en": "zh/en",
"en/zh": "en/zh",
}
# language code lookup by name, with a few language aliases
TO_LANGUAGE_CODE = {
**{language: code for code, language in LANGUAGES.items()},
"burmese": "my",
"valencian": "ca",
"flemish": "nl",
"haitian": "ht",
"letzeburgesch": "lb",
"pushto": "ps",
"panjabi": "pa",
"moldavian": "ro",
"moldovan": "ro",
"sinhalese": "si",
"castilian": "es",
"mandarin": "zh",
}
AUDIO_EVENT = {
"ASR": "ASR",
"AED": "AED",
"SER": "SER",
"Speech": "Speech",
"/Speech": "/Speech",
"BGM": "BGM",
"/BGM": "/BGM",
"Laughter": "Laughter",
"/Laughter": "/Laughter",
"Applause": "Applause",
"/Applause": "/Applause",
}
EMOTION = {
"HAPPY": "HAPPY",
"SAD": "SAD",
"ANGRY": "ANGRY",
"NEUTRAL": "NEUTRAL",
}
TTS_Vocal_Token = {
"TTS/B": "TTS/B",
"TTS/O": "TTS/O",
"TTS/Q": "TTS/Q",
"TTS/A": "TTS/A",
"TTS/CO": "TTS/CO",
"TTS/CL": "TTS/CL",
"TTS/H": "TTS/H",
**{f"TTS/SP{i:02d}": f"TTS/SP{i:02d}" for i in range(1, 14)}
}
@lru_cache(maxsize=None)
def get_encoding(name: str = "gpt2", num_languages: int = 99):
vocab_path = os.path.join(os.path.dirname(__file__), "assets", f"{name}.tiktoken")
ranks = {
base64.b64decode(token): int(rank)
for token, rank in (line.split() for line in open(vocab_path) if line)
}
n_vocab = len(ranks)
special_tokens = {}
specials = [
"<|endoftext|>",
"<|startoftranscript|>",
*[f"<|{lang}|>" for lang in list(LANGUAGES.keys())[:num_languages]],
*[f"<|{audio_event}|>" for audio_event in list(AUDIO_EVENT.keys())],
*[f"<|{emotion}|>" for emotion in list(EMOTION.keys())],
"<|translate|>",
"<|transcribe|>",
"<|startoflm|>",
"<|startofprev|>",
"<|nospeech|>",
"<|notimestamps|>",
*[f"<|SPECIAL_TOKEN_{i}|>" for i in range(1, 31)], # register special tokens for ASR
*[f"<|{tts}|>" for tts in list(TTS_Vocal_Token.keys())], # register special tokens for TTS
*[f"<|{i * 0.02:.2f}|>" for i in range(1501)],
]
for token in specials:
special_tokens[token] = n_vocab
n_vocab += 1
return tiktoken.Encoding(
name=os.path.basename(vocab_path),
explicit_n_vocab=n_vocab,
pat_str=r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""",
mergeable_ranks=ranks,
special_tokens=special_tokens,
)
@lru_cache(maxsize=None)
def get_tokenizer(
multilingual: bool,
*,
num_languages: int = 99,
language: Optional[str] = None,
task: Optional[str] = None, # Literal["transcribe", "translate", None]
) -> Tokenizer:
if language is not None:
language = language.lower()
if language not in LANGUAGES:
if language in TO_LANGUAGE_CODE:
language = TO_LANGUAGE_CODE[language]
else:
raise ValueError(f"Unsupported language: {language}")
if multilingual:
encoding_name = "multilingual_zh_ja_yue_char_del"
language = language or "en"
task = task or "transcribe"
else:
encoding_name = "gpt2"
language = None
task = None
encoding = get_encoding(name=encoding_name, num_languages=num_languages)
return Tokenizer(
encoding=encoding, num_languages=num_languages, language=language, task=task
)
class QwenTokenizer():
def __init__(self, token_path, skip_special_tokens=True):
super().__init__()
# NOTE: non-chat model, all these special tokens keep randomly initialized.
special_tokens = {
'eos_token': '<|endoftext|>',
'pad_token': '<|endoftext|>',
'additional_special_tokens': [
'<|im_start|>', '<|im_end|>', '<|endofprompt|>',
'[breath]', '<strong>', '</strong>', '[noise]',
'[laughter]', '[cough]', '[clucking]', '[accent]',
'[quick_breath]',
"<laughter>", "</laughter>",
"[hissing]", "[sigh]", "[vocalized-noise]",
"[lipsmack]", "[mn]"
]
}
self.special_tokens = special_tokens
self.tokenizer = AutoTokenizer.from_pretrained(token_path)
self.tokenizer.add_special_tokens(special_tokens)
self.skip_special_tokens = skip_special_tokens
def encode(self, text, **kwargs):
tokens = self.tokenizer([text], return_tensors="pt")
tokens = tokens["input_ids"][0].cpu().tolist()
return tokens
def decode(self, tokens):
tokens = torch.tensor(tokens, dtype=torch.int64)
text = self.tokenizer.batch_decode([tokens], skip_special_tokens=self.skip_special_tokens)[0]
return text
@lru_cache(maxsize=None)
def get_qwen_tokenizer(
token_path: str,
skip_special_tokens: bool
) -> QwenTokenizer:
return QwenTokenizer(token_path=token_path, skip_special_tokens=skip_special_tokens)

View File

@@ -222,7 +222,7 @@ class RelPositionMultiHeadedAttention(MultiHeadedAttention):
torch.nn.init.xavier_uniform_(self.pos_bias_u) torch.nn.init.xavier_uniform_(self.pos_bias_u)
torch.nn.init.xavier_uniform_(self.pos_bias_v) torch.nn.init.xavier_uniform_(self.pos_bias_v)
def rel_shift(self, x): def rel_shift(self, x: torch.Tensor) -> torch.Tensor:
"""Compute relative positional encoding. """Compute relative positional encoding.
Args: Args:
@@ -233,10 +233,14 @@ class RelPositionMultiHeadedAttention(MultiHeadedAttention):
torch.Tensor: Output tensor. torch.Tensor: Output tensor.
""" """
zero_pad = torch.zeros((*x.size()[:3], 1), device=x.device, dtype=x.dtype) zero_pad = torch.zeros((x.size()[0], x.size()[1], x.size()[2], 1),
device=x.device,
dtype=x.dtype)
x_padded = torch.cat([zero_pad, x], dim=-1) x_padded = torch.cat([zero_pad, x], dim=-1)
x_padded = x_padded.view(*x.size()[:2], x.size(3) + 1, x.size(2)) x_padded = x_padded.view(x.size()[0],
x.size()[1],
x.size(3) + 1, x.size(2))
x = x_padded[:, :, 1:].view_as(x)[ x = x_padded[:, :, 1:].view_as(x)[
:, :, :, : x.size(-1) // 2 + 1 :, :, :, : x.size(-1) // 2 + 1
] # only keep the positions from 0 to time2 ] # only keep the positions from 0 to time2

View File

@@ -174,7 +174,7 @@ class TransformerDecoder(torch.nn.Module):
memory_mask) memory_mask)
return x return x
@torch.jit.ignore(drop=True) @torch.jit.unused
def forward_layers_checkpointed(self, x: torch.Tensor, def forward_layers_checkpointed(self, x: torch.Tensor,
tgt_mask: torch.Tensor, tgt_mask: torch.Tensor,
memory: torch.Tensor, memory: torch.Tensor,

View File

@@ -212,7 +212,7 @@ class EspnetRelPositionalEncoding(torch.nn.Module):
""" """
def __init__(self, d_model, dropout_rate, max_len=5000): def __init__(self, d_model: int, dropout_rate: float, max_len: int = 5000):
"""Construct an PositionalEncoding object.""" """Construct an PositionalEncoding object."""
super(EspnetRelPositionalEncoding, self).__init__() super(EspnetRelPositionalEncoding, self).__init__()
self.d_model = d_model self.d_model = d_model
@@ -221,7 +221,7 @@ class EspnetRelPositionalEncoding(torch.nn.Module):
self.pe = None self.pe = None
self.extend_pe(torch.tensor(0.0).expand(1, max_len)) self.extend_pe(torch.tensor(0.0).expand(1, max_len))
def extend_pe(self, x): def extend_pe(self, x: torch.Tensor):
"""Reset the positional encodings.""" """Reset the positional encodings."""
if self.pe is not None: if self.pe is not None:
# self.pe contains both positive and negative parts # self.pe contains both positive and negative parts
@@ -253,7 +253,8 @@ class EspnetRelPositionalEncoding(torch.nn.Module):
pe = torch.cat([pe_positive, pe_negative], dim=1) pe = torch.cat([pe_positive, pe_negative], dim=1)
self.pe = pe.to(device=x.device, dtype=x.dtype) self.pe = pe.to(device=x.device, dtype=x.dtype)
def forward(self, x: torch.Tensor, offset: Union[int, torch.Tensor] = 0): def forward(self, x: torch.Tensor, offset: Union[int, torch.Tensor] = 0) \
-> Tuple[torch.Tensor, torch.Tensor]:
"""Add positional encoding. """Add positional encoding.
Args: Args:
@@ -288,6 +289,6 @@ class EspnetRelPositionalEncoding(torch.nn.Module):
""" """
pos_emb = self.pe[ pos_emb = self.pe[
:, :,
self.pe.size(1) // 2 - size + 1 : self.pe.size(1) // 2 + size, self.pe.size(1) // 2 - size + 1: self.pe.size(1) // 2 + size,
] ]
return pos_emb return pos_emb

View File

@@ -169,7 +169,7 @@ class BaseEncoder(torch.nn.Module):
xs, chunk_masks, _, _ = layer(xs, chunk_masks, pos_emb, mask_pad) xs, chunk_masks, _, _ = layer(xs, chunk_masks, pos_emb, mask_pad)
return xs return xs
@torch.jit.ignore(drop=True) @torch.jit.unused
def forward_layers_checkpointed(self, xs: torch.Tensor, def forward_layers_checkpointed(self, xs: torch.Tensor,
chunk_masks: torch.Tensor, chunk_masks: torch.Tensor,
pos_emb: torch.Tensor, pos_emb: torch.Tensor,
@@ -180,6 +180,7 @@ class BaseEncoder(torch.nn.Module):
mask_pad) mask_pad)
return xs return xs
@torch.jit.export
def forward_chunk( def forward_chunk(
self, self,
xs: torch.Tensor, xs: torch.Tensor,
@@ -270,6 +271,7 @@ class BaseEncoder(torch.nn.Module):
return (xs, r_att_cache, r_cnn_cache) return (xs, r_att_cache, r_cnn_cache)
@torch.jit.unused
def forward_chunk_by_chunk( def forward_chunk_by_chunk(
self, self,
xs: torch.Tensor, xs: torch.Tensor,

View File

@@ -49,8 +49,8 @@ class TransformerEncoderLayer(nn.Module):
super().__init__() super().__init__()
self.self_attn = self_attn self.self_attn = self_attn
self.feed_forward = feed_forward self.feed_forward = feed_forward
self.norm1 = nn.LayerNorm(size, eps=1e-5) self.norm1 = nn.LayerNorm(size, eps=1e-12)
self.norm2 = nn.LayerNorm(size, eps=1e-5) self.norm2 = nn.LayerNorm(size, eps=1e-12)
self.dropout = nn.Dropout(dropout_rate) self.dropout = nn.Dropout(dropout_rate)
self.size = size self.size = size
self.normalize_before = normalize_before self.normalize_before = normalize_before
@@ -142,17 +142,17 @@ class ConformerEncoderLayer(nn.Module):
self.feed_forward = feed_forward self.feed_forward = feed_forward
self.feed_forward_macaron = feed_forward_macaron self.feed_forward_macaron = feed_forward_macaron
self.conv_module = conv_module self.conv_module = conv_module
self.norm_ff = nn.LayerNorm(size, eps=1e-5) # for the FNN module self.norm_ff = nn.LayerNorm(size, eps=1e-12) # for the FNN module
self.norm_mha = nn.LayerNorm(size, eps=1e-5) # for the MHA module self.norm_mha = nn.LayerNorm(size, eps=1e-12) # for the MHA module
if feed_forward_macaron is not None: if feed_forward_macaron is not None:
self.norm_ff_macaron = nn.LayerNorm(size, eps=1e-5) self.norm_ff_macaron = nn.LayerNorm(size, eps=1e-12)
self.ff_scale = 0.5 self.ff_scale = 0.5
else: else:
self.ff_scale = 1.0 self.ff_scale = 1.0
if self.conv_module is not None: if self.conv_module is not None:
self.norm_conv = nn.LayerNorm(size, eps=1e-5) # for the CNN module self.norm_conv = nn.LayerNorm(size, eps=1e-12) # for the CNN module
self.norm_final = nn.LayerNorm( self.norm_final = nn.LayerNorm(
size, eps=1e-5) # for the final output of the block size, eps=1e-12) # for the final output of the block
self.dropout = nn.Dropout(dropout_rate) self.dropout = nn.Dropout(dropout_rate)
self.size = size self.size = size
self.normalize_before = normalize_before self.normalize_before = normalize_before

View File

@@ -0,0 +1,318 @@
# Copyright (c) 2021 Mobvoi Inc (Binbin Zhang, Di Wu)
# 2022 Xingchen Song (sxc19@mails.tsinghua.edu.cn)
# 2024 Alibaba Inc (Xiang Lyu)
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Modified from ESPnet(https://github.com/espnet/espnet)
"""Encoder definition."""
from typing import Tuple
import torch
from torch import nn
from torch.nn import functional as F
from cosyvoice.transformer.convolution import ConvolutionModule
from cosyvoice.transformer.encoder_layer import ConformerEncoderLayer
from cosyvoice.transformer.positionwise_feed_forward import PositionwiseFeedForward
from cosyvoice.utils.class_utils import (
COSYVOICE_EMB_CLASSES,
COSYVOICE_SUBSAMPLE_CLASSES,
COSYVOICE_ATTENTION_CLASSES,
COSYVOICE_ACTIVATION_CLASSES,
)
from cosyvoice.utils.mask import make_pad_mask
from cosyvoice.utils.mask import add_optional_chunk_mask
class Upsample1D(nn.Module):
"""A 1D upsampling layer with an optional convolution.
Parameters:
channels (`int`):
number of channels in the inputs and outputs.
use_conv (`bool`, default `False`):
option to use a convolution.
use_conv_transpose (`bool`, default `False`):
option to use a convolution transpose.
out_channels (`int`, optional):
number of output channels. Defaults to `channels`.
"""
def __init__(self, channels: int, out_channels: int, stride: int = 2):
super().__init__()
self.channels = channels
self.out_channels = out_channels
self.stride = stride
# In this mode, first repeat interpolate, than conv with stride=1
self.conv = nn.Conv1d(self.channels, self.out_channels, stride * 2 + 1, stride=1, padding=0)
def forward(self, inputs: torch.Tensor, input_lengths: torch.Tensor):
outputs = F.interpolate(inputs, scale_factor=float(self.stride), mode="nearest")
outputs = F.pad(outputs, (self.stride * 2, 0), value=0.0)
outputs = self.conv(outputs)
return outputs, input_lengths * self.stride
class PreLookaheadLayer(nn.Module):
def __init__(self, channels: int, pre_lookahead_len: int = 1):
super().__init__()
self.channels = channels
self.pre_lookahead_len = pre_lookahead_len
self.conv1 = nn.Conv1d(
channels, channels,
kernel_size=pre_lookahead_len + 1,
stride=1, padding=0,
)
self.conv2 = nn.Conv1d(
channels, channels,
kernel_size=3, stride=1, padding=0,
)
def forward(self, inputs: torch.Tensor) -> torch.Tensor:
"""
inputs: (batch_size, seq_len, channels)
"""
outputs = inputs.transpose(1, 2).contiguous()
# look ahead
outputs = F.pad(outputs, (0, self.pre_lookahead_len), mode='constant', value=0.0)
outputs = F.leaky_relu(self.conv1(outputs))
# outputs
outputs = F.pad(outputs, (2, 0), mode='constant', value=0.0)
outputs = self.conv2(outputs)
outputs = outputs.transpose(1, 2).contiguous()
# residual connection
outputs = outputs + inputs
return outputs
class UpsampleConformerEncoder(torch.nn.Module):
def __init__(
self,
input_size: int,
output_size: int = 256,
attention_heads: int = 4,
linear_units: int = 2048,
num_blocks: int = 6,
dropout_rate: float = 0.1,
positional_dropout_rate: float = 0.1,
attention_dropout_rate: float = 0.0,
input_layer: str = "conv2d",
pos_enc_layer_type: str = "rel_pos",
normalize_before: bool = True,
static_chunk_size: int = 0,
use_dynamic_chunk: bool = False,
global_cmvn: torch.nn.Module = None,
use_dynamic_left_chunk: bool = False,
positionwise_conv_kernel_size: int = 1,
macaron_style: bool = True,
selfattention_layer_type: str = "rel_selfattn",
activation_type: str = "swish",
use_cnn_module: bool = True,
cnn_module_kernel: int = 15,
causal: bool = False,
cnn_module_norm: str = "batch_norm",
key_bias: bool = True,
gradient_checkpointing: bool = False,
):
"""
Args:
input_size (int): input dim
output_size (int): dimension of attention
attention_heads (int): the number of heads of multi head attention
linear_units (int): the hidden units number of position-wise feed
forward
num_blocks (int): the number of decoder blocks
dropout_rate (float): dropout rate
attention_dropout_rate (float): dropout rate in attention
positional_dropout_rate (float): dropout rate after adding
positional encoding
input_layer (str): input layer type.
optional [linear, conv2d, conv2d6, conv2d8]
pos_enc_layer_type (str): Encoder positional encoding layer type.
opitonal [abs_pos, scaled_abs_pos, rel_pos, no_pos]
normalize_before (bool):
True: use layer_norm before each sub-block of a layer.
False: use layer_norm after each sub-block of a layer.
static_chunk_size (int): chunk size for static chunk training and
decoding
use_dynamic_chunk (bool): whether use dynamic chunk size for
training or not, You can only use fixed chunk(chunk_size > 0)
or dyanmic chunk size(use_dynamic_chunk = True)
global_cmvn (Optional[torch.nn.Module]): Optional GlobalCMVN module
use_dynamic_left_chunk (bool): whether use dynamic left chunk in
dynamic chunk training
key_bias: whether use bias in attention.linear_k, False for whisper models.
gradient_checkpointing: rerunning a forward-pass segment for each
checkpointed segment during backward.
"""
super().__init__()
self._output_size = output_size
self.global_cmvn = global_cmvn
self.embed = COSYVOICE_SUBSAMPLE_CLASSES[input_layer](
input_size,
output_size,
dropout_rate,
COSYVOICE_EMB_CLASSES[pos_enc_layer_type](output_size,
positional_dropout_rate),
)
self.normalize_before = normalize_before
self.after_norm = torch.nn.LayerNorm(output_size, eps=1e-5)
self.static_chunk_size = static_chunk_size
self.use_dynamic_chunk = use_dynamic_chunk
self.use_dynamic_left_chunk = use_dynamic_left_chunk
self.gradient_checkpointing = gradient_checkpointing
activation = COSYVOICE_ACTIVATION_CLASSES[activation_type]()
# self-attention module definition
encoder_selfattn_layer_args = (
attention_heads,
output_size,
attention_dropout_rate,
key_bias,
)
# feed-forward module definition
positionwise_layer_args = (
output_size,
linear_units,
dropout_rate,
activation,
)
# convolution module definition
convolution_layer_args = (output_size, cnn_module_kernel, activation,
cnn_module_norm, causal)
self.pre_lookahead_layer = PreLookaheadLayer(channels=512, pre_lookahead_len=3)
self.encoders = torch.nn.ModuleList([
ConformerEncoderLayer(
output_size,
COSYVOICE_ATTENTION_CLASSES[selfattention_layer_type](
*encoder_selfattn_layer_args),
PositionwiseFeedForward(*positionwise_layer_args),
PositionwiseFeedForward(
*positionwise_layer_args) if macaron_style else None,
ConvolutionModule(
*convolution_layer_args) if use_cnn_module else None,
dropout_rate,
normalize_before,
) for _ in range(num_blocks)
])
self.up_layer = Upsample1D(channels=512, out_channels=512, stride=2)
self.up_embed = COSYVOICE_SUBSAMPLE_CLASSES[input_layer](
input_size,
output_size,
dropout_rate,
COSYVOICE_EMB_CLASSES[pos_enc_layer_type](output_size,
positional_dropout_rate),
)
self.up_encoders = torch.nn.ModuleList([
ConformerEncoderLayer(
output_size,
COSYVOICE_ATTENTION_CLASSES[selfattention_layer_type](
*encoder_selfattn_layer_args),
PositionwiseFeedForward(*positionwise_layer_args),
PositionwiseFeedForward(
*positionwise_layer_args) if macaron_style else None,
ConvolutionModule(
*convolution_layer_args) if use_cnn_module else None,
dropout_rate,
normalize_before,
) for _ in range(4)
])
def output_size(self) -> int:
return self._output_size
def forward(
self,
xs: torch.Tensor,
xs_lens: torch.Tensor,
decoding_chunk_size: int = 0,
num_decoding_left_chunks: int = -1,
) -> Tuple[torch.Tensor, torch.Tensor]:
"""Embed positions in tensor.
Args:
xs: padded input tensor (B, T, D)
xs_lens: input length (B)
decoding_chunk_size: decoding chunk size for dynamic chunk
0: default for training, use random dynamic chunk.
<0: for decoding, use full chunk.
>0: for decoding, use fixed chunk size as set.
num_decoding_left_chunks: number of left chunks, this is for decoding,
the chunk size is decoding_chunk_size.
>=0: use num_decoding_left_chunks
<0: use all left chunks
Returns:
encoder output tensor xs, and subsampled masks
xs: padded output tensor (B, T' ~= T/subsample_rate, D)
masks: torch.Tensor batch padding mask after subsample
(B, 1, T' ~= T/subsample_rate)
NOTE(xcsong):
We pass the `__call__` method of the modules instead of `forward` to the
checkpointing API because `__call__` attaches all the hooks of the module.
https://discuss.pytorch.org/t/any-different-between-model-input-and-model-forward-input/3690/2
"""
T = xs.size(1)
masks = ~make_pad_mask(xs_lens, T).unsqueeze(1) # (B, 1, T)
if self.global_cmvn is not None:
xs = self.global_cmvn(xs)
xs, pos_emb, masks = self.embed(xs, masks)
mask_pad = masks # (B, 1, T/subsample_rate)
chunk_masks = add_optional_chunk_mask(xs, masks,
self.use_dynamic_chunk,
self.use_dynamic_left_chunk,
decoding_chunk_size,
self.static_chunk_size,
num_decoding_left_chunks)
# lookahead + conformer encoder
xs = self.pre_lookahead_layer(xs)
xs = self.forward_layers(xs, chunk_masks, pos_emb, mask_pad)
# upsample + conformer encoder
xs = xs.transpose(1, 2).contiguous()
xs, xs_lens = self.up_layer(xs, xs_lens)
xs = xs.transpose(1, 2).contiguous()
T = xs.size(1)
masks = ~make_pad_mask(xs_lens, T).unsqueeze(1) # (B, 1, T)
xs, pos_emb, masks = self.up_embed(xs, masks)
mask_pad = masks # (B, 1, T/subsample_rate)
chunk_masks = add_optional_chunk_mask(xs, masks,
self.use_dynamic_chunk,
self.use_dynamic_left_chunk,
decoding_chunk_size,
self.static_chunk_size * self.up_layer.stride,
num_decoding_left_chunks)
xs = self.forward_up_layers(xs, chunk_masks, pos_emb, mask_pad)
if self.normalize_before:
xs = self.after_norm(xs)
# Here we assume the mask is not changed in encoder layers, so just
# return the masks before encoder layers, and the masks will be used
# for cross attention with decoder later
return xs, masks
def forward_layers(self, xs: torch.Tensor, chunk_masks: torch.Tensor,
pos_emb: torch.Tensor,
mask_pad: torch.Tensor) -> torch.Tensor:
for layer in self.encoders:
xs, chunk_masks, _, _ = layer(xs, chunk_masks, pos_emb, mask_pad)
return xs
def forward_up_layers(self, xs: torch.Tensor, chunk_masks: torch.Tensor,
pos_emb: torch.Tensor,
mask_pad: torch.Tensor) -> torch.Tensor:
for layer in self.up_encoders:
xs, chunk_masks, _, _ = layer(xs, chunk_masks, pos_emb, mask_pad)
return xs

View File

@@ -32,6 +32,10 @@ from cosyvoice.transformer.attention import (MultiHeadedAttention,
RelPositionMultiHeadedAttention) RelPositionMultiHeadedAttention)
from cosyvoice.transformer.embedding import EspnetRelPositionalEncoding from cosyvoice.transformer.embedding import EspnetRelPositionalEncoding
from cosyvoice.transformer.subsampling import LegacyLinearNoSubsampling from cosyvoice.transformer.subsampling import LegacyLinearNoSubsampling
from cosyvoice.llm.llm import TransformerLM, Qwen2LM
from cosyvoice.flow.flow import MaskedDiffWithXvec, CausalMaskedDiffWithXvec
from cosyvoice.hifigan.generator import HiFTGenerator
from cosyvoice.cli.model import CosyVoiceModel, CosyVoice2Model
COSYVOICE_ACTIVATION_CLASSES = { COSYVOICE_ACTIVATION_CLASSES = {
@@ -68,3 +72,12 @@ COSYVOICE_ATTENTION_CLASSES = {
"selfattn": MultiHeadedAttention, "selfattn": MultiHeadedAttention,
"rel_selfattn": RelPositionMultiHeadedAttention, "rel_selfattn": RelPositionMultiHeadedAttention,
} }
def get_model_type(configs):
# NOTE CosyVoice2Model inherits CosyVoiceModel
if isinstance(configs['llm'], TransformerLM) and isinstance(configs['flow'], MaskedDiffWithXvec) and isinstance(configs['hift'], HiFTGenerator):
return CosyVoiceModel
if isinstance(configs['llm'], Qwen2LM) and isinstance(configs['flow'], CausalMaskedDiffWithXvec) and isinstance(configs['hift'], HiFTGenerator):
return CosyVoice2Model
raise TypeError('No valid model type found!')

View File

@@ -15,8 +15,10 @@
# Modified from ESPnet(https://github.com/espnet/espnet) # Modified from ESPnet(https://github.com/espnet/espnet)
"""Unility functions for Transformer.""" """Unility functions for Transformer."""
import random
from typing import List from typing import List
import numpy as np
import torch import torch
IGNORE_ID = -1 IGNORE_ID = -1
@@ -101,3 +103,64 @@ def init_weights(m, mean=0.0, std=0.01):
classname = m.__class__.__name__ classname = m.__class__.__name__
if classname.find("Conv") != -1: if classname.find("Conv") != -1:
m.weight.data.normal_(mean, std) m.weight.data.normal_(mean, std)
# Repetition Aware Sampling in VALL-E 2
def ras_sampling(weighted_scores, decoded_tokens, sampling, top_p=0.8, top_k=25, win_size=10, tau_r=0.1):
top_ids = nucleus_sampling(weighted_scores, top_p=top_p, top_k=top_k)
rep_num = (torch.tensor(decoded_tokens[-win_size:]).to(weighted_scores.device) == top_ids).sum().item()
if rep_num >= win_size * tau_r:
top_ids = random_sampling(weighted_scores, decoded_tokens, sampling)
return top_ids
def nucleus_sampling(weighted_scores, top_p=0.8, top_k=25):
prob, indices = [], []
cum_prob = 0.0
sorted_value, sorted_idx = weighted_scores.softmax(dim=0).sort(descending=True, stable=True)
for i in range(len(sorted_idx)):
# sampling both top-p and numbers.
if cum_prob < top_p and len(prob) < top_k:
cum_prob += sorted_value[i]
prob.append(sorted_value[i])
indices.append(sorted_idx[i])
else:
break
prob = torch.tensor(prob).to(weighted_scores)
indices = torch.tensor(indices, dtype=torch.long).to(weighted_scores.device)
top_ids = indices[prob.multinomial(1, replacement=True)]
return top_ids
def random_sampling(weighted_scores, decoded_tokens, sampling):
top_ids = weighted_scores.softmax(dim=0).multinomial(1, replacement=True)
return top_ids
def fade_in_out(fade_in_mel, fade_out_mel, window):
device = fade_in_mel.device
fade_in_mel, fade_out_mel = fade_in_mel.cpu(), fade_out_mel.cpu()
mel_overlap_len = int(window.shape[0] / 2)
if fade_in_mel.device == torch.device('cpu'):
fade_in_mel = fade_in_mel.clone()
fade_in_mel[..., :mel_overlap_len] = fade_in_mel[..., :mel_overlap_len] * window[:mel_overlap_len] + \
fade_out_mel[..., -mel_overlap_len:] * window[mel_overlap_len:]
return fade_in_mel.to(device)
def set_all_random_seed(seed):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
def mask_to_bias(mask: torch.Tensor, dtype: torch.dtype) -> torch.Tensor:
assert mask.dtype == torch.bool
assert dtype in [torch.float32, torch.bfloat16, torch.float16]
mask = mask.to(dtype)
# attention mask bias
# NOTE(Mddct): torch.finfo jit issues
# chunk_masks = (1.0 - chunk_masks) * torch.finfo(dtype).min
mask = (1.0 - mask) * -1.0e+10
return mask

View File

@@ -25,13 +25,14 @@ from cosyvoice.utils.train_utils import update_parameter_and_lr, log_per_step, l
class Executor: class Executor:
def __init__(self): def __init__(self, gan: bool = False):
self.gan = gan
self.step = 0 self.step = 0
self.epoch = 0 self.epoch = 0
self.rank = int(os.environ.get('RANK', 0)) self.rank = int(os.environ.get('RANK', 0))
self.device = torch.device('cuda:{}'.format(self.rank)) self.device = torch.device('cuda:{}'.format(self.rank))
def train_one_epoc(self, model, optimizer, scheduler, train_data_loader, cv_data_loader, writer, info_dict, group_join): def train_one_epoc(self, model, optimizer, scheduler, train_data_loader, cv_data_loader, writer, info_dict, scaler, group_join):
''' Train one epoch ''' Train one epoch
''' '''
@@ -64,13 +65,72 @@ class Executor:
context = nullcontext context = nullcontext
with context(): with context():
info_dict = batch_forward(model, batch_dict, info_dict) info_dict = batch_forward(model, batch_dict, scaler, info_dict)
info_dict = batch_backward(model, info_dict) info_dict = batch_backward(model, scaler, info_dict)
info_dict = update_parameter_and_lr(model, optimizer, scheduler, info_dict) info_dict = update_parameter_and_lr(model, optimizer, scheduler, scaler, info_dict)
log_per_step(writer, info_dict) log_per_step(writer, info_dict)
# NOTE specify save_per_step in cosyvoice.yaml if you want to enable step save # NOTE specify save_per_step in cosyvoice.yaml if you want to enable step save
if info_dict['save_per_step'] > 0 and (self.step + 1) % info_dict['save_per_step'] == 0 and (batch_idx + 1) % info_dict["accum_grad"] == 0: if info_dict['save_per_step'] > 0 and (self.step + 1) % info_dict['save_per_step'] == 0 and \
(batch_idx + 1) % info_dict["accum_grad"] == 0:
dist.barrier()
self.cv(model, cv_data_loader, writer, info_dict, on_batch_end=False)
model.train()
if (batch_idx + 1) % info_dict["accum_grad"] == 0:
self.step += 1
dist.barrier()
self.cv(model, cv_data_loader, writer, info_dict, on_batch_end=True)
def train_one_epoc_gan(self, model, optimizer, scheduler, optimizer_d, scheduler_d, train_data_loader, cv_data_loader,
writer, info_dict, scaler, group_join):
''' Train one epoch
'''
lr = optimizer.param_groups[0]['lr']
logging.info('Epoch {} TRAIN info lr {} rank {}'.format(self.epoch, lr, self.rank))
logging.info('using accumulate grad, new batch size is {} times'
' larger than before'.format(info_dict['accum_grad']))
# A context manager to be used in conjunction with an instance of
# torch.nn.parallel.DistributedDataParallel to be able to train
# with uneven inputs across participating processes.
model.train()
model_context = model.join if info_dict['train_engine'] == 'torch_ddp' else nullcontext
with model_context():
for batch_idx, batch_dict in enumerate(train_data_loader):
info_dict["tag"] = "TRAIN"
info_dict["step"] = self.step
info_dict["epoch"] = self.epoch
info_dict["batch_idx"] = batch_idx
if cosyvoice_join(group_join, info_dict):
break
# Disable gradient synchronizations across DDP processes.
# Within this context, gradients will be accumulated on module
# variables, which will later be synchronized.
if info_dict['train_engine'] == 'torch_ddp' and (batch_idx + 1) % info_dict["accum_grad"] != 0:
context = model.no_sync
# Used for single gpu training and DDP gradient synchronization
# processes.
else:
context = nullcontext
with context():
batch_dict['turn'] = 'discriminator'
info_dict = batch_forward(model, batch_dict, scaler, info_dict)
info_dict = batch_backward(model, scaler, info_dict)
info_dict = update_parameter_and_lr(model, optimizer_d, scheduler_d, scaler, info_dict)
optimizer.zero_grad()
log_per_step(writer, info_dict)
with context():
batch_dict['turn'] = 'generator'
info_dict = batch_forward(model, batch_dict, scaler, info_dict)
info_dict = batch_backward(model, scaler, info_dict)
info_dict = update_parameter_and_lr(model, optimizer, scheduler, scaler, info_dict)
optimizer_d.zero_grad()
log_per_step(writer, info_dict)
# NOTE specify save_per_step in cosyvoice.yaml if you want to enable step save
if info_dict['save_per_step'] > 0 and (self.step + 1) % info_dict['save_per_step'] == 0 and \
(batch_idx + 1) % info_dict["accum_grad"] == 0:
dist.barrier() dist.barrier()
self.cv(model, cv_data_loader, writer, info_dict, on_batch_end=False) self.cv(model, cv_data_loader, writer, info_dict, on_batch_end=False)
model.train() model.train()
@@ -95,7 +155,9 @@ class Executor:
num_utts = len(batch_dict["utts"]) num_utts = len(batch_dict["utts"])
total_num_utts += num_utts total_num_utts += num_utts
info_dict = batch_forward(model, batch_dict, info_dict) if self.gan is True:
batch_dict['turn'] = 'generator'
info_dict = batch_forward(model, batch_dict, None, info_dict)
for k, v in info_dict['loss_dict'].items(): for k, v in info_dict['loss_dict'].items():
if k not in total_loss_dict: if k not in total_loss_dict:

View File

@@ -1,5 +1,5 @@
# Copyright (c) 2021 Mobvoi Inc. (authors: Binbin Zhang) # Copyright (c) 2021 Mobvoi Inc. (authors: Binbin Zhang)
# 2024 Alibaba Inc (authors: Xiang Lyu) # 2024 Alibaba Inc (authors: Xiang Lyu, Zetao Hu)
# #
# Licensed under the Apache License, Version 2.0 (the "License"); # Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License. # you may not use this file except in compliance with the License.
@@ -15,6 +15,10 @@
import json import json
import torchaudio import torchaudio
import logging
logging.getLogger('matplotlib').setLevel(logging.WARNING)
logging.basicConfig(level=logging.DEBUG,
format='%(asctime)s %(levelname)s %(message)s')
def read_lists(list_file): def read_lists(list_file):
@@ -24,6 +28,7 @@ def read_lists(list_file):
lists.append(line.strip()) lists.append(line.strip())
return lists return lists
def read_json_lists(list_file): def read_json_lists(list_file):
lists = read_lists(list_file) lists = read_lists(list_file)
results = {} results = {}
@@ -32,22 +37,53 @@ def read_json_lists(list_file):
results.update(json.load(fin)) results.update(json.load(fin))
return results return results
def load_wav(wav, target_sr): def load_wav(wav, target_sr):
speech, sample_rate = torchaudio.load(wav) speech, sample_rate = torchaudio.load(wav, backend='soundfile')
speech = speech.mean(dim=0, keepdim=True) speech = speech.mean(dim=0, keepdim=True)
if sample_rate != target_sr: if sample_rate != target_sr:
assert sample_rate > target_sr, 'wav sample rate {} must be greater than {}'.format(sample_rate, target_sr) assert sample_rate > target_sr, 'wav sample rate {} must be greater than {}'.format(sample_rate, target_sr)
speech = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=target_sr)(speech) speech = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=target_sr)(speech)
return speech return speech
def speed_change(waveform, sample_rate, speed_factor: str):
effects = [ def convert_onnx_to_trt(trt_model, onnx_model, fp16):
["tempo", speed_factor], # speed_factor import tensorrt as trt
["rate", f"{sample_rate}"] _min_shape = [(2, 80, 4), (2, 1, 4), (2, 80, 4), (2,), (2, 80), (2, 80, 4)]
] _opt_shape = [(2, 80, 193), (2, 1, 193), (2, 80, 193), (2,), (2, 80), (2, 80, 193)]
augmented_waveform, new_sample_rate = torchaudio.sox_effects.apply_effects_tensor( _max_shape = [(2, 80, 6800), (2, 1, 6800), (2, 80, 6800), (2,), (2, 80), (2, 80, 6800)]
waveform, input_names = ["x", "mask", "mu", "t", "spks", "cond"]
sample_rate,
effects logging.info("Converting onnx to trt...")
) network_flags = 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
return augmented_waveform, new_sample_rate logger = trt.Logger(trt.Logger.INFO)
builder = trt.Builder(logger)
network = builder.create_network(network_flags)
parser = trt.OnnxParser(network, logger)
config = builder.create_builder_config()
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 33) # 8GB
if fp16:
config.set_flag(trt.BuilderFlag.FP16)
profile = builder.create_optimization_profile()
# load onnx model
with open(onnx_model, "rb") as f:
if not parser.parse(f.read()):
for error in range(parser.num_errors):
print(parser.get_error(error))
raise ValueError('failed to parse {}'.format(onnx_model))
# set input shapes
for i in range(len(input_names)):
profile.set_shape(input_names[i], _min_shape[i], _opt_shape[i], _max_shape[i])
tensor_dtype = trt.DataType.HALF if fp16 else trt.DataType.FLOAT
# set input and output data type
for i in range(network.num_inputs):
input_tensor = network.get_input(i)
input_tensor.dtype = tensor_dtype
for i in range(network.num_outputs):
output_tensor = network.get_output(i)
output_tensor.dtype = tensor_dtype
config.add_optimization_profile(profile)
engine_bytes = builder.build_serialized_network(network, config)
# save trt engine
with open(trt_model, "wb") as f:
f.write(engine_bytes)

View File

@@ -13,8 +13,10 @@
# limitations under the License. # limitations under the License.
import re import re
import regex
chinese_char_pattern = re.compile(r'[\u4e00-\u9fff]+') chinese_char_pattern = re.compile(r'[\u4e00-\u9fff]+')
# whether contain chinese character # whether contain chinese character
def contains_chinese(text): def contains_chinese(text):
return bool(chinese_char_pattern.search(text)) return bool(chinese_char_pattern.search(text))
@@ -79,6 +81,13 @@ def split_paragraph(text: str, tokenize, lang="zh", token_max_n=80, token_min_n=
pounc = ['.', '?', '!', ';', ':'] pounc = ['.', '?', '!', ';', ':']
if comma_split: if comma_split:
pounc.extend(['', ',']) pounc.extend(['', ','])
if text[-1] not in pounc:
if lang == "zh":
text += ""
else:
text += "."
st = 0 st = 0
utts = [] utts = []
for i, c in enumerate(text): for i, c in enumerate(text):
@@ -91,11 +100,7 @@ def split_paragraph(text: str, tokenize, lang="zh", token_max_n=80, token_min_n=
st = i + 2 st = i + 2
else: else:
st = i + 1 st = i + 1
if len(utts) == 0:
if lang == "zh":
utts.append(text + '')
else:
utts.append(text + '.')
final_utts = [] final_utts = []
cur_utt = "" cur_utt = ""
for utt in utts: for utt in utts:
@@ -123,3 +128,9 @@ def replace_blank(text: str):
else: else:
out_str.append(c) out_str.append(c)
return "".join(out_str) return "".join(out_str)
def is_only_punctuation(text):
# Regular expression: Match strings that consist only of punctuation marks or are empty.
punctuation_pattern = r'^[\p{P}\p{S}]*$'
return bool(regex.fullmatch(punctuation_pattern, text))

20
cosyvoice/utils/losses.py Normal file
View File

@@ -0,0 +1,20 @@
import torch
import torch.nn.functional as F
def tpr_loss(disc_real_outputs, disc_generated_outputs, tau):
loss = 0
for dr, dg in zip(disc_real_outputs, disc_generated_outputs):
m_DG = torch.median((dr - dg))
L_rel = torch.mean((((dr - dg) - m_DG) ** 2)[dr < dg + m_DG])
loss += tau - F.relu(tau - L_rel)
return loss
def mel_loss(real_speech, generated_speech, mel_transforms):
loss = 0
for transform in mel_transforms:
mel_r = transform(real_speech)
mel_g = transform(generated_speech)
loss += F.l1_loss(mel_g, mel_r)
return loss

View File

@@ -15,6 +15,7 @@
# limitations under the License. # limitations under the License.
import torch import torch
from cosyvoice.utils.file_utils import logging
''' '''
def subsequent_mask( def subsequent_mask(
size: int, size: int,
@@ -86,7 +87,7 @@ def subsequent_mask(
return mask return mask
def subsequent_chunk_mask( def subsequent_chunk_mask_deprecated(
size: int, size: int,
chunk_size: int, chunk_size: int,
num_left_chunks: int = -1, num_left_chunks: int = -1,
@@ -124,6 +125,41 @@ def subsequent_chunk_mask(
return ret return ret
def subsequent_chunk_mask(
size: int,
chunk_size: int,
num_left_chunks: int = -1,
device: torch.device = torch.device("cpu"),
) -> torch.Tensor:
"""Create mask for subsequent steps (size, size) with chunk size,
this is for streaming encoder
Args:
size (int): size of mask
chunk_size (int): size of chunk
num_left_chunks (int): number of left chunks
<0: use full chunk
>=0: use num_left_chunks
device (torch.device): "cpu" or "cuda" or torch.Tensor.device
Returns:
torch.Tensor: mask
Examples:
>>> subsequent_chunk_mask(4, 2)
[[1, 1, 0, 0],
[1, 1, 0, 0],
[1, 1, 1, 1],
[1, 1, 1, 1]]
"""
# NOTE this modified implementation meets onnx export requirements, but it doesn't support num_left_chunks
# actually this is not needed after we have inference cache implemented, will remove it later
pos_idx = torch.arange(size, device=device)
block_value = (torch.div(pos_idx, chunk_size, rounding_mode='trunc') + 1) * chunk_size
ret = pos_idx.unsqueeze(0) < block_value.unsqueeze(1)
return ret
def add_optional_chunk_mask(xs: torch.Tensor, def add_optional_chunk_mask(xs: torch.Tensor,
masks: torch.Tensor, masks: torch.Tensor,
use_dynamic_chunk: bool, use_dynamic_chunk: bool,
@@ -195,6 +231,10 @@ def add_optional_chunk_mask(xs: torch.Tensor,
chunk_masks = masks & chunk_masks # (B, L, L) chunk_masks = masks & chunk_masks # (B, L, L)
else: else:
chunk_masks = masks chunk_masks = masks
assert chunk_masks.dtype == torch.bool
if (chunk_masks.sum(dim=-1) == 0).sum().item() != 0:
logging.warning('get chunk_masks all false at some timestep, force set to true, make sure they are masked in futuer computation!')
chunk_masks[chunk_masks.sum(dim=-1)==0] = True
return chunk_masks return chunk_masks

View File

@@ -567,8 +567,7 @@ class NoamAnnealing(_LRScheduler):
min_lr=0.0, min_lr=0.0,
last_epoch=-1): last_epoch=-1):
self._normalize = d_model**(-0.5) self._normalize = d_model**(-0.5)
assert not (warmup_steps is not None assert not (warmup_steps is not None and warmup_ratio is not None), \
and warmup_ratio is not None), \
"Either use particular number of step or ratio" "Either use particular number of step or ratio"
assert warmup_ratio is None or max_steps is not None, \ assert warmup_ratio is None or max_steps is not None, \
"If there is a ratio, there should be a total steps" "If there is a ratio, there should be a total steps"

View File

@@ -14,7 +14,6 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
from contextlib import nullcontext
import logging import logging
import os import os
import torch import torch
@@ -51,9 +50,10 @@ def init_distributed(args):
return world_size, local_rank, rank return world_size, local_rank, rank
def init_dataset_and_dataloader(args, configs): def init_dataset_and_dataloader(args, configs, gan):
train_dataset = Dataset(args.train_data, data_pipeline=configs['data_pipeline'], mode='train', shuffle=True, partition=True) data_pipeline = configs['data_pipeline_gan'] if gan is True else configs['data_pipeline']
cv_dataset = Dataset(args.cv_data, data_pipeline=configs['data_pipeline'], mode='train', shuffle=False, partition=False) train_dataset = Dataset(args.train_data, data_pipeline=data_pipeline, mode='train', gan=gan, shuffle=True, partition=True)
cv_dataset = Dataset(args.cv_data, data_pipeline=data_pipeline, mode='train', gan=gan, shuffle=False, partition=False)
# do not use persistent_workers=True, as whisper tokenizer opens tiktoken file each time when the for loop starts # do not use persistent_workers=True, as whisper tokenizer opens tiktoken file each time when the for loop starts
train_data_loader = DataLoader(train_dataset, train_data_loader = DataLoader(train_dataset,
@@ -69,7 +69,6 @@ def init_dataset_and_dataloader(args, configs):
return train_dataset, cv_dataset, train_data_loader, cv_data_loader return train_dataset, cv_dataset, train_data_loader, cv_data_loader
def check_modify_and_save_config(args, configs): def check_modify_and_save_config(args, configs):
if args.train_engine == "torch_ddp": if args.train_engine == "torch_ddp":
configs['train_conf']["dtype"] = 'fp32' configs['train_conf']["dtype"] = 'fp32'
@@ -84,7 +83,8 @@ def check_modify_and_save_config(args, configs):
configs['train_conf']["dtype"] = "fp32" configs['train_conf']["dtype"] = "fp32"
assert ds_configs["train_micro_batch_size_per_gpu"] == 1 assert ds_configs["train_micro_batch_size_per_gpu"] == 1
# if use deepspeed, override ddp config # if use deepspeed, override ddp config
configs['train_conf']['save_per_step'] = int(configs['train_conf']['save_per_step'] * configs['train_conf']['accum_grad'] / ds_configs["gradient_accumulation_steps"]) configs['train_conf']['save_per_step'] = int(configs['train_conf']['save_per_step'] *
configs['train_conf']['accum_grad'] / ds_configs["gradient_accumulation_steps"])
configs['train_conf']['accum_grad'] = ds_configs["gradient_accumulation_steps"] configs['train_conf']['accum_grad'] = ds_configs["gradient_accumulation_steps"]
configs['train_conf']['grad_clip'] = ds_configs["gradient_clipping"] configs['train_conf']['grad_clip'] = ds_configs["gradient_clipping"]
configs['train_conf']['log_interval'] = ds_configs["steps_per_print"] configs['train_conf']['log_interval'] = ds_configs["steps_per_print"]
@@ -108,38 +108,80 @@ def wrap_cuda_model(args, model):
return model return model
def init_optimizer_and_scheduler(args, configs, model): def init_optimizer_and_scheduler(args, configs, model, gan):
if configs['train_conf']['optim'] == 'adam': if gan is False:
optimizer = optim.Adam(model.parameters(), **configs['train_conf']['optim_conf']) if configs['train_conf']['optim'] == 'adam':
elif configs['train_conf']['optim'] == 'adamw': optimizer = optim.Adam(model.parameters(), **configs['train_conf']['optim_conf'])
optimizer = optim.AdamW(model.parameters(), **configs['train_conf']['optim_conf']) elif configs['train_conf']['optim'] == 'adamw':
optimizer = optim.AdamW(model.parameters(), **configs['train_conf']['optim_conf'])
else:
raise ValueError("unknown optimizer: " + configs['train_conf'])
if configs['train_conf']['scheduler'] == 'warmuplr':
scheduler_type = WarmupLR
scheduler = WarmupLR(optimizer, **configs['train_conf']['scheduler_conf'])
elif configs['train_conf']['scheduler'] == 'NoamHoldAnnealing':
scheduler_type = NoamHoldAnnealing
scheduler = NoamHoldAnnealing(optimizer, **configs['train_conf']['scheduler_conf'])
elif configs['train_conf']['scheduler'] == 'constantlr':
scheduler_type = ConstantLR
scheduler = ConstantLR(optimizer)
else:
raise ValueError("unknown scheduler: " + configs['train_conf'])
# use deepspeed optimizer for speedup
if args.train_engine == "deepspeed":
def scheduler(opt):
return scheduler_type(opt, **configs['train_conf']['scheduler_conf'])
model, optimizer, _, scheduler = deepspeed.initialize(
args=args,
model=model,
optimizer=None,
lr_scheduler=scheduler,
model_parameters=model.parameters())
optimizer_d, scheduler_d = None, None
else: else:
raise ValueError("unknown optimizer: " + configs['train_conf']) # currently we wrap generator and discriminator in one model, so we cannot use deepspeed
if configs['train_conf']['optim'] == 'adam':
optimizer = optim.Adam(model.module.generator.parameters(), **configs['train_conf']['optim_conf'])
elif configs['train_conf']['optim'] == 'adamw':
optimizer = optim.AdamW(model.module.generator.parameters(), **configs['train_conf']['optim_conf'])
else:
raise ValueError("unknown optimizer: " + configs['train_conf'])
if configs['train_conf']['scheduler'] == 'warmuplr': if configs['train_conf']['scheduler'] == 'warmuplr':
scheduler_type = WarmupLR scheduler_type = WarmupLR
scheduler = WarmupLR(optimizer, **configs['train_conf']['scheduler_conf']) scheduler = WarmupLR(optimizer, **configs['train_conf']['scheduler_conf'])
elif configs['train_conf']['scheduler'] == 'NoamHoldAnnealing': elif configs['train_conf']['scheduler'] == 'NoamHoldAnnealing':
scheduler_type = NoamHoldAnnealing scheduler_type = NoamHoldAnnealing
scheduler = NoamHoldAnnealing(optimizer, **configs['train_conf']['scheduler_conf']) scheduler = NoamHoldAnnealing(optimizer, **configs['train_conf']['scheduler_conf'])
elif configs['train_conf']['scheduler'] == 'constantlr': elif configs['train_conf']['scheduler'] == 'constantlr':
scheduler_type = ConstantLR scheduler_type = ConstantLR
scheduler = ConstantLR(optimizer) scheduler = ConstantLR(optimizer)
else: else:
raise ValueError("unknown scheduler: " + configs['train_conf']) raise ValueError("unknown scheduler: " + configs['train_conf'])
# use deepspeed optimizer for speedup if configs['train_conf']['optim_d'] == 'adam':
if args.train_engine == "deepspeed": optimizer_d = optim.Adam(model.module.discriminator.parameters(), **configs['train_conf']['optim_conf'])
def scheduler(opt): elif configs['train_conf']['optim_d'] == 'adamw':
return scheduler_type(opt, **configs['train_conf']['scheduler_conf']) optimizer_d = optim.AdamW(model.module.discriminator.parameters(), **configs['train_conf']['optim_conf'])
model, optimizer, _, scheduler = deepspeed.initialize( else:
args=args, raise ValueError("unknown optimizer: " + configs['train_conf'])
model=model,
optimizer=None,
lr_scheduler=scheduler,
model_parameters=model.parameters())
return model, optimizer, scheduler if configs['train_conf']['scheduler_d'] == 'warmuplr':
scheduler_type = WarmupLR
scheduler_d = WarmupLR(optimizer_d, **configs['train_conf']['scheduler_conf'])
elif configs['train_conf']['scheduler_d'] == 'NoamHoldAnnealing':
scheduler_type = NoamHoldAnnealing
scheduler_d = NoamHoldAnnealing(optimizer_d, **configs['train_conf']['scheduler_conf'])
elif configs['train_conf']['scheduler'] == 'constantlr':
scheduler_type = ConstantLR
scheduler_d = ConstantLR(optimizer_d)
else:
raise ValueError("unknown scheduler: " + configs['train_conf'])
return model, optimizer, scheduler, optimizer_d, scheduler_d
def init_summarywriter(args): def init_summarywriter(args):
@@ -157,7 +199,7 @@ def save_model(model, model_name, info_dict):
if info_dict["train_engine"] == "torch_ddp": if info_dict["train_engine"] == "torch_ddp":
if rank == 0: if rank == 0:
torch.save(model.module.state_dict(), save_model_path) torch.save({**model.module.state_dict(), 'epoch': info_dict['epoch'], 'step': info_dict['step']}, save_model_path)
else: else:
with torch.no_grad(): with torch.no_grad():
model.save_checkpoint(save_dir=model_dir, model.save_checkpoint(save_dir=model_dir,
@@ -193,7 +235,7 @@ def cosyvoice_join(group_join, info_dict):
return False return False
def batch_forward(model, batch, info_dict): def batch_forward(model, batch, scaler, info_dict):
device = int(os.environ.get('LOCAL_RANK', 0)) device = int(os.environ.get('LOCAL_RANK', 0))
dtype = info_dict["dtype"] dtype = info_dict["dtype"]
@@ -205,7 +247,7 @@ def batch_forward(model, batch, info_dict):
dtype = torch.float32 dtype = torch.float32
if info_dict['train_engine'] == 'torch_ddp': if info_dict['train_engine'] == 'torch_ddp':
autocast = nullcontext() autocast = torch.cuda.amp.autocast(enabled=scaler is not None)
else: else:
autocast = torch.cuda.amp.autocast(enabled=True, dtype=dtype, cache_enabled=False) autocast = torch.cuda.amp.autocast(enabled=True, dtype=dtype, cache_enabled=False)
@@ -214,27 +256,41 @@ def batch_forward(model, batch, info_dict):
return info_dict return info_dict
def batch_backward(model, info_dict): def batch_backward(model, scaler, info_dict):
if info_dict["train_engine"] == "deepspeed": if info_dict["train_engine"] == "deepspeed":
scaled_loss = model.backward(info_dict['loss_dict']['loss']) scaled_loss = model.backward(info_dict['loss_dict']['loss'])
else: else:
scaled_loss = info_dict['loss_dict']['loss'] / info_dict['accum_grad'] scaled_loss = info_dict['loss_dict']['loss'] / info_dict['accum_grad']
scaled_loss.backward() if scaler is not None:
scaler.scale(scaled_loss).backward()
else:
scaled_loss.backward()
info_dict['loss_dict']['loss'] = scaled_loss info_dict['loss_dict']['loss'] = scaled_loss
return info_dict return info_dict
def update_parameter_and_lr(model, optimizer, scheduler, info_dict): def update_parameter_and_lr(model, optimizer, scheduler, scaler, info_dict):
grad_norm = 0.0 grad_norm = 0.0
if info_dict['train_engine'] == "deepspeed": if info_dict['train_engine'] == "deepspeed":
info_dict["is_gradient_accumulation_boundary"] = model.is_gradient_accumulation_boundary() info_dict["is_gradient_accumulation_boundary"] = model.is_gradient_accumulation_boundary()
model.step() model.step()
grad_norm = model.get_global_grad_norm() grad_norm = model.get_global_grad_norm()
elif (info_dict['batch_idx'] + 1) % info_dict["accum_grad"] == 0: elif (info_dict['batch_idx'] + 1) % info_dict["accum_grad"] == 0:
grad_norm = clip_grad_norm_(model.parameters(), info_dict['grad_clip']) # Use mixed precision training
if torch.isfinite(grad_norm): if scaler is not None:
optimizer.step() scaler.unscale_(optimizer)
grad_norm = clip_grad_norm_(model.parameters(), info_dict['grad_clip'])
# We don't check grad here since that if the gradient
# has inf/nan values, scaler.step will skip
# optimizer.step().
if torch.isfinite(grad_norm):
scaler.step(optimizer)
scaler.update()
else:
grad_norm = clip_grad_norm_(model.parameters(), info_dict['grad_clip'])
if torch.isfinite(grad_norm):
optimizer.step()
optimizer.zero_grad() optimizer.zero_grad()
scheduler.step() scheduler.step()
info_dict["lr"] = optimizer.param_groups[0]['lr'] info_dict["lr"] = optimizer.param_groups[0]['lr']

51
docker/Dockerfile Normal file
View File

@@ -0,0 +1,51 @@
FROM nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
ARG VENV_NAME="cosyvoice"
ENV VENV=$VENV_NAME
ENV LANG=C.UTF-8 LC_ALL=C.UTF-8
ENV DEBIAN_FRONTEN=noninteractive
ENV PYTHONUNBUFFERED=1
SHELL ["/bin/bash", "--login", "-c"]
RUN apt-get update -y --fix-missing
RUN apt-get install -y git build-essential curl wget ffmpeg unzip git git-lfs sox libsox-dev && \
apt-get clean && \
git lfs install
# ==================================================================
# conda install and conda forge channel as default
# ------------------------------------------------------------------
# Install miniforge
RUN wget --quiet https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh -O ~/miniforge.sh && \
/bin/bash ~/miniforge.sh -b -p /opt/conda && \
rm ~/miniforge.sh && \
ln -s /opt/conda/etc/profile.d/conda.sh /etc/profile.d/conda.sh && \
echo "source /opt/conda/etc/profile.d/conda.sh" >> /opt/nvidia/entrypoint.d/100.conda.sh && \
echo "source /opt/conda/etc/profile.d/conda.sh" >> ~/.bashrc && \
echo "conda activate ${VENV}" >> /opt/nvidia/entrypoint.d/110.conda_default_env.sh && \
echo "conda activate ${VENV}" >> $HOME/.bashrc
ENV PATH /opt/conda/bin:$PATH
RUN conda config --add channels conda-forge && \
conda config --set channel_priority strict
# ------------------------------------------------------------------
# ~conda
# ==================================================================
RUN conda create -y -n ${VENV} python=3.10
ENV CONDA_DEFAULT_ENV=${VENV}
ENV PATH /opt/conda/bin:/opt/conda/envs/${VENV}/bin:$PATH
WORKDIR /workspace
ENV PYTHONPATH="${PYTHONPATH}:/workspace/CosyVoice:/workspace/CosyVoice/third_party/Matcha-TTS"
RUN git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git
RUN conda activate ${VENV} && conda install -y -c conda-forge pynini==2.1.5
RUN conda activate ${VENV} && cd CosyVoice && \
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
WORKDIR /workspace/CosyVoice

View File

@@ -18,7 +18,7 @@ llm: !new:cosyvoice.llm.llm.TransformerLM
text_encoder_input_size: !ref <text_encoder_input_size> text_encoder_input_size: !ref <text_encoder_input_size>
llm_input_size: !ref <llm_input_size> llm_input_size: !ref <llm_input_size>
llm_output_size: !ref <llm_output_size> llm_output_size: !ref <llm_output_size>
text_token_size: 51866 text_token_size: 51866 # change to 60515 if you want to train with CosyVoice-300M-25Hz recipe
speech_token_size: 4096 speech_token_size: 4096
length_normalized_loss: True length_normalized_loss: True
lsm_weight: 0 lsm_weight: 0
@@ -31,7 +31,7 @@ llm: !new:cosyvoice.llm.llm.TransformerLM
num_blocks: 3 num_blocks: 3
dropout_rate: 0.1 dropout_rate: 0.1
positional_dropout_rate: 0.1 positional_dropout_rate: 0.1
attention_dropout_rate: 0 attention_dropout_rate: 0.0
normalize_before: True normalize_before: True
input_layer: 'linear' input_layer: 'linear'
pos_enc_layer_type: 'rel_pos_espnet' pos_enc_layer_type: 'rel_pos_espnet'
@@ -49,11 +49,16 @@ llm: !new:cosyvoice.llm.llm.TransformerLM
num_blocks: 7 num_blocks: 7
dropout_rate: 0.1 dropout_rate: 0.1
positional_dropout_rate: 0.1 positional_dropout_rate: 0.1
attention_dropout_rate: 0 attention_dropout_rate: 0.0
input_layer: 'linear_legacy' input_layer: 'linear_legacy'
pos_enc_layer_type: 'rel_pos_espnet' pos_enc_layer_type: 'rel_pos_espnet'
selfattention_layer_type: 'rel_selfattn' selfattention_layer_type: 'rel_selfattn'
static_chunk_size: 1 static_chunk_size: 1
sampling: !name:cosyvoice.utils.common.ras_sampling
top_p: 0.8
top_k: 25
win_size: 10
tau_r: 0.1
flow: !new:cosyvoice.flow.flow.MaskedDiffWithXvec flow: !new:cosyvoice.flow.flow.MaskedDiffWithXvec
input_size: 512 input_size: 512
@@ -61,7 +66,7 @@ flow: !new:cosyvoice.flow.flow.MaskedDiffWithXvec
spk_embed_dim: !ref <spk_embed_dim> spk_embed_dim: !ref <spk_embed_dim>
output_type: 'mel' output_type: 'mel'
vocab_size: 4096 vocab_size: 4096
input_frame_rate: 50 input_frame_rate: 50 # change to 25 if you want to train with CosyVoice-300M-25Hz recipe
only_mask_loss: True only_mask_loss: True
encoder: !new:cosyvoice.transformer.encoder.ConformerEncoder encoder: !new:cosyvoice.transformer.encoder.ConformerEncoder
output_size: 512 output_size: 512
@@ -97,7 +102,7 @@ flow: !new:cosyvoice.flow.flow.MaskedDiffWithXvec
in_channels: 320 in_channels: 320
out_channels: 80 out_channels: 80
channels: [256, 256] channels: [256, 256]
dropout: 0 dropout: 0.0
attention_head_dim: 64 attention_head_dim: 64
n_blocks: 4 n_blocks: 4
num_mid_blocks: 8 num_mid_blocks: 8
@@ -128,9 +133,28 @@ hift: !new:cosyvoice.hifigan.generator.HiFTGenerator
in_channels: 80 in_channels: 80
cond_channels: 512 cond_channels: 512
# gan related module
mel_spec_transform1: !name:matcha.utils.audio.mel_spectrogram
n_fft: 1024
num_mels: 80
sampling_rate: !ref <sample_rate>
hop_size: 256
win_size: 1024
fmin: 0
fmax: null
center: False
hifigan: !new:cosyvoice.hifigan.hifigan.HiFiGan
generator: !ref <hift>
discriminator: !new:cosyvoice.hifigan.discriminator.MultipleDiscriminator
mpd: !new:matcha.hifigan.models.MultiPeriodDiscriminator
mrd: !new:cosyvoice.hifigan.discriminator.MultiResolutionDiscriminator
mel_spec_transform: [
!ref <mel_spec_transform1>
]
# processor functions # processor functions
parquet_opener: !name:cosyvoice.dataset.processor.parquet_opener parquet_opener: !name:cosyvoice.dataset.processor.parquet_opener
get_tokenizer: !name:whisper.tokenizer.get_tokenizer get_tokenizer: !name:whisper.tokenizer.get_tokenizer # change to !name:cosyvoice.tokenizer.tokenizer.get_tokenizer if you want to train with CosyVoice-300M-25Hz recipe
multilingual: True multilingual: True
num_languages: 100 num_languages: 100
language: 'en' language: 'en'
@@ -146,6 +170,8 @@ filter: !name:cosyvoice.dataset.processor.filter
token_min_length: 1 token_min_length: 1
resample: !name:cosyvoice.dataset.processor.resample resample: !name:cosyvoice.dataset.processor.resample
resample_rate: !ref <sample_rate> resample_rate: !ref <sample_rate>
truncate: !name:cosyvoice.dataset.processor.truncate
truncate_length: 24576 # must be a multiplier of hop_size
feat_extractor: !name:matcha.utils.audio.mel_spectrogram feat_extractor: !name:matcha.utils.audio.mel_spectrogram
n_fft: 1024 n_fft: 1024
num_mels: 80 num_mels: 80
@@ -157,6 +183,9 @@ feat_extractor: !name:matcha.utils.audio.mel_spectrogram
center: False center: False
compute_fbank: !name:cosyvoice.dataset.processor.compute_fbank compute_fbank: !name:cosyvoice.dataset.processor.compute_fbank
feat_extractor: !ref <feat_extractor> feat_extractor: !ref <feat_extractor>
compute_f0: !name:cosyvoice.dataset.processor.compute_f0
sample_rate: !ref <sample_rate>
hop_size: 256
parse_embedding: !name:cosyvoice.dataset.processor.parse_embedding parse_embedding: !name:cosyvoice.dataset.processor.parse_embedding
normalize: True normalize: True
shuffle: !name:cosyvoice.dataset.processor.shuffle shuffle: !name:cosyvoice.dataset.processor.shuffle
@@ -182,8 +211,22 @@ data_pipeline: [
!ref <batch>, !ref <batch>,
!ref <padding>, !ref <padding>,
] ]
data_pipeline_gan: [
!ref <parquet_opener>,
!ref <tokenize>,
!ref <filter>,
!ref <resample>,
!ref <truncate>,
!ref <compute_fbank>,
!ref <compute_f0>,
!ref <parse_embedding>,
!ref <shuffle>,
!ref <sort>,
!ref <batch>,
!ref <padding>,
]
# train conf # llm flow train conf
train_conf: train_conf:
optim: adam optim: adam
optim_conf: optim_conf:
@@ -195,4 +238,20 @@ train_conf:
grad_clip: 5 grad_clip: 5
accum_grad: 2 accum_grad: 2
log_interval: 100 log_interval: 100
save_per_step: -1
# gan train conf
train_conf_gan:
optim: adam
optim_conf:
lr: 0.0002 # use small lr for gan training
scheduler: constantlr
optim_d: adam
optim_conf_d:
lr: 0.0002 # use small lr for gan training
scheduler_d: constantlr
max_epoch: 200
grad_clip: 5
accum_grad: 1 # in gan training, accum_grad must be 1
log_interval: 100
save_per_step: -1 save_per_step: -1

View File

@@ -18,7 +18,7 @@ llm: !new:cosyvoice.llm.llm.TransformerLM
text_encoder_input_size: !ref <text_encoder_input_size> text_encoder_input_size: !ref <text_encoder_input_size>
llm_input_size: !ref <llm_input_size> llm_input_size: !ref <llm_input_size>
llm_output_size: !ref <llm_output_size> llm_output_size: !ref <llm_output_size>
text_token_size: 51866 text_token_size: 51866 # change to 60515 if you want to train with CosyVoice-300M-25Hz recipe
speech_token_size: 4096 speech_token_size: 4096
length_normalized_loss: True length_normalized_loss: True
lsm_weight: 0 lsm_weight: 0
@@ -31,7 +31,7 @@ llm: !new:cosyvoice.llm.llm.TransformerLM
num_blocks: 6 num_blocks: 6
dropout_rate: 0.1 dropout_rate: 0.1
positional_dropout_rate: 0.1 positional_dropout_rate: 0.1
attention_dropout_rate: 0 attention_dropout_rate: 0.0
normalize_before: True normalize_before: True
input_layer: 'linear' input_layer: 'linear'
pos_enc_layer_type: 'rel_pos_espnet' pos_enc_layer_type: 'rel_pos_espnet'
@@ -49,11 +49,16 @@ llm: !new:cosyvoice.llm.llm.TransformerLM
num_blocks: 14 num_blocks: 14
dropout_rate: 0.1 dropout_rate: 0.1
positional_dropout_rate: 0.1 positional_dropout_rate: 0.1
attention_dropout_rate: 0 attention_dropout_rate: 0.0
input_layer: 'linear_legacy' input_layer: 'linear_legacy'
pos_enc_layer_type: 'rel_pos_espnet' pos_enc_layer_type: 'rel_pos_espnet'
selfattention_layer_type: 'rel_selfattn' selfattention_layer_type: 'rel_selfattn'
static_chunk_size: 1 static_chunk_size: 1
sampling: !name:cosyvoice.utils.common.ras_sampling
top_p: 0.8
top_k: 25
win_size: 10
tau_r: 0.1
flow: !new:cosyvoice.flow.flow.MaskedDiffWithXvec flow: !new:cosyvoice.flow.flow.MaskedDiffWithXvec
input_size: 512 input_size: 512
@@ -61,7 +66,7 @@ flow: !new:cosyvoice.flow.flow.MaskedDiffWithXvec
spk_embed_dim: !ref <spk_embed_dim> spk_embed_dim: !ref <spk_embed_dim>
output_type: 'mel' output_type: 'mel'
vocab_size: 4096 vocab_size: 4096
input_frame_rate: 50 input_frame_rate: 50 # change to 25 if you want to train with CosyVoice-300M-25Hz recipe
only_mask_loss: True only_mask_loss: True
encoder: !new:cosyvoice.transformer.encoder.ConformerEncoder encoder: !new:cosyvoice.transformer.encoder.ConformerEncoder
output_size: 512 output_size: 512
@@ -97,7 +102,7 @@ flow: !new:cosyvoice.flow.flow.MaskedDiffWithXvec
in_channels: 320 in_channels: 320
out_channels: 80 out_channels: 80
channels: [256, 256] channels: [256, 256]
dropout: 0 dropout: 0.0
attention_head_dim: 64 attention_head_dim: 64
n_blocks: 4 n_blocks: 4
num_mid_blocks: 12 num_mid_blocks: 12
@@ -128,9 +133,28 @@ hift: !new:cosyvoice.hifigan.generator.HiFTGenerator
in_channels: 80 in_channels: 80
cond_channels: 512 cond_channels: 512
# gan related module
mel_spec_transform1: !name:matcha.utils.audio.mel_spectrogram
n_fft: 1024
num_mels: 80
sampling_rate: !ref <sample_rate>
hop_size: 256
win_size: 1024
fmin: 0
fmax: null
center: False
hifigan: !new:cosyvoice.hifigan.hifigan.HiFiGan
generator: !ref <hift>
discriminator: !new:cosyvoice.hifigan.discriminator.MultipleDiscriminator
mpd: !new:matcha.hifigan.models.MultiPeriodDiscriminator
mrd: !new:cosyvoice.hifigan.discriminator.MultiResolutionDiscriminator
mel_spec_transform: [
!ref <mel_spec_transform1>
]
# processor functions # processor functions
parquet_opener: !name:cosyvoice.dataset.processor.parquet_opener parquet_opener: !name:cosyvoice.dataset.processor.parquet_opener
get_tokenizer: !name:whisper.tokenizer.get_tokenizer get_tokenizer: !name:whisper.tokenizer.get_tokenizer # change to !name:cosyvoice.tokenizer.tokenizer.get_tokenizer if you want to train with CosyVoice-300M-25Hz recipe
multilingual: True multilingual: True
num_languages: 100 num_languages: 100
language: 'en' language: 'en'
@@ -146,6 +170,8 @@ filter: !name:cosyvoice.dataset.processor.filter
token_min_length: 1 token_min_length: 1
resample: !name:cosyvoice.dataset.processor.resample resample: !name:cosyvoice.dataset.processor.resample
resample_rate: !ref <sample_rate> resample_rate: !ref <sample_rate>
truncate: !name:cosyvoice.dataset.processor.truncate
truncate_length: 24576 # must be a multiplier of hop_size
feat_extractor: !name:matcha.utils.audio.mel_spectrogram feat_extractor: !name:matcha.utils.audio.mel_spectrogram
n_fft: 1024 n_fft: 1024
num_mels: 80 num_mels: 80
@@ -157,6 +183,9 @@ feat_extractor: !name:matcha.utils.audio.mel_spectrogram
center: False center: False
compute_fbank: !name:cosyvoice.dataset.processor.compute_fbank compute_fbank: !name:cosyvoice.dataset.processor.compute_fbank
feat_extractor: !ref <feat_extractor> feat_extractor: !ref <feat_extractor>
compute_f0: !name:cosyvoice.dataset.processor.compute_f0
sample_rate: !ref <sample_rate>
hop_size: 256
parse_embedding: !name:cosyvoice.dataset.processor.parse_embedding parse_embedding: !name:cosyvoice.dataset.processor.parse_embedding
normalize: True normalize: True
shuffle: !name:cosyvoice.dataset.processor.shuffle shuffle: !name:cosyvoice.dataset.processor.shuffle
@@ -165,7 +194,7 @@ sort: !name:cosyvoice.dataset.processor.sort
sort_size: 500 # sort_size should be less than shuffle_size sort_size: 500 # sort_size should be less than shuffle_size
batch: !name:cosyvoice.dataset.processor.batch batch: !name:cosyvoice.dataset.processor.batch
batch_type: 'dynamic' batch_type: 'dynamic'
max_frames_in_batch: 2000 max_frames_in_batch: 2000 # change to 1400 in gan train on v100 16g
padding: !name:cosyvoice.dataset.processor.padding padding: !name:cosyvoice.dataset.processor.padding
use_spk_embedding: False # change to True during sft use_spk_embedding: False # change to True during sft
@@ -182,8 +211,22 @@ data_pipeline: [
!ref <batch>, !ref <batch>,
!ref <padding>, !ref <padding>,
] ]
data_pipeline_gan: [
!ref <parquet_opener>,
!ref <tokenize>,
!ref <filter>,
!ref <resample>,
!ref <truncate>,
!ref <compute_fbank>,
!ref <compute_f0>,
!ref <parse_embedding>,
!ref <shuffle>,
!ref <sort>,
!ref <batch>,
!ref <padding>,
]
# train conf # llm flow train conf
train_conf: train_conf:
optim: adam optim: adam
optim_conf: optim_conf:
@@ -195,4 +238,20 @@ train_conf:
grad_clip: 5 grad_clip: 5
accum_grad: 2 accum_grad: 2
log_interval: 100 log_interval: 100
save_per_step: -1
# gan train conf
train_conf_gan:
optim: adam
optim_conf:
lr: 0.0002 # use small lr for gan training
scheduler: constantlr
optim_d: adam
optim_conf_d:
lr: 0.0002 # use small lr for gan training
scheduler_d: constantlr
max_epoch: 200
grad_clip: 5
accum_grad: 1 # in gan training, accum_grad must be 1
log_interval: 100
save_per_step: -1 save_per_step: -1

View File

@@ -7,6 +7,7 @@ from tqdm import tqdm
logger = logging.getLogger() logger = logging.getLogger()
def main(): def main():
wavs = list(glob.glob('{}/*/*/*wav'.format(args.src_dir))) wavs = list(glob.glob('{}/*/*/*wav'.format(args.src_dir)))
@@ -41,6 +42,7 @@ def main():
f.write('{} {}\n'.format(k, ' '.join(v))) f.write('{} {}\n'.format(k, ' '.join(v)))
return return
if __name__ == "__main__": if __name__ == "__main__":
parser = argparse.ArgumentParser() parser = argparse.ArgumentParser()
parser.add_argument('--src_dir', parser.add_argument('--src_dir',

View File

@@ -83,9 +83,9 @@ if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
fi fi
cat data/{train-clean-100,train-clean-360,train-other-500}/parquet/data.list > data/train.data.list cat data/{train-clean-100,train-clean-360,train-other-500}/parquet/data.list > data/train.data.list
cat data/{dev-clean,dev-other}/parquet/data.list > data/dev.data.list cat data/{dev-clean,dev-other}/parquet/data.list > data/dev.data.list
for model in llm; do for model in llm flow hifigan; do
torchrun --nnodes=1 --nproc_per_node=$num_gpus \ torchrun --nnodes=1 --nproc_per_node=$num_gpus \
--rdzv_id=$job_id --rdzv_backend="c10d" --rdzv_endpoint="localhost:0" \ --rdzv_id=$job_id --rdzv_backend="c10d" --rdzv_endpoint="localhost:1234" \
cosyvoice/bin/train.py \ cosyvoice/bin/train.py \
--train_engine $train_engine \ --train_engine $train_engine \
--config conf/cosyvoice.yaml \ --config conf/cosyvoice.yaml \
@@ -99,7 +99,28 @@ if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
--num_workers ${num_workers} \ --num_workers ${num_workers} \
--prefetch ${prefetch} \ --prefetch ${prefetch} \
--pin_memory \ --pin_memory \
--use_amp \
--deepspeed_config ./conf/ds_stage2.json \ --deepspeed_config ./conf/ds_stage2.json \
--deepspeed.save_states model+optimizer --deepspeed.save_states model+optimizer
done done
fi
# average model
average_num=5
if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
for model in llm flow hifigan; do
decode_checkpoint=`pwd`/exp/cosyvoice/$model/$train_engine/${model}.pt
echo "do model average and final checkpoint is $decode_checkpoint"
python cosyvoice/bin/average_model.py \
--dst_model $decode_checkpoint \
--src_path `pwd`/exp/cosyvoice/$model/$train_engine \
--num ${average_num} \
--val_best
done
fi
if [ ${stage} -le 7 ] && [ ${stop_stage} -ge 7 ]; then
echo "Export your model for inference speedup. Remember copy your llm or flow model to model_dir"
python cosyvoice/bin/export_jit.py --model_dir $pretrained_model_dir
python cosyvoice/bin/export_onnx.py --model_dir $pretrained_model_dir
fi fi

View File

@@ -0,0 +1 @@
../../../cosyvoice

View File

@@ -0,0 +1 @@
../../../tools

View File

@@ -18,7 +18,7 @@ llm: !new:cosyvoice.llm.llm.TransformerLM
text_encoder_input_size: !ref <text_encoder_input_size> text_encoder_input_size: !ref <text_encoder_input_size>
llm_input_size: !ref <llm_input_size> llm_input_size: !ref <llm_input_size>
llm_output_size: !ref <llm_output_size> llm_output_size: !ref <llm_output_size>
text_token_size: 51866 text_token_size: 51866 # change to 60515 if you want to train with CosyVoice-300M-25Hz recipe
speech_token_size: 4096 speech_token_size: 4096
length_normalized_loss: True length_normalized_loss: True
lsm_weight: 0 lsm_weight: 0
@@ -54,6 +54,11 @@ llm: !new:cosyvoice.llm.llm.TransformerLM
pos_enc_layer_type: 'rel_pos_espnet' pos_enc_layer_type: 'rel_pos_espnet'
selfattention_layer_type: 'rel_selfattn' selfattention_layer_type: 'rel_selfattn'
static_chunk_size: 1 static_chunk_size: 1
sampling: !name:cosyvoice.utils.common.ras_sampling
top_p: 0.8
top_k: 25
win_size: 10
tau_r: 0.1
flow: !new:cosyvoice.flow.flow.MaskedDiffWithXvec flow: !new:cosyvoice.flow.flow.MaskedDiffWithXvec
input_size: 512 input_size: 512
@@ -61,7 +66,7 @@ flow: !new:cosyvoice.flow.flow.MaskedDiffWithXvec
spk_embed_dim: !ref <spk_embed_dim> spk_embed_dim: !ref <spk_embed_dim>
output_type: 'mel' output_type: 'mel'
vocab_size: 4096 vocab_size: 4096
input_frame_rate: 50 input_frame_rate: 50 # change to 25 if you want to train with CosyVoice-300M-25Hz recipe
only_mask_loss: True only_mask_loss: True
encoder: !new:cosyvoice.transformer.encoder.ConformerEncoder encoder: !new:cosyvoice.transformer.encoder.ConformerEncoder
output_size: 512 output_size: 512
@@ -130,7 +135,7 @@ hift: !new:cosyvoice.hifigan.generator.HiFTGenerator
# processor functions # processor functions
parquet_opener: !name:cosyvoice.dataset.processor.parquet_opener parquet_opener: !name:cosyvoice.dataset.processor.parquet_opener
get_tokenizer: !name:whisper.tokenizer.get_tokenizer get_tokenizer: !name:whisper.tokenizer.get_tokenizer # change to !name:cosyvoice.tokenizer.tokenizer.get_tokenizer if you want to train with CosyVoice-300M-25Hz recipe
multilingual: True multilingual: True
num_languages: 100 num_languages: 100
language: 'en' language: 'en'

View File

@@ -18,7 +18,7 @@ llm: !new:cosyvoice.llm.llm.TransformerLM
text_encoder_input_size: !ref <text_encoder_input_size> text_encoder_input_size: !ref <text_encoder_input_size>
llm_input_size: !ref <llm_input_size> llm_input_size: !ref <llm_input_size>
llm_output_size: !ref <llm_output_size> llm_output_size: !ref <llm_output_size>
text_token_size: 51866 text_token_size: 51866 # change to 60515 if you want to train with CosyVoice-300M-25Hz recipe
speech_token_size: 4096 speech_token_size: 4096
length_normalized_loss: True length_normalized_loss: True
lsm_weight: 0 lsm_weight: 0
@@ -54,6 +54,11 @@ llm: !new:cosyvoice.llm.llm.TransformerLM
pos_enc_layer_type: 'rel_pos_espnet' pos_enc_layer_type: 'rel_pos_espnet'
selfattention_layer_type: 'rel_selfattn' selfattention_layer_type: 'rel_selfattn'
static_chunk_size: 1 static_chunk_size: 1
sampling: !name:cosyvoice.utils.common.ras_sampling
top_p: 0.8
top_k: 25
win_size: 10
tau_r: 0.1
flow: !new:cosyvoice.flow.flow.MaskedDiffWithXvec flow: !new:cosyvoice.flow.flow.MaskedDiffWithXvec
input_size: 512 input_size: 512
@@ -61,7 +66,7 @@ flow: !new:cosyvoice.flow.flow.MaskedDiffWithXvec
spk_embed_dim: !ref <spk_embed_dim> spk_embed_dim: !ref <spk_embed_dim>
output_type: 'mel' output_type: 'mel'
vocab_size: 4096 vocab_size: 4096
input_frame_rate: 50 input_frame_rate: 50 # change to 25 if you want to train with CosyVoice-300M-25Hz recipe
only_mask_loss: True only_mask_loss: True
encoder: !new:cosyvoice.transformer.encoder.ConformerEncoder encoder: !new:cosyvoice.transformer.encoder.ConformerEncoder
output_size: 512 output_size: 512
@@ -130,7 +135,7 @@ hift: !new:cosyvoice.hifigan.generator.HiFTGenerator
# processor functions # processor functions
parquet_opener: !name:cosyvoice.dataset.processor.parquet_opener parquet_opener: !name:cosyvoice.dataset.processor.parquet_opener
get_tokenizer: !name:whisper.tokenizer.get_tokenizer get_tokenizer: !name:whisper.tokenizer.get_tokenizer # change to !name:cosyvoice.tokenizer.tokenizer.get_tokenizer if you want to train with CosyVoice-300M-25Hz recipe
multilingual: True multilingual: True
num_languages: 100 num_languages: 100
language: 'en' language: 'en'

View File

@@ -6,6 +6,7 @@ from tqdm import tqdm
logger = logging.getLogger() logger = logging.getLogger()
def main(): def main():
utt2wav, utt2text, utt2spk, spk2utt = {}, {}, {}, {} utt2wav, utt2text, utt2spk, spk2utt = {}, {}, {}, {}
with open(os.path.join(args.src_dir, "TRANS.txt"), "r") as f: with open(os.path.join(args.src_dir, "TRANS.txt"), "r") as f:
@@ -40,6 +41,7 @@ def main():
f.write('{} {}\n'.format(k, ' '.join(v))) f.write('{} {}\n'.format(k, ' '.join(v)))
return return
if __name__ == "__main__": if __name__ == "__main__":
parser = argparse.ArgumentParser() parser = argparse.ArgumentParser()
parser.add_argument('--src_dir', parser.add_argument('--src_dir',

View File

@@ -83,7 +83,7 @@ if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
fi fi
cp data/train/parquet/data.list data/train.data.list cp data/train/parquet/data.list data/train.data.list
cp data/dev/parquet/data.list data/dev.data.list cp data/dev/parquet/data.list data/dev.data.list
for model in llm; do for model in llm flow; do
torchrun --nnodes=1 --nproc_per_node=$num_gpus \ torchrun --nnodes=1 --nproc_per_node=$num_gpus \
--rdzv_id=$job_id --rdzv_backend="c10d" --rdzv_endpoint="localhost:0" \ --rdzv_id=$job_id --rdzv_backend="c10d" --rdzv_endpoint="localhost:0" \
cosyvoice/bin/train.py \ cosyvoice/bin/train.py \
@@ -102,4 +102,10 @@ if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
--deepspeed_config ./conf/ds_stage2.json \ --deepspeed_config ./conf/ds_stage2.json \
--deepspeed.save_states model+optimizer --deepspeed.save_states model+optimizer
done done
fi
if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
echo "Export your model for inference speedup. Remember copy your llm or flow model to model_dir"
python cosyvoice/bin/export_jit.py --model_dir $pretrained_model_dir
python cosyvoice/bin/export_onnx.py --model_dir $pretrained_model_dir
fi fi

View File

@@ -1,9 +1,10 @@
--extra-index-url https://download.pytorch.org/whl/cu118 --extra-index-url https://download.pytorch.org/whl/cu121
--extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-12/pypi/simple/ # https://github.com/microsoft/onnxruntime/issues/21684
conformer==0.3.2 conformer==0.3.2
deepspeed==0.14.2; sys_platform == 'linux' deepspeed==0.14.2; sys_platform == 'linux'
diffusers==0.27.2 diffusers==0.29.0
gdown==5.1.0 gdown==5.1.0
gradio==4.32.2 gradio==5.4.0
grpcio==1.57.0 grpcio==1.57.0
grpcio-tools==1.57.0 grpcio-tools==1.57.0
hydra-core==1.3.2 hydra-core==1.3.2
@@ -15,17 +16,24 @@ matplotlib==3.7.5
modelscope==1.15.0 modelscope==1.15.0
networkx==3.1 networkx==3.1
omegaconf==2.3.0 omegaconf==2.3.0
onnxruntime-gpu==1.16.0; sys_platform == 'linux' onnx==1.16.0
onnxruntime==1.16.0; sys_platform == 'darwin' or sys_platform == 'windows' onnxruntime-gpu==1.18.0; sys_platform == 'linux'
onnxruntime==1.18.0; sys_platform == 'darwin' or sys_platform == 'windows'
openai-whisper==20231117 openai-whisper==20231117
protobuf==4.25 protobuf==4.25
pydantic==2.7.0 pydantic==2.7.0
pyworld==0.3.4
rich==13.7.1 rich==13.7.1
soundfile==0.12.1 soundfile==0.12.1
tensorboard==2.14.0 tensorboard==2.14.0
torch==2.0.1 tensorrt-cu12==10.0.1; sys_platform == 'linux'
torchaudio==2.0.2 tensorrt-cu12-bindings==10.0.1; sys_platform == 'linux'
tensorrt-cu12-libs==10.0.1; sys_platform == 'linux'
torch==2.4.0
torchaudio==2.4.0
transformers==4.40.1
uvicorn==0.30.0
wget==3.2 wget==3.2
fastapi==0.111.0 fastapi==0.115.6
fastapi-cli==0.0.4 fastapi-cli==0.0.4
WeTextProcessing==1.0.3 WeTextProcessing==1.0.3

View File

@@ -1,56 +1,69 @@
# Copyright (c) 2024 Alibaba Inc (authors: Xiang Lyu)
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse import argparse
import logging import logging
import requests import requests
import torch
import torchaudio
import numpy as np
def saveResponse(path, response):
# 以二进制写入模式打开文件
with open(path, 'wb') as file:
# 将响应的二进制内容写入文件
file.write(response.content)
def main(): def main():
api = args.api_base url = "http://{}:{}/inference_{}".format(args.host, args.port, args.mode)
if args.mode == 'sft': if args.mode == 'sft':
url = api + "/api/inference/sft"
payload={
'tts': args.tts_text,
'role': args.spk_id
}
response = requests.request("POST", url, data=payload)
saveResponse(args.tts_wav, response)
elif args.mode == 'zero_shot':
url = api + "/api/inference/zero-shot"
payload={
'tts': args.tts_text,
'prompt': args.prompt_text
}
files=[('audio', ('prompt_audio.wav', open(args.prompt_wav,'rb'), 'application/octet-stream'))]
response = requests.request("POST", url, data=payload, files=files)
saveResponse(args.tts_wav, response)
elif args.mode == 'cross_lingual':
url = api + "/api/inference/cross-lingual"
payload={
'tts': args.tts_text,
}
files=[('audio', ('prompt_audio.wav', open(args.prompt_wav,'rb'), 'application/octet-stream'))]
response = requests.request("POST", url, data=payload, files=files)
saveResponse(args.tts_wav, response)
else:
url = api + "/api/inference/instruct"
payload = { payload = {
'tts': args.tts_text, 'tts_text': args.tts_text,
'role': args.spk_id, 'spk_id': args.spk_id
'instruct': args.instruct_text
} }
response = requests.request("POST", url, data=payload) response = requests.request("GET", url, data=payload, stream=True)
saveResponse(args.tts_wav, response) elif args.mode == 'zero_shot':
logging.info("Response save to {}", args.tts_wav) payload = {
'tts_text': args.tts_text,
'prompt_text': args.prompt_text
}
files = [('prompt_wav', ('prompt_wav', open(args.prompt_wav, 'rb'), 'application/octet-stream'))]
response = requests.request("GET", url, data=payload, files=files, stream=True)
elif args.mode == 'cross_lingual':
payload = {
'tts_text': args.tts_text,
}
files = [('prompt_wav', ('prompt_wav', open(args.prompt_wav, 'rb'), 'application/octet-stream'))]
response = requests.request("GET", url, data=payload, files=files, stream=True)
else:
payload = {
'tts_text': args.tts_text,
'spk_id': args.spk_id,
'instruct_text': args.instruct_text
}
response = requests.request("GET", url, data=payload, stream=True)
tts_audio = b''
for r in response.iter_content(chunk_size=16000):
tts_audio += r
tts_speech = torch.from_numpy(np.array(np.frombuffer(tts_audio, dtype=np.int16))).unsqueeze(dim=0)
logging.info('save response to {}'.format(args.tts_wav))
torchaudio.save(args.tts_wav, tts_speech, target_sr)
logging.info('get response')
if __name__ == "__main__": if __name__ == "__main__":
parser = argparse.ArgumentParser() parser = argparse.ArgumentParser()
parser.add_argument('--api_base', parser.add_argument('--host',
type=str, type=str,
default='http://127.0.0.1:6006') default='0.0.0.0')
parser.add_argument('--port',
type=int,
default='50000')
parser.add_argument('--mode', parser.add_argument('--mode',
default='sft', default='sft',
choices=['sft', 'zero_shot', 'cross_lingual', 'instruct'], choices=['sft', 'zero_shot', 'cross_lingual', 'instruct'],
@@ -66,10 +79,11 @@ if __name__ == "__main__":
default='希望你以后能够做的比我还好呦。') default='希望你以后能够做的比我还好呦。')
parser.add_argument('--prompt_wav', parser.add_argument('--prompt_wav',
type=str, type=str,
default='../../../zero_shot_prompt.wav') default='../../../asset/zero_shot_prompt.wav')
parser.add_argument('--instruct_text', parser.add_argument('--instruct_text',
type=str, type=str,
default='Theo \'Crimson\', is a fiery, passionate rebel leader. Fights with fervor for justice, but struggles with impulsiveness.') default='Theo \'Crimson\', is a fiery, passionate rebel leader. \
Fights with fervor for justice, but struggles with impulsiveness.')
parser.add_argument('--tts_wav', parser.add_argument('--tts_wav',
type=str, type=str,
default='demo.wav') default='demo.wav')

View File

@@ -1,119 +1,101 @@
# Set inference model # Copyright (c) 2024 Alibaba Inc (authors: Xiang Lyu)
# export MODEL_DIR=pretrained_models/CosyVoice-300M-Instruct #
# For development # Licensed under the Apache License, Version 2.0 (the "License");
# fastapi dev --port 6006 fastapi_server.py # you may not use this file except in compliance with the License.
# For production deployment # You may obtain a copy of the License at
# fastapi run --port 6006 fastapi_server.py #
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os import os
import sys import sys
import io,time import argparse
from fastapi import FastAPI, Response, File, UploadFile, Form import logging
from fastapi.responses import HTMLResponse logging.getLogger('matplotlib').setLevel(logging.WARNING)
from fastapi.middleware.cors import CORSMiddleware #引入 CORS中间件模块 from fastapi import FastAPI, UploadFile, Form, File
from contextlib import asynccontextmanager from fastapi.responses import StreamingResponse
from fastapi.middleware.cors import CORSMiddleware
import uvicorn
import numpy as np
ROOT_DIR = os.path.dirname(os.path.abspath(__file__)) ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
sys.path.append('{}/../../..'.format(ROOT_DIR)) sys.path.append('{}/../../..'.format(ROOT_DIR))
sys.path.append('{}/../../../third_party/Matcha-TTS'.format(ROOT_DIR)) sys.path.append('{}/../../../third_party/Matcha-TTS'.format(ROOT_DIR))
from cosyvoice.cli.cosyvoice import CosyVoice from cosyvoice.cli.cosyvoice import CosyVoice, CosyVoice2
from cosyvoice.utils.file_utils import load_wav from cosyvoice.utils.file_utils import load_wav
import numpy as np
import torch
import torchaudio
import logging
logging.getLogger('matplotlib').setLevel(logging.WARNING)
class LaunchFailed(Exception): app = FastAPI()
pass # set cross region allowance
@asynccontextmanager
async def lifespan(app: FastAPI):
model_dir = os.getenv("MODEL_DIR", "pretrained_models/CosyVoice-300M-SFT")
if model_dir:
logging.info("MODEL_DIR is {}", model_dir)
app.cosyvoice = CosyVoice(model_dir)
# sft usage
logging.info("Avaliable speakers {}", app.cosyvoice.list_avaliable_spks())
else:
raise LaunchFailed("MODEL_DIR environment must set")
yield
app = FastAPI(lifespan=lifespan)
#设置允许访问的域名
origins = ["*"] #"*",即为所有,也可以改为允许的特定ip。
app.add_middleware( app.add_middleware(
CORSMiddleware, CORSMiddleware,
allow_origins=origins, #设置允许的origins来源 allow_origins=["*"],
allow_credentials=True, allow_credentials=True,
allow_methods=["*"], # 设置允许跨域的http方法比如 get、post、put等。 allow_methods=["*"],
allow_headers=["*"]) #允许跨域的headers可以用来鉴别来源等作用。 allow_headers=["*"])
def buildResponse(output):
buffer = io.BytesIO()
torchaudio.save(buffer, output, 22050, format="wav")
buffer.seek(0)
return Response(content=buffer.read(-1), media_type="audio/wav")
@app.post("/api/inference/sft") def generate_data(model_output):
@app.get("/api/inference/sft") for i in model_output:
async def sft(tts: str = Form(), role: str = Form()): tts_audio = (i['tts_speech'].numpy() * (2 ** 15)).astype(np.int16).tobytes()
start = time.process_time() yield tts_audio
output = app.cosyvoice.inference_sft(tts, role)
end = time.process_time()
logging.info("infer time is {} seconds", end-start)
return buildResponse(output['tts_speech'])
@app.post("/api/inference/zero-shot")
async def zeroShot(tts: str = Form(), prompt: str = Form(), audio: UploadFile = File()):
start = time.process_time()
prompt_speech = load_wav(audio.file, 16000)
prompt_audio = (prompt_speech.numpy() * (2**15)).astype(np.int16).tobytes()
prompt_speech_16k = torch.from_numpy(np.array(np.frombuffer(prompt_audio, dtype=np.int16))).unsqueeze(dim=0)
prompt_speech_16k = prompt_speech_16k.float() / (2**15)
output = app.cosyvoice.inference_zero_shot(tts, prompt, prompt_speech_16k) @app.get("/inference_sft")
end = time.process_time() @app.post("/inference_sft")
logging.info("infer time is {} seconds", end-start) async def inference_sft(tts_text: str = Form(), spk_id: str = Form()):
return buildResponse(output['tts_speech']) model_output = cosyvoice.inference_sft(tts_text, spk_id)
return StreamingResponse(generate_data(model_output))
@app.post("/api/inference/cross-lingual")
async def crossLingual(tts: str = Form(), audio: UploadFile = File()):
start = time.process_time()
prompt_speech = load_wav(audio.file, 16000)
prompt_audio = (prompt_speech.numpy() * (2**15)).astype(np.int16).tobytes()
prompt_speech_16k = torch.from_numpy(np.array(np.frombuffer(prompt_audio, dtype=np.int16))).unsqueeze(dim=0)
prompt_speech_16k = prompt_speech_16k.float() / (2**15)
output = app.cosyvoice.inference_cross_lingual(tts, prompt_speech_16k) @app.get("/inference_zero_shot")
end = time.process_time() @app.post("/inference_zero_shot")
logging.info("infer time is {} seconds", end-start) async def inference_zero_shot(tts_text: str = Form(), prompt_text: str = Form(), prompt_wav: UploadFile = File()):
return buildResponse(output['tts_speech']) prompt_speech_16k = load_wav(prompt_wav.file, 16000)
model_output = cosyvoice.inference_zero_shot(tts_text, prompt_text, prompt_speech_16k)
return StreamingResponse(generate_data(model_output))
@app.post("/api/inference/instruct")
@app.get("/api/inference/instruct")
async def instruct(tts: str = Form(), role: str = Form(), instruct: str = Form()):
start = time.process_time()
output = app.cosyvoice.inference_instruct(tts, role, instruct)
end = time.process_time()
logging.info("infer time is {} seconds", end-start)
return buildResponse(output['tts_speech'])
@app.get("/api/roles") @app.get("/inference_cross_lingual")
async def roles(): @app.post("/inference_cross_lingual")
return {"roles": app.cosyvoice.list_avaliable_spks()} async def inference_cross_lingual(tts_text: str = Form(), prompt_wav: UploadFile = File()):
prompt_speech_16k = load_wav(prompt_wav.file, 16000)
model_output = cosyvoice.inference_cross_lingual(tts_text, prompt_speech_16k)
return StreamingResponse(generate_data(model_output))
@app.get("/", response_class=HTMLResponse)
async def root(): @app.get("/inference_instruct")
return """ @app.post("/inference_instruct")
<!DOCTYPE html> async def inference_instruct(tts_text: str = Form(), spk_id: str = Form(), instruct_text: str = Form()):
<html lang=zh-cn> model_output = cosyvoice.inference_instruct(tts_text, spk_id, instruct_text)
<head> return StreamingResponse(generate_data(model_output))
<meta charset=utf-8>
<title>Api information</title> @app.get("/inference_instruct2")
</head> @app.post("/inference_instruct2")
<body> async def inference_instruct2(tts_text: str = Form(), instruct_text: str = Form(), prompt_wav: UploadFile = File()):
Get the supported tones from the Roles API first, then enter the tones and textual content in the TTS API for synthesis. <a href='./docs'>Documents of API</a> prompt_speech_16k = load_wav(prompt_wav.file, 16000)
</body> model_output = cosyvoice.inference_instruct2(tts_text, instruct_text, prompt_speech_16k)
</html> return StreamingResponse(generate_data(model_output))
"""
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--port',
type=int,
default=50000)
parser.add_argument('--model_dir',
type=str,
default='iic/CosyVoice-300M',
help='local path or modelscope repo id')
args = parser.parse_args()
try:
cosyvoice = CosyVoice(args.model_dir)
except Exception:
try:
cosyvoice = CosyVoice2(args.model_dir)
except Exception:
raise TypeError('no valid model_type!')
uvicorn.run(app, host="0.0.0.0", port=args.port)

View File

@@ -61,8 +61,11 @@ def main():
request.instruct_request.CopyFrom(instruct_request) request.instruct_request.CopyFrom(instruct_request)
response = stub.Inference(request) response = stub.Inference(request)
tts_audio = b''
for r in response:
tts_audio += r.tts_audio
tts_speech = torch.from_numpy(np.array(np.frombuffer(tts_audio, dtype=np.int16))).unsqueeze(dim=0)
logging.info('save response to {}'.format(args.tts_wav)) logging.info('save response to {}'.format(args.tts_wav))
tts_speech = torch.from_numpy(np.array(np.frombuffer(response.tts_audio, dtype=np.int16))).unsqueeze(dim=0)
torchaudio.save(args.tts_wav, tts_speech, target_sr) torchaudio.save(args.tts_wav, tts_speech, target_sr)
logging.info('get response') logging.info('get response')
@@ -90,10 +93,11 @@ if __name__ == "__main__":
default='希望你以后能够做的比我还好呦。') default='希望你以后能够做的比我还好呦。')
parser.add_argument('--prompt_wav', parser.add_argument('--prompt_wav',
type=str, type=str,
default='../../../zero_shot_prompt.wav') default='../../../asset/zero_shot_prompt.wav')
parser.add_argument('--instruct_text', parser.add_argument('--instruct_text',
type=str, type=str,
default='Theo \'Crimson\', is a fiery, passionate rebel leader. Fights with fervor for justice, but struggles with impulsiveness.') default='Theo \'Crimson\', is a fiery, passionate rebel leader. \
Fights with fervor for justice, but struggles with impulsiveness.')
parser.add_argument('--tts_wav', parser.add_argument('--tts_wav',
type=str, type=str,
default='demo.wav') default='demo.wav')

View File

@@ -4,7 +4,7 @@ package cosyvoice;
option go_package = "protos/"; option go_package = "protos/";
service CosyVoice{ service CosyVoice{
rpc Inference(Request) returns (Response) {} rpc Inference(Request) returns (stream Response) {}
} }
message Request{ message Request{

View File

@@ -13,9 +13,6 @@
# limitations under the License. # limitations under the License.
import os import os
import sys import sys
ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
sys.path.append('{}/../../..'.format(ROOT_DIR))
sys.path.append('{}/../../../third_party/Matcha-TTS'.format(ROOT_DIR))
from concurrent import futures from concurrent import futures
import argparse import argparse
import cosyvoice_pb2 import cosyvoice_pb2
@@ -25,14 +22,24 @@ logging.getLogger('matplotlib').setLevel(logging.WARNING)
import grpc import grpc
import torch import torch
import numpy as np import numpy as np
from cosyvoice.cli.cosyvoice import CosyVoice ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
sys.path.append('{}/../../..'.format(ROOT_DIR))
sys.path.append('{}/../../../third_party/Matcha-TTS'.format(ROOT_DIR))
from cosyvoice.cli.cosyvoice import CosyVoice, CosyVoice2
logging.basicConfig(level=logging.DEBUG, logging.basicConfig(level=logging.DEBUG,
format='%(asctime)s %(levelname)s %(message)s') format='%(asctime)s %(levelname)s %(message)s')
class CosyVoiceServiceImpl(cosyvoice_pb2_grpc.CosyVoiceServicer): class CosyVoiceServiceImpl(cosyvoice_pb2_grpc.CosyVoiceServicer):
def __init__(self, args): def __init__(self, args):
self.cosyvoice = CosyVoice(args.model_dir) try:
self.cosyvoice = CosyVoice(args.model_dir)
except Exception:
try:
self.cosyvoice = CosyVoice2(args.model_dir)
except Exception:
raise TypeError('no valid model_type!')
logging.info('grpc service initialized') logging.info('grpc service initialized')
def Inference(self, request, context): def Inference(self, request, context):
@@ -43,7 +50,9 @@ class CosyVoiceServiceImpl(cosyvoice_pb2_grpc.CosyVoiceServicer):
logging.info('get zero_shot inference request') logging.info('get zero_shot inference request')
prompt_speech_16k = torch.from_numpy(np.array(np.frombuffer(request.zero_shot_request.prompt_audio, dtype=np.int16))).unsqueeze(dim=0) prompt_speech_16k = torch.from_numpy(np.array(np.frombuffer(request.zero_shot_request.prompt_audio, dtype=np.int16))).unsqueeze(dim=0)
prompt_speech_16k = prompt_speech_16k.float() / (2**15) prompt_speech_16k = prompt_speech_16k.float() / (2**15)
model_output = self.cosyvoice.inference_zero_shot(request.zero_shot_request.tts_text, request.zero_shot_request.prompt_text, prompt_speech_16k) model_output = self.cosyvoice.inference_zero_shot(request.zero_shot_request.tts_text,
request.zero_shot_request.prompt_text,
prompt_speech_16k)
elif request.HasField('cross_lingual_request'): elif request.HasField('cross_lingual_request'):
logging.info('get cross_lingual inference request') logging.info('get cross_lingual inference request')
prompt_speech_16k = torch.from_numpy(np.array(np.frombuffer(request.cross_lingual_request.prompt_audio, dtype=np.int16))).unsqueeze(dim=0) prompt_speech_16k = torch.from_numpy(np.array(np.frombuffer(request.cross_lingual_request.prompt_audio, dtype=np.int16))).unsqueeze(dim=0)
@@ -51,12 +60,16 @@ class CosyVoiceServiceImpl(cosyvoice_pb2_grpc.CosyVoiceServicer):
model_output = self.cosyvoice.inference_cross_lingual(request.cross_lingual_request.tts_text, prompt_speech_16k) model_output = self.cosyvoice.inference_cross_lingual(request.cross_lingual_request.tts_text, prompt_speech_16k)
else: else:
logging.info('get instruct inference request') logging.info('get instruct inference request')
model_output = self.cosyvoice.inference_instruct(request.instruct_request.tts_text, request.instruct_request.spk_id, request.instruct_request.instruct_text) model_output = self.cosyvoice.inference_instruct(request.instruct_request.tts_text,
request.instruct_request.spk_id,
request.instruct_request.instruct_text)
logging.info('send inference response') logging.info('send inference response')
response = cosyvoice_pb2.Response() for i in model_output:
response.tts_audio = (model_output['tts_speech'].numpy() * (2 ** 15)).astype(np.int16).tobytes() response = cosyvoice_pb2.Response()
return response response.tts_audio = (i['tts_speech'].numpy() * (2 ** 15)).astype(np.int16).tobytes()
yield response
def main(): def main():
grpcServer = grpc.server(futures.ThreadPoolExecutor(max_workers=args.max_conc), maximum_concurrent_rpcs=args.max_conc) grpcServer = grpc.server(futures.ThreadPoolExecutor(max_workers=args.max_conc), maximum_concurrent_rpcs=args.max_conc)

View File

@@ -13,14 +13,50 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
import argparse import argparse
from concurrent.futures import ThreadPoolExecutor, as_completed
import onnxruntime
import torch import torch
import torchaudio import torchaudio
from tqdm import tqdm
import onnxruntime
import torchaudio.compliance.kaldi as kaldi import torchaudio.compliance.kaldi as kaldi
from tqdm import tqdm
def single_job(utt):
audio, sample_rate = torchaudio.load(utt2wav[utt])
if sample_rate != 16000:
audio = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)(audio)
feat = kaldi.fbank(audio,
num_mel_bins=80,
dither=0,
sample_frequency=16000)
feat = feat - feat.mean(dim=0, keepdim=True)
embedding = ort_session.run(None, {ort_session.get_inputs()[0].name: feat.unsqueeze(dim=0).cpu().numpy()})[0].flatten().tolist()
return utt, embedding
def main(args): def main(args):
all_task = [executor.submit(single_job, utt) for utt in utt2wav.keys()]
utt2embedding, spk2embedding = {}, {}
for future in tqdm(as_completed(all_task)):
utt, embedding = future.result()
utt2embedding[utt] = embedding
spk = utt2spk[utt]
if spk not in spk2embedding:
spk2embedding[spk] = []
spk2embedding[spk].append(embedding)
for k, v in spk2embedding.items():
spk2embedding[k] = torch.tensor(v).mean(dim=0).tolist()
torch.save(utt2embedding, "{}/utt2embedding.pt".format(args.dir))
torch.save(spk2embedding, "{}/spk2embedding.pt".format(args.dir))
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--dir", type=str)
parser.add_argument("--onnx_path", type=str)
parser.add_argument("--num_thread", type=int, default=8)
args = parser.parse_args()
utt2wav, utt2spk = {}, {} utt2wav, utt2spk = {}, {}
with open('{}/wav.scp'.format(args.dir)) as f: with open('{}/wav.scp'.format(args.dir)) as f:
for l in f: for l in f:
@@ -36,34 +72,6 @@ def main(args):
option.intra_op_num_threads = 1 option.intra_op_num_threads = 1
providers = ["CPUExecutionProvider"] providers = ["CPUExecutionProvider"]
ort_session = onnxruntime.InferenceSession(args.onnx_path, sess_options=option, providers=providers) ort_session = onnxruntime.InferenceSession(args.onnx_path, sess_options=option, providers=providers)
executor = ThreadPoolExecutor(max_workers=args.num_thread)
utt2embedding, spk2embedding = {}, {}
for utt in tqdm(utt2wav.keys()):
audio, sample_rate = torchaudio.load(utt2wav[utt])
if sample_rate != 16000:
audio = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)(audio)
feat = kaldi.fbank(audio,
num_mel_bins=80,
dither=0,
sample_frequency=16000)
feat = feat - feat.mean(dim=0, keepdim=True)
embedding = ort_session.run(None, {ort_session.get_inputs()[0].name: feat.unsqueeze(dim=0).cpu().numpy()})[0].flatten().tolist()
utt2embedding[utt] = embedding
spk = utt2spk[utt]
if spk not in spk2embedding:
spk2embedding[spk] = []
spk2embedding[spk].append(embedding)
for k, v in spk2embedding.items():
spk2embedding[k] = torch.tensor(v).mean(dim=0).tolist()
torch.save(utt2embedding, '{}/utt2embedding.pt'.format(args.dir))
torch.save(spk2embedding, '{}/spk2embedding.pt'.format(args.dir))
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('--dir',
type=str)
parser.add_argument('--onnx_path',
type=str)
args = parser.parse_args()
main(args) main(args)

View File

@@ -13,6 +13,7 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
import argparse import argparse
from concurrent.futures import ThreadPoolExecutor, as_completed
import logging import logging
import torch import torch
from tqdm import tqdm from tqdm import tqdm
@@ -22,7 +23,39 @@ import torchaudio
import whisper import whisper
def single_job(utt):
audio, sample_rate = torchaudio.load(utt2wav[utt], backend='soundfile')
if sample_rate != 16000:
audio = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)(audio)
# Convert audio to mono
if audio.shape[0] > 1:
audio = audio.mean(dim=0, keepdim=True)
if audio.shape[1] / 16000 > 30:
logging.warning('do not support extract speech token for audio longer than 30s')
speech_token = []
else:
feat = whisper.log_mel_spectrogram(audio, n_mels=128)
speech_token = ort_session.run(None, {ort_session.get_inputs()[0].name: feat.detach().cpu().numpy(),
ort_session.get_inputs()[1].name: np.array([feat.shape[2]], dtype=np.int32)})[0].flatten().tolist()
return utt, speech_token
def main(args): def main(args):
all_task = [executor.submit(single_job, utt) for utt in utt2wav.keys()]
utt2speech_token = {}
for future in tqdm(as_completed(all_task)):
utt, speech_token = future.result()
utt2speech_token[utt] = speech_token
torch.save(utt2speech_token, '{}/utt2speech_token.pt'.format(args.dir))
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--dir", type=str)
parser.add_argument("--onnx_path", type=str)
parser.add_argument("--num_thread", type=int, default=8)
args = parser.parse_args()
utt2wav = {} utt2wav = {}
with open('{}/wav.scp'.format(args.dir)) as f: with open('{}/wav.scp'.format(args.dir)) as f:
for l in f: for l in f:
@@ -34,28 +67,6 @@ def main(args):
option.intra_op_num_threads = 1 option.intra_op_num_threads = 1
providers = ["CUDAExecutionProvider"] providers = ["CUDAExecutionProvider"]
ort_session = onnxruntime.InferenceSession(args.onnx_path, sess_options=option, providers=providers) ort_session = onnxruntime.InferenceSession(args.onnx_path, sess_options=option, providers=providers)
executor = ThreadPoolExecutor(max_workers=args.num_thread)
utt2speech_token = {}
for utt in tqdm(utt2wav.keys()):
audio, sample_rate = torchaudio.load(utt2wav[utt])
if sample_rate != 16000:
audio = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)(audio)
if audio.shape[1] / 16000 > 30:
logging.warning('do not support extract speech token for audio longer than 30s')
speech_token = []
else:
feat = whisper.log_mel_spectrogram(audio, n_mels=128)
speech_token = ort_session.run(None, {ort_session.get_inputs()[0].name: feat.detach().cpu().numpy(),
ort_session.get_inputs()[1].name: np.array([feat.shape[2]], dtype=np.int32)})[0].flatten().tolist()
utt2speech_token[utt] = speech_token
torch.save(utt2speech_token, '{}/utt2speech_token.pt'.format(args.dir))
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('--dir',
type=str)
parser.add_argument('--onnx_path',
type=str)
args = parser.parse_args()
main(args) main(args)

View File

@@ -53,6 +53,7 @@ def job(utt_list, parquet_file, utt2parquet_file, spk2parquet_file):
json.dump({k: parquet_file for k in list(set(spk_list))}, f, ensure_ascii=False, indent=2) json.dump({k: parquet_file for k in list(set(spk_list))}, f, ensure_ascii=False, indent=2)
logging.info('spend time {}'.format(time.time() - start_time)) logging.info('spend time {}'.format(time.time() - start_time))
if __name__ == "__main__": if __name__ == "__main__":
parser = argparse.ArgumentParser() parser = argparse.ArgumentParser()
parser.add_argument('--num_utts_per_parquet', parser.add_argument('--num_utts_per_parquet',

116
webui.py
View File

@@ -13,9 +13,6 @@
# limitations under the License. # limitations under the License.
import os import os
import sys import sys
ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
sys.path.append('{}/third_party/Matcha-TTS'.format(ROOT_DIR))
import argparse import argparse
import gradio as gr import gradio as gr
import numpy as np import numpy as np
@@ -23,15 +20,20 @@ import torch
import torchaudio import torchaudio
import random import random
import librosa import librosa
ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
sys.path.append('{}/third_party/Matcha-TTS'.format(ROOT_DIR))
from cosyvoice.cli.cosyvoice import CosyVoice, CosyVoice2
from cosyvoice.utils.file_utils import load_wav, logging
from cosyvoice.utils.common import set_all_random_seed
import logging inference_mode_list = ['预训练音色', '3s极速复刻', '跨语种复刻', '自然语言控制']
logging.getLogger('matplotlib').setLevel(logging.WARNING) instruct_dict = {'预训练音色': '1. 选择预训练音色\n2. 点击生成音频按钮',
'3s极速复刻': '1. 选择prompt音频文件或录入prompt音频注意不超过30s若同时提供优先选择prompt音频文件\n2. 输入prompt文本\n3. 点击生成音频按钮',
'跨语种复刻': '1. 选择prompt音频文件或录入prompt音频注意不超过30s若同时提供优先选择prompt音频文件\n2. 点击生成音频按钮',
'自然语言控制': '1. 选择预训练音色\n2. 输入instruct文本\n3. 点击生成音频按钮'}
stream_mode_list = [('', False), ('', True)]
max_val = 0.8
from cosyvoice.cli.cosyvoice import CosyVoice
from cosyvoice.utils.file_utils import load_wav, speed_change
logging.basicConfig(level=logging.DEBUG,
format='%(asctime)s %(levelname)s %(message)s')
def generate_seed(): def generate_seed():
seed = random.randint(1, 100000000) seed = random.randint(1, 100000000)
@@ -40,13 +42,7 @@ def generate_seed():
"value": seed "value": seed
} }
def set_all_random_seed(seed):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
max_val = 0.8
def postprocess(speech, top_db=60, hop_length=220, win_length=440): def postprocess(speech, top_db=60, hop_length=220, win_length=440):
speech, _ = librosa.effects.trim( speech, _ = librosa.effects.trim(
speech, top_db=top_db, speech, top_db=top_db,
@@ -55,18 +51,16 @@ def postprocess(speech, top_db=60, hop_length=220, win_length=440):
) )
if speech.abs().max() > max_val: if speech.abs().max() > max_val:
speech = speech / speech.abs().max() * max_val speech = speech / speech.abs().max() * max_val
speech = torch.concat([speech, torch.zeros(1, int(target_sr * 0.2))], dim=1) speech = torch.concat([speech, torch.zeros(1, int(cosyvoice.sample_rate * 0.2))], dim=1)
return speech return speech
inference_mode_list = ['预训练音色', '3s极速复刻', '跨语种复刻', '自然语言控制']
instruct_dict = {'预训练音色': '1. 选择预训练音色\n2. 点击生成音频按钮',
'3s极速复刻': '1. 选择prompt音频文件或录入prompt音频注意不超过30s若同时提供优先选择prompt音频文件\n2. 输入prompt文本\n3. 点击生成音频按钮',
'跨语种复刻': '1. 选择prompt音频文件或录入prompt音频注意不超过30s若同时提供优先选择prompt音频文件\n2. 点击生成音频按钮',
'自然语言控制': '1. 选择预训练音色\n2. 输入instruct文本\n3. 点击生成音频按钮'}
def change_instruction(mode_checkbox_group): def change_instruction(mode_checkbox_group):
return instruct_dict[mode_checkbox_group] return instruct_dict[mode_checkbox_group]
def generate_audio(tts_text, mode_checkbox_group, sft_dropdown, prompt_text, prompt_wav_upload, prompt_wav_record, instruct_text, seed, speed_factor):
def generate_audio(tts_text, mode_checkbox_group, sft_dropdown, prompt_text, prompt_wav_upload, prompt_wav_record, instruct_text,
seed, stream, speed):
if prompt_wav_upload is not None: if prompt_wav_upload is not None:
prompt_wav = prompt_wav_upload prompt_wav = prompt_wav_upload
elif prompt_wav_record is not None: elif prompt_wav_record is not None:
@@ -75,86 +69,87 @@ def generate_audio(tts_text, mode_checkbox_group, sft_dropdown, prompt_text, pro
prompt_wav = None prompt_wav = None
# if instruct mode, please make sure that model is iic/CosyVoice-300M-Instruct and not cross_lingual mode # if instruct mode, please make sure that model is iic/CosyVoice-300M-Instruct and not cross_lingual mode
if mode_checkbox_group in ['自然语言控制']: if mode_checkbox_group in ['自然语言控制']:
if cosyvoice.frontend.instruct is False: if cosyvoice.instruct is False:
gr.Warning('您正在使用自然语言控制模式, {}模型不支持此模式, 请使用iic/CosyVoice-300M-Instruct模型'.format(args.model_dir)) gr.Warning('您正在使用自然语言控制模式, {}模型不支持此模式, 请使用iic/CosyVoice-300M-Instruct模型'.format(args.model_dir))
return (target_sr, default_data) yield (cosyvoice.sample_rate, default_data)
if instruct_text == '': if instruct_text == '':
gr.Warning('您正在使用自然语言控制模式, 请输入instruct文本') gr.Warning('您正在使用自然语言控制模式, 请输入instruct文本')
return (target_sr, default_data) yield (cosyvoice.sample_rate, default_data)
if prompt_wav is not None or prompt_text != '': if prompt_wav is not None or prompt_text != '':
gr.Info('您正在使用自然语言控制模式, prompt音频/prompt文本会被忽略') gr.Info('您正在使用自然语言控制模式, prompt音频/prompt文本会被忽略')
# if cross_lingual mode, please make sure that model is iic/CosyVoice-300M and tts_text prompt_text are different language # if cross_lingual mode, please make sure that model is iic/CosyVoice-300M and tts_text prompt_text are different language
if mode_checkbox_group in ['跨语种复刻']: if mode_checkbox_group in ['跨语种复刻']:
if cosyvoice.frontend.instruct is True: if cosyvoice.instruct is True:
gr.Warning('您正在使用跨语种复刻模式, {}模型不支持此模式, 请使用iic/CosyVoice-300M模型'.format(args.model_dir)) gr.Warning('您正在使用跨语种复刻模式, {}模型不支持此模式, 请使用iic/CosyVoice-300M模型'.format(args.model_dir))
return (target_sr, default_data) yield (cosyvoice.sample_rate, default_data)
if instruct_text != '': if instruct_text != '':
gr.Info('您正在使用跨语种复刻模式, instruct文本会被忽略') gr.Info('您正在使用跨语种复刻模式, instruct文本会被忽略')
if prompt_wav is None: if prompt_wav is None:
gr.Warning('您正在使用跨语种复刻模式, 请提供prompt音频') gr.Warning('您正在使用跨语种复刻模式, 请提供prompt音频')
return (target_sr, default_data) yield (cosyvoice.sample_rate, default_data)
gr.Info('您正在使用跨语种复刻模式, 请确保合成文本和prompt文本为不同语言') gr.Info('您正在使用跨语种复刻模式, 请确保合成文本和prompt文本为不同语言')
# if in zero_shot cross_lingual, please make sure that prompt_text and prompt_wav meets requirements # if in zero_shot cross_lingual, please make sure that prompt_text and prompt_wav meets requirements
if mode_checkbox_group in ['3s极速复刻', '跨语种复刻']: if mode_checkbox_group in ['3s极速复刻', '跨语种复刻']:
if prompt_wav is None: if prompt_wav is None:
gr.Warning('prompt音频为空您是否忘记输入prompt音频') gr.Warning('prompt音频为空您是否忘记输入prompt音频')
return (target_sr, default_data) yield (cosyvoice.sample_rate, default_data)
if torchaudio.info(prompt_wav).sample_rate < prompt_sr: if torchaudio.info(prompt_wav).sample_rate < prompt_sr:
gr.Warning('prompt音频采样率{}低于{}'.format(torchaudio.info(prompt_wav).sample_rate, prompt_sr)) gr.Warning('prompt音频采样率{}低于{}'.format(torchaudio.info(prompt_wav).sample_rate, prompt_sr))
return (target_sr, default_data) yield (cosyvoice.sample_rate, default_data)
# sft mode only use sft_dropdown # sft mode only use sft_dropdown
if mode_checkbox_group in ['预训练音色']: if mode_checkbox_group in ['预训练音色']:
if instruct_text != '' or prompt_wav is not None or prompt_text != '': if instruct_text != '' or prompt_wav is not None or prompt_text != '':
gr.Info('您正在使用预训练音色模式prompt文本/prompt音频/instruct文本会被忽略') gr.Info('您正在使用预训练音色模式prompt文本/prompt音频/instruct文本会被忽略')
if sft_dropdown == '':
gr.Warning('没有可用的预训练音色!')
yield (cosyvoice.sample_rate, default_data)
# zero_shot mode only use prompt_wav prompt text # zero_shot mode only use prompt_wav prompt text
if mode_checkbox_group in ['3s极速复刻']: if mode_checkbox_group in ['3s极速复刻']:
if prompt_text == '': if prompt_text == '':
gr.Warning('prompt文本为空您是否忘记输入prompt文本') gr.Warning('prompt文本为空您是否忘记输入prompt文本')
return (target_sr, default_data) yield (cosyvoice.sample_rate, default_data)
if instruct_text != '': if instruct_text != '':
gr.Info('您正在使用3s极速复刻模式预训练音色/instruct文本会被忽略') gr.Info('您正在使用3s极速复刻模式预训练音色/instruct文本会被忽略')
if mode_checkbox_group == '预训练音色': if mode_checkbox_group == '预训练音色':
logging.info('get sft inference request') logging.info('get sft inference request')
set_all_random_seed(seed) set_all_random_seed(seed)
output = cosyvoice.inference_sft(tts_text, sft_dropdown) for i in cosyvoice.inference_sft(tts_text, sft_dropdown, stream=stream, speed=speed):
yield (cosyvoice.sample_rate, i['tts_speech'].numpy().flatten())
elif mode_checkbox_group == '3s极速复刻': elif mode_checkbox_group == '3s极速复刻':
logging.info('get zero_shot inference request') logging.info('get zero_shot inference request')
prompt_speech_16k = postprocess(load_wav(prompt_wav, prompt_sr)) prompt_speech_16k = postprocess(load_wav(prompt_wav, prompt_sr))
set_all_random_seed(seed) set_all_random_seed(seed)
output = cosyvoice.inference_zero_shot(tts_text, prompt_text, prompt_speech_16k) for i in cosyvoice.inference_zero_shot(tts_text, prompt_text, prompt_speech_16k, stream=stream, speed=speed):
yield (cosyvoice.sample_rate, i['tts_speech'].numpy().flatten())
elif mode_checkbox_group == '跨语种复刻': elif mode_checkbox_group == '跨语种复刻':
logging.info('get cross_lingual inference request') logging.info('get cross_lingual inference request')
prompt_speech_16k = postprocess(load_wav(prompt_wav, prompt_sr)) prompt_speech_16k = postprocess(load_wav(prompt_wav, prompt_sr))
set_all_random_seed(seed) set_all_random_seed(seed)
output = cosyvoice.inference_cross_lingual(tts_text, prompt_speech_16k) for i in cosyvoice.inference_cross_lingual(tts_text, prompt_speech_16k, stream=stream, speed=speed):
yield (cosyvoice.sample_rate, i['tts_speech'].numpy().flatten())
else: else:
logging.info('get instruct inference request') logging.info('get instruct inference request')
set_all_random_seed(seed) set_all_random_seed(seed)
output = cosyvoice.inference_instruct(tts_text, sft_dropdown, instruct_text) for i in cosyvoice.inference_instruct(tts_text, sft_dropdown, instruct_text, stream=stream, speed=speed):
yield (cosyvoice.sample_rate, i['tts_speech'].numpy().flatten())
if speed_factor != 1.0:
try:
audio_data, sample_rate = speed_change(output["tts_speech"], target_sr, str(speed_factor))
audio_data = audio_data.numpy().flatten()
except Exception as e:
print(f"Failed to change speed of audio: \n{e}")
else:
audio_data = output['tts_speech'].numpy().flatten()
return (target_sr, audio_data)
def main(): def main():
with gr.Blocks() as demo: with gr.Blocks() as demo:
gr.Markdown("### 代码库 [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) 预训练模型 [CosyVoice-300M](https://www.modelscope.cn/models/iic/CosyVoice-300M) [CosyVoice-300M-Instruct](https://www.modelscope.cn/models/iic/CosyVoice-300M-Instruct) [CosyVoice-300M-SFT](https://www.modelscope.cn/models/iic/CosyVoice-300M-SFT)") gr.Markdown("### 代码库 [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) \
预训练模型 [CosyVoice-300M](https://www.modelscope.cn/models/iic/CosyVoice-300M) \
[CosyVoice-300M-Instruct](https://www.modelscope.cn/models/iic/CosyVoice-300M-Instruct) \
[CosyVoice-300M-SFT](https://www.modelscope.cn/models/iic/CosyVoice-300M-SFT)")
gr.Markdown("#### 请输入需要合成的文本,选择推理模式,并按照提示步骤进行操作") gr.Markdown("#### 请输入需要合成的文本,选择推理模式,并按照提示步骤进行操作")
tts_text = gr.Textbox(label="输入合成文本", lines=1, value="我是通义实验室语音团队全新推出的生成式语音大模型,提供舒适自然的语音合成能力。") tts_text = gr.Textbox(label="输入合成文本", lines=1, value="我是通义实验室语音团队全新推出的生成式语音大模型,提供舒适自然的语音合成能力。")
speed_factor = gr.Slider(minimum=0.25, maximum=4, step=0.05, label="语速调节", value=1.0, interactive=True)
with gr.Row(): with gr.Row():
mode_checkbox_group = gr.Radio(choices=inference_mode_list, label='选择推理模式', value=inference_mode_list[0]) mode_checkbox_group = gr.Radio(choices=inference_mode_list, label='选择推理模式', value=inference_mode_list[0])
instruction_text = gr.Text(label="操作步骤", value=instruct_dict[inference_mode_list[0]], scale=0.5) instruction_text = gr.Text(label="操作步骤", value=instruct_dict[inference_mode_list[0]], scale=0.5)
sft_dropdown = gr.Dropdown(choices=sft_spk, label='选择预训练音色', value=sft_spk[0], scale=0.25) sft_dropdown = gr.Dropdown(choices=sft_spk, label='选择预训练音色', value=sft_spk[0], scale=0.25)
stream = gr.Radio(choices=stream_mode_list, label='是否流式推理', value=stream_mode_list[0][1])
speed = gr.Number(value=1, label="速度调节(仅支持非流式推理)", minimum=0.5, maximum=2.0, step=0.1)
with gr.Column(scale=0.25): with gr.Column(scale=0.25):
seed_button = gr.Button(value="\U0001F3B2") seed_button = gr.Button(value="\U0001F3B2")
seed = gr.Number(value=0, label="随机推理种子") seed = gr.Number(value=0, label="随机推理种子")
@@ -167,16 +162,18 @@ def main():
generate_button = gr.Button("生成音频") generate_button = gr.Button("生成音频")
audio_output = gr.Audio(label="合成音频") audio_output = gr.Audio(label="合成音频", autoplay=True, streaming=True)
seed_button.click(generate_seed, inputs=[], outputs=seed) seed_button.click(generate_seed, inputs=[], outputs=seed)
generate_button.click(generate_audio, generate_button.click(generate_audio,
inputs=[tts_text, mode_checkbox_group, sft_dropdown, prompt_text, prompt_wav_upload, prompt_wav_record, instruct_text, seed, speed_factor], inputs=[tts_text, mode_checkbox_group, sft_dropdown, prompt_text, prompt_wav_upload, prompt_wav_record, instruct_text,
seed, stream, speed],
outputs=[audio_output]) outputs=[audio_output])
mode_checkbox_group.change(fn=change_instruction, inputs=[mode_checkbox_group], outputs=[instruction_text]) mode_checkbox_group.change(fn=change_instruction, inputs=[mode_checkbox_group], outputs=[instruction_text])
demo.queue(max_size=4, default_concurrency_limit=2) demo.queue(max_size=4, default_concurrency_limit=2)
demo.launch(server_name='0.0.0.0', server_port=args.port) demo.launch(server_name='0.0.0.0', server_port=args.port)
if __name__ == '__main__': if __name__ == '__main__':
parser = argparse.ArgumentParser() parser = argparse.ArgumentParser()
parser.add_argument('--port', parser.add_argument('--port',
@@ -184,11 +181,20 @@ if __name__ == '__main__':
default=8000) default=8000)
parser.add_argument('--model_dir', parser.add_argument('--model_dir',
type=str, type=str,
default='iic/CosyVoice-300M', default='pretrained_models/CosyVoice2-0.5B',
help='local path or modelscope repo id') help='local path or modelscope repo id')
args = parser.parse_args() args = parser.parse_args()
cosyvoice = CosyVoice(args.model_dir) try:
sft_spk = cosyvoice.list_avaliable_spks() cosyvoice = CosyVoice(args.model_dir)
prompt_sr, target_sr = 16000, 22050 except Exception:
default_data = np.zeros(target_sr) try:
cosyvoice = CosyVoice2(args.model_dir)
except Exception:
raise TypeError('no valid model_type!')
sft_spk = cosyvoice.list_available_spks()
if len(sft_spk) == 0:
sft_spk = ['']
prompt_sr = 16000
default_data = np.zeros(cosyvoice.sample_rate)
main() main()