deepgeek/Matcha-TTS

Fork 0

mirror of https://github.com/shivammehta25/Matcha-TTS.git synced 2026-02-05 02:09:21 +08:00

Files

Shivam Mehta f37918d9d2 Updating readme and setup with information

2023-09-16 19:27:46 +00:00

5.2 KiB

Raw Blame History

🍵 Matcha-TTS: A fast TTS architecture with conditional flow matching

Shivam Mehta, Ruibo Tu, Jonas Beskow, Éva Székely, and Gustav Eje Henter

This is the official code implementation of 🍵 Matcha-TTS.

We propose 🍵 Matcha-TTS, a new approach to non-autoregressive neural TTS, that uses conditional flow matching (similar to rectified flows) to speed up ODE-based speech synthesis. Our method:

Is probabilistic
Has compact memory footprint
Sounds highly natural
Is very fast to synthesise from

Check out our demo page. Read our arXiv preprint for more details.

Installation

Create an environment (suggested but optional)

conda create -n matcha_tts python=3.10 -y
conda activate matcha_tts

Install Matcha TTS using pip from source (in future we plan to add it to PyPI)

pip install git+https://github.com/shivammehta25/Matcha-TTS.git

Run CLI / gradio app / jupyter notebook

# This will download the required models
match_tts --text "<INPUT TEXT>"

matcha_tts_app

or open synthesis.ipynb on jupyter notebook

CLI Arguments

To synthesise from given text, run:

match_tts --text "<INPUT TEXT>"

To synthesise from a file, run:

match_tts --file <PATH TO FILE>

To batch synthesise from a file, run:

match_tts --file <PATH TO FILE> --batched

Additional arguments

Speaking rate

match_tts --text "<INPUT TEXT>" --speaking_rate 1.0

Sampling temperature

match_tts --text "<INPUT TEXT>" --temperature 0.667

Euler ODE solver steps

match_tts --text "<INPUT TEXT>" --steps 10

Citation information

If you find this work useful, please cite our paper:

@article{mehta2023matcha,
  title={Matcha-TTS: A fast TTS architecture with conditional flow matching},
  author={Mehta, Shivam and Tu, Ruibo and Beskow, Jonas and Sz{\'e}kely, {\'E}va and Henter, Gustav Eje},
  journal={arXiv preprint arXiv:2309.03199},
  year={2023}
}

Train with your own dataset

Let's assume we are training with LJSpeech

Download the dataset from here, extract it to data/LJSpeech-1.1, and prepare the filelists to point to the extracted data like the 5th point of setup in Tacotron2 repo.
Clone and enter this repository

git clone https://github.com/shivammehta25/Matcha-TTS.git
cd Matcha-TTS

Install the package from source

pip install -e .

Go to configs/data/ljspeech.yaml and change

train_filelist_path: data/filelists/ljs_audio_text_train_filelist.txt
valid_filelist_path: data/filelists/ljs_audio_text_val_filelist.txt

Generate normalisation statistics with the yaml file of dataset configuration

matcha-data-stats -i ljspeech.yaml
# Output:
#{'mel_mean': -5.53662231756592, 'mel_std': 2.1161014277038574}

Update these values in configs/data/ljspeech.yaml under data_statistics key.

data_statistics:  # Computed for ljspeech dataset
  mel_mean: -5.536622
  mel_std: 2.116101

to the paths of your train and validation filelists.

Run the training script

make train-ljspeech

python matcha/train.py experiment=ljspeech

for a minimum memory run

python matcha/train.py experiment=ljspeech_min_memory

for multi-gpu training, run

python matcha/train.py experiment=ljspeech trainer.devices=[0,1]

Acknowledgements

Since this code uses: Lightning-Hydra-Template, you have all the powers that comes with it.

Other source codes I would like to acknowledge:

Coqui-TTS : For helping me figure out how to make cython binaries pip installable
Grad-TTS: For source code of MAS
torchdyn: Useful for trying other ODE solvers during development
labml.ai: For RoPE implementation

5.2 KiB Raw Blame History