5.2 KiB
🍵 Matcha-TTS: A fast TTS architecture with conditional flow matching
Shivam Mehta, Ruibo Tu, Jonas Beskow, Éva Székely, and Gustav Eje Henter
This is the official code implementation of 🍵 Matcha-TTS.
We propose 🍵 Matcha-TTS, a new approach to non-autoregressive neural TTS, that uses conditional flow matching (similar to rectified flows) to speed up ODE-based speech synthesis. Our method:
- Is probabilistic
- Has compact memory footprint
- Sounds highly natural
- Is very fast to synthesise from
Check out our demo page. Read our arXiv preprint for more details.
Installation
- Create an environment (suggested but optional)
conda create -n matcha_tts python=3.10 -y
conda activate matcha_tts
- Install Matcha TTS using pip from source (in future we plan to add it to PyPI)
pip install git+https://github.com/shivammehta25/Matcha-TTS.git
- Run CLI / gradio app / jupyter notebook
# This will download the required models
match_tts --text "<INPUT TEXT>"
or
matcha_tts_app
or open synthesis.ipynb on jupyter notebook
CLI Arguments
- To synthesise from given text, run:
match_tts --text "<INPUT TEXT>"
- To synthesise from a file, run:
match_tts --file <PATH TO FILE>
- To batch synthesise from a file, run:
match_tts --file <PATH TO FILE> --batched
Additional arguments
- Speaking rate
match_tts --text "<INPUT TEXT>" --speaking_rate 1.0
- Sampling temperature
match_tts --text "<INPUT TEXT>" --temperature 0.667
- Euler ODE solver steps
match_tts --text "<INPUT TEXT>" --steps 10
Citation information
If you find this work useful, please cite our paper:
@article{mehta2023matcha,
title={Matcha-TTS: A fast TTS architecture with conditional flow matching},
author={Mehta, Shivam and Tu, Ruibo and Beskow, Jonas and Sz{\'e}kely, {\'E}va and Henter, Gustav Eje},
journal={arXiv preprint arXiv:2309.03199},
year={2023}
}
Train with your own dataset
Let's assume we are training with LJSpeech
-
Download the dataset from here, extract it to
data/LJSpeech-1.1, and prepare the filelists to point to the extracted data like the 5th point of setup in Tacotron2 repo. -
Clone and enter this repository
git clone https://github.com/shivammehta25/Matcha-TTS.git
cd Matcha-TTS
- Install the package from source
pip install -e .
- Go to
configs/data/ljspeech.yamland change
train_filelist_path: data/filelists/ljs_audio_text_train_filelist.txt
valid_filelist_path: data/filelists/ljs_audio_text_val_filelist.txt
- Generate normalisation statistics with the yaml file of dataset configuration
matcha-data-stats -i ljspeech.yaml
# Output:
#{'mel_mean': -5.53662231756592, 'mel_std': 2.1161014277038574}
Update these values in configs/data/ljspeech.yaml under data_statistics key.
data_statistics: # Computed for ljspeech dataset
mel_mean: -5.536622
mel_std: 2.116101
to the paths of your train and validation filelists.
- Run the training script
make train-ljspeech
or
python matcha/train.py experiment=ljspeech
- for a minimum memory run
python matcha/train.py experiment=ljspeech_min_memory
- for multi-gpu training, run
python matcha/train.py experiment=ljspeech trainer.devices=[0,1]
Acknowledgements
Since this code uses: Lightning-Hydra-Template, you have all the powers that comes with it.
Other source codes I would like to acknowledge:
