mirror of
https://github.com/shivammehta25/Matcha-TTS.git
synced 2026-02-04 17:59:19 +08:00
27 KiB
27 KiB
Matcha-TTS: A fast TTS architecture with conditional flow matching
<head> </head>Shivam Mehta, Ruibo Tu, Jonas Beskow, Éva Székely, and Gustav Eje Henter
We propose 🍵 Matcha-TTS, a new approach to non-autoregressive neural TTS, that uses conditional flow matching (similar to rectified flows) to speed up ODE-based speech synthesis. Our method:
- Is probabilistic
- Has compact memory footprint
- Sounds highly natural
- Is very fast to synthesise from
See below for audio examples, or read our ICASSP 2024 paper for more details. Code is available in our GitHub repository, along with pre-trained models.
You can also try 🍵 Matcha-TTS in your browser on HuggingFace 🤗 spaces.
<style type="text/css"> .tg { border-collapse: collapse; border-color: #9ABAD9; border-spacing: 0; } .tg td { background-color: #EBF5FF; border-color: #9ABAD9; border-style: solid; border-width: 1px; color: #444; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 0px 20px; word-break: normal; font-weight: bold; vertical-align: middle; text-align: center; white-space: nowrap; } .tg th { background-color: #409cff; border-color: #9ABAD9; border-style: solid; border-width: 1px; color: #fff; font-family: Arial, sans-serif; font-size: 14px; font-weight: bold; overflow: hidden; padding: 0px 20px; word-break: normal; font-weight: bold; vertical-align: middle; text-align: center; white-space: nowrap; margin: auto; } .tg th a { background-color: #409cff; color: #fff; text-decoration: none; font-family: Arial, sans-serif; font-size: 14px; font-weight: bold; overflow: hidden; padding: 0px 20px; word-break: normal; font-weight: bold; vertical-align: middle; text-align: center; white-space: nowrap; margin: auto; } .tg .tg-0pky { border-color: inherit; text-align: center; vertical-align: top, } td img { position: relative; margin: 0 auto; max-width: 650px; padding: 5px; border: 0px; } .tg .tg-fymr { border-color: inherit; font-weight: bold; text-align: center; vertical-align: top } .slider { -webkit-appearance: none; width: 75%; height: 15px; border-radius: 5px; background: #d3d3d3; outline: none; opacity: 0.7; -webkit-transition: .2s; transition: opacity .2s; } .slider::-webkit-slider-thumb { -webkit-appearance: none; appearance: none; width: 25px; height: 25px; border-radius: 50%; background: #409cff; cursor: pointer; } .slider::-moz-range-thumb { width: 25px; height: 25px; border-radius: 50%; background: #409cff; cursor: pointer; } /* audio { width: 240px; } */ /* CSS */ .button-12 { display: flex; flex-direction: column; align-items: center; padding: 10px 54px; font-family: -apple-system, BlinkMacSystemFont, 'Roboto', sans-serif; font-weight: bold; border-radius: 6px; border: none; background: #6E6D70; box-shadow: 0px 0.5px 1px rgba(0, 0, 0, 0.1), inset 0px 0.5px 0.5px rgba(255, 255, 255, 0.5), 0px 0px 0px 0.5px rgba(0, 0, 0, 0.12); color: #DFDEDF; user-select: none; -webkit-user-select: none; touch-action: manipulation; } .button-12:focus { box-shadow: inset 0px 0.8px 0px -0.25px rgba(255, 255, 255, 0.2), 0px 0.5px 1px rgba(0, 0, 0, 0.1), 0px 0px 0px 3.5px rgba(58, 108, 217, 0.5); outline: 0; } audio { margin: 0.5em; } .slider { -webkit-appearance: none; width: 75%; height: 15px; border-radius: 5px; background: #d3d3d3; outline: none; opacity: 0.7; -webkit-transition: .2s; transition: opacity .2s; } .slider::-webkit-slider-thumb { -webkit-appearance: none; appearance: none; width: 25px; height: 25px; border-radius: 50%; background: #409cff; cursor: pointer; } .slider::-moz-range-thumb { width: 25px; height: 25px; border-radius: 50%; background: #409cff; cursor: pointer; } </style> <script src="transcripts.js"></script> <script> transcript_listening_test = { 1: "It had established periodic regular review of the status of four hundred individuals;", //4 2: "The narrative of these events is based largely on the recollections of the participants,", // 3 3: "The jury did not believe him, and the verdict was for the defendants.", // 7 4: "One by one the huge uprights of black timber were fitted together,", // 19 5: "The position of this palmprint on the carton was parallel with the long axis of the box, and at right angles with the short axis;", // 23 6: "The boy declared he saw no one, and accordingly passed through without paying the toll of a penny." // 38 } function play_audio(filename, audio_id, condition_name, transcription){ audio = document.getElementById(audio_id); audio_source = document.getElementById(audio_id + "-src"); block_quote = document.getElementById(audio_id + "-transcript"); stimulus_span = document.getElementById(audio_id + "-span"); audio.pause(); audio_source.src = filename; block_quote.innerHTML = transcription; stimulus_span.innerHTML = condition_name; audio.load(); audio.play(); } </script>Stimuli from the listening test
Click the buttons in the table to load and play the different stimuli.
Currently loaded stimulus: MAT-10 : Sentence 1
Audio player:
Transcription:
It had established periodic regular review of the status of four hundred individuals;
| System | Condition | Sentence 1 | Sentence 2 | Sentence 3 | Sentence 4 | Sentence 5 | Sentence 6 |
|---|---|---|---|---|---|---|---|
| Vocoded speech |
VOC | |
|
|
|
|
|
| Matcha-TTS | MAT-10 | |
|
|
|
|
|
| MAT-4 | |
|
|
|
|
|
|
| MAT-2 | |
|
|
|
|
|
|
| Grad-TTS | GRAD-10 | |
|
|
|
|
|
| GRAD-4 | |
|
|
|
|
|
|
| Grad-TTS+CFM | GCFM-4 | |
|
|
|
|
|
| FastSpeech 2 | FS2 | |
|
|
|
|
|
| VITS | VITS | |
|
|
|
|
|
Effect of the number of ODE solver steps
Steps:
| System | Sentence 1 | Sentence 2 | Sentence 3 |
|---|---|---|---|
| Matcha-TTS | |||
| Grad-TTS | |||
| Grad-TTS + CFM |
Citation information
@inproceedings{mehta2024matcha,
title={Matcha-{TTS}: A fast {TTS} architecture with conditional flow matching},
author={Mehta, Shivam and Tu, Ruibo and Beskow, Jonas and Sz{\'e}kely, {\'E}va and Henter, Gustav Eje},
booktitle={Proc. ICASSP},
year={2024}
}
