mirror of
https://github.com/shivammehta25/Matcha-TTS.git
synced 2026-02-04 17:59:19 +08:00
21 KiB
21 KiB
Matcha-TTS: A fast TTS architecture with conditional flow matching
<head> </head>Shivam Mehta, Ruibo Tu, Jonas Beskow, Éva Székely, and Gustav Eje Henter
We propose Matcha-TTS, a new approach to non-autoregressive neural TTS, that uses conditional flow matching to speed up ODE-based speech synthesis. Our method:
- Is probabilistic
- Has compact memory footprint
- Sounds highly natural
- Is very fast to synthesise from
Please check out the audio examples below and read our arXiv preprint for more details. Code and pre-trained models will be made available shortly after the ICASSP deadline.
<style type="text/css"> .tg { border-collapse: collapse; border-color: #9ABAD9; border-spacing: 0; } .tg td { background-color: #EBF5FF; border-color: #9ABAD9; border-style: solid; border-width: 1px; color: #444; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 0px 20px; word-break: normal; font-weight: bold; vertical-align: middle; text-align: center; white-space: nowrap; } .tg th { background-color: #409cff; border-color: #9ABAD9; border-style: solid; border-width: 1px; color: #fff; font-family: Arial, sans-serif; font-size: 14px; font-weight: normal; overflow: hidden; padding: 0px 20px; word-break: normal; font-weight: bold; vertical-align: middle; text-align: center; white-space: nowrap; margin: auto; } .tg .tg-0pky { border-color: inherit; text-align: center; vertical-align: top, } td img { position: relative; margin: 0 auto; max-width: 650px; padding: 5px; border: 0px; } .tg .tg-fymr { border-color: inherit; font-weight: bold; text-align: center; vertical-align: top } .slider { -webkit-appearance: none; width: 75%; height: 15px; border-radius: 5px; background: #d3d3d3; outline: none; opacity: 0.7; -webkit-transition: .2s; transition: opacity .2s; } .slider::-webkit-slider-thumb { -webkit-appearance: none; appearance: none; width: 25px; height: 25px; border-radius: 50%; background: #409cff; cursor: pointer; } .slider::-moz-range-thumb { width: 25px; height: 25px; border-radius: 50%; background: #409cff; cursor: pointer; } audio { width: 240px; } /* CSS */ .button-12 { display: flex; flex-direction: column; align-items: center; padding: 10px 54px; font-family: -apple-system, BlinkMacSystemFont, 'Roboto', sans-serif; font-weight: bold; border-radius: 6px; border: none; background: #6E6D70; box-shadow: 0px 0.5px 1px rgba(0, 0, 0, 0.1), inset 0px 0.5px 0.5px rgba(255, 255, 255, 0.5), 0px 0px 0px 0.5px rgba(0, 0, 0, 0.12); color: #DFDEDF; user-select: none; -webkit-user-select: none; touch-action: manipulation; } .button-12:focus { box-shadow: inset 0px 0.8px 0px -0.25px rgba(255, 255, 255, 0.2), 0px 0.5px 1px rgba(0, 0, 0, 0.1), 0px 0px 0px 3.5px rgba(58, 108, 217, 0.5); outline: 0; } video { margin: 1em; } </style> <script src="transcripts.js"></script>Architecture
<script>
transcript_listening_test = {
1: "It had established periodic regular review of the status of four hundred individuals;", //4
2: "The narrative of these events is based largely on the recollections of the participants,", // 3
3: "The jury did not believe him, and the verdict was for the defendants.", // 7
4: "One by one the huge uprights of black timber were fitted together,", // 19
5: "The position of this palmprint on the carton was parallel with the long axis of the box, and at right angles with the short axis;", // 23
6: "The boy declared he saw no one, and accordingly passed through without paying the toll of a penny." // 38
}
function play_audio(filename, audio_id, condition_name, transcription){
audio = document.getElementById(audio_id);
audio_source = document.getElementById(audio_id + "-src");
block_quote = document.getElementById(audio_id + "-transcript");
stimulus_span = document.getElementById(audio_id + "-span");
audio.pause();
audio_source.src = filename;
block_quote.innerHTML = transcription;
stimulus_span.innerHTML = condition_name;
audio.load();
audio.play();
}
</script>
Stimuli from the listening test
Click the buttons in the table to load and play the different stimuli.
Currently loaded stimulus: MAT-10 : Sentence 1
Audio player:
Transcription:
It had established periodic regular review of the status of four hundred individuals;