Gustav Eje Henter 5a7f220662 Update demo page
Tweak the text, mention ICASSP acceptance, and update citation information.
2024-01-14 15:43:13 +01:00
2023-09-07 09:38:41 +00:00
2023-09-06 02:04:14 +00:00
2023-09-06 02:04:14 +00:00
2024-01-14 15:43:13 +01:00

Matcha-TTS: A fast TTS architecture with conditional flow matching

<head> </head>
Shivam Mehta, Ruibo Tu, Jonas Beskow, Éva Székely, and Gustav Eje Henter

We propose 🍵 Matcha-TTS, a new approach to non-autoregressive neural TTS, that uses conditional flow matching (similar to rectified flows) to speed up ODE-based speech synthesis. Our method:

  • Is probabilistic
  • Has compact memory footprint
  • Sounds highly natural
  • Is very fast to synthesise from

See below for audio examples, or read our ICASSP 2024 paper for more details. Code is available in our GitHub repository, along with pre-trained models.

You can also try 🍵 Matcha-TTS in your browser on HuggingFace 🤗 spaces.

<style type="text/css"> .tg { border-collapse: collapse; border-color: #9ABAD9; border-spacing: 0; } .tg td { background-color: #EBF5FF; border-color: #9ABAD9; border-style: solid; border-width: 1px; color: #444; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 0px 20px; word-break: normal; font-weight: bold; vertical-align: middle; text-align: center; white-space: nowrap; } .tg th { background-color: #409cff; border-color: #9ABAD9; border-style: solid; border-width: 1px; color: #fff; font-family: Arial, sans-serif; font-size: 14px; font-weight: bold; overflow: hidden; padding: 0px 20px; word-break: normal; font-weight: bold; vertical-align: middle; text-align: center; white-space: nowrap; margin: auto; } .tg th a { background-color: #409cff; color: #fff; text-decoration: none; font-family: Arial, sans-serif; font-size: 14px; font-weight: bold; overflow: hidden; padding: 0px 20px; word-break: normal; font-weight: bold; vertical-align: middle; text-align: center; white-space: nowrap; margin: auto; } .tg .tg-0pky { border-color: inherit; text-align: center; vertical-align: top, } td img { position: relative; margin: 0 auto; max-width: 650px; padding: 5px; border: 0px; } .tg .tg-fymr { border-color: inherit; font-weight: bold; text-align: center; vertical-align: top } .slider { -webkit-appearance: none; width: 75%; height: 15px; border-radius: 5px; background: #d3d3d3; outline: none; opacity: 0.7; -webkit-transition: .2s; transition: opacity .2s; } .slider::-webkit-slider-thumb { -webkit-appearance: none; appearance: none; width: 25px; height: 25px; border-radius: 50%; background: #409cff; cursor: pointer; } .slider::-moz-range-thumb { width: 25px; height: 25px; border-radius: 50%; background: #409cff; cursor: pointer; } /* audio { width: 240px; } */ /* CSS */ .button-12 { display: flex; flex-direction: column; align-items: center; padding: 10px 54px; font-family: -apple-system, BlinkMacSystemFont, 'Roboto', sans-serif; font-weight: bold; border-radius: 6px; border: none; background: #6E6D70; box-shadow: 0px 0.5px 1px rgba(0, 0, 0, 0.1), inset 0px 0.5px 0.5px rgba(255, 255, 255, 0.5), 0px 0px 0px 0.5px rgba(0, 0, 0, 0.12); color: #DFDEDF; user-select: none; -webkit-user-select: none; touch-action: manipulation; } .button-12:focus { box-shadow: inset 0px 0.8px 0px -0.25px rgba(255, 255, 255, 0.2), 0px 0.5px 1px rgba(0, 0, 0, 0.1), 0px 0px 0px 3.5px rgba(58, 108, 217, 0.5); outline: 0; } audio { margin: 0.5em; } .slider { -webkit-appearance: none; width: 75%; height: 15px; border-radius: 5px; background: #d3d3d3; outline: none; opacity: 0.7; -webkit-transition: .2s; transition: opacity .2s; } .slider::-webkit-slider-thumb { -webkit-appearance: none; appearance: none; width: 25px; height: 25px; border-radius: 50%; background: #409cff; cursor: pointer; } .slider::-moz-range-thumb { width: 25px; height: 25px; border-radius: 50%; background: #409cff; cursor: pointer; } </style> <script src="transcripts.js"></script> <script> transcript_listening_test = { 1: "It had established periodic regular review of the status of four hundred individuals;", //4 2: "The narrative of these events is based largely on the recollections of the participants,", // 3 3: "The jury did not believe him, and the verdict was for the defendants.", // 7 4: "One by one the huge uprights of black timber were fitted together,", // 19 5: "The position of this palmprint on the carton was parallel with the long axis of the box, and at right angles with the short axis;", // 23 6: "The boy declared he saw no one, and accordingly passed through without paying the toll of a penny." // 38 } function play_audio(filename, audio_id, condition_name, transcription){ audio = document.getElementById(audio_id); audio_source = document.getElementById(audio_id + "-src"); block_quote = document.getElementById(audio_id + "-transcript"); stimulus_span = document.getElementById(audio_id + "-span"); audio.pause(); audio_source.src = filename; block_quote.innerHTML = transcription; stimulus_span.innerHTML = condition_name; audio.load(); audio.play(); } </script>

Stimuli from the listening test

Click the buttons in the table to load and play the different stimuli.

Currently loaded stimulus: MAT-10 : Sentence 1

Audio player:

Transcription:

It had established periodic regular review of the status of four hundred individuals;

System Condition Sentence 1 Sentence 2 Sentence 3 Sentence 4 Sentence 5 Sentence 6
Vocoded
speech
VOC
Matcha-TTS MAT-10
MAT-4
MAT-2
Grad-TTS GRAD-10
GRAD-4
Grad-TTS+CFM GCFM-4
FastSpeech 2 FS2
VITS VITS

Effect of the number of ODE solver steps

Steps:

<script> var itr_slider = document.getElementById("itr_slider"); var itr_vals = document.getElementsByClassName("itr_val"); // Functions to update values var iterations = { 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 10, 7: 15, 8: 20, 9: 25, 10: 50, 11: 100, 12: 500, }; function updateVals(classes, value){ for(var i=0; i < classes.length; i++) { classes[i].innerHTML= iterations[parseInt(value)]; } } let systems = [ "MAT", "GRAD", "GCFM" ] updateVals(itr_vals, 6); itr_slider.oninput = function() { updateVals(itr_vals, this.value); let iteration = iterations[parseInt(this.value)]; // Update sources for (let sent=1; sent <= 3; sent++){ for (let system_idx = 0; system_idx < systems.length; system_idx++){ let audio = document.getElementById(systems[system_idx] + "_sent_" + sent); let audio_src = document.getElementById( systems[system_idx] + "_sent_src_" + sent); audio_src.src = "stimuli/number_of_ode_solver/" + systems[system_idx] + "-" + iteration + "_" + sent + ".wav"; audio.load(); } } } </script>
System Sentence 1 Sentence 2 Sentence 3
Matcha-TTS
Grad-TTS
Grad-TTS + CFM

Citation information

@inproceedings{mehta2024matcha,
  title={Matcha-{TTS}: A fast {TTS} architecture with conditional flow matching},
  author={Mehta, Shivam and Tu, Ruibo and Beskow, Jonas and Sz{\'e}kely, {\'E}va and Henter, Gustav Eje},
  booktitle={Proc. ICASSP},
  year={2024}
}

MatchaTTS

Languages
Jupyter Notebook 74.8%
Python 24.9%
Cython 0.2%
Makefile 0.1%