2023-09-06 02:15:42 +00:00
2023-09-06 02:04:14 +00:00
2023-09-06 02:04:14 +00:00
2023-09-06 02:04:14 +00:00
2023-09-06 02:15:42 +00:00

Matcha-TTS: A fast TTS architecture with conditional flow matching

<head> </head>
Shivam Mehta, Ruibo Tu, Jonas Beskow, Éva Székely, and Gustav Eje Henter

We propose Matcha-TTS, a new approach to non-autoregressive neural TTS, that uses conditional flow matching to speed up ODE-based speech synthesis. Our method:

  • Is probabilistic
  • Has compact memory footprint
  • Sounds highly natural
  • Is very fast to synthesise from

Please check out the audio examples below and read our arXiv preprint for more details. Code and pre-trained models will be made available shortly after the ICASSP deadline.

<style type="text/css"> .tg { border-collapse: collapse; border-color: #9ABAD9; border-spacing: 0; } .tg td { background-color: #EBF5FF; border-color: #9ABAD9; border-style: solid; border-width: 1px; color: #444; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 0px 20px; word-break: normal; font-weight: bold; vertical-align: middle; text-align: center; white-space: nowrap; } .tg th { background-color: #409cff; border-color: #9ABAD9; border-style: solid; border-width: 1px; color: #fff; font-family: Arial, sans-serif; font-size: 14px; font-weight: normal; overflow: hidden; padding: 0px 20px; word-break: normal; font-weight: bold; vertical-align: middle; text-align: center; white-space: nowrap; margin: auto; } .tg .tg-0pky { border-color: inherit; text-align: center; vertical-align: top, } td img { position: relative; margin: 0 auto; max-width: 650px; padding: 5px; border: 0px; } .tg .tg-fymr { border-color: inherit; font-weight: bold; text-align: center; vertical-align: top } .slider { -webkit-appearance: none; width: 75%; height: 15px; border-radius: 5px; background: #d3d3d3; outline: none; opacity: 0.7; -webkit-transition: .2s; transition: opacity .2s; } .slider::-webkit-slider-thumb { -webkit-appearance: none; appearance: none; width: 25px; height: 25px; border-radius: 50%; background: #409cff; cursor: pointer; } .slider::-moz-range-thumb { width: 25px; height: 25px; border-radius: 50%; background: #409cff; cursor: pointer; } audio { width: 240px; } /* CSS */ .button-12 { display: flex; flex-direction: column; align-items: center; padding: 10px 54px; font-family: -apple-system, BlinkMacSystemFont, 'Roboto', sans-serif; font-weight: bold; border-radius: 6px; border: none; background: #6E6D70; box-shadow: 0px 0.5px 1px rgba(0, 0, 0, 0.1), inset 0px 0.5px 0.5px rgba(255, 255, 255, 0.5), 0px 0px 0px 0.5px rgba(0, 0, 0, 0.12); color: #DFDEDF; user-select: none; -webkit-user-select: none; touch-action: manipulation; } .button-12:focus { box-shadow: inset 0px 0.8px 0px -0.25px rgba(255, 255, 255, 0.2), 0px 0.5px 1px rgba(0, 0, 0, 0.1), 0px 0px 0px 3.5px rgba(58, 108, 217, 0.5); outline: 0; } video { margin: 1em; } </style> <script src="transcripts.js"></script>

Architecture

Architecture of OverFlow <script> transcript_listening_test = { 1: "It had established periodic regular review of the status of four hundred individuals;", //4 2: "The narrative of these events is based largely on the recollections of the participants,", // 3 3: "The jury did not believe him, and the verdict was for the defendants.", // 7 4: "One by one the huge uprights of black timber were fitted together,", // 19 5: "The position of this palmprint on the carton was parallel with the long axis of the box, and at right angles with the short axis;", // 23 6: "The boy declared he saw no one, and accordingly passed through without paying the toll of a penny." // 38 } function play_audio(filename, audio_id, condition_name, transcription){ audio = document.getElementById(audio_id); audio_source = document.getElementById(audio_id + "-src"); block_quote = document.getElementById(audio_id + "-transcript"); stimulus_span = document.getElementById(audio_id + "-span"); audio.pause(); audio_source.src = filename; block_quote.innerHTML = transcription; stimulus_span.innerHTML = condition_name; audio.load(); audio.play(); } </script>

Stimuli from the evaluation test

Currently loaded => MAT-10 : Sentence 1

It had established periodic regular review of the status of four hundred individuals;

Architecture Condition Sentence 1 Sentence 2 Sentence 3 Sentence 4 Sentence 5 Sentence 6
Vocoded VOC
Matcha-TTS MAT-10
MAT-4
MAT-2
Grad-TTS GRAD-10
GRAD-4
Grad-TTS+CFM GCFM-4
FastSpeech FS2
VITS VITS

MatchaTTS

Languages
Jupyter Notebook 74.8%
Python 24.9%
Cython 0.2%
Makefile 0.1%