1 Commits

Author SHA1 Message Date
dependabot[bot]
0c0bcaf69f Bump diffusers from 0.25.0 to 0.27.2
Bumps [diffusers](https://github.com/huggingface/diffusers) from 0.25.0 to 0.27.2.
- [Release notes](https://github.com/huggingface/diffusers/releases)
- [Commits](https://github.com/huggingface/diffusers/compare/v0.25.0...v0.27.2)

---
updated-dependencies:
- dependency-name: diffusers
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2024-03-20 13:08:50 +00:00
18 changed files with 47 additions and 396 deletions

View File

@@ -1,9 +1,9 @@
default_language_version: default_language_version:
python: python3.11 python: python3.10
repos: repos:
- repo: https://github.com/pre-commit/pre-commit-hooks - repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.5.0 rev: v4.4.0
hooks: hooks:
# list of supported hooks: https://pre-commit.com/hooks.html # list of supported hooks: https://pre-commit.com/hooks.html
- id: trailing-whitespace - id: trailing-whitespace
@@ -18,28 +18,28 @@ repos:
# python code formatting # python code formatting
- repo: https://github.com/psf/black - repo: https://github.com/psf/black
rev: 23.12.1 rev: 23.9.1
hooks: hooks:
- id: black - id: black
args: [--line-length, "120"] args: [--line-length, "120"]
# python import sorting # python import sorting
- repo: https://github.com/PyCQA/isort - repo: https://github.com/PyCQA/isort
rev: 5.13.2 rev: 5.12.0
hooks: hooks:
- id: isort - id: isort
args: ["--profile", "black", "--filter-files"] args: ["--profile", "black", "--filter-files"]
# python upgrading syntax to newer version # python upgrading syntax to newer version
- repo: https://github.com/asottile/pyupgrade - repo: https://github.com/asottile/pyupgrade
rev: v3.15.0 rev: v3.14.0
hooks: hooks:
- id: pyupgrade - id: pyupgrade
args: [--py38-plus] args: [--py38-plus]
# python check (PEP8), programming errors and code complexity # python check (PEP8), programming errors and code complexity
- repo: https://github.com/PyCQA/flake8 - repo: https://github.com/PyCQA/flake8
rev: 7.0.0 rev: 6.1.0
hooks: hooks:
- id: flake8 - id: flake8
args: args:
@@ -54,6 +54,6 @@ repos:
# pylint # pylint
- repo: https://github.com/pycqa/pylint - repo: https://github.com/pycqa/pylint
rev: v3.0.3 rev: v3.0.0
hooks: hooks:
- id: pylint - id: pylint

View File

@@ -17,7 +17,7 @@
</div> </div>
> This is the official code implementation of 🍵 Matcha-TTS [ICASSP 2024]. > This is the official code implementation of 🍵 Matcha-TTS.
We propose 🍵 Matcha-TTS, a new approach to non-autoregressive neural TTS, that uses [conditional flow matching](https://arxiv.org/abs/2210.02747) (similar to [rectified flows](https://arxiv.org/abs/2209.03003)) to speed up ODE-based speech synthesis. Our method: We propose 🍵 Matcha-TTS, a new approach to non-autoregressive neural TTS, that uses [conditional flow matching](https://arxiv.org/abs/2210.02747) (similar to [rectified flows](https://arxiv.org/abs/2209.03003)) to speed up ODE-based speech synthesis. Our method:
@@ -252,43 +252,6 @@ python3 -m matcha.onnx.infer model.onnx --text "hey" --output-dir ./outputs --vo
This will write `.wav` audio files to the output directory. This will write `.wav` audio files to the output directory.
## Extract phoneme alignments from Matcha-TTS
If the dataset is structured as
```bash
data/
└── LJSpeech-1.1
├── metadata.csv
├── README
├── test.txt
├── train.txt
├── val.txt
└── wavs
```
Then you can extract the phoneme level alignments from a Trained Matcha-TTS model using:
```bash
python matcha/utils/get_durations_from_trained_model.py -i dataset_yaml -c <checkpoint>
```
Example:
```bash
python matcha/utils/get_durations_from_trained_model.py -i ljspeech.yaml -c matcha_ljspeech.ckpt
```
or simply:
```bash
matcha-tts-get-durations -i ljspeech.yaml -c matcha_ljspeech.ckpt
```
---
## Train using extracted alignments
In the datasetconfig turn on load duration.
Example: `ljspeech.yaml`
```
load_durations: True
```
or see an examples in configs/experiment/ljspeech_from_durations.yaml
## Citation information ## Citation information
If you use our code or otherwise find this work useful, please cite our paper: If you use our code or otherwise find this work useful, please cite our paper:

View File

@@ -1,7 +1,7 @@
_target_: matcha.data.text_mel_datamodule.TextMelDataModule _target_: matcha.data.text_mel_datamodule.TextMelDataModule
name: ljspeech name: ljspeech
train_filelist_path: data/LJSpeech-1.1/train.txt train_filelist_path: data/filelists/ljs_audio_text_train_filelist.txt
valid_filelist_path: data/LJSpeech-1.1/val.txt valid_filelist_path: data/filelists/ljs_audio_text_val_filelist.txt
batch_size: 32 batch_size: 32
num_workers: 20 num_workers: 20
pin_memory: True pin_memory: True
@@ -19,4 +19,3 @@ data_statistics: # Computed for ljspeech dataset
mel_mean: -5.536622 mel_mean: -5.536622
mel_std: 2.116101 mel_std: 2.116101
seed: ${seed} seed: ${seed}
load_durations: false

View File

@@ -1,19 +0,0 @@
# @package _global_
# to execute this experiment run:
# python train.py experiment=multispeaker
defaults:
- override /data: ljspeech.yaml
# all parameters below will be merged with parameters from default configurations set above
# this allows you to overwrite only specified parameters
tags: ["ljspeech"]
run_name: ljspeech
data:
load_durations: True
batch_size: 64

View File

@@ -13,4 +13,3 @@ n_feats: 80
data_statistics: ${data.data_statistics} data_statistics: ${data.data_statistics}
out_size: null # Must be divisible by 4 out_size: null # Must be divisible by 4
prior_loss: true prior_loss: true
use_precomputed_durations: ${data.load_durations}

View File

@@ -1 +1 @@
0.0.6.0 0.0.5.1

View File

@@ -48,7 +48,7 @@ def plot_spectrogram_to_numpy(spectrogram, filename):
def process_text(i: int, text: str, device: torch.device): def process_text(i: int, text: str, device: torch.device):
print(f"[{i}] - Input text: {text}") print(f"[{i}] - Input text: {text}")
x = torch.tensor( x = torch.tensor(
intersperse(text_to_sequence(text, ["english_cleaners2"])[0], 0), intersperse(text_to_sequence(text, ["english_cleaners2"]), 0),
dtype=torch.long, dtype=torch.long,
device=device, device=device,
)[None] )[None]
@@ -326,13 +326,12 @@ def batched_synthesis(args, device, model, vocoder, denoiser, texts, spk):
for i, batch in enumerate(dataloader): for i, batch in enumerate(dataloader):
i = i + 1 i = i + 1
start_t = dt.datetime.now() start_t = dt.datetime.now()
b = batch["x"].shape[0]
output = model.synthesise( output = model.synthesise(
batch["x"].to(device), batch["x"].to(device),
batch["x_lengths"].to(device), batch["x_lengths"].to(device),
n_timesteps=args.steps, n_timesteps=args.steps,
temperature=args.temperature, temperature=args.temperature,
spks=spk.expand(b) if spk is not None else spk, spks=spk,
length_scale=args.speaking_rate, length_scale=args.speaking_rate,
) )

View File

@@ -1,8 +1,6 @@
import random import random
from pathlib import Path
from typing import Any, Dict, Optional from typing import Any, Dict, Optional
import numpy as np
import torch import torch
import torchaudio as ta import torchaudio as ta
from lightning import LightningDataModule from lightning import LightningDataModule
@@ -41,7 +39,6 @@ class TextMelDataModule(LightningDataModule):
f_max, f_max,
data_statistics, data_statistics,
seed, seed,
load_durations,
): ):
super().__init__() super().__init__()
@@ -71,7 +68,6 @@ class TextMelDataModule(LightningDataModule):
self.hparams.f_max, self.hparams.f_max,
self.hparams.data_statistics, self.hparams.data_statistics,
self.hparams.seed, self.hparams.seed,
self.hparams.load_durations,
) )
self.validset = TextMelDataset( # pylint: disable=attribute-defined-outside-init self.validset = TextMelDataset( # pylint: disable=attribute-defined-outside-init
self.hparams.valid_filelist_path, self.hparams.valid_filelist_path,
@@ -87,7 +83,6 @@ class TextMelDataModule(LightningDataModule):
self.hparams.f_max, self.hparams.f_max,
self.hparams.data_statistics, self.hparams.data_statistics,
self.hparams.seed, self.hparams.seed,
self.hparams.load_durations,
) )
def train_dataloader(self): def train_dataloader(self):
@@ -114,7 +109,7 @@ class TextMelDataModule(LightningDataModule):
"""Clean up after fit or test.""" """Clean up after fit or test."""
pass # pylint: disable=unnecessary-pass pass # pylint: disable=unnecessary-pass
def state_dict(self): def state_dict(self): # pylint: disable=no-self-use
"""Extra things to save to checkpoint.""" """Extra things to save to checkpoint."""
return {} return {}
@@ -139,7 +134,6 @@ class TextMelDataset(torch.utils.data.Dataset):
f_max=8000, f_max=8000,
data_parameters=None, data_parameters=None,
seed=None, seed=None,
load_durations=False,
): ):
self.filepaths_and_text = parse_filelist(filelist_path) self.filepaths_and_text = parse_filelist(filelist_path)
self.n_spks = n_spks self.n_spks = n_spks
@@ -152,8 +146,6 @@ class TextMelDataset(torch.utils.data.Dataset):
self.win_length = win_length self.win_length = win_length
self.f_min = f_min self.f_min = f_min
self.f_max = f_max self.f_max = f_max
self.load_durations = load_durations
if data_parameters is not None: if data_parameters is not None:
self.data_parameters = data_parameters self.data_parameters = data_parameters
else: else:
@@ -172,29 +164,10 @@ class TextMelDataset(torch.utils.data.Dataset):
filepath, text = filepath_and_text[0], filepath_and_text[1] filepath, text = filepath_and_text[0], filepath_and_text[1]
spk = None spk = None
text, cleaned_text = self.get_text(text, add_blank=self.add_blank) text = self.get_text(text, add_blank=self.add_blank)
mel = self.get_mel(filepath) mel = self.get_mel(filepath)
durations = self.get_durations(filepath, text) if self.load_durations else None return {"x": text, "y": mel, "spk": spk}
return {"x": text, "y": mel, "spk": spk, "filepath": filepath, "x_text": cleaned_text, "durations": durations}
def get_durations(self, filepath, text):
filepath = Path(filepath)
data_dir, name = filepath.parent.parent, filepath.stem
try:
dur_loc = data_dir / "durations" / f"{name}.npy"
durs = torch.from_numpy(np.load(dur_loc).astype(int))
except FileNotFoundError as e:
raise FileNotFoundError(
f"Tried loading the durations but durations didn't exist at {dur_loc}, make sure you've generate the durations first using: python matcha/utils/get_durations_from_trained_model.py \n"
) from e
assert len(durs) == len(text), f"Length of durations {len(durs)} and text {len(text)} do not match"
return durs
def get_mel(self, filepath): def get_mel(self, filepath):
audio, sr = ta.load(filepath) audio, sr = ta.load(filepath)
@@ -214,11 +187,11 @@ class TextMelDataset(torch.utils.data.Dataset):
return mel return mel
def get_text(self, text, add_blank=True): def get_text(self, text, add_blank=True):
text_norm, cleaned_text = text_to_sequence(text, self.cleaners) text_norm = text_to_sequence(text, self.cleaners)
if self.add_blank: if self.add_blank:
text_norm = intersperse(text_norm, 0) text_norm = intersperse(text_norm, 0)
text_norm = torch.IntTensor(text_norm) text_norm = torch.IntTensor(text_norm)
return text_norm, cleaned_text return text_norm
def __getitem__(self, index): def __getitem__(self, index):
datapoint = self.get_datapoint(self.filepaths_and_text[index]) datapoint = self.get_datapoint(self.filepaths_and_text[index])
@@ -241,11 +214,8 @@ class TextMelBatchCollate:
y = torch.zeros((B, n_feats, y_max_length), dtype=torch.float32) y = torch.zeros((B, n_feats, y_max_length), dtype=torch.float32)
x = torch.zeros((B, x_max_length), dtype=torch.long) x = torch.zeros((B, x_max_length), dtype=torch.long)
durations = torch.zeros((B, x_max_length), dtype=torch.long)
y_lengths, x_lengths = [], [] y_lengths, x_lengths = [], []
spks = [] spks = []
filepaths, x_texts = [], []
for i, item in enumerate(batch): for i, item in enumerate(batch):
y_, x_ = item["y"], item["x"] y_, x_ = item["y"], item["x"]
y_lengths.append(y_.shape[-1]) y_lengths.append(y_.shape[-1])
@@ -253,22 +223,9 @@ class TextMelBatchCollate:
y[i, :, : y_.shape[-1]] = y_ y[i, :, : y_.shape[-1]] = y_
x[i, : x_.shape[-1]] = x_ x[i, : x_.shape[-1]] = x_
spks.append(item["spk"]) spks.append(item["spk"])
filepaths.append(item["filepath"])
x_texts.append(item["x_text"])
if item["durations"] is not None:
durations[i, : item["durations"].shape[-1]] = item["durations"]
y_lengths = torch.tensor(y_lengths, dtype=torch.long) y_lengths = torch.tensor(y_lengths, dtype=torch.long)
x_lengths = torch.tensor(x_lengths, dtype=torch.long) x_lengths = torch.tensor(x_lengths, dtype=torch.long)
spks = torch.tensor(spks, dtype=torch.long) if self.n_spks > 1 else None spks = torch.tensor(spks, dtype=torch.long) if self.n_spks > 1 else None
return { return {"x": x, "x_lengths": x_lengths, "y": y, "y_lengths": y_lengths, "spks": spks}
"x": x,
"x_lengths": x_lengths,
"y": y,
"y_lengths": y_lengths,
"spks": spks,
"filepaths": filepaths,
"x_texts": x_texts,
"durations": durations if not torch.eq(durations, 0).all() else None,
}

View File

@@ -58,14 +58,13 @@ class BaseLightningClass(LightningModule, ABC):
y, y_lengths = batch["y"], batch["y_lengths"] y, y_lengths = batch["y"], batch["y_lengths"]
spks = batch["spks"] spks = batch["spks"]
dur_loss, prior_loss, diff_loss, *_ = self( dur_loss, prior_loss, diff_loss = self(
x=x, x=x,
x_lengths=x_lengths, x_lengths=x_lengths,
y=y, y=y,
y_lengths=y_lengths, y_lengths=y_lengths,
spks=spks, spks=spks,
out_size=self.out_size, out_size=self.out_size,
durations=batch["durations"],
) )
return { return {
"dur_loss": dur_loss, "dur_loss": dur_loss,

View File

@@ -35,7 +35,6 @@ class MatchaTTS(BaseLightningClass): # 🍵
optimizer=None, optimizer=None,
scheduler=None, scheduler=None,
prior_loss=True, prior_loss=True,
use_precomputed_durations=False,
): ):
super().__init__() super().__init__()
@@ -47,7 +46,6 @@ class MatchaTTS(BaseLightningClass): # 🍵
self.n_feats = n_feats self.n_feats = n_feats
self.out_size = out_size self.out_size = out_size
self.prior_loss = prior_loss self.prior_loss = prior_loss
self.use_precomputed_durations = use_precomputed_durations
if n_spks > 1: if n_spks > 1:
self.spk_emb = torch.nn.Embedding(n_spks, spk_emb_dim) self.spk_emb = torch.nn.Embedding(n_spks, spk_emb_dim)
@@ -149,7 +147,7 @@ class MatchaTTS(BaseLightningClass): # 🍵
"rtf": rtf, "rtf": rtf,
} }
def forward(self, x, x_lengths, y, y_lengths, spks=None, out_size=None, cond=None, durations=None): def forward(self, x, x_lengths, y, y_lengths, spks=None, out_size=None, cond=None):
""" """
Computes 3 losses: Computes 3 losses:
1. duration loss: loss between predicted token durations and those extracted by Monotinic Alignment Search (MAS). 1. duration loss: loss between predicted token durations and those extracted by Monotinic Alignment Search (MAS).
@@ -181,20 +179,17 @@ class MatchaTTS(BaseLightningClass): # 🍵
y_mask = sequence_mask(y_lengths, y_max_length).unsqueeze(1).to(x_mask) y_mask = sequence_mask(y_lengths, y_max_length).unsqueeze(1).to(x_mask)
attn_mask = x_mask.unsqueeze(-1) * y_mask.unsqueeze(2) attn_mask = x_mask.unsqueeze(-1) * y_mask.unsqueeze(2)
if self.use_precomputed_durations: # Use MAS to find most likely alignment `attn` between text and mel-spectrogram
attn = generate_path(durations.squeeze(1), attn_mask.squeeze(1)) with torch.no_grad():
else: const = -0.5 * math.log(2 * math.pi) * self.n_feats
# Use MAS to find most likely alignment `attn` between text and mel-spectrogram factor = -0.5 * torch.ones(mu_x.shape, dtype=mu_x.dtype, device=mu_x.device)
with torch.no_grad(): y_square = torch.matmul(factor.transpose(1, 2), y**2)
const = -0.5 * math.log(2 * math.pi) * self.n_feats y_mu_double = torch.matmul(2.0 * (factor * mu_x).transpose(1, 2), y)
factor = -0.5 * torch.ones(mu_x.shape, dtype=mu_x.dtype, device=mu_x.device) mu_square = torch.sum(factor * (mu_x**2), 1).unsqueeze(-1)
y_square = torch.matmul(factor.transpose(1, 2), y**2) log_prior = y_square - y_mu_double + mu_square + const
y_mu_double = torch.matmul(2.0 * (factor * mu_x).transpose(1, 2), y)
mu_square = torch.sum(factor * (mu_x**2), 1).unsqueeze(-1)
log_prior = y_square - y_mu_double + mu_square + const
attn = monotonic_align.maximum_path(log_prior, attn_mask.squeeze(1)) attn = monotonic_align.maximum_path(log_prior, attn_mask.squeeze(1))
attn = attn.detach() # b, t_text, T_mel attn = attn.detach()
# Compute loss between predicted log-scaled durations and those obtained from MAS # Compute loss between predicted log-scaled durations and those obtained from MAS
# refered to as prior loss in the paper # refered to as prior loss in the paper
@@ -241,4 +236,4 @@ class MatchaTTS(BaseLightningClass): # 🍵
else: else:
prior_loss = 0 prior_loss = 0
return dur_loss, prior_loss, diff_loss, attn return dur_loss, prior_loss, diff_loss

View File

@@ -21,7 +21,7 @@ def text_to_sequence(text, cleaner_names):
for symbol in clean_text: for symbol in clean_text:
symbol_id = _symbol_to_id[symbol] symbol_id = _symbol_to_id[symbol]
sequence += [symbol_id] sequence += [symbol_id]
return sequence, clean_text return sequence
def cleaned_text_to_sequence(cleaned_text): def cleaned_text_to_sequence(cleaned_text):

View File

@@ -15,6 +15,7 @@ import logging
import re import re
import phonemizer import phonemizer
import piper_phonemize
from unidecode import unidecode from unidecode import unidecode
# To avoid excessive logging we set the log level of the phonemizer package to Critical # To avoid excessive logging we set the log level of the phonemizer package to Critical
@@ -105,17 +106,11 @@ def english_cleaners2(text):
return phonemes return phonemes
# I am removing this due to incompatibility with several version of python def english_cleaners_piper(text):
# However, if you want to use it, you can uncomment it """Pipeline for English text, including abbreviation expansion. + punctuation + stress"""
# and install piper-phonemize with the following command: text = convert_to_ascii(text)
# pip install piper-phonemize text = lowercase(text)
text = expand_abbreviations(text)
# import piper_phonemize phonemes = "".join(piper_phonemize.phonemize_espeak(text=text, voice="en-US")[0])
# def english_cleaners_piper(text): phonemes = collapse_whitespace(phonemes)
# """Pipeline for English text, including abbreviation expansion. + punctuation + stress""" return phonemes
# text = convert_to_ascii(text)
# text = lowercase(text)
# text = expand_abbreviations(text)
# phonemes = "".join(piper_phonemize.phonemize_espeak(text=text, voice="en-US")[0])
# phonemes = collapse_whitespace(phonemes)
# return phonemes

View File

@@ -94,7 +94,6 @@ def main():
cfg["batch_size"] = args.batch_size cfg["batch_size"] = args.batch_size
cfg["train_filelist_path"] = str(os.path.join(root_path, cfg["train_filelist_path"])) cfg["train_filelist_path"] = str(os.path.join(root_path, cfg["train_filelist_path"]))
cfg["valid_filelist_path"] = str(os.path.join(root_path, cfg["valid_filelist_path"])) cfg["valid_filelist_path"] = str(os.path.join(root_path, cfg["valid_filelist_path"]))
cfg["load_durations"] = False
text_mel_datamodule = TextMelDataModule(**cfg) text_mel_datamodule = TextMelDataModule(**cfg)
text_mel_datamodule.setup() text_mel_datamodule.setup()

View File

@@ -1,195 +0,0 @@
r"""
The file creates a pickle file where the values needed for loading of dataset is stored and the model can load it
when needed.
Parameters from hparam.py will be used
"""
import argparse
import json
import os
import sys
from pathlib import Path
import lightning
import numpy as np
import rootutils
import torch
from hydra import compose, initialize
from omegaconf import open_dict
from torch import nn
from tqdm.auto import tqdm
from matcha.cli import get_device
from matcha.data.text_mel_datamodule import TextMelDataModule
from matcha.models.matcha_tts import MatchaTTS
from matcha.utils.logging_utils import pylogger
from matcha.utils.utils import get_phoneme_durations
log = pylogger.get_pylogger(__name__)
def save_durations_to_folder(
attn: torch.Tensor, x_length: int, y_length: int, filepath: str, output_folder: Path, text: str
):
durations = attn.squeeze().sum(1)[:x_length].numpy()
durations_json = get_phoneme_durations(durations, text)
output = output_folder / Path(filepath).name.replace(".wav", ".npy")
with open(output.with_suffix(".json"), "w", encoding="utf-8") as f:
json.dump(durations_json, f, indent=4, ensure_ascii=False)
np.save(output, durations)
@torch.inference_mode()
def compute_durations(data_loader: torch.utils.data.DataLoader, model: nn.Module, device: torch.device, output_folder):
"""Generate durations from the model for each datapoint and save it in a folder
Args:
data_loader (torch.utils.data.DataLoader): Dataloader
model (nn.Module): MatchaTTS model
device (torch.device): GPU or CPU
"""
for batch in tqdm(data_loader, desc="🍵 Computing durations 🍵:"):
x, x_lengths = batch["x"], batch["x_lengths"]
y, y_lengths = batch["y"], batch["y_lengths"]
spks = batch["spks"]
x = x.to(device)
y = y.to(device)
x_lengths = x_lengths.to(device)
y_lengths = y_lengths.to(device)
spks = spks.to(device) if spks is not None else None
_, _, _, attn = model(
x=x,
x_lengths=x_lengths,
y=y,
y_lengths=y_lengths,
spks=spks,
)
attn = attn.cpu()
for i in range(attn.shape[0]):
save_durations_to_folder(
attn[i],
x_lengths[i].item(),
y_lengths[i].item(),
batch["filepaths"][i],
output_folder,
batch["x_texts"][i],
)
def main():
parser = argparse.ArgumentParser()
parser.add_argument(
"-i",
"--input-config",
type=str,
default="ljspeech.yaml",
help="The name of the yaml config file under configs/data",
)
parser.add_argument(
"-b",
"--batch-size",
type=int,
default="32",
help="Can have increased batch size for faster computation",
)
parser.add_argument(
"-f",
"--force",
action="store_true",
default=False,
required=False,
help="force overwrite the file",
)
parser.add_argument(
"-c",
"--checkpoint_path",
type=str,
required=True,
help="Path to the checkpoint file to load the model from",
)
parser.add_argument(
"-o",
"--output-folder",
type=str,
default=None,
help="Output folder to save the data statistics",
)
parser.add_argument(
"--cpu", action="store_true", help="Use CPU for inference, not recommended (default: use GPU if available)"
)
args = parser.parse_args()
with initialize(version_base="1.3", config_path="../../configs/data"):
cfg = compose(config_name=args.input_config, return_hydra_config=True, overrides=[])
root_path = rootutils.find_root(search_from=__file__, indicator=".project-root")
with open_dict(cfg):
del cfg["hydra"]
del cfg["_target_"]
cfg["seed"] = 1234
cfg["batch_size"] = args.batch_size
cfg["train_filelist_path"] = str(os.path.join(root_path, cfg["train_filelist_path"]))
cfg["valid_filelist_path"] = str(os.path.join(root_path, cfg["valid_filelist_path"]))
cfg["load_durations"] = False
if args.output_folder is not None:
output_folder = Path(args.output_folder)
else:
output_folder = Path(cfg["train_filelist_path"]).parent / "durations"
print(f"Output folder set to: {output_folder}")
if os.path.exists(output_folder) and not args.force:
print("Folder already exists. Use -f to force overwrite")
sys.exit(1)
output_folder.mkdir(parents=True, exist_ok=True)
print(f"Preprocessing: {cfg['name']} from training filelist: {cfg['train_filelist_path']}")
print("Loading model...")
device = get_device(args)
model = MatchaTTS.load_from_checkpoint(args.checkpoint_path, map_location=device)
text_mel_datamodule = TextMelDataModule(**cfg)
text_mel_datamodule.setup()
try:
print("Computing stats for training set if exists...")
train_dataloader = text_mel_datamodule.train_dataloader()
compute_durations(train_dataloader, model, device, output_folder)
except lightning.fabric.utilities.exceptions.MisconfigurationException:
print("No training set found")
try:
print("Computing stats for validation set if exists...")
val_dataloader = text_mel_datamodule.val_dataloader()
compute_durations(val_dataloader, model, device, output_folder)
except lightning.fabric.utilities.exceptions.MisconfigurationException:
print("No validation set found")
try:
print("Computing stats for test set if exists...")
test_dataloader = text_mel_datamodule.test_dataloader()
compute_durations(test_dataloader, model, device, output_folder)
except lightning.fabric.utilities.exceptions.MisconfigurationException:
print("No test set found")
print(f"[+] Done! Data statistics saved to: {output_folder}")
if __name__ == "__main__":
# Helps with generating durations for the dataset to train other architectures
# that cannot learn to align due to limited size of dataset
# Example usage:
# python python matcha/utils/get_durations_from_trained_model.py -i ljspeech.yaml -c pretrained_model
# This will create a folder in data/processed_data/durations/ljspeech with the durations
main()

View File

@@ -2,7 +2,6 @@ import os
import sys import sys
import warnings import warnings
from importlib.util import find_spec from importlib.util import find_spec
from math import ceil
from pathlib import Path from pathlib import Path
from typing import Any, Callable, Dict, Tuple from typing import Any, Callable, Dict, Tuple
@@ -218,42 +217,3 @@ def assert_model_downloaded(checkpoint_path, url, use_wget=True):
gdown.download(url=url, output=checkpoint_path, quiet=False, fuzzy=True) gdown.download(url=url, output=checkpoint_path, quiet=False, fuzzy=True)
else: else:
wget.download(url=url, out=checkpoint_path) wget.download(url=url, out=checkpoint_path)
def get_phoneme_durations(durations, phones):
prev = durations[0]
merged_durations = []
# Convolve with stride 2
for i in range(1, len(durations), 2):
if i == len(durations) - 2:
# if it is last take full value
next_half = durations[i + 1]
else:
next_half = ceil(durations[i + 1] / 2)
curr = prev + durations[i] + next_half
prev = durations[i + 1] - next_half
merged_durations.append(curr)
assert len(phones) == len(merged_durations)
assert len(merged_durations) == (len(durations) - 1) // 2
merged_durations = torch.cumsum(torch.tensor(merged_durations), 0, dtype=torch.long)
start = torch.tensor(0)
duration_json = []
for i, duration in enumerate(merged_durations):
duration_json.append(
{
phones[i]: {
"starttime": start.item(),
"endtime": duration.item(),
"duration": duration.item() - start.item(),
}
}
)
start = duration
assert list(duration_json[-1].values())[0]["endtime"] == sum(
durations
), f"{list(duration_json[-1].values())[0]['endtime'], sum(durations)}"
return duration_json

View File

@@ -35,10 +35,11 @@ torchaudio
matplotlib matplotlib
pandas pandas
conformer==0.3.2 conformer==0.3.2
diffusers==0.25.0 diffusers==0.27.2
notebook notebook
ipywidgets ipywidgets
gradio==3.43.2 gradio
gdown gdown
wget wget
seaborn seaborn
piper_phonemize

View File

@@ -38,7 +38,6 @@ setup(
"matcha-data-stats=matcha.utils.generate_data_statistics:main", "matcha-data-stats=matcha.utils.generate_data_statistics:main",
"matcha-tts=matcha.cli:cli", "matcha-tts=matcha.cli:cli",
"matcha-tts-app=matcha.app:main", "matcha-tts-app=matcha.app:main",
"matcha-tts-get-durations=matcha.utils.get_durations_from_trained_model:main",
] ]
}, },
ext_modules=cythonize(exts, language_level=3), ext_modules=cythonize(exts, language_level=3),

View File

@@ -19,7 +19,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": 1,
"id": "148f4bc0-c28e-4670-9a5e-4c7928ab8992", "id": "148f4bc0-c28e-4670-9a5e-4c7928ab8992",
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
@@ -192,7 +192,7 @@
"source": [ "source": [
"@torch.inference_mode()\n", "@torch.inference_mode()\n",
"def process_text(text: str):\n", "def process_text(text: str):\n",
" x = torch.tensor(intersperse(text_to_sequence(text, ['english_cleaners2'])[0], 0),dtype=torch.long, device=device)[None]\n", " x = torch.tensor(intersperse(text_to_sequence(text, ['english_cleaners2']), 0),dtype=torch.long, device=device)[None]\n",
" x_lengths = torch.tensor([x.shape[-1]],dtype=torch.long, device=device)\n", " x_lengths = torch.tensor([x.shape[-1]],dtype=torch.long, device=device)\n",
" x_phones = sequence_to_text(x.squeeze(0).tolist())\n", " x_phones = sequence_to_text(x.squeeze(0).tolist())\n",
" return {\n", " return {\n",