[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci
[pre-commit.ci] pre-commit autoupdate
2026-02-04 17:59:19 +08:00 · 2026-01-19 22:11:39 +00:00 · 2026-01-19 22:10:05 +00:00 · 2025-09-17 08:49:57 -07:00 · 2024-12-02 09:02:41 -06:00 · 2024-12-02 08:26:00 -06:00
38 changed files with 776 additions and 86 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -161,3 +161,4 @@ generator_v1
 g_02500000
 gradio_cached_examples/
 synth_output/
 /data
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -1,9 +1,9 @@
 default_language_version:
-  python: python3.10
+  python: python3.11
 repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
-    rev: v4.5.0
+    rev: v6.0.0
    hooks:
      # list of supported hooks: https://pre-commit.com/hooks.html
      - id: trailing-whitespace
@@ -17,29 +17,29 @@ repos:
      - id: check-added-large-files
  # python code formatting
-  - repo: https://github.com/psf/black
+  - repo: https://github.com/psf/black-pre-commit-mirror
-    rev: 23.12.1
+    rev: 26.1.0
    hooks:
      - id: black
        args: [--line-length, "120"]
  # python import sorting
  - repo: https://github.com/PyCQA/isort
-    rev: 5.13.2
+    rev: 7.0.0
    hooks:
      - id: isort
        args: ["--profile", "black", "--filter-files"]
  # python upgrading syntax to newer version
  - repo: https://github.com/asottile/pyupgrade
-    rev: v3.15.0
+    rev: v3.21.2
    hooks:
      - id: pyupgrade
        args: [--py38-plus]
  # python check (PEP8), programming errors and code complexity
  - repo: https://github.com/PyCQA/flake8
-    rev: 7.0.0
+    rev: 7.3.0
    hooks:
      - id: flake8
        args:
@@ -54,6 +54,6 @@ repos:
  # pylint
  - repo: https://github.com/pycqa/pylint
-    rev: v3.0.3
+    rev: v4.0.4
    hooks:
    -   id: pylint
--- a/README.md
+++ b/README.md
@@ -10,7 +10,7 @@
 [![hydra](https://img.shields.io/badge/Config-Hydra_1.3-89b8cd)](https://hydra.cc/)
 [![black](https://img.shields.io/badge/Code%20Style-Black-black.svg?labelColor=gray)](https://black.readthedocs.io/en/stable/)
 [![isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)
-
+[![PyPI Downloads](https://static.pepy.tech/personalized-badge/matcha-tts?period=total&units=INTERNATIONAL_SYSTEM&left_color=BLACK&right_color=GREEN&left_text=downloads)](https://pepy.tech/projects/matcha-tts)
 <p style="text-align: center;">
  <img src="https://shivammehta25.github.io/Matcha-TTS/images/logo.png" height="128"/>
 </p>
@@ -252,6 +252,43 @@ python3 -m matcha.onnx.infer model.onnx --text "hey" --output-dir ./outputs --vo
 This will write `.wav` audio files to the output directory.
 ## Extract phoneme alignments from Matcha-TTS
 If the dataset is structured as
 ```bash
 data/
 └── LJSpeech-1.1
    ├── metadata.csv
    ├── README
    ├── test.txt
    ├── train.txt
    ├── val.txt
    └── wavs
 ```
 Then you can extract the phoneme level alignments from a Trained Matcha-TTS model using:
 ```bash
 python  matcha/utils/get_durations_from_trained_model.py -i dataset_yaml -c <checkpoint>
 ```
 Example:
 ```bash
 python  matcha/utils/get_durations_from_trained_model.py -i ljspeech.yaml -c matcha_ljspeech.ckpt
 ```
 or simply:
 ```bash
 matcha-tts-get-durations -i ljspeech.yaml -c matcha_ljspeech.ckpt
 ```
 ---
 ## Train using extracted alignments
 In the datasetconfig turn on load duration.
 Example: `ljspeech.yaml`
 ```
 load_durations: True
 ```
 or see an examples in configs/experiment/ljspeech_from_durations.yaml
 ## Citation information
 If you use our code or otherwise find this work useful, please cite our paper:
--- a/configs/data/hi-fi_en-US_female.yaml
+++ b/configs/data/hi-fi_en-US_female.yaml
@@ -5,8 +5,8 @@ defaults:
 # Dataset URL: https://ast-astrec.nict.go.jp/en/release/hi-fi-captain/
 _target_: matcha.data.text_mel_datamodule.TextMelDataModule
 name: hi-fi_en-US_female
-train_filelist_path: data/filelists/hi-fi-captain-en-us-female_train.txt
+train_filelist_path: data/hi-fi_en-US_female/train.txt
-valid_filelist_path: data/filelists/hi-fi-captain-en-us-female_val.txt
+valid_filelist_path: data/hi-fi_en-US_female/val.txt
 batch_size: 32
 cleaners: [english_cleaners_piper]
 data_statistics:  # Computed for this dataset
--- a/configs/data/ljspeech.yaml
+++ b/configs/data/ljspeech.yaml
@@ -1,7 +1,7 @@
 _target_: matcha.data.text_mel_datamodule.TextMelDataModule
 name: ljspeech
-train_filelist_path: data/filelists/ljs_audio_text_train_filelist.txt
+train_filelist_path: data/LJSpeech-1.1/train.txt
-valid_filelist_path: data/filelists/ljs_audio_text_val_filelist.txt
+valid_filelist_path: data/LJSpeech-1.1/val.txt
 batch_size: 32
 num_workers: 20
 pin_memory: True
@@ -19,3 +19,4 @@ data_statistics:  # Computed for ljspeech dataset
  mel_mean: -5.536622
  mel_std: 2.116101
 seed: ${seed}
 load_durations: false
--- a/configs/experiment/ljspeech_from_durations.yaml
+++ b/configs/experiment/ljspeech_from_durations.yaml
@@ -0,0 +1,19 @@
 # @package _global_
 # to execute this experiment run:
 # python train.py experiment=multispeaker
 defaults:
  - override /data: ljspeech.yaml
 # all parameters below will be merged with parameters from default configurations set above
 # this allows you to overwrite only specified parameters
 tags: ["ljspeech"]
 run_name: ljspeech
 data:
  load_durations: True
  batch_size: 64
--- a/configs/model/matcha.yaml
+++ b/configs/model/matcha.yaml
@@ -13,3 +13,4 @@ n_feats: 80
 data_statistics: ${data.data_statistics}
 out_size: null # Must be divisible by 4
 prior_loss: true
 use_precomputed_durations: ${data.load_durations}
--- a/1
+++ b/1
@@ -1 +0,0 @@
 /home/smehta/Projects/Speech-Backbones/Grad-TTS/data
--- a/matcha/VERSION
+++ b/matcha/VERSION
@@ -1 +1 @@
-0.0.5.1
+0.0.7.2
--- a/matcha/cli.py
+++ b/matcha/cli.py
@@ -48,7 +48,7 @@ def plot_spectrogram_to_numpy(spectrogram, filename):
 def process_text(i: int, text: str, device: torch.device):
    print(f"[{i}] - Input text: {text}")
    x = torch.tensor(
-        intersperse(text_to_sequence(text, ["english_cleaners2"]), 0),
+        intersperse(text_to_sequence(text, ["english_cleaners2"])[0], 0),
        dtype=torch.long,
        device=device,
    )[None]
@@ -114,10 +114,10 @@ def load_matcha(model_name, checkpoint_path, device):
    return model
-def to_waveform(mel, vocoder, denoiser=None):
+def to_waveform(mel, vocoder, denoiser=None, denoiser_strength=0.00025):
    audio = vocoder(mel).clamp(-1, 1)
    if denoiser is not None:
-        audio = denoiser(audio.squeeze(), strength=0.00025).cpu().squeeze()
+        audio = denoiser(audio.squeeze(), strength=denoiser_strength).cpu().squeeze()
    return audio.cpu().squeeze()
@@ -326,16 +326,17 @@ def batched_synthesis(args, device, model, vocoder, denoiser, texts, spk):
    for i, batch in enumerate(dataloader):
        i = i + 1
        start_t = dt.datetime.now()
        b = batch["x"].shape[0]
        output = model.synthesise(
            batch["x"].to(device),
            batch["x_lengths"].to(device),
            n_timesteps=args.steps,
            temperature=args.temperature,
-            spks=spk,
+            spks=spk.expand(b) if spk is not None else spk,
            length_scale=args.speaking_rate,
        )
-        output["waveform"] = to_waveform(output["mel"], vocoder, denoiser)
+        output["waveform"] = to_waveform(output["mel"], vocoder, denoiser, args.denoiser_strength)
        t = (dt.datetime.now() - start_t).total_seconds()
        rtf_w = t * 22050 / (output["waveform"].shape[-1])
        print(f"[🍵-Batch: {i}] Matcha-TTS RTF: {output['rtf']:.4f}")
@@ -376,7 +377,7 @@ def unbatched_synthesis(args, device, model, vocoder, denoiser, texts, spk):
            spks=spk,
            length_scale=args.speaking_rate,
        )
-        output["waveform"] = to_waveform(output["mel"], vocoder, denoiser)
+        output["waveform"] = to_waveform(output["mel"], vocoder, denoiser, args.denoiser_strength)
        # RTF with HiFiGAN
        t = (dt.datetime.now() - start_t).total_seconds()
        rtf_w = t * 22050 / (output["waveform"].shape[-1])
--- a/matcha/data/text_mel_datamodule.py
+++ b/matcha/data/text_mel_datamodule.py
@@ -1,6 +1,8 @@
 import random
 from pathlib import Path
 from typing import Any, Dict, Optional
 import numpy as np
 import torch
 import torchaudio as ta
 from lightning import LightningDataModule
@@ -39,6 +41,7 @@ class TextMelDataModule(LightningDataModule):
        f_max,
        data_statistics,
        seed,
        load_durations,
    ):
        super().__init__()
@@ -68,6 +71,7 @@ class TextMelDataModule(LightningDataModule):
            self.hparams.f_max,
            self.hparams.data_statistics,
            self.hparams.seed,
            self.hparams.load_durations,
        )
        self.validset = TextMelDataset(  # pylint: disable=attribute-defined-outside-init
            self.hparams.valid_filelist_path,
@@ -83,6 +87,7 @@ class TextMelDataModule(LightningDataModule):
            self.hparams.f_max,
            self.hparams.data_statistics,
            self.hparams.seed,
            self.hparams.load_durations,
        )
    def train_dataloader(self):
@@ -109,7 +114,7 @@ class TextMelDataModule(LightningDataModule):
        """Clean up after fit or test."""
        pass  # pylint: disable=unnecessary-pass
-    def state_dict(self):  # pylint: disable=no-self-use
+    def state_dict(self):
        """Extra things to save to checkpoint."""
        return {}
@@ -134,6 +139,7 @@ class TextMelDataset(torch.utils.data.Dataset):
        f_max=8000,
        data_parameters=None,
        seed=None,
        load_durations=False,
    ):
        self.filepaths_and_text = parse_filelist(filelist_path)
        self.n_spks = n_spks
@@ -146,6 +152,8 @@ class TextMelDataset(torch.utils.data.Dataset):
        self.win_length = win_length
        self.f_min = f_min
        self.f_max = f_max
        self.load_durations = load_durations
        if data_parameters is not None:
            self.data_parameters = data_parameters
        else:
@@ -164,10 +172,29 @@ class TextMelDataset(torch.utils.data.Dataset):
            filepath, text = filepath_and_text[0], filepath_and_text[1]
            spk = None
-        text = self.get_text(text, add_blank=self.add_blank)
+        text, cleaned_text = self.get_text(text, add_blank=self.add_blank)
        mel = self.get_mel(filepath)
-        return {"x": text, "y": mel, "spk": spk}
+        durations = self.get_durations(filepath, text) if self.load_durations else None
        return {"x": text, "y": mel, "spk": spk, "filepath": filepath, "x_text": cleaned_text, "durations": durations}
    def get_durations(self, filepath, text):
        filepath = Path(filepath)
        data_dir, name = filepath.parent.parent, filepath.stem
        try:
            dur_loc = data_dir / "durations" / f"{name}.npy"
            durs = torch.from_numpy(np.load(dur_loc).astype(int))
        except FileNotFoundError as e:
            raise FileNotFoundError(
                f"Tried loading the durations but durations didn't exist at {dur_loc}, make sure you've generate the durations first using: python matcha/utils/get_durations_from_trained_model.py \n"
            ) from e
        assert len(durs) == len(text), f"Length of durations {len(durs)} and text {len(text)} do not match"
        return durs
    def get_mel(self, filepath):
        audio, sr = ta.load(filepath)
@@ -187,11 +214,11 @@ class TextMelDataset(torch.utils.data.Dataset):
        return mel
    def get_text(self, text, add_blank=True):
-        text_norm = text_to_sequence(text, self.cleaners)
+        text_norm, cleaned_text = text_to_sequence(text, self.cleaners)
        if self.add_blank:
            text_norm = intersperse(text_norm, 0)
        text_norm = torch.IntTensor(text_norm)
-        return text_norm
+        return text_norm, cleaned_text
    def __getitem__(self, index):
        datapoint = self.get_datapoint(self.filepaths_and_text[index])
@@ -207,15 +234,18 @@ class TextMelBatchCollate:
    def __call__(self, batch):
        B = len(batch)
-        y_max_length = max([item["y"].shape[-1] for item in batch])
+        y_max_length = max([item["y"].shape[-1] for item in batch])  # pylint: disable=consider-using-generator
        y_max_length = fix_len_compatibility(y_max_length)
-        x_max_length = max([item["x"].shape[-1] for item in batch])
+        x_max_length = max([item["x"].shape[-1] for item in batch])  # pylint: disable=consider-using-generator
        n_feats = batch[0]["y"].shape[-2]
        y = torch.zeros((B, n_feats, y_max_length), dtype=torch.float32)
        x = torch.zeros((B, x_max_length), dtype=torch.long)
        durations = torch.zeros((B, x_max_length), dtype=torch.long)
        y_lengths, x_lengths = [], []
        spks = []
        filepaths, x_texts = [], []
        for i, item in enumerate(batch):
            y_, x_ = item["y"], item["x"]
            y_lengths.append(y_.shape[-1])
@@ -223,9 +253,22 @@ class TextMelBatchCollate:
            y[i, :, : y_.shape[-1]] = y_
            x[i, : x_.shape[-1]] = x_
            spks.append(item["spk"])
            filepaths.append(item["filepath"])
            x_texts.append(item["x_text"])
            if item["durations"] is not None:
                durations[i, : item["durations"].shape[-1]] = item["durations"]
        y_lengths = torch.tensor(y_lengths, dtype=torch.long)
        x_lengths = torch.tensor(x_lengths, dtype=torch.long)
        spks = torch.tensor(spks, dtype=torch.long) if self.n_spks > 1 else None
-        return {"x": x, "x_lengths": x_lengths, "y": y, "y_lengths": y_lengths, "spks": spks}
+        return {
            "x": x,
            "x_lengths": x_lengths,
            "y": y,
            "y_lengths": y_lengths,
            "spks": spks,
            "filepaths": filepaths,
            "x_texts": x_texts,
            "durations": durations if not torch.eq(durations, 0).all() else None,
        }
--- a/matcha/hifigan/denoiser.py
+++ b/matcha/hifigan/denoiser.py
@@ -1,9 +1,14 @@
 # Code modified from Rafael Valle's implementation https://github.com/NVIDIA/waveglow/blob/5bc2a53e20b3b533362f974cfa1ea0267ae1c2b1/denoiser.py
 """Waveglow style denoiser can be used to remove the artifacts from the HiFiGAN generated audio."""
 import torch
 class ModeException(Exception):
    pass
 class Denoiser(torch.nn.Module):
    """Removes model bias from audio produced with waveglow"""
@@ -20,7 +25,7 @@ class Denoiser(torch.nn.Module):
        elif mode == "normal":
            mel_input = torch.randn((1, 80, 88), dtype=dtype, device=device)
        else:
-            raise Exception(f"Mode {mode} if not supported")
+            raise ModeException(f"Mode {mode} if not supported")
        def stft_fn(audio, n_fft, hop_length, win_length, window):
            spec = torch.stft(
--- a/matcha/hifigan/env.py
+++ b/matcha/hifigan/env.py
@@ -1,4 +1,4 @@
-""" from https://github.com/jik876/hifi-gan """
+"""from https://github.com/jik876/hifi-gan"""
 import os
 import shutil
--- a/matcha/hifigan/meldataset.py
+++ b/matcha/hifigan/meldataset.py
@@ -1,4 +1,4 @@
-""" from https://github.com/jik876/hifi-gan """
+"""from https://github.com/jik876/hifi-gan"""
 import math
 import os
@@ -55,7 +55,7 @@ def mel_spectrogram(y, n_fft, num_mels, sampling_rate, hop_size, win_size, fmin,
    if torch.max(y) > 1.0:
        print("max value is ", torch.max(y))
-    global mel_basis, hann_window  # pylint: disable=global-statement
+    global mel_basis, hann_window  # pylint: disable=global-statement,global-variable-not-assigned
    if fmax not in mel_basis:
        mel = librosa_mel_fn(sampling_rate, n_fft, num_mels, fmin, fmax)
        mel_basis[str(fmax) + "_" + str(y.device)] = torch.from_numpy(mel).float().to(y.device)
--- a/matcha/hifigan/models.py
+++ b/matcha/hifigan/models.py
@@ -1,7 +1,7 @@
-""" from https://github.com/jik876/hifi-gan """
+"""from https://github.com/jik876/hifi-gan"""
 import torch
-import torch.nn as nn
+import torch.nn as nn  # pylint: disable=consider-using-from-import
 import torch.nn.functional as F
 from torch.nn import AvgPool1d, Conv1d, Conv2d, ConvTranspose1d
 from torch.nn.utils import remove_weight_norm, spectral_norm, weight_norm
--- a/matcha/hifigan/xutils.py
+++ b/matcha/hifigan/xutils.py
@@ -1,4 +1,4 @@
-""" from https://github.com/jik876/hifi-gan """
+"""from https://github.com/jik876/hifi-gan"""
 import glob
 import os
--- a/matcha/models/baselightningmodule.py
+++ b/matcha/models/baselightningmodule.py
@@ -2,6 +2,7 @@
 This is a base lightning module that can be used to train a model.
 The benefit of this abstraction is that all the logic outside of model definition can be reused for different models.
 """
 import inspect
 from abc import ABC
 from typing import Any, Dict
@@ -58,13 +59,14 @@ class BaseLightningClass(LightningModule, ABC):
        y, y_lengths = batch["y"], batch["y_lengths"]
        spks = batch["spks"]
-        dur_loss, prior_loss, diff_loss = self(
+        dur_loss, prior_loss, diff_loss, *_ = self(
            x=x,
            x_lengths=x_lengths,
            y=y,
            y_lengths=y_lengths,
            spks=spks,
            out_size=self.out_size,
            durations=batch["durations"],
        )
        return {
            "dur_loss": dur_loss,
--- a/matcha/models/components/decoder.py
+++ b/matcha/models/components/decoder.py
@@ -2,7 +2,7 @@ import math
 from typing import Optional
 import torch
-import torch.nn as nn
+import torch.nn as nn  # pylint: disable=consider-using-from-import
 import torch.nn.functional as F
 from conformer import ConformerBlock
 from diffusers.models.activations import get_activation
--- a/matcha/models/components/text_encoder.py
+++ b/matcha/models/components/text_encoder.py
@@ -1,12 +1,12 @@
-""" from https://github.com/jaywalnut310/glow-tts """
+"""from https://github.com/jaywalnut310/glow-tts"""
 import math
 import torch
-import torch.nn as nn
+import torch.nn as nn  # pylint: disable=consider-using-from-import
 from einops import rearrange
-import matcha.utils as utils
+import matcha.utils as utils  # pylint: disable=consider-using-from-import
 from matcha.utils.model import sequence_mask
 log = utils.get_pylogger(__name__)
--- a/matcha/models/components/transformer.py
+++ b/matcha/models/components/transformer.py
@@ -1,7 +1,7 @@
 from typing import Any, Dict, Optional
 import torch
-import torch.nn as nn
+import torch.nn as nn  # pylint: disable=consider-using-from-import
 from diffusers.models.attention import (
    GEGLU,
    GELU,
--- a/matcha/models/matcha_tts.py
+++ b/matcha/models/matcha_tts.py
@@ -4,7 +4,7 @@ import random
 import torch
-import matcha.utils.monotonic_align as monotonic_align
+import matcha.utils.monotonic_align as monotonic_align  # pylint: disable=consider-using-from-import
 from matcha import utils
 from matcha.models.baselightningmodule import BaseLightningClass
 from matcha.models.components.flow_matching import CFM
@@ -35,6 +35,7 @@ class MatchaTTS(BaseLightningClass):  # 🍵
        optimizer=None,
        scheduler=None,
        prior_loss=True,
        use_precomputed_durations=False,
    ):
        super().__init__()
@@ -46,6 +47,7 @@ class MatchaTTS(BaseLightningClass):  # 🍵
        self.n_feats = n_feats
        self.out_size = out_size
        self.prior_loss = prior_loss
        self.use_precomputed_durations = use_precomputed_durations
        if n_spks > 1:
            self.spk_emb = torch.nn.Embedding(n_spks, spk_emb_dim)
@@ -104,6 +106,7 @@ class MatchaTTS(BaseLightningClass):  # 🍵
                # Lengths of mel spectrograms
                "rtf": float,
                # Real-time factor
            }
        """
        # For RTF computation
        t = dt.datetime.now()
@@ -147,10 +150,10 @@ class MatchaTTS(BaseLightningClass):  # 🍵
            "rtf": rtf,
        }
-    def forward(self, x, x_lengths, y, y_lengths, spks=None, out_size=None, cond=None):
+    def forward(self, x, x_lengths, y, y_lengths, spks=None, out_size=None, cond=None, durations=None):
        """
        Computes 3 losses:
-            1. duration loss: loss between predicted token durations and those extracted by Monotinic Alignment Search (MAS).
+            1. duration loss: loss between predicted token durations and those extracted by Monotonic Alignment Search (MAS).
            2. prior loss: loss between mel-spectrogram and encoder outputs.
            3. flow matching loss: loss between mel-spectrogram and decoder outputs.
@@ -179,17 +182,20 @@ class MatchaTTS(BaseLightningClass):  # 🍵
        y_mask = sequence_mask(y_lengths, y_max_length).unsqueeze(1).to(x_mask)
        attn_mask = x_mask.unsqueeze(-1) * y_mask.unsqueeze(2)
-        # Use MAS to find most likely alignment `attn` between text and mel-spectrogram
+        if self.use_precomputed_durations:
-        with torch.no_grad():
+            attn = generate_path(durations.squeeze(1), attn_mask.squeeze(1))
-            const = -0.5 * math.log(2 * math.pi) * self.n_feats
+        else:
-            factor = -0.5 * torch.ones(mu_x.shape, dtype=mu_x.dtype, device=mu_x.device)
+            # Use MAS to find most likely alignment `attn` between text and mel-spectrogram
-            y_square = torch.matmul(factor.transpose(1, 2), y**2)
+            with torch.no_grad():
-            y_mu_double = torch.matmul(2.0 * (factor * mu_x).transpose(1, 2), y)
+                const = -0.5 * math.log(2 * math.pi) * self.n_feats
-            mu_square = torch.sum(factor * (mu_x**2), 1).unsqueeze(-1)
+                factor = -0.5 * torch.ones(mu_x.shape, dtype=mu_x.dtype, device=mu_x.device)
-            log_prior = y_square - y_mu_double + mu_square + const
+                y_square = torch.matmul(factor.transpose(1, 2), y**2)
                y_mu_double = torch.matmul(2.0 * (factor * mu_x).transpose(1, 2), y)
                mu_square = torch.sum(factor * (mu_x**2), 1).unsqueeze(-1)
                log_prior = y_square - y_mu_double + mu_square + const
-            attn = monotonic_align.maximum_path(log_prior, attn_mask.squeeze(1))
+                attn = monotonic_align.maximum_path(log_prior, attn_mask.squeeze(1))
-            attn = attn.detach()
+                attn = attn.detach()  # b, t_text, T_mel
        # Compute loss between predicted log-scaled durations and those obtained from MAS
        # refered to as prior loss in the paper
@@ -236,4 +242,4 @@ class MatchaTTS(BaseLightningClass):  # 🍵
        else:
            prior_loss = 0
-        return dur_loss, prior_loss, diff_loss
+        return dur_loss, prior_loss, diff_loss, attn
--- a/matcha/text/init.py
+++ b/matcha/text/init.py
@@ -1,4 +1,5 @@
-""" from https://github.com/keithito/tacotron """
+"""from https://github.com/keithito/tacotron"""
 from matcha.text import cleaners
 from matcha.text.symbols import symbols
@@ -7,6 +8,10 @@ _symbol_to_id = {s: i for i, s in enumerate(symbols)}
 _id_to_symbol = {i: s for i, s in enumerate(symbols)}  # pylint: disable=unnecessary-comprehension
 class UnknownCleanerException(Exception):
    pass
 def text_to_sequence(text, cleaner_names):
    """Converts a string of text to a sequence of IDs corresponding to the symbols in the text.
    Args:
@@ -21,7 +26,7 @@ def text_to_sequence(text, cleaner_names):
    for symbol in clean_text:
        symbol_id = _symbol_to_id[symbol]
        sequence += [symbol_id]
-    return sequence
+    return sequence, clean_text
 def cleaned_text_to_sequence(cleaned_text):
@@ -48,6 +53,6 @@ def _clean_text(text, cleaner_names):
    for name in cleaner_names:
        cleaner = getattr(cleaners, name)
        if not cleaner:
-            raise Exception("Unknown cleaner: %s" % name)
+            raise UnknownCleanerException(f"Unknown cleaner: {name}")
        text = cleaner(text)
    return text
--- a/matcha/text/cleaners.py
+++ b/matcha/text/cleaners.py
@@ -1,4 +1,4 @@
-""" from https://github.com/keithito/tacotron
+"""from https://github.com/keithito/tacotron
 Cleaners are transformations that run over the input text at both training and eval time.
@@ -15,7 +15,6 @@ import logging
 import re
 import phonemizer
 import piper_phonemize
 from unidecode import unidecode
 # To avoid excessive logging we set the log level of the phonemizer package to Critical
@@ -37,9 +36,12 @@ global_phonemizer = phonemizer.backend.EspeakBackend(
 # Regular expression matching whitespace:
 _whitespace_re = re.compile(r"\s+")
 # Remove brackets
 _brackets_re = re.compile(r"[\[\]\(\)\{\}]")
 # List of (regular expression, replacement) pairs for abbreviations:
 _abbreviations = [
-    (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+    (re.compile(f"\\b{x[0]}\\.", re.IGNORECASE), x[1])
    for x in [
        ("mrs", "misess"),
        ("mr", "mister"),
@@ -73,6 +75,10 @@ def lowercase(text):
    return text.lower()
 def remove_brackets(text):
    return re.sub(_brackets_re, "", text)
 def collapse_whitespace(text):
    return re.sub(_whitespace_re, " ", text)
@@ -102,15 +108,37 @@ def english_cleaners2(text):
    text = lowercase(text)
    text = expand_abbreviations(text)
    phonemes = global_phonemizer.phonemize([text], strip=True, njobs=1)[0]
    # Added in some cases espeak is not removing brackets
    phonemes = remove_brackets(phonemes)
    phonemes = collapse_whitespace(phonemes)
    return phonemes
-def english_cleaners_piper(text):
+def ipa_simplifier(text):
-    """Pipeline for English text, including abbreviation expansion. + punctuation + stress"""
+    replacements = [
-    text = convert_to_ascii(text)
+        ("ɐ", "ə"),
-    text = lowercase(text)
+        ("ˈə", "ə"),
-    text = expand_abbreviations(text)
+        ("ʤ", "dʒ"),
-    phonemes = "".join(piper_phonemize.phonemize_espeak(text=text, voice="en-US")[0])
+        ("ʧ", "tʃ"),
-    phonemes = collapse_whitespace(phonemes)
+        ("ᵻ", "ɪ"),
    ]
    for replacement in replacements:
        text = text.replace(replacement[0], replacement[1])
    phonemes = collapse_whitespace(text)
    return phonemes
 # I am removing this due to incompatibility with several version of python
 # However, if you want to use it, you can uncomment it
 # and install piper-phonemize with the following command:
 # pip install piper-phonemize
 # import piper_phonemize
 # def english_cleaners_piper(text):
 #     """Pipeline for English text, including abbreviation expansion. + punctuation + stress"""
 #     text = convert_to_ascii(text)
 #     text = lowercase(text)
 #     text = expand_abbreviations(text)
 #     phonemes = "".join(piper_phonemize.phonemize_espeak(text=text, voice="en-US")[0])
 #     phonemes = collapse_whitespace(phonemes)
 #     return phonemes
--- a/matcha/text/numbers.py
+++ b/matcha/text/numbers.py
@@ -1,4 +1,4 @@
-""" from https://github.com/keithito/tacotron """
+"""from https://github.com/keithito/tacotron"""
 import re
--- a/matcha/text/symbols.py
+++ b/matcha/text/symbols.py
@@ -1,7 +1,8 @@
-""" from https://github.com/keithito/tacotron
+"""from https://github.com/keithito/tacotron
 Defines the set of symbols used in text input to the model.
 """
 _pad = "_"
 _punctuation = ';:,.!?¡¿—…"«»“” '
 _letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
--- a/matcha/utils/audio.py
+++ b/matcha/utils/audio.py
@@ -48,7 +48,7 @@ def mel_spectrogram(y, n_fft, num_mels, sampling_rate, hop_size, win_size, fmin,
    if torch.max(y) > 1.0:
        print("max value is ", torch.max(y))
-    global mel_basis, hann_window  # pylint: disable=global-statement
+    global mel_basis, hann_window  # pylint: disable=global-statement,global-variable-not-assigned
    if f"{str(fmax)}_{str(y.device)}" not in mel_basis:
        mel = librosa_mel_fn(sr=sampling_rate, n_fft=n_fft, n_mels=num_mels, fmin=fmin, fmax=fmax)
        mel_basis[str(fmax) + "_" + str(y.device)] = torch.from_numpy(mel).float().to(y.device)
--- a/matcha/utils/data/init.py
+++ b/matcha/utils/data/init.py
--- a/matcha/utils/data/hificaptain.py
+++ b/matcha/utils/data/hificaptain.py
@@ -0,0 +1,148 @@
 #!/usr/bin/env python
 import argparse
 import os
 import sys
 import tempfile
 from pathlib import Path
 import torchaudio
 from torch.hub import download_url_to_file
 from tqdm import tqdm
 from matcha.utils.data.utils import _extract_zip
 URLS = {
    "en-US": {
        "female": "https://ast-astrec.nict.go.jp/release/hi-fi-captain/hfc_en-US_F.zip",
        "male": "https://ast-astrec.nict.go.jp/release/hi-fi-captain/hfc_en-US_M.zip",
    },
    "ja-JP": {
        "female": "https://ast-astrec.nict.go.jp/release/hi-fi-captain/hfc_ja-JP_F.zip",
        "male": "https://ast-astrec.nict.go.jp/release/hi-fi-captain/hfc_ja-JP_M.zip",
    },
 }
 INFO_PAGE = "https://ast-astrec.nict.go.jp/en/release/hi-fi-captain/"
 # On their website they say "We NICT open-sourced Hi-Fi-CAPTAIN",
 # but they use this very-much-not-open-source licence.
 # Dunno if this is open washing or stupidity.
 LICENCE = "CC BY-NC-SA 4.0"
 # I'd normally put the citation here. It's on their website.
 # Boo to non-open-source stuff.
 def get_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("-s", "--save-dir", type=str, default=None, help="Place to store the downloaded zip files")
    parser.add_argument(
        "-r",
        "--skip-resampling",
        action="store_true",
        default=False,
        help="Skip resampling the data (from 48 to 22.05)",
    )
    parser.add_argument(
        "-l", "--language", type=str, choices=["en-US", "ja-JP"], default="en-US", help="The language to download"
    )
    parser.add_argument(
        "-g",
        "--gender",
        type=str,
        choices=["male", "female"],
        default="female",
        help="The gender of the speaker to download",
    )
    parser.add_argument(
        "-o",
        "--output_dir",
        type=str,
        default="data",
        help="Place to store the converted data. Top-level only, the subdirectory will be created",
    )
    return parser.parse_args()
 def process_text(infile, outpath: Path):
    outmode = "w"
    if infile.endswith("dev.txt"):
        outfile = outpath / "valid.txt"
    elif infile.endswith("eval.txt"):
        outfile = outpath / "test.txt"
    else:
        outfile = outpath / "train.txt"
        if outfile.exists():
            outmode = "a"
    with (
        open(infile, encoding="utf-8") as inf,
        open(outfile, outmode, encoding="utf-8") as of,
    ):
        for line in inf.readlines():
            line = line.strip()
            fileid, rest = line.split(" ", maxsplit=1)
            outfile = str(outpath / f"{fileid}.wav")
            of.write(f"{outfile}|{rest}\n")
 def process_files(zipfile, outpath, resample=True):
    with tempfile.TemporaryDirectory() as tmpdirname:
        for filename in tqdm(_extract_zip(zipfile, tmpdirname)):
            if not filename.startswith(tmpdirname):
                filename = os.path.join(tmpdirname, filename)
            if filename.endswith(".txt"):
                process_text(filename, outpath)
            elif filename.endswith(".wav"):
                filepart = filename.rsplit("/", maxsplit=1)[-1]
                outfile = str(outpath / filepart)
                arr, sr = torchaudio.load(filename)
                if resample:
                    arr = torchaudio.functional.resample(arr, orig_freq=sr, new_freq=22050)
                torchaudio.save(outfile, arr, 22050)
            else:
                continue
 def main():
    args = get_args()
    save_dir = None
    if args.save_dir:
        save_dir = Path(args.save_dir)
        if not save_dir.is_dir():
            save_dir.mkdir()
    if not args.output_dir:
        print("output directory not specified, exiting")
        sys.exit(1)
    URL = URLS[args.language][args.gender]
    dirname = f"hi-fi_{args.language}_{args.gender}"
    outbasepath = Path(args.output_dir)
    if not outbasepath.is_dir():
        outbasepath.mkdir()
    outpath = outbasepath / dirname
    if not outpath.is_dir():
        outpath.mkdir()
    resample = True
    if args.skip_resampling:
        resample = False
    if save_dir:
        zipname = URL.rsplit("/", maxsplit=1)[-1]
        zipfile = save_dir / zipname
        if not zipfile.exists():
            download_url_to_file(URL, zipfile, progress=True)
        process_files(zipfile, outpath, resample)
    else:
        with tempfile.NamedTemporaryFile(suffix=".zip", delete=True) as zf:
            download_url_to_file(URL, zf.name, progress=True)
            process_files(zf.name, outpath, resample)
 if __name__ == "__main__":
    main()
--- a/matcha/utils/data/ljspeech.py
+++ b/matcha/utils/data/ljspeech.py
@@ -0,0 +1,97 @@
 #!/usr/bin/env python
 import argparse
 import random
 import tempfile
 from pathlib import Path
 from torch.hub import download_url_to_file
 from matcha.utils.data.utils import _extract_tar
 URL = "https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2"
 INFO_PAGE = "https://keithito.com/LJ-Speech-Dataset/"
 LICENCE = "Public domain (LibriVox copyright disclaimer)"
 CITATION = """
@misc{ljspeech17,
  author       = {Keith Ito and Linda Johnson},
  title        = {The LJ Speech Dataset},
  howpublished = {\\url{https://keithito.com/LJ-Speech-Dataset/}},
  year         = 2017
 }
 """
 def decision():
    return random.random() < 0.98
 def get_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("-s", "--save-dir", type=str, default=None, help="Place to store the downloaded zip files")
    parser.add_argument(
        "output_dir",
        type=str,
        nargs="?",
        default="data",
        help="Place to store the converted data (subdirectory LJSpeech-1.1 will be created)",
    )
    return parser.parse_args()
 def process_csv(ljpath: Path):
    if (ljpath / "metadata.csv").exists():
        basepath = ljpath
    elif (ljpath / "LJSpeech-1.1" / "metadata.csv").exists():
        basepath = ljpath / "LJSpeech-1.1"
    csvpath = basepath / "metadata.csv"
    wavpath = basepath / "wavs"
    with (
        open(csvpath, encoding="utf-8") as csvf,
        open(basepath / "train.txt", "w", encoding="utf-8") as tf,
        open(basepath / "val.txt", "w", encoding="utf-8") as vf,
    ):
        for line in csvf.readlines():
            line = line.strip()
            parts = line.split("|")
            wavfile = str(wavpath / f"{parts[0]}.wav")
            if decision():
                tf.write(f"{wavfile}|{parts[1]}\n")
            else:
                vf.write(f"{wavfile}|{parts[1]}\n")
 def main():
    args = get_args()
    save_dir = None
    if args.save_dir:
        save_dir = Path(args.save_dir)
        if not save_dir.is_dir():
            save_dir.mkdir()
    outpath = Path(args.output_dir)
    if not outpath.is_dir():
        outpath.mkdir()
    if save_dir:
        tarname = URL.rsplit("/", maxsplit=1)[-1]
        tarfile = save_dir / tarname
        if not tarfile.exists():
            download_url_to_file(URL, str(tarfile), progress=True)
        _extract_tar(tarfile, outpath)
        process_csv(outpath)
    else:
        with tempfile.NamedTemporaryFile(suffix=".tar.bz2", delete=True) as zf:
            download_url_to_file(URL, zf.name, progress=True)
            _extract_tar(zf.name, outpath)
            process_csv(outpath)
 if __name__ == "__main__":
    main()
--- a/matcha/utils/data/utils.py
+++ b/matcha/utils/data/utils.py
@@ -0,0 +1,53 @@
 # taken from https://github.com/pytorch/audio/blob/main/src/torchaudio/datasets/utils.py
 # Copyright (c) 2017 Facebook Inc. (Soumith Chintala)
 # Licence: BSD 2-Clause
 # pylint: disable=C0123
 import logging
 import os
 import tarfile
 import zipfile
 from pathlib import Path
 from typing import Any, List, Optional, Union
 _LG = logging.getLogger(__name__)
 def _extract_tar(from_path: Union[str, Path], to_path: Optional[str] = None, overwrite: bool = False) -> List[str]:
    if type(from_path) is Path:
        from_path = str(Path)
    if to_path is None:
        to_path = os.path.dirname(from_path)
    with tarfile.open(from_path, "r") as tar:
        files = []
        for file_ in tar:  # type: Any
            file_path = os.path.join(to_path, file_.name)
            if file_.isfile():
                files.append(file_path)
                if os.path.exists(file_path):
                    _LG.info("%s already extracted.", file_path)
                    if not overwrite:
                        continue
            tar.extract(file_, to_path)
        return files
 def _extract_zip(from_path: Union[str, Path], to_path: Optional[str] = None, overwrite: bool = False) -> List[str]:
    if type(from_path) is Path:
        from_path = str(Path)
    if to_path is None:
        to_path = os.path.dirname(from_path)
    with zipfile.ZipFile(from_path, "r") as zfile:
        files = zfile.namelist()
        for file_ in files:
            file_path = os.path.join(to_path, file_)
            if os.path.exists(file_path):
                _LG.info("%s already extracted.", file_path)
                if not overwrite:
                    continue
            zfile.extract(file_, to_path)
    return files
--- a/matcha/utils/generate_data_statistics.py
+++ b/matcha/utils/generate_data_statistics.py
@@ -4,6 +4,7 @@ when needed.
 Parameters from hparam.py will be used
 """
 import argparse
 import json
 import os
@@ -94,6 +95,7 @@ def main():
        cfg["batch_size"] = args.batch_size
        cfg["train_filelist_path"] = str(os.path.join(root_path, cfg["train_filelist_path"]))
        cfg["valid_filelist_path"] = str(os.path.join(root_path, cfg["valid_filelist_path"]))
        cfg["load_durations"] = False
    text_mel_datamodule = TextMelDataModule(**cfg)
    text_mel_datamodule.setup()
@@ -101,10 +103,8 @@ def main():
    log.info("Dataloader loaded! Now computing stats...")
    params = compute_data_statistics(data_loader, cfg["n_feats"])
    print(params)
-    json.dump(
+    with open(output_file, "w", encoding="utf-8") as dumpfile:
-        params,
+        json.dump(params, dumpfile)
        open(output_file, "w"),
    )
 if __name__ == "__main__":
--- a/matcha/utils/get_durations_from_trained_model.py
+++ b/matcha/utils/get_durations_from_trained_model.py
@@ -0,0 +1,196 @@
 r"""
 The file creates a pickle file where the values needed for loading of dataset is stored and the model can load it
 when needed.
 Parameters from hparam.py will be used
 """
 import argparse
 import json
 import os
 import sys
 from pathlib import Path
 import lightning
 import numpy as np
 import rootutils
 import torch
 from hydra import compose, initialize
 from omegaconf import open_dict
 from torch import nn
 from tqdm.auto import tqdm
 from matcha.cli import get_device
 from matcha.data.text_mel_datamodule import TextMelDataModule
 from matcha.models.matcha_tts import MatchaTTS
 from matcha.utils.logging_utils import pylogger
 from matcha.utils.utils import get_phoneme_durations
 log = pylogger.get_pylogger(__name__)
 def save_durations_to_folder(
    attn: torch.Tensor, x_length: int, y_length: int, filepath: str, output_folder: Path, text: str
 ):
    durations = attn.squeeze().sum(1)[:x_length].numpy()
    durations_json = get_phoneme_durations(durations, text)
    output = output_folder / Path(filepath).name.replace(".wav", ".npy")
    with open(output.with_suffix(".json"), "w", encoding="utf-8") as f:
        json.dump(durations_json, f, indent=4, ensure_ascii=False)
    np.save(output, durations)
@torch.inference_mode()
 def compute_durations(data_loader: torch.utils.data.DataLoader, model: nn.Module, device: torch.device, output_folder):
    """Generate durations from the model for each datapoint and save it in a folder
    Args:
        data_loader (torch.utils.data.DataLoader): Dataloader
        model (nn.Module): MatchaTTS model
        device (torch.device): GPU or CPU
    """
    for batch in tqdm(data_loader, desc="🍵 Computing durations 🍵:"):
        x, x_lengths = batch["x"], batch["x_lengths"]
        y, y_lengths = batch["y"], batch["y_lengths"]
        spks = batch["spks"]
        x = x.to(device)
        y = y.to(device)
        x_lengths = x_lengths.to(device)
        y_lengths = y_lengths.to(device)
        spks = spks.to(device) if spks is not None else None
        _, _, _, attn = model(
            x=x,
            x_lengths=x_lengths,
            y=y,
            y_lengths=y_lengths,
            spks=spks,
        )
        attn = attn.cpu()
        for i in range(attn.shape[0]):
            save_durations_to_folder(
                attn[i],
                x_lengths[i].item(),
                y_lengths[i].item(),
                batch["filepaths"][i],
                output_folder,
                batch["x_texts"][i],
            )
 def main():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "-i",
        "--input-config",
        type=str,
        default="ljspeech.yaml",
        help="The name of the yaml config file under configs/data",
    )
    parser.add_argument(
        "-b",
        "--batch-size",
        type=int,
        default="32",
        help="Can have increased batch size for faster computation",
    )
    parser.add_argument(
        "-f",
        "--force",
        action="store_true",
        default=False,
        required=False,
        help="force overwrite the file",
    )
    parser.add_argument(
        "-c",
        "--checkpoint_path",
        type=str,
        required=True,
        help="Path to the checkpoint file to load the model from",
    )
    parser.add_argument(
        "-o",
        "--output-folder",
        type=str,
        default=None,
        help="Output folder to save the data statistics",
    )
    parser.add_argument(
        "--cpu", action="store_true", help="Use CPU for inference, not recommended (default: use GPU if available)"
    )
    args = parser.parse_args()
    with initialize(version_base="1.3", config_path="../../configs/data"):
        cfg = compose(config_name=args.input_config, return_hydra_config=True, overrides=[])
    root_path = rootutils.find_root(search_from=__file__, indicator=".project-root")
    with open_dict(cfg):
        del cfg["hydra"]
        del cfg["_target_"]
        cfg["seed"] = 1234
        cfg["batch_size"] = args.batch_size
        cfg["train_filelist_path"] = str(os.path.join(root_path, cfg["train_filelist_path"]))
        cfg["valid_filelist_path"] = str(os.path.join(root_path, cfg["valid_filelist_path"]))
        cfg["load_durations"] = False
    if args.output_folder is not None:
        output_folder = Path(args.output_folder)
    else:
        output_folder = Path(cfg["train_filelist_path"]).parent / "durations"
    print(f"Output folder set to: {output_folder}")
    if os.path.exists(output_folder) and not args.force:
        print("Folder already exists. Use -f to force overwrite")
        sys.exit(1)
    output_folder.mkdir(parents=True, exist_ok=True)
    print(f"Preprocessing: {cfg['name']} from training filelist: {cfg['train_filelist_path']}")
    print("Loading model...")
    device = get_device(args)
    model = MatchaTTS.load_from_checkpoint(args.checkpoint_path, map_location=device)
    text_mel_datamodule = TextMelDataModule(**cfg)
    text_mel_datamodule.setup()
    try:
        print("Computing stats for training set if exists...")
        train_dataloader = text_mel_datamodule.train_dataloader()
        compute_durations(train_dataloader, model, device, output_folder)
    except lightning.fabric.utilities.exceptions.MisconfigurationException:
        print("No training set found")
    try:
        print("Computing stats for validation set if exists...")
        val_dataloader = text_mel_datamodule.val_dataloader()
        compute_durations(val_dataloader, model, device, output_folder)
    except lightning.fabric.utilities.exceptions.MisconfigurationException:
        print("No validation set found")
    try:
        print("Computing stats for test set if exists...")
        test_dataloader = text_mel_datamodule.test_dataloader()
        compute_durations(test_dataloader, model, device, output_folder)
    except lightning.fabric.utilities.exceptions.MisconfigurationException:
        print("No test set found")
    print(f"[+] Done! Data statistics saved to: {output_folder}")
 if __name__ == "__main__":
    # Helps with generating durations for the dataset to train other architectures
    # that cannot learn to align due to limited size of dataset
    # Example usage:
    # python python matcha/utils/get_durations_from_trained_model.py -i ljspeech.yaml -c pretrained_model
    # This will create a folder in data/processed_data/durations/ljspeech with the durations
    main()
--- a/matcha/utils/model.py
+++ b/matcha/utils/model.py
@@ -1,4 +1,4 @@
-""" from https://github.com/jaywalnut310/glow-tts """
+"""from https://github.com/jaywalnut310/glow-tts"""
 import numpy as np
 import torch
--- a/matcha/utils/rich_utils.py
+++ b/matcha/utils/rich_utils.py
@@ -72,7 +72,7 @@ def print_config_tree(
    # save config tree to file
    if save_to_file:
-        with open(Path(cfg.paths.output_dir, "config_tree.log"), "w") as file:
+        with open(Path(cfg.paths.output_dir, "config_tree.log"), "w", encoding="utf-8") as file:
            rich.print(tree, file=file)
@@ -97,5 +97,5 @@ def enforce_tags(cfg: DictConfig, save_to_file: bool = False) -> None:
        log.info(f"Tags: {cfg.tags}")
    if save_to_file:
-        with open(Path(cfg.paths.output_dir, "tags.log"), "w") as file:
+        with open(Path(cfg.paths.output_dir, "tags.log"), "w", encoding="utf-8") as file:
            rich.print(cfg.tags, file=file)
--- a/matcha/utils/utils.py
+++ b/matcha/utils/utils.py
@@ -2,6 +2,7 @@ import os
 import sys
 import warnings
 from importlib.util import find_spec
 from math import ceil
 from pathlib import Path
 from typing import Any, Callable, Dict, Tuple
@@ -217,3 +218,42 @@ def assert_model_downloaded(checkpoint_path, url, use_wget=True):
        gdown.download(url=url, output=checkpoint_path, quiet=False, fuzzy=True)
    else:
        wget.download(url=url, out=checkpoint_path)
 def get_phoneme_durations(durations, phones):
    prev = durations[0]
    merged_durations = []
    # Convolve with stride 2
    for i in range(1, len(durations), 2):
        if i == len(durations) - 2:
            # if it is last take full value
            next_half = durations[i + 1]
        else:
            next_half = ceil(durations[i + 1] / 2)
        curr = prev + durations[i] + next_half
        prev = durations[i + 1] - next_half
        merged_durations.append(curr)
    assert len(phones) == len(merged_durations)
    assert len(merged_durations) == (len(durations) - 1) // 2
    merged_durations = torch.cumsum(torch.tensor(merged_durations), 0, dtype=torch.long)
    start = torch.tensor(0)
    duration_json = []
    for i, duration in enumerate(merged_durations):
        duration_json.append(
            {
                phones[i]: {
                    "starttime": start.item(),
                    "endtime": duration.item(),
                    "duration": duration.item() - start.item(),
                }
            }
        )
        start = duration
    assert list(duration_json[-1].values())[0]["endtime"] == sum(
        durations
    ), f"{list(duration_json[-1].values())[0]['endtime'],  sum(durations)}"
    return duration_json
--- a/requirements.txt
+++ b/requirements.txt
@@ -35,11 +35,10 @@ torchaudio
 matplotlib
 pandas
 conformer==0.3.2
-diffusers==0.25.0
+diffusers # developed using version ==0.25.0
 notebook
 ipywidgets
-gradio
+gradio==3.43.2
 gdown
 wget
 seaborn
 piper_phonemize
--- a/setup.py
+++ b/setup.py
@@ -16,9 +16,16 @@ with open("README.md", encoding="utf-8") as readme_file:
    README = readme_file.read()
 cwd = os.path.dirname(os.path.abspath(__file__))
-with open(os.path.join(cwd, "matcha", "VERSION")) as fin:
+with open(os.path.join(cwd, "matcha", "VERSION"), encoding="utf-8") as fin:
    version = fin.read().strip()
 def get_requires():
    requirements = os.path.join(os.path.dirname(__file__), "requirements.txt")
    with open(requirements, encoding="utf-8") as reqfile:
        return [str(r).strip() for r in reqfile]
 setup(
    name="matcha-tts",
    version=version,
@@ -28,7 +35,7 @@ setup(
    author="Shivam Mehta",
    author_email="shivam.mehta25@gmail.com",
    url="https://shivammehta25.github.io/Matcha-TTS",
-    install_requires=[str(r) for r in open(os.path.join(os.path.dirname(__file__), "requirements.txt"))],
+    install_requires=get_requires(),
    include_dirs=[numpy.get_include()],
    include_package_data=True,
    packages=find_packages(exclude=["tests", "tests/*", "examples", "examples/*"]),
@@ -38,6 +45,7 @@ setup(
            "matcha-data-stats=matcha.utils.generate_data_statistics:main",
            "matcha-tts=matcha.cli:cli",
            "matcha-tts-app=matcha.app:main",
            "matcha-tts-get-durations=matcha.utils.get_durations_from_trained_model:main",
        ]
    },
    ext_modules=cythonize(exts, language_level=3),
--- a/synthesis.ipynb
+++ b/synthesis.ipynb
@@ -19,7 +19,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": null,
   "id": "148f4bc0-c28e-4670-9a5e-4c7928ab8992",
   "metadata": {},
   "outputs": [
@@ -192,7 +192,7 @@
   "source": [
    "@torch.inference_mode()\n",
    "def process_text(text: str):\n",
-    "    x = torch.tensor(intersperse(text_to_sequence(text, ['english_cleaners2']), 0),dtype=torch.long, device=device)[None]\n",
+    "    x = torch.tensor(intersperse(text_to_sequence(text, ['english_cleaners2'])[0], 0),dtype=torch.long, device=device)[None]\n",
    "    x_lengths = torch.tensor([x.shape[-1]],dtype=torch.long, device=device)\n",
    "    x_phones = sequence_to_text(x.squeeze(0).tolist())\n",
    "    return {\n",
Author	SHA1	Message	Date
pre-commit-ci[bot]	66178aea04	[pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci	2026-01-19 22:11:39 +00:00
pre-commit-ci[bot]	7ebef67773	[pre-commit.ci] pre-commit autoupdate updates: - [github.com/pre-commit/pre-commit-hooks: v4.5.0 → v6.0.0](https://github.com/pre-commit/pre-commit-hooks/compare/v4.5.0...v6.0.0) - https://github.com/psf/black → https://github.com/psf/black-pre-commit-mirror - [github.com/psf/black-pre-commit-mirror: 23.12.1 → 26.1.0](https://github.com/psf/black-pre-commit-mirror/compare/23.12.1...26.1.0) - [github.com/PyCQA/isort: 5.13.2 → 7.0.0](https://github.com/PyCQA/isort/compare/5.13.2...7.0.0) - [github.com/asottile/pyupgrade: v3.15.0 → v3.21.2](https://github.com/asottile/pyupgrade/compare/v3.15.0...v3.21.2) - [github.com/PyCQA/flake8: 7.0.0 → 7.3.0](https://github.com/PyCQA/flake8/compare/7.0.0...7.3.0) - [github.com/pycqa/pylint: v3.0.3 → v4.0.4](https://github.com/pycqa/pylint/compare/v3.0.3...v4.0.4)	2026-01-19 22:10:05 +00:00
Shivam Mehta	bd4d90d932	Update README.md	2025-09-17 08:49:57 -07:00
Shivam Mehta	108906c603	Merge pull request #121 from jimregan/english-data ljspeech/hificaptain from #99	2024-12-02 09:02:41 -06:00
Shivam Mehta	354f5dc69f	Merge pull request #123 from jimregan/patch-1 Fix a typo	2024-12-02 08:26:00 -06:00
Jim O’Regan	8e5f98476e	Fix a typo	2024-12-02 15:21:31 +01:00
Jim O'Regan	7e499df0b2	ljspeech/hificaptain from #99	2024-12-02 11:01:04 +00:00
Shivam Mehta	0735e653fc	Merge pull request #103 from jimregan/mmconv-cleaner add a cleaner for IPA data (pre-phonetised)	2024-11-13 22:15:47 -08:00
Shivam Mehta	f9843cfca4	Merge pull request #101 from jimregan/pylint Make pylint happy	2024-11-13 22:13:36 -08:00
Shivam Mehta	289ef51578	Fixing thhe usage of denoiser_strength from the command line.	2024-11-14 06:55:51 +01:00
Shivam Mehta	7a65f83b17	Updating the version	2024-11-14 06:42:06 +01:00
Shivam Mehta	7275764a48	Fixing espeak not removing brackets in some cases	2024-11-14 06:39:58 +01:00
Jim O'Regan	863bfbdd8b	rename method, it's more generic than the previous name suggested	2024-10-03 18:51:47 +00:00
Jim O'Regan	4bc541705a	add a cleaner for the mmconv data Different versions of espeak represent things differently, it seems (also, there are some distinctions none of our speakers make, so normalising those away reduces perplexity a tiny amount).	2024-10-03 17:18:58 +00:00
pre-commit-ci[bot]	a3fea22988	[pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci	2024-10-02 14:31:11 +00:00
Jim O'Regan	d56f40765c	disable consider-using-from-import instead (missed one)	2024-10-02 14:30:18 +00:00
Jim O'Regan	b0ba920dc1	disable consider-using-from-import instead	2024-10-02 14:29:06 +00:00
Jim O'Regan	a220f283e3	disable consider-using-generator	2024-10-02 13:57:12 +00:00
Jim O'Regan	1df73ef43e	disable global-variable-not-assigned	2024-10-02 13:55:44 +00:00
Jim O'Regan	404b045b65	add dummy exception (W0719)	2024-10-02 13:51:17 +00:00
Jim O'Regan	7cfae6bed4	add dummy exception (W0719)	2024-10-02 13:49:47 +00:00
Jim O'Regan	a83fd29829	C0209	2024-10-02 13:45:27 +00:00
pre-commit-ci[bot]	c8178bf2cd	[pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci	2024-10-02 13:32:45 +00:00
Jim O'Regan	8b1284993a	W1514 + R1732	2024-10-02 13:31:57 +00:00
Jim O'Regan	0000f93021	R1732 + W1514	2024-10-02 13:25:02 +00:00
pre-commit-ci[bot]	c2569a1018	[pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci	2024-10-02 13:21:37 +00:00
Jim O'Regan	bd058a68f7	R0402	2024-10-02 13:21:00 +00:00
Jim O'Regan	362ba2dce7	C0209	2024-10-02 08:38:28 +00:00
Shivam Mehta	77804265f8	removing diffuser versioning	2024-08-09 18:28:56 +02:00
Shivam Mehta	d31cd92a61	Merge pull request #75 from shivammehta25/dev Adding alginment information to readme	2024-05-27 13:57:49 +02:00
Shivam Mehta	068d135e20	Adding alginment information to readme	2024-05-27 13:57:10 +02:00
Shivam Mehta	bd37d03b62	Merge pull request #74 from shivammehta25/dev Adding the possibility to use Matcha-TTS as an aligner and train from pretrained extracted alignments.	2024-05-27 13:54:27 +02:00
Shivam Mehta	ac0b258f80	Adding configuration for training from durations	2024-05-27 13:50:21 +02:00
Shivam Mehta	de910380bc	Fixing batched synthesis for multispeaker model	2024-05-27 13:40:02 +02:00
Shivam Mehta	aa496aa13f	Adding the possibility to train with durations	2024-05-27 13:24:21 +02:00
Shivam Mehta	e658aee6a5	Pinning gradio	2024-05-25 20:15:17 +02:00
Shivam Mehta	d816c40e3d	Updating the notebook to adjust to the change	2024-05-24 11:46:03 +02:00
Shivam Mehta	4b39f6cad0	Adding the possibility of get durations out of pretrained model	2024-05-24 11:34:51 +02:00
Shivam Mehta	dd9105b34b	Merge pull request #60 from jimregan/patch-1 Pin gradio to 3.43.2	2024-02-27 13:29:42 +01:00
Jim O’Regan	7d9d4cfd40	Pin gradio to 3.43.2 Fixes #59	2024-02-27 13:25:08 +01:00
		`@@ -1 +0,0 @@`
			`/home/smehta/Projects/Speech-Backbones/Grad-TTS/data`
`@@ -1,4 +1,4 @@`
	`""" from https://github.com/keithito/tacotron """`	`"""from https://github.com/keithito/tacotron"""`

	`import re`	`import re`