Merge pull request #127 from snakers4/adamnsandle

Adamnsandle
This commit is contained in:
Alexander Veysov
2021-12-07 13:47:01 +03:00
committed by GitHub
17 changed files with 897 additions and 2151 deletions

634
README.md
View File

@@ -1,617 +1,87 @@
[![Mailing list : test](http://img.shields.io/badge/Email-gray.svg?style=for-the-badge&logo=gmail)](mailto:hello@silero.ai) [![Mailing list : test](http://img.shields.io/badge/Telegram-blue.svg?style=for-the-badge&logo=telegram)](https://t.me/silero_speech) [![License: CC BY-NC 4.0](https://img.shields.io/badge/License-MIT-lightgrey.svg?style=for-the-badge)](https://github.com/snakers4/silero-vad/blob/master/LICENSE)
[![Open on Torch Hub](https://img.shields.io/badge/Torch-Hub-red?logo=pytorch&style=for-the-badge)](https://pytorch.org/hub/snakers4_silero-vad_vad/)
[![Mailing list : test](http://img.shields.io/badge/Email-gray.svg?style=for-the-badge&logo=gmail)](mailto:hello@silero.ai) [![Mailing list : test](http://img.shields.io/badge/Telegram-blue.svg?style=for-the-badge&logo=telegram)](https://t.me/silero_speech) [![License: CC BY-NC 4.0](https://img.shields.io/badge/License-MIT-lightgrey.svg?style=for-the-badge)](https://github.com/snakers4/silero-vad/blob/master/LICENSE)
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb)
![header](https://user-images.githubusercontent.com/12515440/89997349-b3523080-dc94-11ea-9906-ca2e8bc50535.png)
- [Silero VAD](#silero-vad)
- [TLDR](#tldr)
- [Live Demonstration](#live-demonstration)
- [Getting Started](#getting-started)
- [Pre-trained Models](#pre-trained-models)
- [Version History](#version-history)
- [PyTorch](#pytorch)
- [VAD](#vad)
- [Number Detector](#number-detector)
- [Language Classifier](#language-classifier)
- [ONNX](#onnx)
- [VAD](#vad-1)
- [Number Detector](#number-detector-1)
- [Language Classifier](#language-classifier-1)
- [Metrics](#metrics)
- [Performance Metrics](#performance-metrics)
- [Streaming Latency](#streaming-latency)
- [Full Audio Throughput](#full-audio-throughput)
- [VAD Quality Metrics](#vad-quality-metrics)
- [FAQ](#faq)
- [VAD Parameter Fine Tuning](#vad-parameter-fine-tuning)
- [Classic way](#classic-way)
- [Adaptive way](#adaptive-way)
- [How VAD Works](#how-vad-works)
- [VAD Quality Metrics Methodology](#vad-quality-metrics-methodology)
- [How Number Detector Works](#how-number-detector-works)
- [How Language Classifier Works](#how-language-classifier-works)
- [Contact](#contact)
- [Get in Touch](#get-in-touch)
- [Commercial Inquiries](#commercial-inquiries)
- [Further reading](#further-reading)
- [Citations](#citations)
<br/>
<h1 align="center">Silero VAD</h1>
<br/>
**Silero VAD** - pre-trained enterprise-grade [Voice Activity Detector](https://en.wikipedia.org/wiki/Voice_activity_detection) (also see our [STT models](https://github.com/snakers4/silero-models)).
# Silero VAD
![image](https://user-images.githubusercontent.com/36505480/107667211-06cf2680-6c98-11eb-9ee5-37eb4596260f.png)
This repository also includes Number Detector and Language classifier [models](https://github.com/snakers4/silero-vad/wiki/Other-Models)
## TLDR
<br/>
**Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier.**
Enterprise-grade Speech Products made refreshingly simple (also see our [STT](https://github.com/snakers4/silero-models) models).
<p align="center">
<img src="https://user-images.githubusercontent.com/36505480/145007002-8473f909-5985-4942-bbcf-9ac86d156c2f.png" />
</p>
Currently, there are hardly any high quality / modern / free / public voice activity detectors except for WebRTC Voice Activity Detector ([link](https://github.com/wiseman/py-webrtcvad)). WebRTC though starts to show its age and it suffers from many false positives.
<details>
<summary>Real Time Example</summary>
https://user-images.githubusercontent.com/36505480/144874384-95f80f6d-a4f1-42cc-9be7-004c891dd481.mp4
</details>
Also in some cases it is crucial to be able to anonymize large-scale spoken corpora (i.e. remove personal data). Typically personal data is considered to be private / sensitive if it contains (i) a name (ii) some private ID. Name recognition is a highly subjective matter and it depends on locale and business case, but Voice Activity and Number Detection are quite general tasks.
<br/>
<h2 align="center">Key Features</h2>
<br/>
**Key features:**
- **High accuracy**
- Modern, portable;
- Low memory footprint;
- Superior metrics to WebRTC;
- Trained on huge spoken corpora and noise / sound libraries;
- Slower than WebRTC, but fast enough for IOT / edge / mobile applications;
- Unlike WebRTC (which mostly tells silence from voice), our VAD can tell voice from noise / music / silence;
Silero VAD has [excellent results](https://github.com/snakers4/silero-vad/wiki/Quality-Metrics#vs-other-available-solutions) on speech detection tasks.
- **Fast**
**Typical use cases:**
One audio chunk (30+ ms) [takes](https://github.com/snakers4/silero-vad/wiki/Performance-Metrics#silero-vad-performance-metrics) around **1ms** to be processed on a single CPU thread. Using batching or GPU can also improve performance considerably.
- Spoken corpora anonymization;
- Can be used together with WebRTC;
- Voice activity detection for IOT / edge / mobile use cases;
- Data cleaning and preparation, number and voice detection in general;
- PyTorch and ONNX can be used with a wide variety of deployment options and backends in mind;
- **Lightweight**
### Live Demonstration
JIT model is less than one megabyte in size.
For more information, please see [examples](https://github.com/snakers4/silero-vad/tree/master/examples).
- **General**
https://user-images.githubusercontent.com/28188499/116685087-182ff100-a9b2-11eb-927d-ed9f621226ee.mp4
Silero VAD was trained on huge corpora that include over **100** languages and it performs well on audios from different domains with various background noise and quality levels.
https://user-images.githubusercontent.com/8079748/117580455-4622dd00-b0f8-11eb-858d-e6368ed4eada.mp4
- **Flexible sampling rate**
## Getting Started
Silero VAD [supports](https://github.com/snakers4/silero-vad/wiki/Quality-Metrics#sample-rate-comparison) **8000 Hz** and **16000 Hz** [sampling rates](https://en.wikipedia.org/wiki/Sampling_(signal_processing)#Sampling_rate).
The models are small enough to be included directly into this repository. Newer models will supersede older models directly.
- **Flexible chunk size**
### Pre-trained Models
Model was trained on audio chunks of different lengths. **30 ms**, **60 ms** and **100 ms** long chunks are supported directly, others may work as well.
**Currently we provide the following endpoints:**
<br/>
<h2 align="center">Typical Use Cases</h2>
<br/>
| model= | Params | Model type | Streaming | Languages | PyTorch | ONNX | Colab |
| -------------------------- | ------ | ------------------- | --------- | -------------------------- | ------------------ | ------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `'silero_vad'` | 1.1M | VAD | Yes | `ru`, `en`, `de`, `es` (*) | :heavy_check_mark: | :heavy_check_mark: | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) |
| `'silero_vad_micro'` | 10K | VAD | Yes | `ru`, `en`, `de`, `es` (*) | :heavy_check_mark: | :heavy_check_mark: | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) |
| `'silero_vad_micro_8k'` | 10K | VAD | Yes | `ru`, `en`, `de`, `es` (*) | :heavy_check_mark: | :heavy_check_mark: | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) |
| `'silero_vad_mini'` | 100K | VAD | Yes | `ru`, `en`, `de`, `es` (*) | :heavy_check_mark: | :heavy_check_mark: | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) |
| `'silero_vad_mini_8k'` | 100K | VAD | Yes | `ru`, `en`, `de`, `es` (*) | :heavy_check_mark: | :heavy_check_mark: | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) |
| `'silero_number_detector'` | 1.1M | Number Detector | No | `ru`, `en`, `de`, `es` | :heavy_check_mark: | :heavy_check_mark: | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) |
| `'silero_lang_detector'` | 1.1M | Language Classifier | No | `ru`, `en`, `de`, `es` | :heavy_check_mark: | :heavy_check_mark: | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) |
| ~~`'silero_lang_detector_116'`~~ | ~~1.7M~~ | ~~Language Classifier~~ ||| | ||
| `'silero_lang_detector_95'` | 4.7M | Language Classifier | No | [95 languages](https://github.com/snakers4/silero-vad/blob/master/files/lang_dict_95.json) | :heavy_check_mark: | :heavy_check_mark: | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) |
- Voice activity detection for IOT / edge / mobile use cases
- Data cleaning and preparation, voice detection in general
- Telephony and call-center automation, voice bots
- Voice interfaces
(*) Though explicitly trained on these languages, VAD should work on any Germanic, Romance or Slavic Languages out of the box.
<br/>
<h2 align="center">Links</h2>
<br/>
What models do:
- VAD - detects speech;
- Number Detector - detects spoken numbers (i.e. thirty five);
- Language Classifier - classifies utterances between language;
- Language Classifier 95 - classifies among 95 languages as well as 58 language groups (mutually intelligible languages -> same group)
- [Examples and Dependencies](https://github.com/snakers4/silero-vad/wiki/Examples-and-Dependencies#dependencies)
- [Quality Metrics](https://github.com/snakers4/silero-vad/wiki/Quality-Metrics)
- [Performance Metrics](https://github.com/snakers4/silero-vad/wiki/Performance-Metrics)
- Number Detector and Language classifier [models](https://github.com/snakers4/silero-vad/wiki/Other-Models)
- [Versions and Available Models](https://github.com/snakers4/silero-vad/wiki/Version-history-and-Available-Models)
### Version History
**Version history:**
| Version | Date | Comment |
| ------- | ---------- | --------------------------------------------------------------------------------------------------------------------------- |
| `v1` | 2020-12-15 | Initial release |
| `v1.1` | 2020-12-24 | better vad models compatible with chunks shorter than 250 ms |
| `v1.2` | 2020-12-30 | Number Detector added |
| `v2` | 2021-01-11 | Add Language Classifier heads (en, ru, de, es) |
| `v2.1` | 2021-02-11 | Add micro (10k params) VAD models |
| `v2.2` | 2021-03-22 | Add micro 8000 sample rate VAD models |
| `v2.3` | 2021-04-12 | Add mini (100k params) VAD models (8k and 16k sample rate) + **new** adaptive utils for full audio and single audio stream |
| `v2.4` | 2021-07-09 | Add 116 languages classifier and group classifier |
| `v2.4` | 2021-07-09 | Deleted 116 language classifier, added 95 language classifier instead (get rid of lowspoken languages for quality improvement)
|
### PyTorch
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb)
We are keeping the colab examples up-to-date, but you can manually manage your dependencies:
- `pytorch` >= 1.7.1 (there were breaking changes in `torch.hub` introduced in 1.7);
- `torchaudio` >= 0.7.2 (used only for IO and resampling, can be easily replaced);
- `soundfile` >= 0.10.3 (used as a default backend for torchaudio, can be replaced);
All of the dependencies except for PyTorch are superficial and for utils / example only. You can use any libraries / pipelines that read files and resample into 16 kHz.
#### VAD
[![Open on Torch Hub](https://img.shields.io/badge/Torch-Hub-red?logo=pytorch&style=for-the-badge)](https://pytorch.org/hub/snakers4_silero-vad_vad/)
```python
import torch
torch.set_num_threads(1)
from pprint import pprint
model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
model='silero_vad',
force_reload=True)
(get_speech_ts,
get_speech_ts_adaptive,
_, read_audio,
_, _, _) = utils
files_dir = torch.hub.get_dir() + '/snakers4_silero-vad_master/files'
wav = read_audio(f'{files_dir}/en.wav')
# full audio
# get speech timestamps from full audio file
# classic way
speech_timestamps = get_speech_ts(wav, model,
num_steps=4)
pprint(speech_timestamps)
# adaptive way
speech_timestamps = get_speech_ts_adaptive(wav, model)
pprint(speech_timestamps)
```
#### Number Detector
[![Open on Torch Hub](https://img.shields.io/badge/Torch-Hub-red?logo=pytorch&style=for-the-badge)](https://pytorch.org/hub/snakers4_silero-vad_number/)
```python
import torch
torch.set_num_threads(1)
from pprint import pprint
model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
model='silero_number_detector',
force_reload=True)
(get_number_ts,
_, read_audio,
_, _) = utils
files_dir = torch.hub.get_dir() + '/snakers4_silero-vad_master/files'
wav = read_audio(f'{files_dir}/en_num.wav')
# full audio
# get number timestamps from full audio file
number_timestamps = get_number_ts(wav, model)
pprint(number_timestamps)
```
#### Language Classifier
##### 4 languages
[![Open on Torch Hub](https://img.shields.io/badge/Torch-Hub-red?logo=pytorch&style=for-the-badge)](https://pytorch.org/hub/snakers4_silero-vad_language/)
```python
import torch
torch.set_num_threads(1)
from pprint import pprint
model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
model='silero_lang_detector',
force_reload=True)
get_language, read_audio = utils
files_dir = torch.hub.get_dir() + '/snakers4_silero-vad_master/files'
wav = read_audio(f'{files_dir}/de.wav')
language = get_language(wav, model)
pprint(language)
```
##### 95 languages
[![Open on Torch Hub](https://img.shields.io/badge/Torch-Hub-red?logo=pytorch&style=for-the-badge)](https://pytorch.org/hub/snakers4_silero-vad_language/)
```python
import torch
torch.set_num_threads(1)
from pprint import pprint
model, lang_dict, lang_group_dict, utils = torch.hub.load(
repo_or_dir='snakers4/silero-vad',
model='silero_lang_detector_95',
force_reload=True)
get_language_and_group, read_audio = utils
files_dir = torch.hub.get_dir() + '/snakers4_silero-vad_master/files'
wav = read_audio(f'{files_dir}/de.wav')
languages, language_groups = get_language_and_group(wav, model, lang_dict, lang_group_dict, top_n=2)
for i in languages:
pprint(f'Language: {i[0]} with prob {i[-1]}')
for i in language_groups:
pprint(f'Language group: {i[0]} with prob {i[-1]}')
```
### ONNX
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb)
You can run our models everywhere, where you can import the ONNX model or run ONNX runtime.
#### VAD
```python
import torch
import onnxruntime
from pprint import pprint
_, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
model='silero_vad',
force_reload=True)
(get_speech_ts,
get_speech_ts_adaptive,
_, read_audio,
_, _, _) = utils
files_dir = torch.hub.get_dir() + '/snakers4_silero-vad_master/files'
def init_onnx_model(model_path: str):
return onnxruntime.InferenceSession(model_path)
def validate_onnx(model, inputs):
with torch.no_grad():
ort_inputs = {'input': inputs.cpu().numpy()}
outs = model.run(None, ort_inputs)
outs = [torch.Tensor(x) for x in outs]
return outs[0]
model = init_onnx_model(f'{files_dir}/model.onnx')
wav = read_audio(f'{files_dir}/en.wav')
# get speech timestamps from full audio file
# classic way
speech_timestamps = get_speech_ts(wav, model, num_steps=4, run_function=validate_onnx)
pprint(speech_timestamps)
# adaptive way
speech_timestamps = get_speech_ts(wav, model, run_function=validate_onnx)
pprint(speech_timestamps)
```
#### Number Detector
```python
import torch
import onnxruntime
from pprint import pprint
model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
model='silero_number_detector',
force_reload=True)
(get_number_ts,
_, read_audio,
_, _) = utils
files_dir = torch.hub.get_dir() + '/snakers4_silero-vad_master/files'
def init_onnx_model(model_path: str):
return onnxruntime.InferenceSession(model_path)
def validate_onnx(model, inputs):
with torch.no_grad():
ort_inputs = {'input': inputs.cpu().numpy()}
outs = model.run(None, ort_inputs)
outs = [torch.Tensor(x) for x in outs]
return outs
model = init_onnx_model(f'{files_dir}/number_detector.onnx')
wav = read_audio(f'{files_dir}/en_num.wav')
# get speech timestamps from full audio file
number_timestamps = get_number_ts(wav, model, run_function=validate_onnx)
pprint(number_timestamps)
```
#### Language Classifier
##### 4 languages
```python
import torch
import onnxruntime
from pprint import pprint
model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
model='silero_lang_detector',
force_reload=True)
get_language, read_audio = utils
files_dir = torch.hub.get_dir() + '/snakers4_silero-vad_master/files'
def init_onnx_model(model_path: str):
return onnxruntime.InferenceSession(model_path)
def validate_onnx(model, inputs):
with torch.no_grad():
ort_inputs = {'input': inputs.cpu().numpy()}
outs = model.run(None, ort_inputs)
outs = [torch.Tensor(x) for x in outs]
return outs
model = init_onnx_model(f'{files_dir}/number_detector.onnx')
wav = read_audio(f'{files_dir}/de.wav')
language = get_language(wav, model, run_function=validate_onnx)
print(language)
```
##### 95 languages
```python
import torch
import onnxruntime
from pprint import pprint
model, lang_dict, lang_group_dict, utils = torch.hub.load(
repo_or_dir='snakers4/silero-vad',
model='silero_lang_detector_95',
force_reload=True)
get_language_and_group, read_audio = utils
files_dir = torch.hub.get_dir() + '/snakers4_silero-vad_master/files'
def init_onnx_model(model_path: str):
return onnxruntime.InferenceSession(model_path)
def validate_onnx(model, inputs):
with torch.no_grad():
ort_inputs = {'input': inputs.cpu().numpy()}
outs = model.run(None, ort_inputs)
outs = [torch.Tensor(x) for x in outs]
return outs
model = init_onnx_model(f'{files_dir}/lang_classifier_95.onnx')
wav = read_audio(f'{files_dir}/de.wav')
languages, language_groups = get_language_and_group(wav, model, lang_dict, lang_group_dict, top_n=2, run_function=validate_onnx)
for i in languages:
pprint(f'Language: {i[0]} with prob {i[-1]}')
for i in language_groups:
pprint(f'Language group: {i[0]} with prob {i[-1]}')
```
[![Open on Torch Hub](https://img.shields.io/badge/Torch-Hub-red?logo=pytorch&style=for-the-badge)](https://pytorch.org/hub/snakers4_silero-vad_language/)
## Metrics
### Performance Metrics
All speed test were run on AMD Ryzen Threadripper 3960X using only 1 thread:
```
torch.set_num_threads(1) # pytorch
ort_session.intra_op_num_threads = 1 # onnx
ort_session.inter_op_num_threads = 1 # onnx
```
#### Streaming Latency
Streaming latency depends on 2 variables:
- **num_steps** - number of windows to split each audio chunk into. Our post-processing class keeps previous chunk in memory (250 ms), so new chunk (also 250 ms) is appended to it. The resulting big chunk (500 ms) is split into **num_steps** overlapping windows, each 250 ms long.
- **number of audio streams**
So **batch size** for streaming is **num_steps * number of audio streams**. Time between receiving new audio chunks and getting results is shown in picture:
| Batch size | Pytorch model time, ms | Onnx model time, ms |
| :--------: | :--------------------: | :-----------------: |
| **2** | 9 | 2 |
| **4** | 11 | 4 |
| **8** | 14 | 7 |
| **16** | 19 | 12 |
| **40** | 36 | 29 |
| **80** | 64 | 55 |
| **120** | 96 | 85 |
| **200** | 157 | 137 |
#### Full Audio Throughput
**RTS** (seconds of audio processed per second, real time speed, or 1 / RTF) for full audio processing depends on **num_steps** (see previous paragraph) and **batch size** (bigger is better).
| Batch size | num_steps | Pytorch model RTS | Onnx model RTS |
| :--------: | :-------: | :---------------: | :------------: |
| **40** | **4** | 68 | 86 |
| **40** | **8** | 34 | 43 |
| **80** | **4** | 78 | 91 |
| **80** | **8** | 39 | 45 |
| **120** | **4** | 78 | 88 |
| **120** | **8** | 39 | 44 |
| **200** | **4** | 80 | 91 |
| **200** | **8** | 40 | 46 |
### VAD Quality Metrics
We use random 250 ms audio chunks for validation. Speech to non-speech ratio among chunks is about ~50/50 (i.e. balanced). Speech chunks are sampled from real audios in four different languages (English, Russian, Spanish, German), then random background noise is added to some of them (~40%).
Since our VAD (only VAD, other networks are more flexible) was trained on chunks of the same length, model's output is just one float from 0 to 1 - **speech probability**. We use speech probabilities as thresholds for precision-recall curve. This can be extended to 100 - 150 ms. Less than 100 - 150 ms cannot be distinguished as speech with confidence.
[Webrtc](https://github.com/wiseman/py-webrtcvad) splits audio into frames, each frame has corresponding number (0 **or** 1). We use 30ms frames for webrtc, so each 250 ms chunk is split into 8 frames, their **mean** value is used as a threshold for plot.
[Auditok](https://github.com/amsehili/auditok) - logic same as Webrtc, but we use 50ms frames.
![image](https://user-images.githubusercontent.com/36505480/107667211-06cf2680-6c98-11eb-9ee5-37eb4596260f.png)
## FAQ
### VAD Parameter Fine Tuning
#### Classic way
**This is straightforward classic method `get_speech_ts` where thresholds (`trig_sum` and `neg_trig_sum`) are specified by users**
- Among others, we provide several [utils](https://github.com/snakers4/silero-vad/blob/8b28767292b424e3e505c55f15cd3c4b91e4804b/utils.py#L52-L59) to simplify working with VAD;
- We provide sensible basic hyper-parameters that work for us, but your case can be different;
- `trig_sum` - overlapping windows are used for each audio chunk, trig sum defines average probability among those windows for switching into triggered state (speech state);
- `neg_trig_sum` - same as `trig_sum`, but for switching from triggered to non-triggered state (non-speech)
- `num_steps` - nubmer of overlapping windows to split audio chunk into (we recommend 4 or 8)
- `num_samples_per_window` - number of samples in each window, our models were trained using `4000` samples (250 ms) per window, so this is preferable value (lesser values reduce [quality](https://github.com/snakers4/silero-vad/issues/2#issuecomment-750840434));
- `min_speech_samples` - minimum speech chunk duration in samples
- `min_silence_samples` - minimum silence duration in samples between to separate speech chunks
Optimal parameters may vary per domain, but we provided a tiny tool to learn the best parameters. You can invoke `speech_timestamps` with visualize_probs=True (`pandas` required):
```
speech_timestamps = get_speech_ts(wav, model,
num_samples_per_window=4000,
num_steps=4,
visualize_probs=True)
```
#### Adaptive way
**Adaptive algorithm (`get_speech_ts_adaptive`) automatically selects thresholds (`trig_sum` and `neg_trig_sum`) based on median speech probabilities over the whole audio, SOME ARGUMENTS VARY FROM THE CLASSIC WAY FUNCTION ARGUMENTS**
- `batch_size` - batch size to feed to silero VAD (default - `200`)
- `step` - step size in samples, (default - `500`) (`num_samples_per_window` / `num_steps` from classic method)
- `num_samples_per_window` - number of samples in each window, our models were trained using `4000` samples (250 ms) per window, so this is preferable value (lesser values reduce [quality](https://github.com/snakers4/silero-vad/issues/2#issuecomment-750840434));
- `min_speech_samples` - minimum speech chunk duration in samples (default - `10000`)
- `min_silence_samples` - minimum silence duration in samples between to separate speech chunks (default - `4000`)
- `speech_pad_samples` - widen speech by this amount of samples each side (default - `2000`)
```
speech_timestamps = get_speech_ts_adaptive(wav, model,
num_samples_per_window=4000,
step=500,
visualize_probs=True)
```
The chart should looks something like this:
![image](https://user-images.githubusercontent.com/12515440/106242896-79142580-6219-11eb-9add-fa7195d6fd26.png)
With this particular example you can try shorter chunks (`num_samples_per_window=1600`), but this results in too much noise:
![image](https://user-images.githubusercontent.com/12515440/106243014-a8c32d80-6219-11eb-8374-969f372807f1.png)
### How VAD Works
- Audio is split into 250 ms chunks (you can choose any chunk size, but quality with chunks shorter than 100ms will suffer and there will be more false positives and "unnatural" pauses);
- VAD keeps record of a previous chunk (or zeros at the beginning of the stream);
- Then this 500 ms audio (250 ms + 250 ms) is split into N (typically 4 or 8) windows and the model is applied to this window batch. Each window is 250 ms long (naturally, windows overlap);
- Then probability is averaged across these windows;
- Though typically pauses in speech are 300 ms+ or longer (pauses less than 200-300ms are typically not meaninful), it is hard to confidently classify speech vs noise / music on very short chunks (i.e. 30 - 50ms);
- ~~We are working on lifting this limitation, so that you can use 100 - 125ms windows~~;
### VAD Quality Metrics Methodology
Please see [Quality Metrics](#quality-metrics)
### How Number Detector Works
- It is recommended to split long audio into short ones (< 15s) and apply model on each of them;
- Number Detector can classify if the whole audio contains a number, or if each audio frame contains a number;
- Audio is splitted into frames in a certain way, so, having a per-frame output, we can restore timing bounds for a numbers with an accuracy of about 0.2s;
### How Language Classifier Works
- **99%** validation accuracy
- Language classifier was trained using audio samples in 4 languages: **Russian**, **English**, **Spanish**, **German**
- More languages TBD
- Arbitrary audio length can be used, although network was trained using audio shorter than 15 seconds
### How Language Classifier 95 Works
- **85%** validation accuracy among 95 languages, **90%** validation accuracy among [58 language groups](https://github.com/snakers4/silero-vad/blob/master/files/lang_group_dict_95.json)
- Language classifier 95 was trained using audio samples in [95 languages](https://github.com/snakers4/silero-vad/blob/master/files/lang_dict_95.json)
- Arbitrary audio length can be used, although network was trained using audio shorter than 20 seconds
## Contact
### Get in Touch
<br/>
<h2 align="center">Get In Touch</h2>
<br/>
Try our models, create an [issue](https://github.com/snakers4/silero-vad/issues/new), start a [discussion](https://github.com/snakers4/silero-vad/discussions/new), join our telegram [chat](https://t.me/silero_speech), [email](mailto:hello@silero.ai) us, read our [news](https://t.me/silero_news).
### Commercial Inquiries
Please see our [wiki](https://github.com/snakers4/silero-models/wiki) and [tiers](https://github.com/snakers4/silero-models/wiki/Licensing-and-Tiers) for relevant information and [email](mailto:hello@silero.ai) us directly.
## Further reading
### General
- Silero-models - https://github.com/snakers4/silero-models
- Nice [thread](https://github.com/snakers4/silero-vad/discussions/16#discussioncomment-305830) in discussions
### English
- STT:
- Towards an Imagenet Moment For Speech-To-Text - [link](https://thegradient.pub/towards-an-imagenet-moment-for-speech-to-text/)
- A Speech-To-Text Practitioners Criticisms of Industry and Academia - [link](https://thegradient.pub/a-speech-to-text-practitioners-criticisms-of-industry-and-academia/)
- Modern Google-level STT Models Released - [link](https://habr.com/ru/post/519562/)
- TTS:
- High-Quality Text-to-Speech Made Accessible, Simple and Fast - [link](https://habr.com/ru/post/549482/)
- VAD:
- Modern Portable Voice Activity Detector Released - [link](https://habr.com/ru/post/537276/)
- Text Enhancement:
- We have published a model for text repunctuation and recapitalization for four languages - [link](https://habr.com/ru/post/581960/)
### Chinese
- STT:
- 迈向语音识别领域的 ImageNet 时刻 - [link](https://www.infoq.cn/article/4u58WcFCs0RdpoXev1E2)
- 语音领域学术界和工业界的七宗罪 - [link](https://www.infoq.cn/article/lEe6GCRjF1CNToVITvNw)
### Russian
- STT
- Последние обновления моделей распознавания речи из Silero Models - [link](https://habr.com/ru/post/577630/)
- Сжимаем трансформеры: простые, универсальные и прикладные способы елать их компактными и быстрыми - [link](https://habr.com/ru/post/563778/)
- Ультимативное сравнение систем распознавания речи: Ashmanov, Google, Sber, Silero, Tinkoff, Yandex - [link](https://habr.com/ru/post/559640/)
- Мы опубликовали современные STT модели сравнимые по качеству с Google - [link](https://habr.com/ru/post/519564/)
- Понижаем барьеры на вход в распознавание речи - [link](https://habr.com/ru/post/494006/)
- Огромный открытый датасет русской речи версия 1.0 - [link](https://habr.com/ru/post/474462/)
- Насколько Быстрой Можно Сделать Систему STT? - [link](https://habr.com/ru/post/531524/)
- Наша система Speech-To-Text - [link](https://www.silero.ai/tag/our-speech-to-text/)
- Speech To Text - [link](https://www.silero.ai/tag/speech-to-text/)
- TTS:
- Мы сделали наш публичный синтез речи еще лучше - [link](https://habr.com/ru/post/563484/)
- Мы Опубликовали Качественный, Простой, Доступный и Быстрый Синтез Речи - [link](https://habr.com/ru/post/549480/)
- VAD:
- Модели для Детекции Речи, Чисел и Распознавания Языков - [link](https://www.silero.ai/vad-lang-classifier-number-detector/)
- Мы опубликовали современный Voice Activity Detector и не только -[link](https://habr.com/ru/post/537274/)
- Text Enhancement:
- Мы опубликовали модель, расставляющую знаки препинания и заглавные буквы в тексте на четырех языках - [link](https://habr.com/ru/post/581946/)
## Citations
**Citations**
```
@misc{Silero VAD,
@@ -624,4 +94,4 @@ Please see our [wiki](https://github.com/snakers4/silero-models/wiki) and [tiers
commit = {insert_some_commit_here},
email = {hello@silero.ai}
}
```
```

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

BIN
files/silero_vad.jit Normal file

Binary file not shown.

View File

@@ -2,15 +2,13 @@ dependencies = ['torch', 'torchaudio']
import torch
import json
from utils_vad import (init_jit_model,
get_speech_ts,
get_speech_ts_adaptive,
get_speech_timestamps,
get_number_ts,
get_language,
get_language_and_group,
save_audio,
read_audio,
state_generator,
single_audio_stream,
VADIterator,
collect_chunks,
drop_chunks)
@@ -21,85 +19,11 @@ def silero_vad(**kwargs):
Please see https://github.com/snakers4/silero-vad for usage examples
"""
hub_dir = torch.hub.get_dir()
model = init_jit_model(model_path=f'{hub_dir}/snakers4_silero-vad_master/files/model.jit')
utils = (get_speech_ts,
get_speech_ts_adaptive,
model = init_jit_model(model_path=f'{hub_dir}/snakers4_silero-vad_master/files/silero_vad.jit')
utils = (get_speech_timestamps,
save_audio,
read_audio,
state_generator,
single_audio_stream,
collect_chunks)
return model, utils
def silero_vad_micro(**kwargs):
"""Silero Voice Activity Detector
Returns a model with a set of utils
Please see https://github.com/snakers4/silero-vad for usage examples
"""
hub_dir = torch.hub.get_dir()
model = init_jit_model(model_path=f'{hub_dir}/snakers4_silero-vad_master/files/model_micro.jit')
utils = (get_speech_ts,
get_speech_ts_adaptive,
save_audio,
read_audio,
state_generator,
single_audio_stream,
collect_chunks)
return model, utils
def silero_vad_micro_8k(**kwargs):
"""Silero Voice Activity Detector
Returns a model with a set of utils
Please see https://github.com/snakers4/silero-vad for usage examples
"""
hub_dir = torch.hub.get_dir()
model = init_jit_model(model_path=f'{hub_dir}/snakers4_silero-vad_master/files/model_micro_8k.jit')
utils = (get_speech_ts,
get_speech_ts_adaptive,
save_audio,
read_audio,
state_generator,
single_audio_stream,
collect_chunks)
return model, utils
def silero_vad_mini(**kwargs):
"""Silero Voice Activity Detector
Returns a model with a set of utils
Please see https://github.com/snakers4/silero-vad for usage examples
"""
hub_dir = torch.hub.get_dir()
model = init_jit_model(model_path=f'{hub_dir}/snakers4_silero-vad_master/files/model_mini.jit')
utils = (get_speech_ts,
get_speech_ts_adaptive,
save_audio,
read_audio,
state_generator,
single_audio_stream,
collect_chunks)
return model, utils
def silero_vad_mini_8k(**kwargs):
"""Silero Voice Activity Detector
Returns a model with a set of utils
Please see https://github.com/snakers4/silero-vad for usage examples
"""
hub_dir = torch.hub.get_dir()
model = init_jit_model(model_path=f'{hub_dir}/snakers4_silero-vad_master/files/model_mini_8k.jit')
utils = (get_speech_ts,
get_speech_ts_adaptive,
save_audio,
read_audio,
state_generator,
single_audio_stream,
VADIterator,
collect_chunks)
return model, utils
@@ -151,4 +75,4 @@ def silero_lang_detector_95(**kwargs):
utils = (get_language_and_group, read_audio)
return model, lang_dict, lang_group_dict, utils
return model, lang_dict, lang_group_dict, utils

File diff suppressed because it is too large Load Diff

View File

@@ -1,50 +1,24 @@
import torch
import torchaudio
from typing import List
from itertools import repeat
from collections import deque
import torch.nn.functional as F
torchaudio.set_audio_backend("soundfile") # switch backend
import warnings
languages = ['ru', 'en', 'de', 'es']
class IterativeMedianMeter():
def __init__(self):
self.reset()
def reset(self):
self.median = 0
self.counts = {}
for i in range(0, 101, 1):
self.counts[i / 100] = 0
self.total_values = 0
def __call__(self, val):
self.total_values += 1
rounded = round(abs(val), 2)
self.counts[rounded] += 1
bin_sum = 0
for j in self.counts:
bin_sum += self.counts[j]
if bin_sum >= self.total_values / 2:
self.median = j
break
return self.median
def validate(model,
inputs: torch.Tensor):
inputs: torch.Tensor,
**kwargs):
with torch.no_grad():
outs = model(inputs)
return outs
outs = model(inputs, **kwargs)
if len(outs.shape) == 1:
return outs[1:]
return outs[:, 1] # 0 for noise, 1 for speech
def read_audio(path: str,
target_sr: int = 16000):
sampling_rate: int = 16000):
assert torchaudio.get_audio_backend() == 'soundfile'
wav, sr = torchaudio.load(path)
@@ -52,20 +26,20 @@ def read_audio(path: str,
if wav.size(0) > 1:
wav = wav.mean(dim=0, keepdim=True)
if sr != target_sr:
if sr != sampling_rate:
transform = torchaudio.transforms.Resample(orig_freq=sr,
new_freq=target_sr)
new_freq=sampling_rate)
wav = transform(wav)
sr = target_sr
sr = sampling_rate
assert sr == target_sr
assert sr == sampling_rate
return wav.squeeze(0)
def save_audio(path: str,
tensor: torch.Tensor,
sr: int = 16000):
torchaudio.save(path, tensor.unsqueeze(0), sr)
sampling_rate: int = 16000):
torchaudio.save(path, tensor.unsqueeze(0), sampling_rate)
def init_jit_model(model_path: str,
@@ -76,192 +50,121 @@ def init_jit_model(model_path: str,
return model
def get_speech_ts(wav: torch.Tensor,
model,
trig_sum: float = 0.25,
neg_trig_sum: float = 0.07,
num_steps: int = 8,
batch_size: int = 200,
num_samples_per_window: int = 4000,
min_speech_samples: int = 10000, #samples
min_silence_samples: int = 500,
run_function=validate,
visualize_probs=False,
smoothed_prob_func='mean',
device='cpu'):
assert smoothed_prob_func in ['mean', 'max'], 'smoothed_prob_func not in ["max", "mean"]'
num_samples = num_samples_per_window
assert num_samples % num_steps == 0
step = int(num_samples / num_steps) # stride / hop
outs = []
to_concat = []
for i in range(0, len(wav), step):
chunk = wav[i: i+num_samples]
if len(chunk) < num_samples:
chunk = F.pad(chunk, (0, num_samples - len(chunk)))
to_concat.append(chunk.unsqueeze(0))
if len(to_concat) >= batch_size:
chunks = torch.Tensor(torch.cat(to_concat, dim=0)).to(device)
out = run_function(model, chunks)
outs.append(out)
to_concat = []
if to_concat:
chunks = torch.Tensor(torch.cat(to_concat, dim=0)).to(device)
out = run_function(model, chunks)
outs.append(out)
outs = torch.cat(outs, dim=0)
buffer = deque(maxlen=num_steps) # maxlen reached => first element dropped
triggered = False
speeches = []
current_speech = {}
if visualize_probs:
import pandas as pd
smoothed_probs = []
speech_probs = outs[:, 1] # this is very misleading
temp_end = 0
for i, predict in enumerate(speech_probs): # add name
buffer.append(predict)
if smoothed_prob_func == 'mean':
smoothed_prob = (sum(buffer) / len(buffer))
elif smoothed_prob_func == 'max':
smoothed_prob = max(buffer)
if visualize_probs:
smoothed_probs.append(float(smoothed_prob))
if (smoothed_prob >= trig_sum) and temp_end:
temp_end=0
if (smoothed_prob >= trig_sum) and not triggered:
triggered = True
current_speech['start'] = step * max(0, i-num_steps)
continue
if (smoothed_prob < neg_trig_sum) and triggered:
if not temp_end:
temp_end = step * i
if step * i - temp_end < min_silence_samples:
continue
else:
current_speech['end'] = temp_end
if (current_speech['end'] - current_speech['start']) > min_speech_samples:
speeches.append(current_speech)
temp_end = 0
current_speech = {}
triggered = False
continue
if current_speech:
current_speech['end'] = len(wav)
speeches.append(current_speech)
if visualize_probs:
pd.DataFrame({'probs':smoothed_probs}).plot(figsize=(16,8))
return speeches
def make_visualization(probs, step):
import pandas as pd
pd.DataFrame({'probs': probs},
index=[x * step for x in range(len(probs))]).plot(figsize=(16, 8),
kind='area', ylim=[0, 1.05], xlim=[0, len(probs) * step],
xlabel='seconds',
ylabel='speech probability',
colormap='tab20')
def get_speech_ts_adaptive(wav: torch.Tensor,
model,
batch_size: int = 200,
step: int = 500,
num_samples_per_window: int = 4000, # Number of samples per audio chunk to feed to NN (4000 for 16k SR, 2000 for 8k SR is optimal)
min_speech_samples: int = 10000, # samples
min_silence_samples: int = 4000,
speech_pad_samples: int = 2000,
run_function=validate,
visualize_probs=False,
device='cpu'):
def get_speech_timestamps(audio: torch.Tensor,
model,
threshold: float = 0.5,
sample_rate: int = 16000,
min_speech_duration_ms: int = 250,
min_silence_duration_ms: int = 100,
window_size_samples: int = 1536,
speech_pad_ms: int = 30,
return_seconds: bool = False,
visualize_probs: bool = False):
"""
This function is used for splitting long audios into speech chunks using silero VAD
Attention! All default sample rate values are optimal for 16000 sample rate model, if you are using 8000 sample rate model optimal values are half as much!
This method is used for splitting long audios into speech chunks using silero VAD
Parameters
----------
batch_size: int
batch size to feed to silero VAD (default - 200)
audio: torch.Tensor, one dimensional
One dimensional float torch.Tensor, other types are casted to torch if possible
step: int
step size in samples, (default - 500)
model: preloaded .jit silero VAD model
num_samples_per_window: int
window size in samples (chunk length in samples to feed to NN, default - 4000)
threshold: float (default - 0.5)
Speech threshold. Silero VAD outputs speech probabilities for each audio chunk, probabilities ABOVE this value are considered as SPEECH.
It is better to tune this parameter for each dataset separately, but "lazy" 0.5 is pretty good for most datasets.
min_speech_samples: int
if speech duration is shorter than this value, do not consider it speech (default - 10000)
sample_rate: int (default - 16000)
Currently silero VAD models support 8000 and 16000 sample rates
min_silence_samples: int
number of samples to wait before considering as the end of speech (default - 4000)
min_speech_duration_ms: int (default - 250 milliseconds)
Final speech chunks shorter min_speech_duration_ms are thrown out
speech_pad_samples: int
widen speech by this amount of samples each side (default - 2000)
min_silence_duration_ms: int (default - 100 milliseconds)
In the end of each speech chunk wait for min_silence_duration_ms before separating it
run_function: function
function to use for the model call
window_size_samples: int (default - 1536 samples)
Audio chunks of window_size_samples size are fed to the silero VAD model.
WARNING! Silero VAD models were trained using 512, 1024, 1536 samples for 16000 sample rate and 256, 512, 768 samples for 8000 sample rate.
Values other than these may affect model perfomance!!
visualize_probs: bool
whether draw prob hist or not (default: False)
speech_pad_ms: int (default - 30 milliseconds)
Final speech chunks are padded by speech_pad_ms each side
device: string
torch device to use for the model call (default - "cpu")
return_seconds: bool (default - False)
whether return timestamps in seconds (default - samples)
visualize_probs: bool (default - False)
whether draw prob hist or not
Returns
----------
speeches: list
list containing ends and beginnings of speech chunks (in samples)
speeches: list of dicts
list containing ends and beginnings of speech chunks (samples or seconds based on return_seconds)
"""
if visualize_probs:
import pandas as pd
num_samples = num_samples_per_window
num_steps = int(num_samples / step)
assert min_silence_samples >= step
outs = []
to_concat = []
for i in range(0, len(wav), step):
chunk = wav[i: i+num_samples]
if len(chunk) < num_samples:
chunk = F.pad(chunk, (0, num_samples - len(chunk)))
to_concat.append(chunk.unsqueeze(0))
if len(to_concat) >= batch_size:
chunks = torch.Tensor(torch.cat(to_concat, dim=0)).to(device)
out = run_function(model, chunks)
outs.append(out)
to_concat = []
if not torch.is_tensor(audio):
try:
audio = torch.Tensor(audio)
except:
raise TypeError("Audio cannot be casted to tensor. Cast it manually")
if to_concat:
chunks = torch.Tensor(torch.cat(to_concat, dim=0)).to(device)
out = run_function(model, chunks)
outs.append(out)
if len(audio.shape) > 1:
for i in range(len(audio.shape)): # trying to squeeze empty dimensions
audio = audio.squeeze(0)
if len(audio.shape) > 1:
raise ValueError("More than one dimension in audio. Are you trying to process audio with 2 channels?")
outs = torch.cat(outs, dim=0).cpu()
if sample_rate == 8000 and window_size_samples > 768:
warnings.warn('window_size_samples is too big for 8000 sample_rate! Better set window_size_samples to 256, 512 or 1536 for 8000 sample rate!')
if window_size_samples not in [256, 512, 768, 1024, 1536]:
warnings.warn('Unusual window_size_samples! Supported window_size_samples:\n - [512, 1024, 1536] for 16000 sample_rate\n - [256, 512, 768] for 8000 sample_rate')
model.reset_states()
min_speech_samples = sample_rate * min_speech_duration_ms / 1000
min_silence_samples = sample_rate * min_silence_duration_ms / 1000
speech_pad_samples = sample_rate * speech_pad_ms / 1000
audio_length_samples = len(audio)
speech_probs = []
for current_start_sample in range(0, audio_length_samples, window_size_samples):
chunk = audio[current_start_sample: current_start_sample + window_size_samples]
if len(chunk) < window_size_samples:
chunk = torch.nn.functional.pad(chunk, (0, int(window_size_samples - len(chunk))))
speech_prob = model(chunk, sample_rate).item()
speech_probs.append(speech_prob)
buffer = deque(maxlen=num_steps)
triggered = False
speeches = []
smoothed_probs = []
current_speech = {}
speech_probs = outs[:, 1] # 0 index for silence probs, 1 index for speech probs
median_probs = speech_probs.median()
trig_sum = 0.89 * median_probs + 0.08 # 0.08 when median is zero, 0.97 when median is 1
neg_threshold = threshold - 0.15
temp_end = 0
for i, predict in enumerate(speech_probs):
buffer.append(predict)
smoothed_prob = max(buffer)
if visualize_probs:
smoothed_probs.append(float(smoothed_prob))
if (smoothed_prob >= trig_sum) and temp_end:
for i, speech_prob in enumerate(speech_probs):
if (speech_prob >= threshold) and temp_end:
temp_end = 0
if (smoothed_prob >= trig_sum) and not triggered:
if (speech_prob >= threshold) and not triggered:
triggered = True
current_speech['start'] = step * max(0, i-num_steps)
current_speech['start'] = window_size_samples * i
continue
if (smoothed_prob < trig_sum) and triggered:
if (speech_prob < neg_threshold) and triggered:
if not temp_end:
temp_end = step * i
if step * i - temp_end < min_silence_samples:
temp_end = window_size_samples * i
if (window_size_samples * i) - temp_end < min_silence_samples:
continue
else:
current_speech['end'] = temp_end
@@ -271,24 +174,31 @@ def get_speech_ts_adaptive(wav: torch.Tensor,
current_speech = {}
triggered = False
continue
if current_speech:
current_speech['end'] = len(wav)
speeches.append(current_speech)
if visualize_probs:
pd.DataFrame({'probs': smoothed_probs}).plot(figsize=(16, 8))
for i, ts in enumerate(speeches):
if current_speech:
current_speech['end'] = audio_length_samples
speeches.append(current_speech)
for i, speech in enumerate(speeches):
if i == 0:
ts['start'] = max(0, ts['start'] - speech_pad_samples)
speech['start'] = int(max(0, speech['start'] - speech_pad_samples))
if i != len(speeches) - 1:
silence_duration = speeches[i+1]['start'] - ts['end']
silence_duration = speeches[i+1]['start'] - speech['end']
if silence_duration < 2 * speech_pad_samples:
ts['end'] += silence_duration // 2
speeches[i+1]['start'] = max(0, speeches[i+1]['start'] - silence_duration // 2)
speech['end'] += int(silence_duration // 2)
speeches[i+1]['start'] = int(max(0, speeches[i+1]['start'] - silence_duration // 2))
else:
ts['end'] += speech_pad_samples
speech['end'] += int(speech_pad_samples)
else:
ts['end'] = min(len(wav), ts['end'] + speech_pad_samples)
speech['end'] = int(min(audio_length_samples, speech['end'] + speech_pad_samples))
if return_seconds:
for speech_dict in speeches:
speech_dict['start'] = round(speech_dict['start'] / sample_rate, 1)
speech_dict['end'] = round(speech_dict['end'] / sample_rate, 1)
if visualize_probs:
make_visualization(speech_probs, window_size_samples / sample_rate)
return speeches
@@ -344,13 +254,13 @@ def get_language_and_group(wav: torch.Tensor,
run_function=validate):
wav = torch.unsqueeze(wav, dim=0)
lang_logits, lang_group_logits = run_function(model, wav)
softm = torch.softmax(lang_logits, dim=1).squeeze()
softm_group = torch.softmax(lang_group_logits, dim=1).squeeze()
srtd = torch.argsort(softm, descending=True)
srtd_group = torch.argsort(softm_group, descending=True)
outs = []
outs_group = []
for i in range(top_n):
@@ -362,256 +272,83 @@ def get_language_and_group(wav: torch.Tensor,
return outs, outs_group
class VADiterator:
class VADIterator:
def __init__(self,
trig_sum: float = 0.26,
neg_trig_sum: float = 0.07,
num_steps: int = 8,
num_samples_per_window: int = 4000):
self.num_samples = num_samples_per_window
self.num_steps = num_steps
assert self.num_samples % num_steps == 0
self.step = int(self.num_samples / num_steps) # 500 samples is good enough
self.prev = torch.zeros(self.num_samples)
self.last = False
self.triggered = False
self.buffer = deque(maxlen=num_steps)
self.num_frames = 0
self.trig_sum = trig_sum
self.neg_trig_sum = neg_trig_sum
self.current_name = ''
model,
threshold: float = 0.5,
sample_rate: int = 16000,
min_silence_duration_ms: int = 100,
speech_pad_ms: int = 30
):
def refresh(self):
self.prev = torch.zeros(self.num_samples)
self.last = False
self.triggered = False
self.buffer = deque(maxlen=self.num_steps)
self.num_frames = 0
def prepare_batch(self, wav_chunk, name=None):
if (name is not None) and (name != self.current_name):
self.refresh()
self.current_name = name
assert len(wav_chunk) <= self.num_samples
self.num_frames += len(wav_chunk)
if len(wav_chunk) < self.num_samples:
wav_chunk = F.pad(wav_chunk, (0, self.num_samples - len(wav_chunk))) # short chunk => eof audio
self.last = True
stacked = torch.cat([self.prev, wav_chunk])
self.prev = wav_chunk
overlap_chunks = [stacked[i:i+self.num_samples].unsqueeze(0)
for i in range(self.step, self.num_samples+1, self.step)]
return torch.cat(overlap_chunks, dim=0)
def state(self, model_out):
current_speech = {}
speech_probs = model_out[:, 1] # this is very misleading
for i, predict in enumerate(speech_probs):
self.buffer.append(predict)
if ((sum(self.buffer) / len(self.buffer)) >= self.trig_sum) and not self.triggered:
self.triggered = True
current_speech[self.num_frames - (self.num_steps-i) * self.step] = 'start'
if ((sum(self.buffer) / len(self.buffer)) < self.neg_trig_sum) and self.triggered:
current_speech[self.num_frames - (self.num_steps-i) * self.step] = 'end'
self.triggered = False
if self.triggered and self.last:
current_speech[self.num_frames] = 'end'
if self.last:
self.refresh()
return current_speech, self.current_name
class VADiteratorAdaptive:
def __init__(self,
trig_sum: float = 0.26,
neg_trig_sum: float = 0.06,
step: int = 500,
num_samples_per_window: int = 4000,
speech_pad_samples: int = 1000,
accum_period: int = 50):
"""
This class is used for streaming silero VAD usage
Class for stream imitation
Parameters
----------
trig_sum: float
trigger value for speech probability, probs above this value are considered speech, switch to TRIGGERED state (default - 0.26)
model: preloaded .jit silero VAD model
neg_trig_sum: float
in triggered state probabilites below this value are considered nonspeech, switch to NONTRIGGERED state (default - 0.06)
threshold: float (default - 0.5)
Speech threshold. Silero VAD outputs speech probabilities for each audio chunk, probabilities ABOVE this value are considered as SPEECH.
It is better to tune this parameter for each dataset separately, but "lazy" 0.5 is pretty good for most datasets.
step: int
step size in samples, (default - 500)
sample_rate: int (default - 16000)
Currently silero VAD models support 8000 and 16000 sample rates
num_samples_per_window: int
window size in samples (chunk length in samples to feed to NN, default - 4000)
speech_pad_samples: int
widen speech by this amount of samples each side (default - 1000)
accum_period: int
number of chunks / iterations to wait before switching from constant (initial) trig and neg_trig coeffs to adaptive median coeffs (default - 50)
min_silence_duration_ms: int (default - 100 milliseconds)
In the end of each speech chunk wait for min_silence_duration_ms before separating it
speech_pad_ms: int (default - 30 milliseconds)
Final speech chunks are padded by speech_pad_ms each side
"""
self.num_samples = num_samples_per_window
self.num_steps = int(num_samples_per_window / step)
self.step = step
self.prev = torch.zeros(self.num_samples)
self.last = False
self.model = model
self.threshold = threshold
self.sample_rate = sample_rate
self.min_silence_samples = sample_rate * min_silence_duration_ms / 1000
self.speech_pad_samples = sample_rate * speech_pad_ms / 1000
self.reset_states()
def reset_states(self):
self.model.reset_states()
self.triggered = False
self.buffer = deque(maxlen=self.num_steps)
self.num_frames = 0
self.trig_sum = trig_sum
self.neg_trig_sum = neg_trig_sum
self.current_name = ''
self.median_meter = IterativeMedianMeter()
self.median = 0
self.total_steps = 0
self.accum_period = accum_period
self.speech_pad_samples = speech_pad_samples
self.temp_end = 0
self.current_sample = 0
def refresh(self):
self.prev = torch.zeros(self.num_samples)
self.last = False
self.triggered = False
self.buffer = deque(maxlen=self.num_steps)
self.num_frames = 0
self.median_meter.reset()
self.median = 0
self.total_steps = 0
def __call__(self, x, return_seconds=False):
"""
x: torch.Tensor
audio chunk (see examples in repo)
def prepare_batch(self, wav_chunk, name=None):
if (name is not None) and (name != self.current_name):
self.refresh()
self.current_name = name
assert len(wav_chunk) <= self.num_samples
self.num_frames += len(wav_chunk)
if len(wav_chunk) < self.num_samples:
wav_chunk = F.pad(wav_chunk, (0, self.num_samples - len(wav_chunk))) # short chunk => eof audio
self.last = True
return_seconds: bool (default - False)
whether return timestamps in seconds (default - samples)
"""
window_size_samples = len(x[0]) if x.dim() == 2 else len(x)
self.current_sample += window_size_samples
stacked = torch.cat([self.prev, wav_chunk])
self.prev = wav_chunk
speech_prob = self.model(x, self.sample_rate).item()
overlap_chunks = [stacked[i:i+self.num_samples].unsqueeze(0)
for i in range(self.step, self.num_samples+1, self.step)]
return torch.cat(overlap_chunks, dim=0)
if (speech_prob >= self.threshold) and self.temp_end:
self.temp_end = 0
def state(self, model_out):
current_speech = {}
speech_probs = model_out[:, 1] # 0 index for silence probs, 1 index for speech probs
for i, predict in enumerate(speech_probs):
self.median = self.median_meter(predict.item())
if self.total_steps < self.accum_period:
trig_sum = self.trig_sum
neg_trig_sum = self.neg_trig_sum
if (speech_prob >= self.threshold) and not self.triggered:
self.triggered = True
speech_start = self.current_sample - self.speech_pad_samples
return {'start': int(speech_start) if not return_seconds else round(speech_start / self.sample_rate, 1)}
if (speech_prob < self.threshold - 0.15) and self.triggered:
if not self.temp_end:
self.temp_end = self.current_sample
if self.current_sample - self.temp_end < self.min_silence_samples:
return None
else:
trig_sum = 0.89 * self.median + 0.08 # 0.08 when median is zero, 0.97 when median is 1
neg_trig_sum = 0.6 * self.median
self.total_steps += 1
self.buffer.append(predict)
smoothed_prob = max(self.buffer)
if (smoothed_prob >= trig_sum) and not self.triggered:
self.triggered = True
current_speech[max(0, self.num_frames - (self.num_steps-i) * self.step - self.speech_pad_samples)] = 'start'
if (smoothed_prob < neg_trig_sum) and self.triggered:
current_speech[self.num_frames - (self.num_steps-i) * self.step + self.speech_pad_samples] = 'end'
speech_end = self.temp_end + self.speech_pad_samples
self.temp_end = 0
self.triggered = False
if self.triggered and self.last:
current_speech[self.num_frames] = 'end'
if self.last:
self.refresh()
return current_speech, self.current_name
return {'end': int(speech_end) if not return_seconds else round(speech_end / self.sample_rate, 1)}
def state_generator(model,
audios: List[str],
onnx: bool = False,
trig_sum: float = 0.26,
neg_trig_sum: float = 0.07,
num_steps: int = 8,
num_samples_per_window: int = 4000,
audios_in_stream: int = 2,
run_function=validate):
VADiters = [VADiterator(trig_sum, neg_trig_sum, num_steps, num_samples_per_window) for i in range(audios_in_stream)]
for i, current_pieces in enumerate(stream_imitator(audios, audios_in_stream, num_samples_per_window)):
for_batch = [x.prepare_batch(*y) for x, y in zip(VADiters, current_pieces)]
batch = torch.cat(for_batch)
outs = run_function(model, batch)
vad_outs = torch.split(outs, num_steps)
states = []
for x, y in zip(VADiters, vad_outs):
cur_st = x.state(y)
if cur_st[0]:
states.append(cur_st)
yield states
def stream_imitator(audios: List[str],
audios_in_stream: int,
num_samples_per_window: int = 4000):
audio_iter = iter(audios)
iterators = []
num_samples = num_samples_per_window
# initial wavs
for i in range(audios_in_stream):
next_wav = next(audio_iter)
wav = read_audio(next_wav)
wav_chunks = iter([(wav[i:i+num_samples], next_wav) for i in range(0, len(wav), num_samples)])
iterators.append(wav_chunks)
print('Done initial Loading')
good_iters = audios_in_stream
while True:
values = []
for i, it in enumerate(iterators):
try:
out, wav_name = next(it)
except StopIteration:
try:
next_wav = next(audio_iter)
print('Loading next wav: ', next_wav)
wav = read_audio(next_wav)
iterators[i] = iter([(wav[i:i+num_samples], next_wav) for i in range(0, len(wav), num_samples)])
out, wav_name = next(iterators[i])
except StopIteration:
good_iters -= 1
iterators[i] = repeat((torch.zeros(num_samples), 'junk'))
out, wav_name = next(iterators[i])
if good_iters == 0:
return
values.append((out, wav_name))
yield values
def single_audio_stream(model,
audio: torch.Tensor,
num_samples_per_window:int = 4000,
run_function=validate,
iterator_type='basic',
**kwargs):
num_samples = num_samples_per_window
if iterator_type == 'basic':
VADiter = VADiterator(num_samples_per_window=num_samples_per_window, **kwargs)
elif iterator_type == 'adaptive':
VADiter = VADiteratorAdaptive(num_samples_per_window=num_samples_per_window, **kwargs)
wav = read_audio(audio)
wav_chunks = iter([wav[i:i+num_samples] for i in range(0, len(wav), num_samples)])
for chunk in wav_chunks:
batch = VADiter.prepare_batch(chunk)
outs = run_function(model, batch)
states = []
state = VADiter.state(outs)
if state[0]:
states.append(state[0])
yield states
return None
def collect_chunks(tss: List[dict],

View File

@@ -1,56 +0,0 @@
from utils_vad import *
import sys
import os
from pathlib import Path
sys.path.append('/home/keras/notebook/nvme_raid/adamnsandle/silero_mono/pipelines/align/bin/')
from align_utils import load_audio_norm
import torch
import pandas as pd
import numpy as np
sys.path.append('/home/keras/notebook/nvme_raid/adamnsandle/silero_mono/utils/')
from open_stt import soundfile_opus as sf
def split_save_audio_chunks(audio_path, model_path, save_path=None, device='cpu', absolute=True, max_duration=10, adaptive=False, **kwargs):
if not save_path:
save_path = str(Path(audio_path).with_name('after_vad'))
print(f'No save path specified! Using {save_path} to save audio chunks!')
SAMPLE_RATE = 16000
if type(model_path) == str:
#print('Loading model...')
model = init_jit_model(model_path, device)
else:
#print('Using loaded model')
model = model_path
save_name = Path(audio_path).stem
audio, sr = load_audio_norm(audio_path)
wav = torch.tensor(audio)
if adaptive:
speech_timestamps = get_speech_ts_adaptive(wav, model, device=device, **kwargs)
else:
speech_timestamps = get_speech_ts(wav, model, device=device, **kwargs)
full_save_path = Path(save_path, save_name)
if not os.path.exists(full_save_path):
os.makedirs(full_save_path, exist_ok=True)
chunks = []
if not speech_timestamps:
return pd.DataFrame()
for ts in speech_timestamps:
start_ts = int(ts['start'])
end_ts = int(ts['end'])
for i in range(start_ts, end_ts, max_duration * SAMPLE_RATE):
new_start = i
new_end = min(end_ts, i + max_duration * SAMPLE_RATE)
duration = round((new_end - new_start) / SAMPLE_RATE, 2)
chunk_path = Path(full_save_path, f'{save_name}_{new_start}-{new_end}.opus')
chunk_path = chunk_path.absolute() if absolute else chunk_path
sf.write(str(chunk_path), audio[new_start: new_end], 16000, format='OGG', subtype='OPUS')
chunks.append({'audio_path': chunk_path,
'text': '',
'duration': duration,
'domain': ''})
return pd.DataFrame(chunks)