Update README with video examples

This commit is contained in:
snakers4
2021-05-10 04:43:20 +00:00
parent 77f1e1ae81
commit 91648f32a8

111
README.md
View File

@@ -7,14 +7,28 @@
![header](https://user-images.githubusercontent.com/12515440/89997349-b3523080-dc94-11ea-9906-ca2e8bc50535.png) ![header](https://user-images.githubusercontent.com/12515440/89997349-b3523080-dc94-11ea-9906-ca2e8bc50535.png)
- [Silero VAD](#silero-vad) - [Silero VAD](#silero-vad)
- [TLDR](#tldr)
- [Live Demonstration](#live-demonstration)
- [Getting Started](#getting-started) - [Getting Started](#getting-started)
- [Pre-trained Models](#pre-trained-models)
- [Version History](#version-history)
- [PyTorch](#pytorch) - [PyTorch](#pytorch)
- [VAD](#vad)
- [Number Detector](#number-detector)
- [Language Classifier](#language-classifier)
- [ONNX](#onnx) - [ONNX](#onnx)
- [VAD](#vad-1)
- [Number Detector](#number-detector-1)
- [Language Classifier](#language-classifier-1)
- [Metrics](#metrics) - [Metrics](#metrics)
- [Performance Metrics](#performance-metrics) - [Performance Metrics](#performance-metrics)
- [Streaming Latency](#streaming-latency)
- [Full Audio Throughput](#full-audio-throughput)
- [VAD Quality Metrics](#vad-quality-metrics) - [VAD Quality Metrics](#vad-quality-metrics)
- [FAQ](#faq) - [FAQ](#faq)
- [Tuning VAD](#vad-parameter-fine-tuning) - [VAD Parameter Fine Tuning](#vad-parameter-fine-tuning)
- [Classic way](#classic-way)
- [Adaptive way](#adaptive-way)
- [How VAD Works](#how-vad-works) - [How VAD Works](#how-vad-works)
- [VAD Quality Metrics Methodology](#vad-quality-metrics-methodology) - [VAD Quality Metrics Methodology](#vad-quality-metrics-methodology)
- [How Number Detector Works](#how-number-detector-works) - [How Number Detector Works](#how-number-detector-works)
@@ -22,14 +36,17 @@
- [Contact](#contact) - [Contact](#contact)
- [Get in Touch](#get-in-touch) - [Get in Touch](#get-in-touch)
- [Commercial Inquiries](#commercial-inquiries) - [Commercial Inquiries](#commercial-inquiries)
- [References](#references)
- [Citations](#citations) - [Citations](#citations)
# Silero VAD # Silero VAD
![image](https://user-images.githubusercontent.com/36505480/107667211-06cf2680-6c98-11eb-9ee5-37eb4596260f.png) ![image](https://user-images.githubusercontent.com/36505480/107667211-06cf2680-6c98-11eb-9ee5-37eb4596260f.png)
## TLDR
**Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier.** **Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier.**
Enterprise-grade Speech Products made refreshingly simple (see our [STT](https://github.com/snakers4/silero-models) models). Enterprise-grade Speech Products made refreshingly simple (also see our [STT](https://github.com/snakers4/silero-models) models).
Currently, there are hardly any high quality / modern / free / public voice activity detectors except for WebRTC Voice Activity Detector ([link](https://github.com/wiseman/py-webrtcvad)). WebRTC though starts to show its age and it suffers from many false positives. Currently, there are hardly any high quality / modern / free / public voice activity detectors except for WebRTC Voice Activity Detector ([link](https://github.com/wiseman/py-webrtcvad)). WebRTC though starts to show its age and it suffers from many false positives.
@@ -52,21 +69,31 @@ Also in some cases it is crucial to be able to anonymize large-scale spoken corp
- Data cleaning and preparation, number and voice detection in general; - Data cleaning and preparation, number and voice detection in general;
- PyTorch and ONNX can be used with a wide variety of deployment options and backends in mind; - PyTorch and ONNX can be used with a wide variety of deployment options and backends in mind;
### Live Demonstration
For more information, please see [examples](https://github.com/snakers4/silero-vad/tree/master/examples).
https://user-images.githubusercontent.com/28188499/116685087-182ff100-a9b2-11eb-927d-ed9f621226ee.mp4
https://user-images.githubusercontent.com/8079748/117580455-4622dd00-b0f8-11eb-858d-e6368ed4eada.mp4
## Getting Started ## Getting Started
The models are small enough to be included directly into this repository. Newer models will supersede older models directly. The models are small enough to be included directly into this repository. Newer models will supersede older models directly.
### Pre-trained Models
**Currently we provide the following endpoints:** **Currently we provide the following endpoints:**
| model= | Params | Model type | Streaming | Languages | PyTorch | ONNX | Colab | | model= | Params | Model type | Streaming | Languages | PyTorch | ONNX | Colab |
|--------------------------------|--------|---------------------|--------------------|----------------|---------|------|-------| | -------------------------- | ------ | ------------------- | --------- | -------------------------- | ------------------ | ------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `'silero_vad'` | 1.1M | VAD | Yes | `ru`, `en`, `de`, `es` (*) | :heavy_check_mark: | :heavy_check_mark: | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) | | `'silero_vad'` | 1.1M | VAD | Yes | `ru`, `en`, `de`, `es` (*) | :heavy_check_mark: | :heavy_check_mark: | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) |
| `'silero_vad_micro'` | 10K | VAD | Yes | `ru`, `en`, `de`, `es` (*) | :heavy_check_mark: | :heavy_check_mark: | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) | | `'silero_vad_micro'` | 10K | VAD | Yes | `ru`, `en`, `de`, `es` (*) | :heavy_check_mark: | :heavy_check_mark: | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) |
| `'silero_vad_micro_8k'` | 10K | VAD | Yes | `ru`, `en`, `de`, `es` (*) | :heavy_check_mark: | :heavy_check_mark: | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) | | `'silero_vad_micro_8k'` | 10K | VAD | Yes | `ru`, `en`, `de`, `es` (*) | :heavy_check_mark: | :heavy_check_mark: | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) |
| `'silero_vad_mini'` | 100K | VAD | Yes | `ru`, `en`, `de`, `es` (*) | :heavy_check_mark: | :heavy_check_mark: | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) | | `'silero_vad_mini'` | 100K | VAD | Yes | `ru`, `en`, `de`, `es` (*) | :heavy_check_mark: | :heavy_check_mark: | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) |
| `'silero_vad_mini_8k'` | 100K | VAD | Yes | `ru`, `en`, `de`, `es` (*) | :heavy_check_mark: | :heavy_check_mark: | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) | | `'silero_vad_mini_8k'` | 100K | VAD | Yes | `ru`, `en`, `de`, `es` (*) | :heavy_check_mark: | :heavy_check_mark: | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) |
| `'silero_number_detector'` | 1.1M | Number Detector | No | `ru`, `en`, `de`, `es` | :heavy_check_mark: | :heavy_check_mark: | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) | | `'silero_number_detector'` | 1.1M | Number Detector | No | `ru`, `en`, `de`, `es` | :heavy_check_mark: | :heavy_check_mark: | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) |
| `'silero_lang_detector'` | 1.1M | Language Classifier | No | `ru`, `en`, `de`, `es` | :heavy_check_mark: | :heavy_check_mark: | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) | | `'silero_lang_detector'` | 1.1M | Language Classifier | No | `ru`, `en`, `de`, `es` | :heavy_check_mark: | :heavy_check_mark: | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) |
(*) Though explicitly trained on these languages, VAD should work on any Germanic, Romance or Slavic Languages out of the box. (*) Though explicitly trained on these languages, VAD should work on any Germanic, Romance or Slavic Languages out of the box.
@@ -76,17 +103,19 @@ What models do:
- Number Detector - detects spoken numbers (i.e. thirty five); - Number Detector - detects spoken numbers (i.e. thirty five);
- Language Classifier - classifies utterances between language; - Language Classifier - classifies utterances between language;
### Version History
**Version history:** **Version history:**
| Version | Date | Comment | | Version | Date | Comment |
|---------|-------------|---------------------------------------------------| | ------- | ---------- | --------------------------------------------------------------------------------------------------------------------------- |
| `v1` | 2020-12-15 | Initial release | | `v1` | 2020-12-15 | Initial release |
| `v1.1` | 2020-12-24 | better vad models compatible with chunks shorter than 250 ms | `v1.1` | 2020-12-24 | better vad models compatible with chunks shorter than 250 ms |
| `v1.2` | 2020-12-30 | Number Detector added | `v1.2` | 2020-12-30 | Number Detector added |
| `v2` | 2021-01-11 | Add Language Classifier heads (en, ru, de, es) | | `v2` | 2021-01-11 | Add Language Classifier heads (en, ru, de, es) |
| `v2.1` | 2021-02-11 | Add micro (10k params) VAD models | | `v2.1` | 2021-02-11 | Add micro (10k params) VAD models |
| `v2.2` | 2021-03-22 | Add micro 8000 sample rate VAD models | | `v2.2` | 2021-03-22 | Add micro 8000 sample rate VAD models |
| `v2.3` | 2021-04-12 | Add mini (100k params) VAD models (8k and 16k sample rate) + **new** adaptive utils for full audio and single audio stream| | `v2.3` | 2021-04-12 | Add mini (100k params) VAD models (8k and 16k sample rate) + **new** adaptive utils for full audio and single audio stream |
### PyTorch ### PyTorch
@@ -320,30 +349,30 @@ Streaming latency depends on 2 variables:
So **batch size** for streaming is **num_steps * number of audio streams**. Time between receiving new audio chunks and getting results is shown in picture: So **batch size** for streaming is **num_steps * number of audio streams**. Time between receiving new audio chunks and getting results is shown in picture:
| Batch size | Pytorch model time, ms | Onnx model time, ms | | Batch size | Pytorch model time, ms | Onnx model time, ms |
| :-------------: |:-------------:| :-----:| | :--------: | :--------------------: | :-----------------: |
| **2** | 9 | 2 | | **2** | 9 | 2 |
| **4** | 11 | 4 | | **4** | 11 | 4 |
| **8** | 14 | 7 | | **8** | 14 | 7 |
| **16** | 19 | 12 | | **16** | 19 | 12 |
| **40** | 36 | 29 | | **40** | 36 | 29 |
| **80** | 64 | 55 | | **80** | 64 | 55 |
| **120** | 96 | 85 | | **120** | 96 | 85 |
| **200** | 157 | 137 | | **200** | 157 | 137 |
#### Full Audio Throughput #### Full Audio Throughput
**RTS** (seconds of audio processed per second, real time speed, or 1 / RTF) for full audio processing depends on **num_steps** (see previous paragraph) and **batch size** (bigger is better). **RTS** (seconds of audio processed per second, real time speed, or 1 / RTF) for full audio processing depends on **num_steps** (see previous paragraph) and **batch size** (bigger is better).
| Batch size | num_steps | Pytorch model RTS | Onnx model RTS | | Batch size | num_steps | Pytorch model RTS | Onnx model RTS |
| :-------------: |:-------: | :-------------:| :-----:| | :--------: | :-------: | :---------------: | :------------: |
| **40** | **4** | 68 | 86 | | **40** | **4** | 68 | 86 |
| **40** | **8** | 34 | 43 | | **40** | **8** | 34 | 43 |
| **80** | **4** | 78 | 91 | | **80** | **4** | 78 | 91 |
| **80** | **8** | 39 | 45 | | **80** | **8** | 39 | 45 |
| **120** | **4** | 78 | 88 | | **120** | **4** | 78 | 88 |
| **120** | **8** | 39 | 44 | | **120** | **8** | 39 | 44 |
| **200** | **4** | 80 | 91 | | **200** | **4** | 80 | 91 |
| **200** | **8** | 40 | 46 | | **200** | **8** | 40 | 46 |
### VAD Quality Metrics ### VAD Quality Metrics
@@ -361,7 +390,7 @@ Since our VAD (only VAD, other networks are more flexible) was trained on chunks
### VAD Parameter Fine Tuning ### VAD Parameter Fine Tuning
#### **Classic way** #### Classic way
**This is straightforward classic method `get_speech_ts` where thresholds (`trig_sum` and `neg_trig_sum`) are specified by users** **This is straightforward classic method `get_speech_ts` where thresholds (`trig_sum` and `neg_trig_sum`) are specified by users**
- Among others, we provide several [utils](https://github.com/snakers4/silero-vad/blob/8b28767292b424e3e505c55f15cd3c4b91e4804b/utils.py#L52-L59) to simplify working with VAD; - Among others, we provide several [utils](https://github.com/snakers4/silero-vad/blob/8b28767292b424e3e505c55f15cd3c4b91e4804b/utils.py#L52-L59) to simplify working with VAD;
@@ -382,7 +411,7 @@ speech_timestamps = get_speech_ts(wav, model,
visualize_probs=True) visualize_probs=True)
``` ```
#### **Adaptive way** #### Adaptive way
**Adaptive algorithm (`get_speech_ts_adaptive`) automatically selects thresholds (`trig_sum` and `neg_trig_sum`) based on median speech probabilities over the whole audio, SOME ARGUMENTS VARY FROM THE CLASSIC WAY FUNCTION ARGUMENTS** **Adaptive algorithm (`get_speech_ts_adaptive`) automatically selects thresholds (`trig_sum` and `neg_trig_sum`) based on median speech probabilities over the whole audio, SOME ARGUMENTS VARY FROM THE CLASSIC WAY FUNCTION ARGUMENTS**
- `batch_size` - batch size to feed to silero VAD (default - `200`) - `batch_size` - batch size to feed to silero VAD (default - `200`)