From 422677ea8ec765340667a202d2b11a77e12b7c03 Mon Sep 17 00:00:00 2001
From: snakers41 <aveysov@gmail.com>
Date: Tue, 15 Dec 2020 16:01:54 +0000
Subject: [PATCH] Polish readme

---
 README.md | 45 ++++++++++++++++++++++++++-------------------
 1 file changed, 26 insertions(+), 19 deletions(-)

diff --git a/README.md b/README.md
index 2aea2cb..67a49b4 100644
--- a/README.md
+++ b/README.md
@@ -1,6 +1,6 @@
 [![Mailing list : test](http://img.shields.io/badge/Email-gray.svg?style=for-the-badge&logo=gmail)](mailto:hello@silero.ai) [![Mailing list : test](http://img.shields.io/badge/Telegram-blue.svg?style=for-the-badge&logo=telegram)](https://t.me/joinchat/Bv9tjhpdXTI22OUgpOIIDg) [![License: CC BY-NC 4.0](https://img.shields.io/badge/License-MIT-lightgrey.svg?style=for-the-badge)](https://github.com/snakers4/silero-vad/blob/master/LICENSE) 
  
-[![Open on Torch Hub](https://img.shields.io/badge/Torch-Hub-red?logo=pytorch&style=for-the-badge)](https://pytorch.org/hub/snakers4_silero-vad/)
+[![Open on Torch Hub](https://img.shields.io/badge/Torch-Hub-red?logo=pytorch&style=for-the-badge)](https://pytorch.org/hub/snakers4_silero-vad/) (coming soon)
 
 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb)
 
@@ -12,7 +12,7 @@
     - [ONNX](#onnx)
   - [Metrics](#metrics)
     - [Performance Metrics](#performance-metrics)
-    - [Quality Metrics](#quality-metrics)
+    - [VAD Quality Metrics](#vad-quality-metrics)
   - [FAQ](#faq)
     - [How VAD Works](#how-vad-works)
     - [VAD Quality Metrics Methodology](#vad-quality-metrics-methodology)
@@ -25,28 +25,31 @@
 
 # Silero VAD
 
-`Single Image Why our VAD is better than WebRTC`
+![image](https://user-images.githubusercontent.com/36505480/102233150-9f476580-3ef8-11eb-87fb-ae6f1edfe10f.png)
 
-Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier.
+**Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier.**
 Enterprise-grade Speech Products made refreshingly simple (see our [STT](https://github.com/snakers4/silero-models) models).
 
-Currently, there are hardly any high quality / modern / free / public voice activity detectors except for WebRTC Voice Activity Detector ([link](https://github.com/wiseman/py-webrtcvad)).
+Currently, there are hardly any high quality / modern / free / public voice activity detectors except for WebRTC Voice Activity Detector ([link](https://github.com/wiseman/py-webrtcvad)). WebRTC though starts to show its age and it suffers from many false positives.
 
-Also in enterprise it is crucial to be able to anonymize large-scale spoken corpora (i.e. remove personal data). Typically personal data is considered to be private / sensitive if it contains (i) a name (ii) some private ID. Name recognition is highly subjective and would depend on locale and business case, but Voice Activity and Number detections are quite general tasks.
+Also in some cases it is crucial to be able to anonymize large-scale spoken corpora (i.e. remove personal data). Typically personal data is considered to be private / sensitive if it contains (i) a name (ii) some private ID. Name recognition is a highly subjective matter and it depends on locale and business case, but Voice Activity and Number Detection are quite general tasks.
 
 **Key features:**
 
 - Modern, portable;
-- Lowe memory footprint;
+- Low memory footprint;
 - Superior metrics to WebRTC;
 - Trained on huge spoken corpora and noise / sound libraries;
 - Slower than WebRTC, but fast enough for IOT / edge / mobile applications;
+- Unlike WebRTC (which mostly tells silence from voice), our VAD can tell voice from noise / music / silence;
 
 **Typical use cases:**
 
 - Spoken corpora anonymization;
+- Can be used together with WebRTC;
 - Voice activity detection for IOT / edge / mobile use cases;
-- Data cleaning and preparation, number and voice detection in general; 
+- Data cleaning and preparation, number and voice detection in general;
+- PyTorch and ONNX can be used with a wide variety of deployment options and backends in mind; 
 
 ## Getting Started
 
@@ -62,8 +65,8 @@ Currently we provide the following functionality:
 
 | Version | Date        | Comment                                           |
 |---------|-------------|---------------------------------------------------|
-| `v1`    | 2020-12-15  | Initial release                                   |
-| `v2`    | coming soon | Add Number Detector or Language Classifier heads  |
+| `v1`    | 2020-12-15  | Initial release                                               |
+| `v2`    | coming soon | Add Number Detector or Language Classifier heads, lift 250 ms chunk VAD limitation  |
 
 ### PyTorch
 
@@ -129,20 +132,19 @@ speech_timestamps = get_speech_ts(wav, model, num_steps=4, run_function=validate
 pprint(speech_timestamps)
 ```
 
-
 ## Metrics
 
 ### Performance Metrics
 
 Speed metrics here.
 
-### Quality Metrics
+### VAD Quality Metrics
 
-We use random 0.25 second audio chunks to validate on. Speech to Non-speech ratio among chunks ~50/50, speech chunks are carved from real audios in four different languages (English, Russian, Spanish, German), then random random background noise is applied to some of them. 
+We use random 250 ms audio chunks for validation. Speech to non-speech ratio among chunks is about ~50/50 (i.e. balanced). Speech chunks are sampled from real audios in four different languages (English, Russian, Spanish, German), then random background noise is added to some of them (~40%). 
 
-Since our models were trained on chunks of the same length, model's output is just one float number from 0 to 1 - **speech probability**. We use speech probabilities as tresholds for precision-recall curve.
+Since our VAD (only VAD, other networks are more flexible) was trained on chunks of the same length, model's output is just one float from 0 to 1 - **speech probability**. We use speech probabilities as thresholds for precision-recall curve. This can be extended to 100 - 150 ms (coming soon). Less than 100 - 150 ms cannot be distinguished as speech with confidence.
 
-[Webrtc](https://github.com/wiseman/py-webrtcvad) splits audio into frames, each frame has corresponding number (0 **or** 1). We use 30ms frames for webrtc predicts, so each 0.25 second chunk is splitted into 8 frames, their **mean** value is used as a treshold for plot.
+[Webrtc](https://github.com/wiseman/py-webrtcvad) splits audio into frames, each frame has corresponding number (0 **or** 1). We use 30ms frames for webrtc, so each 250 ms chunk is split into 8 frames, their **mean** value is used as a treshold for plot.
 
 ![image](https://user-images.githubusercontent.com/36505480/102233150-9f476580-3ef8-11eb-87fb-ae6f1edfe10f.png)
 
@@ -150,19 +152,24 @@ Since our models were trained on chunks of the same length, model's output is ju
 
 ### How VAD Works
 
-Bla-bla, 300ms, 15ms latency on 1 thread, see examples (naive, streaming).
+- Audio is split into 250 ms chunks;
+- VAD keeps record of a previous chunk (or zeros at the beginning of the stream);
+- Then this 500 ms audio (250 ms + 250 ms) is split into N (typically 4 or 8) windows and the model is applied to this window batch. Each window is 250 ms long (naturally, windows overlap);
+- Then probability is averaged across these windows;
+- Though typically pauses in speech are 300 ms+ or longer (pauses less than 200-300ms are typically not meaninful), it is hard to confidently classify speech vs noise / music on very short chunks (i.e. 30 - 50ms);
+- We are working on lifting this limitation, so that you can use 100 - 125ms windows;
 
 ### VAD Quality Metrics Methodology
 
-TBD
+Please see [Quality Metrics](#quality-metrics)
 
 ### How Number Detector Works
 
-TBD
+TBD, but there is no explicit limiation on the way audio is split into chunks.
 
 ### How Language Classifier Works
 
-TBD
+TBD, but there is no explicit limiation on the way audio is split into chunks.
 
 ## Contact