mirror of
https://github.com/snakers4/silero-vad.git
synced 2026-02-04 17:39:22 +08:00
Update README with video examples
This commit is contained in:
111
README.md
111
README.md
@@ -7,14 +7,28 @@
|
||||

|
||||
|
||||
- [Silero VAD](#silero-vad)
|
||||
- [TLDR](#tldr)
|
||||
- [Live Demonstration](#live-demonstration)
|
||||
- [Getting Started](#getting-started)
|
||||
- [Pre-trained Models](#pre-trained-models)
|
||||
- [Version History](#version-history)
|
||||
- [PyTorch](#pytorch)
|
||||
- [VAD](#vad)
|
||||
- [Number Detector](#number-detector)
|
||||
- [Language Classifier](#language-classifier)
|
||||
- [ONNX](#onnx)
|
||||
- [VAD](#vad-1)
|
||||
- [Number Detector](#number-detector-1)
|
||||
- [Language Classifier](#language-classifier-1)
|
||||
- [Metrics](#metrics)
|
||||
- [Performance Metrics](#performance-metrics)
|
||||
- [Streaming Latency](#streaming-latency)
|
||||
- [Full Audio Throughput](#full-audio-throughput)
|
||||
- [VAD Quality Metrics](#vad-quality-metrics)
|
||||
- [FAQ](#faq)
|
||||
- [Tuning VAD](#vad-parameter-fine-tuning)
|
||||
- [VAD Parameter Fine Tuning](#vad-parameter-fine-tuning)
|
||||
- [Classic way](#classic-way)
|
||||
- [Adaptive way](#adaptive-way)
|
||||
- [How VAD Works](#how-vad-works)
|
||||
- [VAD Quality Metrics Methodology](#vad-quality-metrics-methodology)
|
||||
- [How Number Detector Works](#how-number-detector-works)
|
||||
@@ -22,14 +36,17 @@
|
||||
- [Contact](#contact)
|
||||
- [Get in Touch](#get-in-touch)
|
||||
- [Commercial Inquiries](#commercial-inquiries)
|
||||
- [References](#references)
|
||||
- [Citations](#citations)
|
||||
|
||||
|
||||
# Silero VAD
|
||||

|
||||
|
||||
## TLDR
|
||||
|
||||
**Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier.**
|
||||
Enterprise-grade Speech Products made refreshingly simple (see our [STT](https://github.com/snakers4/silero-models) models).
|
||||
Enterprise-grade Speech Products made refreshingly simple (also see our [STT](https://github.com/snakers4/silero-models) models).
|
||||
|
||||
Currently, there are hardly any high quality / modern / free / public voice activity detectors except for WebRTC Voice Activity Detector ([link](https://github.com/wiseman/py-webrtcvad)). WebRTC though starts to show its age and it suffers from many false positives.
|
||||
|
||||
@@ -52,21 +69,31 @@ Also in some cases it is crucial to be able to anonymize large-scale spoken corp
|
||||
- Data cleaning and preparation, number and voice detection in general;
|
||||
- PyTorch and ONNX can be used with a wide variety of deployment options and backends in mind;
|
||||
|
||||
### Live Demonstration
|
||||
|
||||
For more information, please see [examples](https://github.com/snakers4/silero-vad/tree/master/examples).
|
||||
|
||||
https://user-images.githubusercontent.com/28188499/116685087-182ff100-a9b2-11eb-927d-ed9f621226ee.mp4
|
||||
|
||||
https://user-images.githubusercontent.com/8079748/117580455-4622dd00-b0f8-11eb-858d-e6368ed4eada.mp4
|
||||
|
||||
## Getting Started
|
||||
|
||||
The models are small enough to be included directly into this repository. Newer models will supersede older models directly.
|
||||
|
||||
### Pre-trained Models
|
||||
|
||||
**Currently we provide the following endpoints:**
|
||||
|
||||
| model= | Params | Model type | Streaming | Languages | PyTorch | ONNX | Colab |
|
||||
|--------------------------------|--------|---------------------|--------------------|----------------|---------|------|-------|
|
||||
| `'silero_vad'` | 1.1M | VAD | Yes | `ru`, `en`, `de`, `es` (*) | :heavy_check_mark: | :heavy_check_mark: | [](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) |
|
||||
| `'silero_vad_micro'` | 10K | VAD | Yes | `ru`, `en`, `de`, `es` (*) | :heavy_check_mark: | :heavy_check_mark: | [](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) |
|
||||
| `'silero_vad_micro_8k'` | 10K | VAD | Yes | `ru`, `en`, `de`, `es` (*) | :heavy_check_mark: | :heavy_check_mark: | [](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) |
|
||||
| `'silero_vad_mini'` | 100K | VAD | Yes | `ru`, `en`, `de`, `es` (*) | :heavy_check_mark: | :heavy_check_mark: | [](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) |
|
||||
| `'silero_vad_mini_8k'` | 100K | VAD | Yes | `ru`, `en`, `de`, `es` (*) | :heavy_check_mark: | :heavy_check_mark: | [](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) |
|
||||
| `'silero_number_detector'` | 1.1M | Number Detector | No | `ru`, `en`, `de`, `es` | :heavy_check_mark: | :heavy_check_mark: | [](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) |
|
||||
| `'silero_lang_detector'` | 1.1M | Language Classifier | No | `ru`, `en`, `de`, `es` | :heavy_check_mark: | :heavy_check_mark: | [](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) |
|
||||
| model= | Params | Model type | Streaming | Languages | PyTorch | ONNX | Colab |
|
||||
| -------------------------- | ------ | ------------------- | --------- | -------------------------- | ------------------ | ------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `'silero_vad'` | 1.1M | VAD | Yes | `ru`, `en`, `de`, `es` (*) | :heavy_check_mark: | :heavy_check_mark: | [](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) |
|
||||
| `'silero_vad_micro'` | 10K | VAD | Yes | `ru`, `en`, `de`, `es` (*) | :heavy_check_mark: | :heavy_check_mark: | [](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) |
|
||||
| `'silero_vad_micro_8k'` | 10K | VAD | Yes | `ru`, `en`, `de`, `es` (*) | :heavy_check_mark: | :heavy_check_mark: | [](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) |
|
||||
| `'silero_vad_mini'` | 100K | VAD | Yes | `ru`, `en`, `de`, `es` (*) | :heavy_check_mark: | :heavy_check_mark: | [](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) |
|
||||
| `'silero_vad_mini_8k'` | 100K | VAD | Yes | `ru`, `en`, `de`, `es` (*) | :heavy_check_mark: | :heavy_check_mark: | [](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) |
|
||||
| `'silero_number_detector'` | 1.1M | Number Detector | No | `ru`, `en`, `de`, `es` | :heavy_check_mark: | :heavy_check_mark: | [](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) |
|
||||
| `'silero_lang_detector'` | 1.1M | Language Classifier | No | `ru`, `en`, `de`, `es` | :heavy_check_mark: | :heavy_check_mark: | [](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) |
|
||||
|
||||
(*) Though explicitly trained on these languages, VAD should work on any Germanic, Romance or Slavic Languages out of the box.
|
||||
|
||||
@@ -76,17 +103,19 @@ What models do:
|
||||
- Number Detector - detects spoken numbers (i.e. thirty five);
|
||||
- Language Classifier - classifies utterances between language;
|
||||
|
||||
### Version History
|
||||
|
||||
**Version history:**
|
||||
|
||||
| Version | Date | Comment |
|
||||
|---------|-------------|---------------------------------------------------|
|
||||
| `v1` | 2020-12-15 | Initial release |
|
||||
| `v1.1` | 2020-12-24 | better vad models compatible with chunks shorter than 250 ms
|
||||
| `v1.2` | 2020-12-30 | Number Detector added
|
||||
| `v2` | 2021-01-11 | Add Language Classifier heads (en, ru, de, es) |
|
||||
| `v2.1` | 2021-02-11 | Add micro (10k params) VAD models |
|
||||
| `v2.2` | 2021-03-22 | Add micro 8000 sample rate VAD models |
|
||||
| `v2.3` | 2021-04-12 | Add mini (100k params) VAD models (8k and 16k sample rate) + **new** adaptive utils for full audio and single audio stream|
|
||||
| Version | Date | Comment |
|
||||
| ------- | ---------- | --------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `v1` | 2020-12-15 | Initial release |
|
||||
| `v1.1` | 2020-12-24 | better vad models compatible with chunks shorter than 250 ms |
|
||||
| `v1.2` | 2020-12-30 | Number Detector added |
|
||||
| `v2` | 2021-01-11 | Add Language Classifier heads (en, ru, de, es) |
|
||||
| `v2.1` | 2021-02-11 | Add micro (10k params) VAD models |
|
||||
| `v2.2` | 2021-03-22 | Add micro 8000 sample rate VAD models |
|
||||
| `v2.3` | 2021-04-12 | Add mini (100k params) VAD models (8k and 16k sample rate) + **new** adaptive utils for full audio and single audio stream |
|
||||
|
||||
### PyTorch
|
||||
|
||||
@@ -320,30 +349,30 @@ Streaming latency depends on 2 variables:
|
||||
So **batch size** for streaming is **num_steps * number of audio streams**. Time between receiving new audio chunks and getting results is shown in picture:
|
||||
|
||||
| Batch size | Pytorch model time, ms | Onnx model time, ms |
|
||||
| :-------------: |:-------------:| :-----:|
|
||||
| **2** | 9 | 2 |
|
||||
| **4** | 11 | 4 |
|
||||
| **8** | 14 | 7 |
|
||||
| **16** | 19 | 12 |
|
||||
| **40** | 36 | 29 |
|
||||
| **80** | 64 | 55 |
|
||||
| **120** | 96 | 85 |
|
||||
| **200** | 157 | 137 |
|
||||
| :--------: | :--------------------: | :-----------------: |
|
||||
| **2** | 9 | 2 |
|
||||
| **4** | 11 | 4 |
|
||||
| **8** | 14 | 7 |
|
||||
| **16** | 19 | 12 |
|
||||
| **40** | 36 | 29 |
|
||||
| **80** | 64 | 55 |
|
||||
| **120** | 96 | 85 |
|
||||
| **200** | 157 | 137 |
|
||||
|
||||
#### Full Audio Throughput
|
||||
|
||||
**RTS** (seconds of audio processed per second, real time speed, or 1 / RTF) for full audio processing depends on **num_steps** (see previous paragraph) and **batch size** (bigger is better).
|
||||
|
||||
| Batch size | num_steps | Pytorch model RTS | Onnx model RTS |
|
||||
| :-------------: |:-------: | :-------------:| :-----:|
|
||||
| **40** | **4** | 68 | 86 |
|
||||
| **40** | **8** | 34 | 43 |
|
||||
| **80** | **4** | 78 | 91 |
|
||||
| **80** | **8** | 39 | 45 |
|
||||
| **120** | **4** | 78 | 88 |
|
||||
| **120** | **8** | 39 | 44 |
|
||||
| **200** | **4** | 80 | 91 |
|
||||
| **200** | **8** | 40 | 46 |
|
||||
| Batch size | num_steps | Pytorch model RTS | Onnx model RTS |
|
||||
| :--------: | :-------: | :---------------: | :------------: |
|
||||
| **40** | **4** | 68 | 86 |
|
||||
| **40** | **8** | 34 | 43 |
|
||||
| **80** | **4** | 78 | 91 |
|
||||
| **80** | **8** | 39 | 45 |
|
||||
| **120** | **4** | 78 | 88 |
|
||||
| **120** | **8** | 39 | 44 |
|
||||
| **200** | **4** | 80 | 91 |
|
||||
| **200** | **8** | 40 | 46 |
|
||||
|
||||
### VAD Quality Metrics
|
||||
|
||||
@@ -361,7 +390,7 @@ Since our VAD (only VAD, other networks are more flexible) was trained on chunks
|
||||
|
||||
### VAD Parameter Fine Tuning
|
||||
|
||||
#### **Classic way**
|
||||
#### Classic way
|
||||
|
||||
**This is straightforward classic method `get_speech_ts` where thresholds (`trig_sum` and `neg_trig_sum`) are specified by users**
|
||||
- Among others, we provide several [utils](https://github.com/snakers4/silero-vad/blob/8b28767292b424e3e505c55f15cd3c4b91e4804b/utils.py#L52-L59) to simplify working with VAD;
|
||||
@@ -382,7 +411,7 @@ speech_timestamps = get_speech_ts(wav, model,
|
||||
visualize_probs=True)
|
||||
```
|
||||
|
||||
#### **Adaptive way**
|
||||
#### Adaptive way
|
||||
|
||||
**Adaptive algorithm (`get_speech_ts_adaptive`) automatically selects thresholds (`trig_sum` and `neg_trig_sum`) based on median speech probabilities over the whole audio, SOME ARGUMENTS VARY FROM THE CLASSIC WAY FUNCTION ARGUMENTS**
|
||||
- `batch_size` - batch size to feed to silero VAD (default - `200`)
|
||||
|
||||
Reference in New Issue
Block a user