mirror of
https://github.com/snakers4/silero-vad.git
synced 2026-02-05 18:09:22 +08:00
Update README.md
This commit is contained in:
155
README.md
155
README.md
@@ -57,9 +57,9 @@ The models are small enough to be included directly into this repository. Newer
|
|||||||
|
|
||||||
Currently we provide the following functionality:
|
Currently we provide the following functionality:
|
||||||
|
|
||||||
| PyTorch | ONNX | VAD | Number Detector | Language Clf | Languages | Colab |
|
| PyTorch | ONNX | VAD | Number Detector | Language Clf | Languages | Colab |
|
||||||
|-------------------|--------------------|---------------------|-----------------|--------------|------------------------|-------|
|
|-------------------|--------------------|---------------------|--------------------|--------------------|------------------------|-------|
|
||||||
| :heavy_check_mark:| :heavy_check_mark: | :heavy_check_mark: | | | `ru`, `en`, `de`, `es` | [](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) |
|
| :heavy_check_mark:| :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | `ru`, `en`, `de`, `es` | [](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) |
|
||||||
|
|
||||||
**Version history:**
|
**Version history:**
|
||||||
|
|
||||||
@@ -67,13 +67,17 @@ Currently we provide the following functionality:
|
|||||||
|---------|-------------|---------------------------------------------------|
|
|---------|-------------|---------------------------------------------------|
|
||||||
| `v1` | 2020-12-15 | Initial release |
|
| `v1` | 2020-12-15 | Initial release |
|
||||||
| `v1.1` | 2020-12-24 | better vad models compatible with chunks shorter than 250 ms
|
| `v1.1` | 2020-12-24 | better vad models compatible with chunks shorter than 250 ms
|
||||||
| `v2` | coming soon | Add Number Detector and Language Classifier heads |
|
| `v1.2` | 2020-12-30 | Number Detector added
|
||||||
|
| `v2` | 2021-01-11 | Add Language Classifier heads |
|
||||||
|
|
||||||
### PyTorch
|
### PyTorch
|
||||||
|
|
||||||
[](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb)
|
[](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb)
|
||||||
|
|
||||||
[](https://pytorch.org/hub/snakers4_silero-vad/) (coming soon)
|
[](https://pytorch.org/hub/snakers4_silero-vad/) (coming soon)
|
||||||
|
|
||||||
|
#### VAD
|
||||||
|
|
||||||
```python
|
```python
|
||||||
import torch
|
import torch
|
||||||
torch.set_num_threads(1)
|
torch.set_num_threads(1)
|
||||||
@@ -96,12 +100,63 @@ speech_timestamps = get_speech_ts(wav, model,
|
|||||||
num_steps=4)
|
num_steps=4)
|
||||||
pprint(speech_timestamps)
|
pprint(speech_timestamps)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
#### Number Detector
|
||||||
|
|
||||||
|
```python
|
||||||
|
import torch
|
||||||
|
torch.set_num_threads(1)
|
||||||
|
from pprint import pprint
|
||||||
|
|
||||||
|
model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
|
||||||
|
model='silero_number_detector',
|
||||||
|
force_reload=True)
|
||||||
|
|
||||||
|
(get_number_ts,
|
||||||
|
_, read_audio,
|
||||||
|
_, _) = utils
|
||||||
|
|
||||||
|
files_dir = torch.hub.get_dir() + '/snakers4_silero-vad_master/files'
|
||||||
|
|
||||||
|
wav = read_audio(f'{files_dir}/en_num.wav')
|
||||||
|
# full audio
|
||||||
|
# get number timestamps from full audio file
|
||||||
|
number_timestamps = get_number_ts(wav, model)
|
||||||
|
|
||||||
|
pprint(number_timestamps)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Language Classifier
|
||||||
|
|
||||||
|
```python
|
||||||
|
import torch
|
||||||
|
torch.set_num_threads(1)
|
||||||
|
from pprint import pprint
|
||||||
|
|
||||||
|
model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
|
||||||
|
model='silero_lang_detector',
|
||||||
|
force_reload=True)
|
||||||
|
|
||||||
|
get_language, read_audio = utils
|
||||||
|
|
||||||
|
files_dir = torch.hub.get_dir() + '/snakers4_silero-vad_master/files'
|
||||||
|
|
||||||
|
wav = read_audio(f'{files_dir}/de.wav')
|
||||||
|
language = get_language(wav, model)
|
||||||
|
|
||||||
|
pprint(language)
|
||||||
|
```
|
||||||
|
|
||||||
### ONNX
|
### ONNX
|
||||||
|
|
||||||
[](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb)
|
[](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb)
|
||||||
|
|
||||||
You can run our model everywhere, where you can import the ONNX model or run ONNX runtime.
|
You can run our models everywhere, where you can import the ONNX model or run ONNX runtime.
|
||||||
|
|
||||||
|
#### VAD
|
||||||
|
|
||||||
```python
|
```python
|
||||||
|
import torch
|
||||||
import onnxruntime
|
import onnxruntime
|
||||||
from pprint import pprint
|
from pprint import pprint
|
||||||
|
|
||||||
@@ -133,6 +188,72 @@ speech_timestamps = get_speech_ts(wav, model, num_steps=4, run_function=validate
|
|||||||
pprint(speech_timestamps)
|
pprint(speech_timestamps)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
#### Number Detector
|
||||||
|
|
||||||
|
```python
|
||||||
|
import torch
|
||||||
|
import onnxruntime
|
||||||
|
from pprint import pprint
|
||||||
|
|
||||||
|
model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
|
||||||
|
model='silero_number_detector',
|
||||||
|
force_reload=True)
|
||||||
|
|
||||||
|
(get_number_ts,
|
||||||
|
_, read_audio,
|
||||||
|
_, _) = utils
|
||||||
|
|
||||||
|
files_dir = torch.hub.get_dir() + '/snakers4_silero-vad_master/files'
|
||||||
|
|
||||||
|
def init_onnx_model(model_path: str):
|
||||||
|
return onnxruntime.InferenceSession(model_path)
|
||||||
|
|
||||||
|
def validate_onnx(model, inputs):
|
||||||
|
with torch.no_grad():
|
||||||
|
ort_inputs = {'input': inputs.cpu().numpy()}
|
||||||
|
outs = model.run(None, ort_inputs)
|
||||||
|
outs = [torch.Tensor(x) for x in outs]
|
||||||
|
return outs
|
||||||
|
|
||||||
|
model = init_onnx_model(f'{files_dir}/number_detector.onnx')
|
||||||
|
wav = read_audio(f'{files_dir}/en_num.wav')
|
||||||
|
|
||||||
|
# get speech timestamps from full audio file
|
||||||
|
number_timestamps = get_number_ts(wav, model, run_function=validate_onnx)
|
||||||
|
pprint(number_timestamps)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Language Classifier
|
||||||
|
|
||||||
|
```python
|
||||||
|
import torch
|
||||||
|
import onnxruntime
|
||||||
|
from pprint import pprint
|
||||||
|
|
||||||
|
model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
|
||||||
|
model='silero_lang_detector',
|
||||||
|
force_reload=True)
|
||||||
|
|
||||||
|
get_language, read_audio = utils
|
||||||
|
|
||||||
|
files_dir = torch.hub.get_dir() + '/snakers4_silero-vad_master/files'
|
||||||
|
|
||||||
|
def init_onnx_model(model_path: str):
|
||||||
|
return onnxruntime.InferenceSession(model_path)
|
||||||
|
|
||||||
|
def validate_onnx(model, inputs):
|
||||||
|
with torch.no_grad():
|
||||||
|
ort_inputs = {'input': inputs.cpu().numpy()}
|
||||||
|
outs = model.run(None, ort_inputs)
|
||||||
|
outs = [torch.Tensor(x) for x in outs]
|
||||||
|
return outs
|
||||||
|
|
||||||
|
model = init_onnx_model(f'{files_dir}/number_detector.onnx')
|
||||||
|
wav = read_audio(f'{files_dir}/de.wav')
|
||||||
|
|
||||||
|
language = get_language(wav, model, run_function=validate_onnx)
|
||||||
|
print(language)
|
||||||
|
```
|
||||||
## Metrics
|
## Metrics
|
||||||
|
|
||||||
### Performance Metrics
|
### Performance Metrics
|
||||||
@@ -184,7 +305,7 @@ So **batch size** for streaming is **num_steps * number of audio streams**. Time
|
|||||||
|
|
||||||
We use random 250 ms audio chunks for validation. Speech to non-speech ratio among chunks is about ~50/50 (i.e. balanced). Speech chunks are sampled from real audios in four different languages (English, Russian, Spanish, German), then random background noise is added to some of them (~40%).
|
We use random 250 ms audio chunks for validation. Speech to non-speech ratio among chunks is about ~50/50 (i.e. balanced). Speech chunks are sampled from real audios in four different languages (English, Russian, Spanish, German), then random background noise is added to some of them (~40%).
|
||||||
|
|
||||||
Since our VAD (only VAD, other networks are more flexible) was trained on chunks of the same length, model's output is just one float from 0 to 1 - **speech probability**. We use speech probabilities as thresholds for precision-recall curve. This can be extended to 100 - 150 ms (coming soon). Less than 100 - 150 ms cannot be distinguished as speech with confidence.
|
Since our VAD (only VAD, other networks are more flexible) was trained on chunks of the same length, model's output is just one float from 0 to 1 - **speech probability**. We use speech probabilities as thresholds for precision-recall curve. This can be extended to 100 - 150 ms. Less than 100 - 150 ms cannot be distinguished as speech with confidence.
|
||||||
|
|
||||||
[Webrtc](https://github.com/wiseman/py-webrtcvad) splits audio into frames, each frame has corresponding number (0 **or** 1). We use 30ms frames for webrtc, so each 250 ms chunk is split into 8 frames, their **mean** value is used as a treshold for plot.
|
[Webrtc](https://github.com/wiseman/py-webrtcvad) splits audio into frames, each frame has corresponding number (0 **or** 1). We use 30ms frames for webrtc, so each 250 ms chunk is split into 8 frames, their **mean** value is used as a treshold for plot.
|
||||||
|
|
||||||
@@ -192,20 +313,23 @@ Since our VAD (only VAD, other networks are more flexible) was trained on chunks
|
|||||||
|
|
||||||
## FAQ
|
## FAQ
|
||||||
|
|
||||||
### Method' argument to use for VAD quality/speed tuning
|
### VAD Parameter Fine Tuning
|
||||||
- `trig_sum` - overlapping windows are used for each audio chunk, trig sum defines average probability among those windows for switching into triggered state (speech state)
|
|
||||||
- `neg_trig_sum` - same as `trig_sum`, but for switching from triggered to non-triggered state (no speech)
|
- Among others, we provide several [utils](https://github.com/snakers4/silero-vad/blob/8b28767292b424e3e505c55f15cd3c4b91e4804b/utils.py#L52-L59) to simplify working with VAD;
|
||||||
- `num_steps` - nubmer of overlapping windows to split audio chunk by (we recommend 4 or 8)
|
- We provide sensible basic hyper-parameters that work for us, but your case can be different;
|
||||||
- `num_samples_per_window` - number of samples in each window, our models were trained using `4000` samples (250 ms) per window, so this is preferable value (lesser reduces quality)
|
- `trig_sum` - overlapping windows are used for each audio chunk, trig sum defines average probability among those windows for switching into triggered state (speech state);
|
||||||
|
- `neg_trig_sum` - same as `trig_sum`, but for switching from triggered to non-triggered state (non-speech)
|
||||||
|
- `num_steps` - nubmer of overlapping windows to split audio chunk into (we recommend 4 or 8)
|
||||||
|
- `num_samples_per_window` - number of samples in each window, our models were trained using `4000` samples (250 ms) per window, so this is preferable value (lesser values reduce [quality](https://github.com/snakers4/silero-vad/issues/2#issuecomment-750840434));
|
||||||
|
|
||||||
### How VAD Works
|
### How VAD Works
|
||||||
|
|
||||||
- Audio is split into 250 ms chunks;
|
- Audio is split into 250 ms chunks (you can choose any chunk size, but quality with chunks shorter than 100ms will suffer and there will be more false positives and "unnatural" pauses);
|
||||||
- VAD keeps record of a previous chunk (or zeros at the beginning of the stream);
|
- VAD keeps record of a previous chunk (or zeros at the beginning of the stream);
|
||||||
- Then this 500 ms audio (250 ms + 250 ms) is split into N (typically 4 or 8) windows and the model is applied to this window batch. Each window is 250 ms long (naturally, windows overlap);
|
- Then this 500 ms audio (250 ms + 250 ms) is split into N (typically 4 or 8) windows and the model is applied to this window batch. Each window is 250 ms long (naturally, windows overlap);
|
||||||
- Then probability is averaged across these windows;
|
- Then probability is averaged across these windows;
|
||||||
- Though typically pauses in speech are 300 ms+ or longer (pauses less than 200-300ms are typically not meaninful), it is hard to confidently classify speech vs noise / music on very short chunks (i.e. 30 - 50ms);
|
- Though typically pauses in speech are 300 ms+ or longer (pauses less than 200-300ms are typically not meaninful), it is hard to confidently classify speech vs noise / music on very short chunks (i.e. 30 - 50ms);
|
||||||
- We are working on lifting this limitation, so that you can use 100 - 125ms windows;
|
- ~~We are working on lifting this limitation, so that you can use 100 - 125ms windows~~;
|
||||||
|
|
||||||
### VAD Quality Metrics Methodology
|
### VAD Quality Metrics Methodology
|
||||||
|
|
||||||
@@ -213,7 +337,9 @@ Please see [Quality Metrics](#quality-metrics)
|
|||||||
|
|
||||||
### How Number Detector Works
|
### How Number Detector Works
|
||||||
|
|
||||||
TBD, but there is no explicit limiation on the way audio is split into chunks.
|
- It is recommended to split long audio into short ones (< 15s) and apply model on each of them;
|
||||||
|
- Number Detector can classify if whole audio contains a number, or if each audio frame contains a number;
|
||||||
|
- Audio is splitted into frames in a certain way, so, having a per-frame output, we can restore timing bounds for a numbers with an accuracy of about 0.2s;
|
||||||
|
|
||||||
### How Language Classifier Works
|
### How Language Classifier Works
|
||||||
|
|
||||||
@@ -222,7 +348,6 @@ TBD, but there is no explicit limiation on the way audio is split into chunks.
|
|||||||
- More languages TBD
|
- More languages TBD
|
||||||
- Arbitrary audio length can be used, although network was trained using audio shorter than 15 seconds
|
- Arbitrary audio length can be used, although network was trained using audio shorter than 15 seconds
|
||||||
|
|
||||||
|
|
||||||
## Contact
|
## Contact
|
||||||
|
|
||||||
### Get in Touch
|
### Get in Touch
|
||||||
|
|||||||
Reference in New Issue
Block a user