mirror of
https://github.com/snakers4/silero-vad.git
synced 2026-02-05 18:09:22 +08:00
Merge branch 'adamnsandle' of github.com:snakers4/silero-vad into adamnsandle
This commit is contained in:
68
README.md
68
README.md
@@ -57,9 +57,9 @@ The models are small enough to be included directly into this repository. Newer
|
|||||||
|
|
||||||
Currently we provide the following functionality:
|
Currently we provide the following functionality:
|
||||||
|
|
||||||
| PyTorch | ONNX | VAD | Number Detector | Language Clf | Languages | Colab |
|
| PyTorch | ONNX | VAD | Number Detector | Language Clf | Languages | Colab |
|
||||||
|-------------------|--------------------|---------------------|--------------------|--------------|------------------------|-------|
|
|-------------------|--------------------|---------------------|--------------------|--------------------|------------------------|-------|
|
||||||
| :heavy_check_mark:| :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | | `ru`, `en`, `de`, `es` | [](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) |
|
| :heavy_check_mark:| :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | `ru`, `en`, `de`, `es` | [](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) |
|
||||||
|
|
||||||
**Version history:**
|
**Version history:**
|
||||||
|
|
||||||
@@ -68,7 +68,7 @@ Currently we provide the following functionality:
|
|||||||
| `v1` | 2020-12-15 | Initial release |
|
| `v1` | 2020-12-15 | Initial release |
|
||||||
| `v1.1` | 2020-12-24 | better vad models compatible with chunks shorter than 250 ms
|
| `v1.1` | 2020-12-24 | better vad models compatible with chunks shorter than 250 ms
|
||||||
| `v1.2` | 2020-12-30 | Number Detector added
|
| `v1.2` | 2020-12-30 | Number Detector added
|
||||||
| `v2` | coming soon | Add Language Classifier heads |
|
| `v2` | 2021-01-11 | Add Language Classifier heads |
|
||||||
|
|
||||||
### PyTorch
|
### PyTorch
|
||||||
|
|
||||||
@@ -126,6 +126,27 @@ number_timestamps = get_number_ts(wav, model)
|
|||||||
pprint(number_timestamps)
|
pprint(number_timestamps)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### Language Classifier
|
||||||
|
|
||||||
|
```python
|
||||||
|
import torch
|
||||||
|
torch.set_num_threads(1)
|
||||||
|
from pprint import pprint
|
||||||
|
|
||||||
|
model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
|
||||||
|
model='silero_lang_detector',
|
||||||
|
force_reload=True)
|
||||||
|
|
||||||
|
get_language, read_audio = utils
|
||||||
|
|
||||||
|
files_dir = torch.hub.get_dir() + '/snakers4_silero-vad_master/files'
|
||||||
|
|
||||||
|
wav = read_audio(f'{files_dir}/de.wav')
|
||||||
|
language = get_language(wav, model)
|
||||||
|
|
||||||
|
pprint(language)
|
||||||
|
```
|
||||||
|
|
||||||
### ONNX
|
### ONNX
|
||||||
|
|
||||||
[](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb)
|
[](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb)
|
||||||
@@ -201,6 +222,38 @@ wav = read_audio(f'{files_dir}/en_num.wav')
|
|||||||
number_timestamps = get_number_ts(wav, model, run_function=validate_onnx)
|
number_timestamps = get_number_ts(wav, model, run_function=validate_onnx)
|
||||||
pprint(number_timestamps)
|
pprint(number_timestamps)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### Language Classifier
|
||||||
|
|
||||||
|
```python
|
||||||
|
import torch
|
||||||
|
import onnxruntime
|
||||||
|
from pprint import pprint
|
||||||
|
|
||||||
|
model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
|
||||||
|
model='silero_lang_detector',
|
||||||
|
force_reload=True)
|
||||||
|
|
||||||
|
get_language, read_audio = utils
|
||||||
|
|
||||||
|
files_dir = torch.hub.get_dir() + '/snakers4_silero-vad_master/files'
|
||||||
|
|
||||||
|
def init_onnx_model(model_path: str):
|
||||||
|
return onnxruntime.InferenceSession(model_path)
|
||||||
|
|
||||||
|
def validate_onnx(model, inputs):
|
||||||
|
with torch.no_grad():
|
||||||
|
ort_inputs = {'input': inputs.cpu().numpy()}
|
||||||
|
outs = model.run(None, ort_inputs)
|
||||||
|
outs = [torch.Tensor(x) for x in outs]
|
||||||
|
return outs
|
||||||
|
|
||||||
|
model = init_onnx_model(f'{files_dir}/number_detector.onnx')
|
||||||
|
wav = read_audio(f'{files_dir}/de.wav')
|
||||||
|
|
||||||
|
language = get_language(wav, model, run_function=validate_onnx)
|
||||||
|
print(language)
|
||||||
|
```
|
||||||
## Metrics
|
## Metrics
|
||||||
|
|
||||||
### Performance Metrics
|
### Performance Metrics
|
||||||
@@ -252,7 +305,7 @@ So **batch size** for streaming is **num_steps * number of audio streams**. Time
|
|||||||
|
|
||||||
We use random 250 ms audio chunks for validation. Speech to non-speech ratio among chunks is about ~50/50 (i.e. balanced). Speech chunks are sampled from real audios in four different languages (English, Russian, Spanish, German), then random background noise is added to some of them (~40%).
|
We use random 250 ms audio chunks for validation. Speech to non-speech ratio among chunks is about ~50/50 (i.e. balanced). Speech chunks are sampled from real audios in four different languages (English, Russian, Spanish, German), then random background noise is added to some of them (~40%).
|
||||||
|
|
||||||
Since our VAD (only VAD, other networks are more flexible) was trained on chunks of the same length, model's output is just one float from 0 to 1 - **speech probability**. We use speech probabilities as thresholds for precision-recall curve. This can be extended to 100 - 150 ms (coming soon). Less than 100 - 150 ms cannot be distinguished as speech with confidence.
|
Since our VAD (only VAD, other networks are more flexible) was trained on chunks of the same length, model's output is just one float from 0 to 1 - **speech probability**. We use speech probabilities as thresholds for precision-recall curve. This can be extended to 100 - 150 ms. Less than 100 - 150 ms cannot be distinguished as speech with confidence.
|
||||||
|
|
||||||
[Webrtc](https://github.com/wiseman/py-webrtcvad) splits audio into frames, each frame has corresponding number (0 **or** 1). We use 30ms frames for webrtc, so each 250 ms chunk is split into 8 frames, their **mean** value is used as a treshold for plot.
|
[Webrtc](https://github.com/wiseman/py-webrtcvad) splits audio into frames, each frame has corresponding number (0 **or** 1). We use 30ms frames for webrtc, so each 250 ms chunk is split into 8 frames, their **mean** value is used as a treshold for plot.
|
||||||
|
|
||||||
@@ -290,7 +343,10 @@ Please see [Quality Metrics](#quality-metrics)
|
|||||||
|
|
||||||
### How Language Classifier Works
|
### How Language Classifier Works
|
||||||
|
|
||||||
TBD, but there is no explicit limiation on the way audio is split into chunks.
|
- **99%** validation accuracy
|
||||||
|
- Language classifier was trained using audio samples in 4 languages: **Russian**, **English**, **Spanish**, **German**
|
||||||
|
- More languages TBD
|
||||||
|
- Arbitrary audio length can be used, although network was trained using audio shorter than 15 seconds
|
||||||
|
|
||||||
## Contact
|
## Contact
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user