mirror of
https://github.com/snakers4/silero-vad.git
synced 2026-02-04 17:39:22 +08:00
Merge branch 'adamnsandle' of github.com:snakers4/silero-vad into adamnsandle
This commit is contained in:
68
README.md
68
README.md
@@ -57,9 +57,9 @@ The models are small enough to be included directly into this repository. Newer
|
||||
|
||||
Currently we provide the following functionality:
|
||||
|
||||
| PyTorch | ONNX | VAD | Number Detector | Language Clf | Languages | Colab |
|
||||
|-------------------|--------------------|---------------------|--------------------|--------------|------------------------|-------|
|
||||
| :heavy_check_mark:| :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | | `ru`, `en`, `de`, `es` | [](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) |
|
||||
| PyTorch | ONNX | VAD | Number Detector | Language Clf | Languages | Colab |
|
||||
|-------------------|--------------------|---------------------|--------------------|--------------------|------------------------|-------|
|
||||
| :heavy_check_mark:| :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | `ru`, `en`, `de`, `es` | [](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) |
|
||||
|
||||
**Version history:**
|
||||
|
||||
@@ -68,7 +68,7 @@ Currently we provide the following functionality:
|
||||
| `v1` | 2020-12-15 | Initial release |
|
||||
| `v1.1` | 2020-12-24 | better vad models compatible with chunks shorter than 250 ms
|
||||
| `v1.2` | 2020-12-30 | Number Detector added
|
||||
| `v2` | coming soon | Add Language Classifier heads |
|
||||
| `v2` | 2021-01-11 | Add Language Classifier heads |
|
||||
|
||||
### PyTorch
|
||||
|
||||
@@ -126,6 +126,27 @@ number_timestamps = get_number_ts(wav, model)
|
||||
pprint(number_timestamps)
|
||||
```
|
||||
|
||||
### Language Classifier
|
||||
|
||||
```python
|
||||
import torch
|
||||
torch.set_num_threads(1)
|
||||
from pprint import pprint
|
||||
|
||||
model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
|
||||
model='silero_lang_detector',
|
||||
force_reload=True)
|
||||
|
||||
get_language, read_audio = utils
|
||||
|
||||
files_dir = torch.hub.get_dir() + '/snakers4_silero-vad_master/files'
|
||||
|
||||
wav = read_audio(f'{files_dir}/de.wav')
|
||||
language = get_language(wav, model)
|
||||
|
||||
pprint(language)
|
||||
```
|
||||
|
||||
### ONNX
|
||||
|
||||
[](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb)
|
||||
@@ -201,6 +222,38 @@ wav = read_audio(f'{files_dir}/en_num.wav')
|
||||
number_timestamps = get_number_ts(wav, model, run_function=validate_onnx)
|
||||
pprint(number_timestamps)
|
||||
```
|
||||
|
||||
### Language Classifier
|
||||
|
||||
```python
|
||||
import torch
|
||||
import onnxruntime
|
||||
from pprint import pprint
|
||||
|
||||
model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
|
||||
model='silero_lang_detector',
|
||||
force_reload=True)
|
||||
|
||||
get_language, read_audio = utils
|
||||
|
||||
files_dir = torch.hub.get_dir() + '/snakers4_silero-vad_master/files'
|
||||
|
||||
def init_onnx_model(model_path: str):
|
||||
return onnxruntime.InferenceSession(model_path)
|
||||
|
||||
def validate_onnx(model, inputs):
|
||||
with torch.no_grad():
|
||||
ort_inputs = {'input': inputs.cpu().numpy()}
|
||||
outs = model.run(None, ort_inputs)
|
||||
outs = [torch.Tensor(x) for x in outs]
|
||||
return outs
|
||||
|
||||
model = init_onnx_model(f'{files_dir}/number_detector.onnx')
|
||||
wav = read_audio(f'{files_dir}/de.wav')
|
||||
|
||||
language = get_language(wav, model, run_function=validate_onnx)
|
||||
print(language)
|
||||
```
|
||||
## Metrics
|
||||
|
||||
### Performance Metrics
|
||||
@@ -252,7 +305,7 @@ So **batch size** for streaming is **num_steps * number of audio streams**. Time
|
||||
|
||||
We use random 250 ms audio chunks for validation. Speech to non-speech ratio among chunks is about ~50/50 (i.e. balanced). Speech chunks are sampled from real audios in four different languages (English, Russian, Spanish, German), then random background noise is added to some of them (~40%).
|
||||
|
||||
Since our VAD (only VAD, other networks are more flexible) was trained on chunks of the same length, model's output is just one float from 0 to 1 - **speech probability**. We use speech probabilities as thresholds for precision-recall curve. This can be extended to 100 - 150 ms (coming soon). Less than 100 - 150 ms cannot be distinguished as speech with confidence.
|
||||
Since our VAD (only VAD, other networks are more flexible) was trained on chunks of the same length, model's output is just one float from 0 to 1 - **speech probability**. We use speech probabilities as thresholds for precision-recall curve. This can be extended to 100 - 150 ms. Less than 100 - 150 ms cannot be distinguished as speech with confidence.
|
||||
|
||||
[Webrtc](https://github.com/wiseman/py-webrtcvad) splits audio into frames, each frame has corresponding number (0 **or** 1). We use 30ms frames for webrtc, so each 250 ms chunk is split into 8 frames, their **mean** value is used as a treshold for plot.
|
||||
|
||||
@@ -290,7 +343,10 @@ Please see [Quality Metrics](#quality-metrics)
|
||||
|
||||
### How Language Classifier Works
|
||||
|
||||
TBD, but there is no explicit limiation on the way audio is split into chunks.
|
||||
- **99%** validation accuracy
|
||||
- Language classifier was trained using audio samples in 4 languages: **Russian**, **English**, **Spanish**, **German**
|
||||
- More languages TBD
|
||||
- Arbitrary audio length can be used, although network was trained using audio shorter than 15 seconds
|
||||
|
||||
## Contact
|
||||
|
||||
|
||||
Reference in New Issue
Block a user