Merge branch 'adamnsandle' of github.com:snakers4/silero-vad into adamnsandle

2026-02-05 18:09:22 +08:00 · 2021-01-11 12:44:41 +00:00
parent 07687e33d8 02e9da821a
commit 6421fff039
1 changed files with 62 additions and 6 deletions
--- a/README.md
+++ b/README.md
@@ -57,9 +57,9 @@ The models are small enough to be included directly into this repository. Newer
 Currently we provide the following functionality:
-| PyTorch           | ONNX               | VAD                 | Number Detector    | Language Clf | Languages              | Colab |
+| PyTorch           | ONNX               | VAD                 | Number Detector    | Language Clf       | Languages              | Colab |
-|-------------------|--------------------|---------------------|--------------------|--------------|------------------------|-------| 
+|-------------------|--------------------|---------------------|--------------------|--------------------|------------------------|-------| 
-| :heavy_check_mark:| :heavy_check_mark: | :heavy_check_mark:  | :heavy_check_mark: |              | `ru`, `en`, `de`, `es` | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) |
+| :heavy_check_mark:| :heavy_check_mark: | :heavy_check_mark:  | :heavy_check_mark: | :heavy_check_mark: | `ru`, `en`, `de`, `es` | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) |
 **Version history:**
@@ -68,7 +68,7 @@ Currently we provide the following functionality:
 | `v1`    | 2020-12-15  | Initial release                                               |
 | `v1.1`  | 2020-12-24  | better vad models compatible with chunks shorter than 250 ms
 | `v1.2`  | 2020-12-30  | Number Detector added
-| `v2`    | coming soon | Add Language Classifier heads |
+| `v2`    | 2021-01-11  | Add Language Classifier heads |
 ### PyTorch
@@ -126,6 +126,27 @@ number_timestamps = get_number_ts(wav, model)
 pprint(number_timestamps)
 ```
 ### Language Classifier
 ```python
 import torch
 torch.set_num_threads(1)
 from pprint import pprint
 model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
                              model='silero_lang_detector',
                              force_reload=True)
 get_language, read_audio = utils
 files_dir = torch.hub.get_dir() + '/snakers4_silero-vad_master/files'
 wav = read_audio(f'{files_dir}/de.wav')
 language = get_language(wav, model)
 pprint(language)
 ```
 ### ONNX
 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb)
@@ -201,6 +222,38 @@ wav = read_audio(f'{files_dir}/en_num.wav')
 number_timestamps = get_number_ts(wav, model, run_function=validate_onnx) 
 pprint(number_timestamps)
 ```
 ### Language Classifier
 ```python
 import torch
 import onnxruntime
 from pprint import pprint
 model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
                              model='silero_lang_detector',
                              force_reload=True)
 get_language, read_audio = utils
 files_dir = torch.hub.get_dir() + '/snakers4_silero-vad_master/files'
 def init_onnx_model(model_path: str):
    return onnxruntime.InferenceSession(model_path)
 def validate_onnx(model, inputs):
    with torch.no_grad():
        ort_inputs = {'input': inputs.cpu().numpy()}
        outs = model.run(None, ort_inputs)
        outs = [torch.Tensor(x) for x in outs]
    return outs
 model = init_onnx_model(f'{files_dir}/number_detector.onnx')
 wav = read_audio(f'{files_dir}/de.wav')
 language = get_language(wav, model, run_function=validate_onnx)
 print(language)
 ```
 ## Metrics
 ### Performance Metrics
@@ -252,7 +305,7 @@ So **batch size** for streaming is **num_steps * number of audio streams**. Time
 We use random 250 ms audio chunks for validation. Speech to non-speech ratio among chunks is about ~50/50 (i.e. balanced). Speech chunks are sampled from real audios in four different languages (English, Russian, Spanish, German), then random background noise is added to some of them (~40%). 
-Since our VAD (only VAD, other networks are more flexible) was trained on chunks of the same length, model's output is just one float from 0 to 1 - **speech probability**. We use speech probabilities as thresholds for precision-recall curve. This can be extended to 100 - 150 ms (coming soon). Less than 100 - 150 ms cannot be distinguished as speech with confidence.
+Since our VAD (only VAD, other networks are more flexible) was trained on chunks of the same length, model's output is just one float from 0 to 1 - **speech probability**. We use speech probabilities as thresholds for precision-recall curve. This can be extended to 100 - 150 ms. Less than 100 - 150 ms cannot be distinguished as speech with confidence.
 [Webrtc](https://github.com/wiseman/py-webrtcvad) splits audio into frames, each frame has corresponding number (0 **or** 1). We use 30ms frames for webrtc, so each 250 ms chunk is split into 8 frames, their **mean** value is used as a treshold for plot.
@@ -290,7 +343,10 @@ Please see [Quality Metrics](#quality-metrics)
 ### How Language Classifier Works
-TBD, but there is no explicit limiation on the way audio is split into chunks.
+- **99%** validation accuracy
 - Language classifier was trained using audio samples in 4 languages: **Russian**, **English**, **Spanish**, **German**
 - More languages TBD
 - Arbitrary audio length can be used, although network was trained using audio shorter than 15 seconds
 ## Contact