From 8996d5e519f7d4e0c936091a010bb332dc41bdab Mon Sep 17 00:00:00 2001 From: snakers41 Date: Tue, 15 Dec 2020 14:51:16 +0000 Subject: [PATCH 1/4] Update readme, add skeleton for FAQ --- README.md | 25 ++++++++++++++++++++++++- 1 file changed, 24 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index f073463..392c277 100644 --- a/README.md +++ b/README.md @@ -13,6 +13,11 @@ - [Metrics](#metrics) - [Performance Metrics](#performance-metrics) - [Quality Metrics](#quality-metrics) + - [FAQ](#faq) + - [How VAD Works](#how-vad-works) + - [VAD Quality Metrics Methodology](#vad-quality-metrics-methodology) + - [How Number Detector Works](#how-number-detector-works) + - [How Language Classifier Works](#how-language-classifier-works) - [Contact](#contact) - [Get in Touch](#get-in-touch) - [Commercial Inquiries](#commercial-inquiries) @@ -57,7 +62,7 @@ Currently we provide the following functionality: | Version | Date | Comment | |---------|-------------|---------------------------------------------------| -| `v1` | 2020-12-15 | initial release | +| `v1` | 2020-12-15 | Initial release | | `v2` | coming soon | Add Number Detector or Language Classifier heads | ### PyTorch @@ -90,6 +95,24 @@ Speed metrics here. Quality metrics here. +## FAQ + +### How VAD Works + +Bla-bla, 300ms, 15ms latency on 1 thread, see examples (naive, streaming). + +### VAD Quality Metrics Methodology + +TBD + +### How Number Detector Works + +TBD + +### How Language Classifier Works + +TBD + ## Contact ### Get in Touch From dd2d7ff70e03f79f72016afb37fcbd13d9338d0f Mon Sep 17 00:00:00 2001 From: Dimitrii Voronin <36505480+adamnsandle@users.noreply.github.com> Date: Tue, 15 Dec 2020 17:12:15 +0200 Subject: [PATCH 2/4] Update README.md --- README.md | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 392c277..af50d00 100644 --- a/README.md +++ b/README.md @@ -93,7 +93,13 @@ Speed metrics here. ### Quality Metrics -Quality metrics here. +We use random 0.25 second audio chunks to validate on. Speech to Non-speech ratio among chunks ~50/50, speech chunks are carved from real audios in four different languages (English, Russian, Spanish, German), then random random background noise is applied to some of them. + +Since our models were trained on chunks of the same length, model's output is just one float number from 0 to 1 - **speech probability**. We use speech probabilities as tresholds for precision-recall curve. + +Webrtc splits audio into frames, each frame has corresponding number (0 **or** 1). We use 30ms frames for webrtc predicts, so each 0.25 second chunk is splitted into 8 frames, their **mean** value is used as a treshold for plot. + +![image](https://user-images.githubusercontent.com/36505480/102233150-9f476580-3ef8-11eb-87fb-ae6f1edfe10f.png) ## FAQ From 97fc53a8395e70cb9a591f5831c51a3ccba6d341 Mon Sep 17 00:00:00 2001 From: Dimitrii Voronin <36505480+adamnsandle@users.noreply.github.com> Date: Tue, 15 Dec 2020 17:13:37 +0200 Subject: [PATCH 3/4] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index af50d00..02474d9 100644 --- a/README.md +++ b/README.md @@ -97,7 +97,7 @@ We use random 0.25 second audio chunks to validate on. Speech to Non-speech rati Since our models were trained on chunks of the same length, model's output is just one float number from 0 to 1 - **speech probability**. We use speech probabilities as tresholds for precision-recall curve. -Webrtc splits audio into frames, each frame has corresponding number (0 **or** 1). We use 30ms frames for webrtc predicts, so each 0.25 second chunk is splitted into 8 frames, their **mean** value is used as a treshold for plot. +[Webrtc](https://github.com/wiseman/py-webrtcvad) splits audio into frames, each frame has corresponding number (0 **or** 1). We use 30ms frames for webrtc predicts, so each 0.25 second chunk is splitted into 8 frames, their **mean** value is used as a treshold for plot. ![image](https://user-images.githubusercontent.com/36505480/102233150-9f476580-3ef8-11eb-87fb-ae6f1edfe10f.png) From 14d7cbc3b114e486a74d680fd41ed5b17b438235 Mon Sep 17 00:00:00 2001 From: Dimitrii Voronin <36505480+adamnsandle@users.noreply.github.com> Date: Tue, 15 Dec 2020 17:28:00 +0200 Subject: [PATCH 4/4] Update README.md --- README.md | 55 ++++++++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 50 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 02474d9..9eade58 100644 --- a/README.md +++ b/README.md @@ -70,21 +70,66 @@ Currently we provide the following functionality: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) [![Open on Torch Hub](https://img.shields.io/badge/Torch-Hub-red?logo=pytorch&style=for-the-badge)](https://pytorch.org/hub/snakers4_silero-vad/) (coming soon) - ```python -TBD -``` +import torch +torch.set_num_threads(1) +from pprint import pprint +model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad', + model='silero_vad', + force_reload=True) + +(get_speech_ts, + _, read_audio, + _, _, _) = utils + +files_dir = torch.hub.get_dir() + '/snakers4_silero-vad_master/files' + +wav = read_audio(f'{files_dir}/en.wav') +# full audio +# get speech timestamps from full audio file +speech_timestamps = get_speech_ts(wav, model, + num_steps=4) +pprint(speech_timestamps) +``` ### ONNX [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) You can run our model everywhere, where you can import the ONNX model or run ONNX runtime. - ```python -TBD +import onnxruntime +from pprint import pprint + +_, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad', + model='silero_vad', + force_reload=True) + +(get_speech_ts, + _, read_audio, + _, _, _) = utils + +files_dir = torch.hub.get_dir() + '/snakers4_silero-vad_master/files' + +def init_onnx_model(model_path: str): + return onnxruntime.InferenceSession(model_path) + +def validate_onnx(model, inputs): + with torch.no_grad(): + ort_inputs = {'input': inputs.cpu().numpy()} + outs = model.run(None, ort_inputs) + outs = [torch.Tensor(x) for x in outs] + return outs + +model = init_onnx_model(f'{files_dir}/model.onnx') +wav = read_audio(f'{files_dir}/en.wav') + +# get speech timestamps from full audio file +speech_timestamps = get_speech_ts(wav, model, num_steps=4, run_function=validate_onnx) +pprint(speech_timestamps) ``` + ## Metrics ### Performance Metrics