v4.0stable force_onnx_cpu fx

v4.0stable fx
Merge pull request #256 from snakers4/adamnsandle
2026-02-04 17:39:22 +08:00 · 2024-07-08 10:16:52 +00:00 · 2024-07-01 09:53:25 +00:00 · 2022-10-28 13:57:10 +03:00 · 2022-10-28 10:55:59 +00:00 · 2022-10-28 10:55:46 +00:00
30 changed files with 809 additions and 11763 deletions
--- a/README.md
+++ b/README.md
@@ -1,617 +1,97 @@
-[![Mailing list : test](http://img.shields.io/badge/Email-gray.svg?style=for-the-badge&logo=gmail)](mailto:hello@silero.ai) [![Mailing list : test](http://img.shields.io/badge/Telegram-blue.svg?style=for-the-badge&logo=telegram)](https://t.me/silero_speech) [![License: CC BY-NC 4.0](https://img.shields.io/badge/License-MIT-lightgrey.svg?style=for-the-badge)](https://github.com/snakers4/silero-vad/blob/master/LICENSE) 
- 
-[![Open on Torch Hub](https://img.shields.io/badge/Torch-Hub-red?logo=pytorch&style=for-the-badge)](https://pytorch.org/hub/snakers4_silero-vad_vad/)
+[![Mailing list : test](http://img.shields.io/badge/Email-gray.svg?style=for-the-badge&logo=gmail)](mailto:hello@silero.ai) [![Mailing list : test](http://img.shields.io/badge/Telegram-blue.svg?style=for-the-badge&logo=telegram)](https://t.me/silero_speech) [![License: CC BY-NC 4.0](https://img.shields.io/badge/License-MIT-lightgrey.svg?style=for-the-badge)](https://github.com/snakers4/silero-vad/blob/master/LICENSE)

 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb)

 ![header](https://user-images.githubusercontent.com/12515440/89997349-b3523080-dc94-11ea-9906-ca2e8bc50535.png)

- [Silero VAD](#silero-vad)
-  - [TLDR](#tldr)
-    - [Live Demonstration](#live-demonstration)
-  - [Getting Started](#getting-started)
-    - [Pre-trained Models](#pre-trained-models)
-    - [Version History](#version-history)
-    - [PyTorch](#pytorch)
-      - [VAD](#vad)
-      - [Number Detector](#number-detector)
-      - [Language Classifier](#language-classifier)
-    - [ONNX](#onnx)
-      - [VAD](#vad-1)
-      - [Number Detector](#number-detector-1)
-      - [Language Classifier](#language-classifier-1)
-  - [Metrics](#metrics)
-    - [Performance Metrics](#performance-metrics)
-      - [Streaming Latency](#streaming-latency)
-      - [Full Audio Throughput](#full-audio-throughput)
-    - [VAD Quality Metrics](#vad-quality-metrics)
-  - [FAQ](#faq)
-    - [VAD Parameter Fine Tuning](#vad-parameter-fine-tuning)
-      - [Classic way](#classic-way)
-      - [Adaptive way](#adaptive-way)
-    - [How VAD Works](#how-vad-works)
-    - [VAD Quality Metrics Methodology](#vad-quality-metrics-methodology)
-    - [How Number Detector Works](#how-number-detector-works)
-    - [How Language Classifier Works](#how-language-classifier-works)
-  - [Contact](#contact)
-    - [Get in Touch](#get-in-touch)
-    - [Commercial Inquiries](#commercial-inquiries)
-  - [Further reading](#further-reading)
-  - [Citations](#citations)
+<br/>
+<h1 align="center">Silero VAD</h1>
+<br/>

+**Silero VAD** - pre-trained enterprise-grade [Voice Activity Detector](https://en.wikipedia.org/wiki/Voice_activity_detection) (also see our [STT models](https://github.com/snakers4/silero-models)).

-# Silero VAD
-![image](https://user-images.githubusercontent.com/36505480/107667211-06cf2680-6c98-11eb-9ee5-37eb4596260f.png)
+This repository also includes Number Detector and Language classifier [models](https://github.com/snakers4/silero-vad/wiki/Other-Models)

-## TLDR
+<br/>

-**Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier.**
-Enterprise-grade Speech Products made refreshingly simple (also see our [STT](https://github.com/snakers4/silero-models) models).
+<p align="center">
+  <img src="https://user-images.githubusercontent.com/36505480/198026365-8da383e0-5398-4a12-b7f8-22c2c0059512.png" />
+</p>

-Currently, there are hardly any high quality / modern / free / public voice activity detectors except for WebRTC Voice Activity Detector ([link](https://github.com/wiseman/py-webrtcvad)). WebRTC though starts to show its age and it suffers from many false positives.
+<details>
+<summary>Real Time Example</summary>
+  
+https://user-images.githubusercontent.com/36505480/144874384-95f80f6d-a4f1-42cc-9be7-004c891dd481.mp4
+  
+</details>

-Also in some cases it is crucial to be able to anonymize large-scale spoken corpora (i.e. remove personal data). Typically personal data is considered to be private / sensitive if it contains (i) a name (ii) some private ID. Name recognition is a highly subjective matter and it depends on locale and business case, but Voice Activity and Number Detection are quite general tasks.
+<br/>
+<h2 align="center">Key Features</h2>
+<br/>

-**Key features:**
+- **Stellar accuracy**

- Modern, portable;
- Low memory footprint;
- Superior metrics to WebRTC;
- Trained on huge spoken corpora and noise / sound libraries;
- Slower than WebRTC, but fast enough for IOT / edge / mobile applications;
- Unlike WebRTC (which mostly tells silence from voice), our VAD can tell voice from noise / music / silence;
+  Silero VAD has [excellent results](https://github.com/snakers4/silero-vad/wiki/Quality-Metrics#vs-other-available-solutions) on speech detection tasks.
+  
+- **Fast**

-**Typical use cases:**
+  One audio chunk (30+ ms) [takes](https://github.com/snakers4/silero-vad/wiki/Performance-Metrics#silero-vad-performance-metrics) less than **1ms** to be processed on a single CPU thread. Using batching or GPU can also improve performance considerably. Under certain conditions ONNX may even run up to 4-5x faster. 

- Spoken corpora anonymization;
- Can be used together with WebRTC;
- Voice activity detection for IOT / edge / mobile use cases;
- Data cleaning and preparation, number and voice detection in general;
- PyTorch and ONNX can be used with a wide variety of deployment options and backends in mind; 
+- **Lightweight**

-### Live Demonstration
+  JIT model is around one megabyte in size.

-For more information, please see [examples](https://github.com/snakers4/silero-vad/tree/master/examples).
+- **General**

-https://user-images.githubusercontent.com/28188499/116685087-182ff100-a9b2-11eb-927d-ed9f621226ee.mp4
+  Silero VAD was trained on huge corpora that include over **100** languages and it performs well on audios from different domains with various background noise and quality levels.

-https://user-images.githubusercontent.com/8079748/117580455-4622dd00-b0f8-11eb-858d-e6368ed4eada.mp4
+- **Flexible sampling rate**

-## Getting Started
+  Silero VAD [supports](https://github.com/snakers4/silero-vad/wiki/Quality-Metrics#sample-rate-comparison)  **8000 Hz** and **16000 Hz** [sampling rates](https://en.wikipedia.org/wiki/Sampling_(signal_processing)#Sampling_rate).

-The models are small enough to be included directly into this repository. Newer models will supersede older models directly.
+- **Flexible chunk size**

-### Pre-trained Models
+  Model was trained on **30 ms**. Longer chunks are supported directly, others may work as well.

-**Currently we provide the following endpoints:**
+- **Highly Portable**

-| model=                     | Params | Model type          | Streaming | Languages                  | PyTorch            | ONNX               | Colab                                                                                                                                                                   |
-| -------------------------- | ------ | ------------------- | --------- | -------------------------- | ------------------ | ------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `'silero_vad'`             | 1.1M   | VAD                 | Yes       | `ru`, `en`, `de`, `es` (*) | :heavy_check_mark: | :heavy_check_mark: | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) |
-| `'silero_vad_micro'`       | 10K    | VAD                 | Yes       | `ru`, `en`, `de`, `es` (*) | :heavy_check_mark: | :heavy_check_mark: | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) |
-| `'silero_vad_micro_8k'`    | 10K    | VAD                 | Yes       | `ru`, `en`, `de`, `es` (*) | :heavy_check_mark: | :heavy_check_mark: | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) |
-| `'silero_vad_mini'`        | 100K   | VAD                 | Yes       | `ru`, `en`, `de`, `es` (*) | :heavy_check_mark: | :heavy_check_mark: | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) |
-| `'silero_vad_mini_8k'`     | 100K   | VAD                 | Yes       | `ru`, `en`, `de`, `es` (*) | :heavy_check_mark: | :heavy_check_mark: | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) |
-| `'silero_number_detector'` | 1.1M   | Number Detector     | No        | `ru`, `en`, `de`, `es`     | :heavy_check_mark: | :heavy_check_mark: | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) |
-| `'silero_lang_detector'`   | 1.1M   | Language Classifier | No        | `ru`, `en`, `de`, `es`     | :heavy_check_mark: | :heavy_check_mark: | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) |
-| ~~`'silero_lang_detector_116'`~~   | ~~1.7M~~   | ~~Language Classifier~~ ||| | ||
-| `'silero_lang_detector_95'`   | 4.7M   | Language Classifier | No        |   [95 languages](https://github.com/snakers4/silero-vad/blob/master/files/lang_dict_95.json)   | :heavy_check_mark: | :heavy_check_mark: | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) |
+  Silero VAD reaps benefits from the rich ecosystems built around **PyTorch** and **ONNX** running everywhere where these runtimes are available.

-(*) Though explicitly trained on these languages, VAD should work on any Germanic, Romance or Slavic Languages out of the box.
+- **No Strings Attached**

-What models do:
+   Published under permissive license (MIT) Silero VAD has zero strings attached - no telemetry, no keys, no registration, no built-in expiration, no keys or vendor lock.

- VAD - detects speech;
- Number Detector  - detects spoken numbers (i.e. thirty five);
- Language Classifier - classifies utterances between language;
- Language Classifier 95 - classifies among 95 languages as well as 58 language groups (mutually intelligible languages -> same group)
+<br/>
+<h2 align="center">Typical Use Cases</h2>
+<br/>

-### Version History
+- Voice activity detection for IOT / edge / mobile use cases
+- Data cleaning and preparation, voice detection in general
+- Telephony and call-center automation, voice bots
+- Voice interfaces

-**Version history:**
+<br/>
+<h2 align="center">Links</h2>
+<br/>

-| Version | Date       | Comment                                                                                                                     |
-| ------- | ---------- | --------------------------------------------------------------------------------------------------------------------------- |
-| `v1`    | 2020-12-15 | Initial release                                                                                                             |
-| `v1.1`  | 2020-12-24 | better vad models compatible with chunks shorter than 250 ms                                                                |
-| `v1.2`  | 2020-12-30 | Number Detector added                                                                                                       |
-| `v2`    | 2021-01-11 | Add Language Classifier heads (en, ru, de, es)                                                                              |
-| `v2.1`  | 2021-02-11 | Add micro (10k params) VAD models                                                                                           |
-| `v2.2`  | 2021-03-22 | Add micro 8000 sample rate VAD models                                                                                       |
-| `v2.3`  | 2021-04-12 | Add mini (100k params) VAD models (8k and 16k sample rate)  + **new** adaptive utils for full audio and single audio stream |
-| `v2.4`  | 2021-07-09 | Add 116 languages classifier and group classifier |
-| `v2.4`  | 2021-07-09 | Deleted 116 language classifier, added 95 language classifier instead (get rid of lowspoken languages for quality improvement) 
-|

-### PyTorch
+- [Examples and Dependencies](https://github.com/snakers4/silero-vad/wiki/Examples-and-Dependencies#dependencies)
+- [Quality Metrics](https://github.com/snakers4/silero-vad/wiki/Quality-Metrics)
+- [Performance Metrics](https://github.com/snakers4/silero-vad/wiki/Performance-Metrics)
+- [Number Detector and Language classifier models](https://github.com/snakers4/silero-vad/wiki/Other-Models)
+- [Versions and Available Models](https://github.com/snakers4/silero-vad/wiki/Version-history-and-Available-Models)
+- [Further reading](https://github.com/snakers4/silero-models#further-reading)
+- [FAQ](https://github.com/snakers4/silero-vad/wiki/FAQ)

-[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb)
-
-We are keeping the colab examples up-to-date, but you can manually manage your dependencies:
-
- `pytorch` >= 1.7.1 (there were breaking changes in `torch.hub` introduced in 1.7);
- `torchaudio` >= 0.7.2 (used only for IO and resampling, can be easily replaced);
- `soundfile` >= 0.10.3 (used as a default backend for torchaudio, can be replaced);
-
-All of the dependencies except for PyTorch are superficial and for utils / example only. You can use any libraries / pipelines that read files and resample into 16 kHz.
-
-#### VAD
-
-[![Open on Torch Hub](https://img.shields.io/badge/Torch-Hub-red?logo=pytorch&style=for-the-badge)](https://pytorch.org/hub/snakers4_silero-vad_vad/)
-
-```python
-import torch
-torch.set_num_threads(1)
-from pprint import pprint
-
-model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
-                              model='silero_vad',
-                              force_reload=True)
-
-(get_speech_ts,
- get_speech_ts_adaptive,
- _, read_audio,
- _, _, _) = utils
-
-files_dir = torch.hub.get_dir() + '/snakers4_silero-vad_master/files'
-
-wav = read_audio(f'{files_dir}/en.wav')
-# full audio
-# get speech timestamps from full audio file
-
-# classic way
-speech_timestamps = get_speech_ts(wav, model,
-                                  num_steps=4)
-pprint(speech_timestamps)
-
-# adaptive way
-speech_timestamps = get_speech_ts_adaptive(wav, model)
-pprint(speech_timestamps)
-```
-
-#### Number Detector
-
-[![Open on Torch Hub](https://img.shields.io/badge/Torch-Hub-red?logo=pytorch&style=for-the-badge)](https://pytorch.org/hub/snakers4_silero-vad_number/)
-
-```python
-import torch
-torch.set_num_threads(1)
-from pprint import pprint
-
-model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
-                              model='silero_number_detector',
-                              force_reload=True)
-
-(get_number_ts,
- _, read_audio,
- _, _) = utils
-
-files_dir = torch.hub.get_dir() + '/snakers4_silero-vad_master/files'
-
-wav = read_audio(f'{files_dir}/en_num.wav')
-# full audio
-# get number timestamps from full audio file
-number_timestamps = get_number_ts(wav, model)
-
-pprint(number_timestamps)
-```
-
-#### Language Classifier 
-##### 4 languages
-
-[![Open on Torch Hub](https://img.shields.io/badge/Torch-Hub-red?logo=pytorch&style=for-the-badge)](https://pytorch.org/hub/snakers4_silero-vad_language/)
-
-```python
-import torch
-torch.set_num_threads(1)
-from pprint import pprint
-
-model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
-                              model='silero_lang_detector',
-                              force_reload=True)
-
-get_language, read_audio = utils
-
-files_dir = torch.hub.get_dir() + '/snakers4_silero-vad_master/files'
-
-wav = read_audio(f'{files_dir}/de.wav')
-language = get_language(wav, model)
-
-pprint(language)
-```
-
-##### 95 languages
-
-[![Open on Torch Hub](https://img.shields.io/badge/Torch-Hub-red?logo=pytorch&style=for-the-badge)](https://pytorch.org/hub/snakers4_silero-vad_language/)
-
-```python
-import torch
-torch.set_num_threads(1)
-from pprint import pprint
-
-model, lang_dict, lang_group_dict,  utils = torch.hub.load(
-                              repo_or_dir='snakers4/silero-vad',
-                              model='silero_lang_detector_95',
-                              force_reload=True)
-
-get_language_and_group, read_audio = utils
-
-files_dir = torch.hub.get_dir() + '/snakers4_silero-vad_master/files'
-
-wav = read_audio(f'{files_dir}/de.wav')
-languages, language_groups = get_language_and_group(wav, model, lang_dict, lang_group_dict, top_n=2)
-
-for i in languages:
-  pprint(f'Language: {i[0]} with prob {i[-1]}')
-
-for i in language_groups:
-  pprint(f'Language group: {i[0]} with prob {i[-1]}')
-```
-
-### ONNX
-
-[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb)
-
-You can run our models everywhere, where you can import the ONNX model or run ONNX runtime.
-
-#### VAD
-
-```python
-import torch
-import onnxruntime
-from pprint import pprint
-
-_, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
-                              model='silero_vad',
-                              force_reload=True)
-
-(get_speech_ts,
- get_speech_ts_adaptive,
- _, read_audio,
- _, _, _) = utils
-
-files_dir = torch.hub.get_dir() + '/snakers4_silero-vad_master/files'
-
-def init_onnx_model(model_path: str):
-    return onnxruntime.InferenceSession(model_path)
-
-def validate_onnx(model, inputs):
-    with torch.no_grad():
-        ort_inputs = {'input': inputs.cpu().numpy()}
-        outs = model.run(None, ort_inputs)
-        outs = [torch.Tensor(x) for x in outs]
-    return outs[0]
-    
-model = init_onnx_model(f'{files_dir}/model.onnx')
-wav = read_audio(f'{files_dir}/en.wav')
-
-# get speech timestamps from full audio file
-
-# classic way
-speech_timestamps = get_speech_ts(wav, model, num_steps=4, run_function=validate_onnx) 
-pprint(speech_timestamps)
-
-# adaptive way
-speech_timestamps = get_speech_ts(wav, model, run_function=validate_onnx) 
-pprint(speech_timestamps)
-```
-
-#### Number Detector
-
-```python
-import torch
-import onnxruntime
-from pprint import pprint
-
-model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
-                              model='silero_number_detector',
-                              force_reload=True)
-
-(get_number_ts,
- _, read_audio,
- _, _) = utils
-
-files_dir = torch.hub.get_dir() + '/snakers4_silero-vad_master/files'
-
-def init_onnx_model(model_path: str):
-    return onnxruntime.InferenceSession(model_path)
-
-def validate_onnx(model, inputs):
-    with torch.no_grad():
-        ort_inputs = {'input': inputs.cpu().numpy()}
-        outs = model.run(None, ort_inputs)
-        outs = [torch.Tensor(x) for x in outs]
-    return outs
-    
-model = init_onnx_model(f'{files_dir}/number_detector.onnx')
-wav = read_audio(f'{files_dir}/en_num.wav')
-
-# get speech timestamps from full audio file
-number_timestamps = get_number_ts(wav, model, run_function=validate_onnx) 
-pprint(number_timestamps)
-```
-
-#### Language Classifier
-##### 4 languages
-
-```python
-import torch
-import onnxruntime
-from pprint import pprint
-
-model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
-                              model='silero_lang_detector',
-                              force_reload=True)
-                              
-get_language, read_audio = utils
-
-files_dir = torch.hub.get_dir() + '/snakers4_silero-vad_master/files'
-
-def init_onnx_model(model_path: str):
-    return onnxruntime.InferenceSession(model_path)
-
-def validate_onnx(model, inputs):
-    with torch.no_grad():
-        ort_inputs = {'input': inputs.cpu().numpy()}
-        outs = model.run(None, ort_inputs)
-        outs = [torch.Tensor(x) for x in outs]
-    return outs
-    
-model = init_onnx_model(f'{files_dir}/number_detector.onnx')
-wav = read_audio(f'{files_dir}/de.wav')
-
-language = get_language(wav, model, run_function=validate_onnx)
-print(language)
-```
-
-##### 95 languages
-
-```python
-import torch
-import onnxruntime
-from pprint import pprint
-
-model, lang_dict, lang_group_dict,  utils = torch.hub.load(
-                              repo_or_dir='snakers4/silero-vad',
-                              model='silero_lang_detector_95',
-                              force_reload=True)
-                              
-get_language_and_group, read_audio = utils
-
-files_dir = torch.hub.get_dir() + '/snakers4_silero-vad_master/files'
-
-def init_onnx_model(model_path: str):
-    return onnxruntime.InferenceSession(model_path)
-
-def validate_onnx(model, inputs):
-    with torch.no_grad():
-        ort_inputs = {'input': inputs.cpu().numpy()}
-        outs = model.run(None, ort_inputs)
-        outs = [torch.Tensor(x) for x in outs]
-    return outs
-    
-model = init_onnx_model(f'{files_dir}/lang_classifier_95.onnx')
-wav = read_audio(f'{files_dir}/de.wav')
-
-languages, language_groups = get_language_and_group(wav, model, lang_dict, lang_group_dict, top_n=2, run_function=validate_onnx)
-
-for i in languages:
-  pprint(f'Language: {i[0]} with prob {i[-1]}')
-
-for i in language_groups:
-  pprint(f'Language group: {i[0]} with prob {i[-1]}')
-
-```
-[![Open on Torch Hub](https://img.shields.io/badge/Torch-Hub-red?logo=pytorch&style=for-the-badge)](https://pytorch.org/hub/snakers4_silero-vad_language/)
-
-## Metrics
-
-### Performance Metrics
-
-All speed test were run on AMD Ryzen Threadripper 3960X using only 1 thread: 
-```
-torch.set_num_threads(1) # pytorch
-ort_session.intra_op_num_threads = 1 # onnx
-ort_session.inter_op_num_threads = 1 # onnx
-```
-
-#### Streaming Latency
-
-Streaming latency depends on 2 variables:
-
- **num_steps** - number of windows to split each audio chunk into. Our post-processing class keeps previous chunk in memory (250 ms), so new chunk (also 250 ms) is appended to it. The resulting big chunk (500 ms) is split into **num_steps** overlapping windows, each 250 ms long.
-
- **number of audio streams**
-
-So **batch size** for streaming is **num_steps * number of audio streams**. Time between receiving new audio chunks and getting results is shown in picture:
-
-| Batch size | Pytorch model time, ms | Onnx model time, ms |
-| :--------: | :--------------------: | :-----------------: |
-|   **2**    |           9            |          2          |
-|   **4**    |           11           |          4          |
-|   **8**    |           14           |          7          |
-|   **16**   |           19           |         12          |
-|   **40**   |           36           |         29          |
-|   **80**   |           64           |         55          |
-|  **120**   |           96           |         85          |
-|  **200**   |          157           |         137         |
-
-#### Full Audio Throughput
-
-**RTS** (seconds of audio processed per second, real time speed, or 1 / RTF) for full audio processing depends on **num_steps** (see previous paragraph) and **batch size** (bigger is better).
-
-| Batch size | num_steps | Pytorch model RTS | Onnx model RTS |
-| :--------: | :-------: | :---------------: | :------------: |
-|   **40**   |   **4**   |        68         |       86       |
-|   **40**   |   **8**   |        34         |       43       |
-|   **80**   |   **4**   |        78         |       91       |
-|   **80**   |   **8**   |        39         |       45       |
-|  **120**   |   **4**   |        78         |       88       |
-|  **120**   |   **8**   |        39         |       44       |
-|  **200**   |   **4**   |        80         |       91       |
-|  **200**   |   **8**   |        40         |       46       |
-
-### VAD Quality Metrics
-
-We use random 250 ms audio chunks for validation. Speech to non-speech ratio among chunks is about ~50/50 (i.e. balanced). Speech chunks are sampled from real audios in four different languages (English, Russian, Spanish, German), then random background noise is added to some of them (~40%). 
-
-Since our VAD (only VAD, other networks are more flexible) was trained on chunks of the same length, model's output is just one float from 0 to 1 - **speech probability**. We use speech probabilities as thresholds for precision-recall curve. This can be extended to 100 - 150 ms. Less than 100 - 150 ms cannot be distinguished as speech with confidence.
-
-[Webrtc](https://github.com/wiseman/py-webrtcvad) splits audio into frames, each frame has corresponding number (0 **or** 1). We use 30ms frames for webrtc, so each 250 ms chunk is split into 8 frames, their **mean** value is used as a threshold for plot.
-
-[Auditok](https://github.com/amsehili/auditok) - logic same as Webrtc, but we use 50ms frames.
-
-![image](https://user-images.githubusercontent.com/36505480/107667211-06cf2680-6c98-11eb-9ee5-37eb4596260f.png)
-
-## FAQ
-
-### VAD Parameter Fine Tuning
-
-#### Classic way
-
-**This is straightforward classic method `get_speech_ts` where thresholds (`trig_sum` and `neg_trig_sum`) are specified by users**
- Among others, we provide several [utils](https://github.com/snakers4/silero-vad/blob/8b28767292b424e3e505c55f15cd3c4b91e4804b/utils.py#L52-L59) to simplify working with VAD;
- We provide sensible basic hyper-parameters that work for us, but your case can be different;
- `trig_sum` - overlapping windows are used for each audio chunk, trig sum defines average probability among those windows for switching into triggered state (speech state);
- `neg_trig_sum` - same as `trig_sum`, but for switching from triggered to non-triggered state (non-speech)
- `num_steps` - nubmer of overlapping windows to split audio chunk into (we recommend 4 or 8)
- `num_samples_per_window` - number of samples in each window, our models were trained using `4000` samples (250 ms) per window, so this is preferable value (lesser values reduce [quality](https://github.com/snakers4/silero-vad/issues/2#issuecomment-750840434));
- `min_speech_samples` - minimum speech chunk duration in samples
- `min_silence_samples` - minimum silence duration in samples between to separate speech chunks
-
-Optimal parameters may vary per domain, but we provided a tiny tool to learn the best parameters. You can invoke `speech_timestamps` with visualize_probs=True (`pandas` required):
-
-```
-speech_timestamps = get_speech_ts(wav, model,
-                                  num_samples_per_window=4000,
-                                  num_steps=4,
-                                  visualize_probs=True)
-```
-
-#### Adaptive way
-
-**Adaptive algorithm (`get_speech_ts_adaptive`) automatically selects thresholds (`trig_sum` and `neg_trig_sum`) based on median speech probabilities over the whole audio, SOME ARGUMENTS VARY FROM THE CLASSIC WAY FUNCTION ARGUMENTS**
- `batch_size` - batch size to feed to silero VAD (default - `200`)
- `step` - step size in samples, (default - `500`) (`num_samples_per_window` / `num_steps` from classic method)
- `num_samples_per_window` -  number of samples in each window, our models were trained using `4000` samples (250 ms) per window, so this is preferable value (lesser values reduce [quality](https://github.com/snakers4/silero-vad/issues/2#issuecomment-750840434));
- `min_speech_samples` - minimum speech chunk duration in samples (default - `10000`)
- `min_silence_samples` - minimum silence duration in samples between to separate speech chunks (default - `4000`)
- `speech_pad_samples` - widen speech by this amount of samples each side (default - `2000`)
-
-```
-speech_timestamps = get_speech_ts_adaptive(wav, model,
-                                  num_samples_per_window=4000,
-                                  step=500,
-                                  visualize_probs=True)
-```
-
-
-The chart should looks something like this:
-
-![image](https://user-images.githubusercontent.com/12515440/106242896-79142580-6219-11eb-9add-fa7195d6fd26.png)
-
-With this particular example you can try shorter chunks (`num_samples_per_window=1600`), but this results in too much noise:
-
-![image](https://user-images.githubusercontent.com/12515440/106243014-a8c32d80-6219-11eb-8374-969f372807f1.png)
-
-
-### How VAD Works
-
- Audio is split into 250 ms chunks (you can choose any chunk size, but quality with chunks shorter than 100ms will suffer and there will be more false positives and "unnatural" pauses);
- VAD keeps record of a previous chunk (or zeros at the beginning of the stream);
- Then this 500 ms audio (250 ms + 250 ms) is split into N (typically 4 or 8) windows and the model is applied to this window batch. Each window is 250 ms long (naturally, windows overlap);
- Then probability is averaged across these windows;
- Though typically pauses in speech are 300 ms+ or longer (pauses less than 200-300ms are typically not meaninful), it is hard to confidently classify speech vs noise / music on very short chunks (i.e. 30 - 50ms);
- ~~We are working on lifting this limitation, so that you can use 100 - 125ms windows~~;
-
-### VAD Quality Metrics Methodology
-
-Please see [Quality Metrics](#quality-metrics)
-
-### How Number Detector Works
-
- It is recommended to split long audio into short ones (< 15s) and apply model on each of them;
- Number Detector can classify if the whole audio contains a number, or if each audio frame contains a number;
- Audio is splitted into frames in a certain way, so, having a per-frame output, we can restore timing bounds for a numbers with an accuracy of about 0.2s;
-
-### How Language Classifier Works
-
- **99%** validation accuracy
- Language classifier was trained using audio samples in 4 languages: **Russian**, **English**, **Spanish**, **German**
- More languages TBD
- Arbitrary audio length can be used, although network was trained using audio shorter than 15 seconds
-
-### How Language Classifier 95 Works
-
- **85%** validation accuracy among 95 languages, **90%** validation accuracy among [58 language groups](https://github.com/snakers4/silero-vad/blob/master/files/lang_group_dict_95.json)
- Language classifier 95 was trained using audio samples in [95 languages](https://github.com/snakers4/silero-vad/blob/master/files/lang_dict_95.json)
- Arbitrary audio length can be used, although network was trained using audio shorter than 20 seconds
-
-## Contact
-
-### Get in Touch
+<br/>
+<h2 align="center">Get In Touch</h2>
+<br/>

 Try our models, create an [issue](https://github.com/snakers4/silero-vad/issues/new), start a [discussion](https://github.com/snakers4/silero-vad/discussions/new), join our telegram [chat](https://t.me/silero_speech), [email](mailto:hello@silero.ai) us, read our [news](https://t.me/silero_news).

-### Commercial Inquiries
-
 Please see our [wiki](https://github.com/snakers4/silero-models/wiki) and [tiers](https://github.com/snakers4/silero-models/wiki/Licensing-and-Tiers) for relevant information and [email](mailto:hello@silero.ai) us directly.

-## Further reading
-
-### General
-
- Silero-models - https://github.com/snakers4/silero-models
- Nice [thread](https://github.com/snakers4/silero-vad/discussions/16#discussioncomment-305830) in discussions
-
-### English
-
- STT:
-  - Towards an Imagenet Moment For Speech-To-Text - [link](https://thegradient.pub/towards-an-imagenet-moment-for-speech-to-text/)
-  - A Speech-To-Text Practitioners Criticisms of Industry and Academia - [link](https://thegradient.pub/a-speech-to-text-practitioners-criticisms-of-industry-and-academia/)
-  - Modern Google-level STT Models Released - [link](https://habr.com/ru/post/519562/)
-
- TTS:
-  - High-Quality Text-to-Speech Made Accessible, Simple and Fast - [link](https://habr.com/ru/post/549482/)
-
- VAD:
-  - Modern Portable Voice Activity Detector Released - [link](https://habr.com/ru/post/537276/)
-
- Text Enhancement:
-  - We have published a model for text repunctuation and recapitalization for four languages - [link](https://habr.com/ru/post/581960/) 
-
-### Chinese
-
- STT:
-  - 迈向语音识别领域的 ImageNet 时刻 - [link](https://www.infoq.cn/article/4u58WcFCs0RdpoXev1E2)
-  - 语音领域学术界和工业界的七宗罪 - [link](https://www.infoq.cn/article/lEe6GCRjF1CNToVITvNw)
-
-### Russian
-
- STT
-  - Последние обновления моделей распознавания речи из Silero Models - [link](https://habr.com/ru/post/577630/) 
-  - Сжимаем трансформеры: простые, универсальные и прикладные способы cделать их компактными и быстрыми - [link](https://habr.com/ru/post/563778/)
-  - Ультимативное сравнение систем распознавания речи: Ashmanov, Google, Sber, Silero, Tinkoff, Yandex - [link](https://habr.com/ru/post/559640/)
-  - Мы опубликовали современные STT модели сравнимые по качеству с Google - [link](https://habr.com/ru/post/519564/)
-  - Понижаем барьеры на вход в распознавание речи - [link](https://habr.com/ru/post/494006/)
-  - Огромный открытый датасет русской речи версия 1.0 - [link](https://habr.com/ru/post/474462/)
-  - Насколько Быстрой Можно Сделать Систему STT? - [link](https://habr.com/ru/post/531524/)
-  - Наша система Speech-To-Text - [link](https://www.silero.ai/tag/our-speech-to-text/)
-  - Speech To Text - [link](https://www.silero.ai/tag/speech-to-text/)
-
- TTS:
-  - Мы сделали наш публичный синтез речи еще лучше - [link](https://habr.com/ru/post/563484/)
-  - Мы Опубликовали Качественный, Простой, Доступный и Быстрый Синтез Речи - [link](https://habr.com/ru/post/549480/)
-
- VAD:
-  - Модели для Детекции Речи, Чисел и Распознавания Языков - [link](https://www.silero.ai/vad-lang-classifier-number-detector/)
-  - Мы опубликовали современный Voice Activity Detector и не только -[link](https://habr.com/ru/post/537274/)
-
- Text Enhancement:
-  - Мы опубликовали модель, расставляющую знаки препинания и заглавные буквы в тексте на четырех языках - [link](https://habr.com/ru/post/581946/)
-
-
-## Citations
+**Citations**

 ```
@misc{Silero VAD,
@@ -625,3 +105,9 @@ Please see our [wiki](https://github.com/snakers4/silero-models/wiki) and [tiers
  email = {hello@silero.ai}
 }
 ```
+
+<br/>
+<h2 align="center">VAD-based Community Apps</h2>
+<br/>
+
+- Voice activity detection for the [browser](https://github.com/ricky0123/vad) using ONNX Runtime Web 
--- a/examples/colab_record_example.ipynb
+++ b/examples/colab_record_example.ipynb
@@ -0,0 +1,241 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "bccAucKjnPHm"
+   },
+   "source": [
+    "### Dependencies and inputs"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "cSih95WFmwgi"
+   },
+   "outputs": [],
+   "source": [
+    "!pip -q install pydub\n",
+    "from google.colab import output\n",
+    "from base64 import b64decode, b64encode\n",
+    "from io import BytesIO\n",
+    "import numpy as np\n",
+    "from pydub import AudioSegment\n",
+    "from IPython.display import HTML, display\n",
+    "import torch\n",
+    "import matplotlib.pyplot as plt\n",
+    "import moviepy.editor as mpe\n",
+    "from matplotlib.animation import FuncAnimation, FFMpegWriter\n",
+    "import matplotlib\n",
+    "matplotlib.use('Agg')\n",
+    "\n",
+    "torch.set_num_threads(1)\n",
+    "\n",
+    "model, _ = torch.hub.load(repo_or_dir='snakers4/silero-vad',\n",
+    "                              model='silero_vad',\n",
+    "                              force_reload=True)\n",
+    "\n",
+    "def int2float(sound):\n",
+    "    abs_max = np.abs(sound).max()\n",
+    "    sound = sound.astype('float32')\n",
+    "    if abs_max > 0:\n",
+    "        sound *= 1/abs_max\n",
+    "    sound = sound.squeeze()\n",
+    "    return sound\n",
+    "\n",
+    "AUDIO_HTML = \"\"\"\n",
+    "<script>\n",
+    "var my_div = document.createElement(\"DIV\");\n",
+    "var my_p = document.createElement(\"P\");\n",
+    "var my_btn = document.createElement(\"BUTTON\");\n",
+    "var t = document.createTextNode(\"Press to start recording\");\n",
+    "\n",
+    "my_btn.appendChild(t);\n",
+    "//my_p.appendChild(my_btn);\n",
+    "my_div.appendChild(my_btn);\n",
+    "document.body.appendChild(my_div);\n",
+    "\n",
+    "var base64data = 0;\n",
+    "var reader;\n",
+    "var recorder, gumStream;\n",
+    "var recordButton = my_btn;\n",
+    "\n",
+    "var handleSuccess = function(stream) {\n",
+    "  gumStream = stream;\n",
+    "  var options = {\n",
+    "    //bitsPerSecond: 8000, //chrome seems to ignore, always 48k\n",
+    "    mimeType : 'audio/webm;codecs=opus'\n",
+    "    //mimeType : 'audio/webm;codecs=pcm'\n",
+    "  };            \n",
+    "  //recorder = new MediaRecorder(stream, options);\n",
+    "  recorder = new MediaRecorder(stream);\n",
+    "  recorder.ondataavailable = function(e) {            \n",
+    "    var url = URL.createObjectURL(e.data);\n",
+    "    // var preview = document.createElement('audio');\n",
+    "    // preview.controls = true;\n",
+    "    // preview.src = url;\n",
+    "    // document.body.appendChild(preview);\n",
+    "\n",
+    "    reader = new FileReader();\n",
+    "    reader.readAsDataURL(e.data); \n",
+    "    reader.onloadend = function() {\n",
+    "      base64data = reader.result;\n",
+    "      //console.log(\"Inside FileReader:\" + base64data);\n",
+    "    }\n",
+    "  };\n",
+    "  recorder.start();\n",
+    "  };\n",
+    "\n",
+    "recordButton.innerText = \"Recording... press to stop\";\n",
+    "\n",
+    "navigator.mediaDevices.getUserMedia({audio: true}).then(handleSuccess);\n",
+    "\n",
+    "\n",
+    "function toggleRecording() {\n",
+    "  if (recorder && recorder.state == \"recording\") {\n",
+    "      recorder.stop();\n",
+    "      gumStream.getAudioTracks()[0].stop();\n",
+    "      recordButton.innerText = \"Saving recording...\"\n",
+    "  }\n",
+    "}\n",
+    "\n",
+    "// https://stackoverflow.com/a/951057\n",
+    "function sleep(ms) {\n",
+    "  return new Promise(resolve => setTimeout(resolve, ms));\n",
+    "}\n",
+    "\n",
+    "var data = new Promise(resolve=>{\n",
+    "//recordButton.addEventListener(\"click\", toggleRecording);\n",
+    "recordButton.onclick = ()=>{\n",
+    "toggleRecording()\n",
+    "\n",
+    "sleep(2000).then(() => {\n",
+    "  // wait 2000ms for the data to be available...\n",
+    "  // ideally this should use something like await...\n",
+    "  //console.log(\"Inside data:\" + base64data)\n",
+    "  resolve(base64data.toString())\n",
+    "\n",
+    "});\n",
+    "\n",
+    "}\n",
+    "});\n",
+    "      \n",
+    "</script>\n",
+    "\"\"\"\n",
+    "\n",
+    "def record(sec=10):\n",
+    "    display(HTML(AUDIO_HTML))\n",
+    "    s = output.eval_js(\"data\")\n",
+    "    b = b64decode(s.split(',')[1])\n",
+    "    audio = AudioSegment.from_file(BytesIO(b))\n",
+    "    audio.export('test.mp3', format='mp3')\n",
+    "    audio = audio.set_channels(1)\n",
+    "    audio = audio.set_frame_rate(16000)\n",
+    "    audio_float = int2float(np.array(audio.get_array_of_samples()))\n",
+    "    audio_tens = torch.tensor(audio_float )\n",
+    "    return audio_tens\n",
+    "\n",
+    "def make_animation(probs, audio_duration, interval=40):\n",
+    "    fig = plt.figure(figsize=(16, 9))\n",
+    "    ax = plt.axes(xlim=(0, audio_duration), ylim=(0, 1.02))\n",
+    "    line, = ax.plot([], [], lw=2)\n",
+    "    x = [i / 16000 * 512 for i in range(len(probs))]\n",
+    "    plt.xlabel('Time, seconds', fontsize=16)\n",
+    "    plt.ylabel('Speech Probability', fontsize=16)\n",
+    "\n",
+    "    def init():\n",
+    "        plt.fill_between(x, probs, color='#064273')\n",
+    "        line.set_data([], [])\n",
+    "        line.set_color('#990000')\n",
+    "        return line,\n",
+    "\n",
+    "    def animate(i):\n",
+    "        x = i * interval / 1000 - 0.04\n",
+    "        y = np.linspace(0, 1.02, 2)\n",
+    "        \n",
+    "        line.set_data(x, y)\n",
+    "        line.set_color('#990000')\n",
+    "        return line,\n",
+    "\n",
+    "    anim = FuncAnimation(fig, animate, init_func=init, interval=interval, save_count=audio_duration / (interval / 1000))\n",
+    "\n",
+    "    f = r\"animation.mp4\" \n",
+    "    writervideo = FFMpegWriter(fps=1000/interval) \n",
+    "    anim.save(f, writer=writervideo)\n",
+    "    plt.close('all')\n",
+    "\n",
+    "def combine_audio(vidname, audname, outname, fps=25): \n",
+    "    my_clip = mpe.VideoFileClip(vidname, verbose=False)\n",
+    "    audio_background = mpe.AudioFileClip(audname)\n",
+    "    final_clip = my_clip.set_audio(audio_background)\n",
+    "    final_clip.write_videofile(outname,fps=fps,verbose=False)\n",
+    "\n",
+    "def record_make_animation():\n",
+    "  tensor = record()\n",
+    "\n",
+    "  print('Calculating probabilities...')\n",
+    "  speech_probs = []\n",
+    "  window_size_samples = 512\n",
+    "  for i in range(0, len(tensor), window_size_samples):\n",
+    "      if len(tensor[i: i+ window_size_samples]) < window_size_samples:\n",
+    "        break\n",
+    "      speech_prob = model(tensor[i: i+ window_size_samples], 16000).item()\n",
+    "      speech_probs.append(speech_prob)\n",
+    "  model.reset_states()\n",
+    "  print('Making animation...')\n",
+    "  make_animation(speech_probs, len(tensor) / 16000)\n",
+    "\n",
+    "  print('Merging your voice with animation...')\n",
+    "  combine_audio('animation.mp4', 'test.mp3', 'merged.mp4')\n",
+    "  print('Done!')\n",
+    "  mp4 = open('merged.mp4','rb').read()\n",
+    "  data_url = \"data:video/mp4;base64,\" + b64encode(mp4).decode()\n",
+    "  display(HTML(\"\"\"\n",
+    "  <video width=800 controls>\n",
+    "        <source src=\"%s\" type=\"video/mp4\">\n",
+    "  </video>\n",
+    "  \"\"\" % data_url))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "IFVs3GvTnpB1"
+   },
+   "source": [
+    "## Record example"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "5EBjrTwiqAaQ"
+   },
+   "outputs": [],
+   "source": [
+    "record_make_animation()"
+   ]
+  }
+ ],
+ "metadata": {
+  "colab": {
+   "collapsed_sections": [
+    "bccAucKjnPHm"
+   ],
+   "name": "Untitled2.ipynb",
+   "provenance": []
+  },
+  "kernelspec": {
+   "display_name": "Python 3",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}
--- a/examples/pyaudio-streaming/pyaudio-streaming-examples.ipynb
+++ b/examples/pyaudio-streaming/pyaudio-streaming-examples.ipynb
--- a/files/de.wav
+++ b/files/de.wav
--- a/files/en.wav
+++ b/files/en.wav
--- a/files/en_num.wav
+++ b/files/en_num.wav
--- a/files/es.wav
+++ b/files/es.wav
--- a/files/lang_classifier_95.jit
+++ b/files/lang_classifier_95.jit
--- a/files/lang_classifier_95.onnx
+++ b/files/lang_classifier_95.onnx
--- a/files/model.jit
+++ b/files/model.jit
--- a/files/model.onnx
+++ b/files/model.onnx
--- a/files/model_micro.jit
+++ b/files/model_micro.jit
--- a/files/model_micro.onnx
+++ b/files/model_micro.onnx
--- a/files/model_micro_8k.jit
+++ b/files/model_micro_8k.jit
--- a/files/model_micro_8k.onnx
+++ b/files/model_micro_8k.onnx
--- a/files/model_micro_mobile.jit
+++ b/files/model_micro_mobile.jit
--- a/files/model_mini.jit
+++ b/files/model_mini.jit
--- a/files/model_mini.onnx
+++ b/files/model_mini.onnx
--- a/files/model_mini_8k.jit
+++ b/files/model_mini_8k.jit
--- a/files/model_mini_8k.onnx
+++ b/files/model_mini_8k.onnx
--- a/files/number_detector.jit
+++ b/files/number_detector.jit
--- a/files/number_detector.onnx
+++ b/files/number_detector.onnx
--- a/files/ru.wav
+++ b/files/ru.wav
--- a/files/ru_num.wav
+++ b/files/ru_num.wav
--- a/files/silero_vad.jit
+++ b/files/silero_vad.jit
--- a/files/silero_vad.onnx
+++ b/files/silero_vad.onnx
--- a/hubconf.py
+++ b/hubconf.py
@@ -1,117 +1,61 @@
 dependencies = ['torch', 'torchaudio']
 import torch
+import os
 import json
 from utils_vad import (init_jit_model,
-                       get_speech_ts,
-                       get_speech_ts_adaptive,
+                       get_speech_timestamps,
                       get_number_ts,
                       get_language,
                       get_language_and_group,
                       save_audio,
                       read_audio,
-                       state_generator,
-                       single_audio_stream,
+                       VADIterator,
                       collect_chunks,
-                       drop_chunks)
+                       drop_chunks,
+                       Validator,
+                       OnnxWrapper)


-def silero_vad(**kwargs):
+def versiontuple(v):
+    return tuple(map(int, (v.split('+')[0].split("."))))
+
+
+def silero_vad(onnx=False, force_onnx_cpu=False):
    """Silero Voice Activity Detector
    Returns a model with a set of utils
    Please see https://github.com/snakers4/silero-vad for usage examples
    """
-    hub_dir = torch.hub.get_dir()
-    model = init_jit_model(model_path=f'{hub_dir}/snakers4_silero-vad_master/files/model.jit')
-    utils = (get_speech_ts,
-             get_speech_ts_adaptive,
+
+    if not onnx:
+        installed_version = torch.__version__
+        supported_version = '1.12.0'
+        if versiontuple(installed_version) < versiontuple(supported_version):
+            raise Exception(f'Please install torch {supported_version} or greater ({installed_version} installed)')
+
+    model_dir = os.path.join(os.path.dirname(__file__), 'files')
+    if onnx:
+        model = OnnxWrapper(os.path.join(model_dir, 'silero_vad.onnx'), force_onnx_cpu)
+    else:
+        model = init_jit_model(os.path.join(model_dir, 'silero_vad.jit'))
+    utils = (get_speech_timestamps,
             save_audio,
             read_audio,
-             state_generator,
-             single_audio_stream,
+             VADIterator,
             collect_chunks)

    return model, utils


-def silero_vad_micro(**kwargs):
-    """Silero Voice Activity Detector
-    Returns a model with a set of utils
-    Please see https://github.com/snakers4/silero-vad for usage examples
-    """
-    hub_dir = torch.hub.get_dir()
-    model = init_jit_model(model_path=f'{hub_dir}/snakers4_silero-vad_master/files/model_micro.jit')
-    utils = (get_speech_ts,
-             get_speech_ts_adaptive,
-             save_audio,
-             read_audio,
-             state_generator,
-             single_audio_stream,
-             collect_chunks)
-
-    return model, utils
-
-
-def silero_vad_micro_8k(**kwargs):
-    """Silero Voice Activity Detector
-    Returns a model with a set of utils
-    Please see https://github.com/snakers4/silero-vad for usage examples
-    """
-    hub_dir = torch.hub.get_dir()
-    model = init_jit_model(model_path=f'{hub_dir}/snakers4_silero-vad_master/files/model_micro_8k.jit')
-    utils = (get_speech_ts,
-             get_speech_ts_adaptive,
-             save_audio,
-             read_audio,
-             state_generator,
-             single_audio_stream,
-             collect_chunks)
-
-    return model, utils
-
-
-def silero_vad_mini(**kwargs):
-    """Silero Voice Activity Detector
-    Returns a model with a set of utils
-    Please see https://github.com/snakers4/silero-vad for usage examples
-    """
-    hub_dir = torch.hub.get_dir()
-    model = init_jit_model(model_path=f'{hub_dir}/snakers4_silero-vad_master/files/model_mini.jit')
-    utils = (get_speech_ts,
-             get_speech_ts_adaptive,
-             save_audio,
-             read_audio,
-             state_generator,
-             single_audio_stream,
-             collect_chunks)
-
-    return model, utils
-
-
-def silero_vad_mini_8k(**kwargs):
-    """Silero Voice Activity Detector
-    Returns a model with a set of utils
-    Please see https://github.com/snakers4/silero-vad for usage examples
-    """
-    hub_dir = torch.hub.get_dir()
-    model = init_jit_model(model_path=f'{hub_dir}/snakers4_silero-vad_master/files/model_mini_8k.jit')
-    utils = (get_speech_ts,
-             get_speech_ts_adaptive,
-             save_audio,
-             read_audio,
-             state_generator,
-             single_audio_stream,
-             collect_chunks)
-
-    return model, utils
-
-
-def silero_number_detector(**kwargs):
+def silero_number_detector(onnx=False, force_onnx_cpu=False):
    """Silero Number Detector
    Returns a model with a set of utils
    Please see https://github.com/snakers4/silero-vad for usage examples
    """
-    hub_dir = torch.hub.get_dir()
-    model = init_jit_model(model_path=f'{hub_dir}/snakers4_silero-vad_master/files/number_detector.jit')
+    if onnx:
+        url = 'https://models.silero.ai/vad_models/number_detector.onnx'
+    else:
+        url = 'https://models.silero.ai/vad_models/number_detector.jit'
+    model = Validator(url, force_onnx_cpu)
    utils = (get_number_ts,
             save_audio,
             read_audio,
@@ -121,34 +65,41 @@ def silero_number_detector(**kwargs):
    return model, utils


-def silero_lang_detector(**kwargs):
+def silero_lang_detector(onnx=False, force_onnx_cpu=False):
    """Silero Language Classifier
    Returns a model with a set of utils
    Please see https://github.com/snakers4/silero-vad for usage examples
    """
-    hub_dir = torch.hub.get_dir()
-    model = init_jit_model(model_path=f'{hub_dir}/snakers4_silero-vad_master/files/number_detector.jit')
+    if onnx:
+        url = 'https://models.silero.ai/vad_models/number_detector.onnx'
+    else:
+        url = 'https://models.silero.ai/vad_models/number_detector.jit'
+    model = Validator(url, force_onnx_cpu)
    utils = (get_language,
             read_audio)

    return model, utils


-def silero_lang_detector_95(**kwargs):
+def silero_lang_detector_95(onnx=False, force_onnx_cpu=False):
    """Silero Language Classifier (95 languages)
    Returns a model with a set of utils
    Please see https://github.com/snakers4/silero-vad for usage examples
    """

-    hub_dir = torch.hub.get_dir()
-    model = init_jit_model(model_path=f'{hub_dir}/snakers4_silero-vad_master/files/lang_classifier_95.jit')
+    if onnx:
+        url = 'https://models.silero.ai/vad_models/lang_classifier_95.onnx'
+    else:
+        url = 'https://models.silero.ai/vad_models/lang_classifier_95.jit'
+    model = Validator(url, force_onnx_cpu)

-    with open(f'{hub_dir}/snakers4_silero-vad_master/files/lang_dict_95.json', 'r') as f:
+    model_dir = os.path.join(os.path.dirname(__file__), 'files')
+    with open(os.path.join(model_dir, 'lang_dict_95.json'), 'r') as f:
        lang_dict = json.load(f)

-    with open(f'{hub_dir}/snakers4_silero-vad_master/files/lang_group_dict_95.json', 'r') as f:
+    with open(os.path.join(model_dir, 'lang_group_dict_95.json'), 'r') as f:
        lang_group_dict = json.load(f)

    utils = (get_language_and_group, read_audio)

-    return model, lang_dict, lang_group_dict, utils
+    return model, lang_dict, lang_group_dict, utils
--- a/silero-vad.ipynb
+++ b/silero-vad.ipynb
@@ -1,21 +1,12 @@
 {
 "cells": [
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "id": "sVNOuHQQjsrp"
-   },
-   "source": [
-    "# PyTorch Examples"
-   ]
-  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "FpMplOCA2Fwp"
   },
   "source": [
-    "## VAD"
+    "#VAD"
   ]
  },
  {
@@ -25,7 +16,7 @@
    "id": "62A6F_072Fwq"
   },
   "source": [
-    "### Install Dependencies"
+    "## Install Dependencies"
   ]
  },
  {
@@ -40,28 +31,41 @@
    "#@title Install and Import Dependencies\n",
    "\n",
    "# this assumes that you have a relevant version of PyTorch installed\n",
-    "!pip install -q torchaudio soundfile\n",
+    "!pip install -q torchaudio\n",
+    "\n",
+    "SAMPLING_RATE = 16000\n",
    "\n",
-    "import glob\n",
    "import torch\n",
    "torch.set_num_threads(1)\n",
    "\n",
    "from IPython.display import Audio\n",
    "from pprint import pprint\n",
-    "\n",
+    "# download example\n",
+    "torch.hub.download_url_to_file('https://models.silero.ai/vad_models/en.wav', 'en_example.wav')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "pSifus5IilRp"
+   },
+   "outputs": [],
+   "source": [
+    "USE_ONNX = False # change this to True if you want to test onnx model\n",
+    "if USE_ONNX:\n",
+    "    !pip install -q onnxruntime\n",
+    "  \n",
    "model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',\n",
    "                              model='silero_vad',\n",
-    "                              force_reload=True)\n",
+    "                              force_reload=True,\n",
+    "                              onnx=USE_ONNX)\n",
    "\n",
-    "(get_speech_ts,\n",
-    " get_speech_ts_adaptive,\n",
+    "(get_speech_timestamps,\n",
    " save_audio,\n",
    " read_audio,\n",
-    " state_generator,\n",
-    " single_audio_stream,\n",
-    " collect_chunks) = utils\n",
-    "\n",
-    "files_dir = torch.hub.get_dir() + '/snakers4_silero-vad_master/files'"
+    " VADIterator,\n",
+    " collect_chunks) = utils"
   ]
  },
  {
@@ -70,16 +74,16 @@
    "id": "fXbbaUO3jsrw"
   },
   "source": [
-    "### Full Audio"
+    "## Full Audio"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
-    "id": "dY2Us3_Q2Fws"
+    "id": "RAfJPb_a-Auj"
   },
   "source": [
-    "**Classic way of getting speech chunks, you may need to select the thresholds yourself**"
+    "**Speech timestapms from full audio**"
   ]
  },
  {
@@ -90,10 +94,9 @@
   },
   "outputs": [],
   "source": [
-    "wav = read_audio(f'{files_dir}/en.wav')\n",
+    "wav = read_audio('en_example.wav', sampling_rate=SAMPLING_RATE)\n",
    "# get speech timestamps from full audio file\n",
-    "speech_timestamps = get_speech_ts(wav, model,\n",
-    "                                  num_steps=4)\n",
+    "speech_timestamps = get_speech_timestamps(wav, model, sampling_rate=SAMPLING_RATE)\n",
    "pprint(speech_timestamps)"
   ]
  },
@@ -107,44 +110,7 @@
   "source": [
    "# merge all speech chunks to one audio\n",
    "save_audio('only_speech.wav',\n",
-    "           collect_chunks(speech_timestamps, wav), 16000) \n",
-    "Audio('only_speech.wav')"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "id": "n8plzbJU2Fws"
-   },
-   "source": [
-    "**Experimental Adaptive method, algorithm selects thresholds itself (see readme for more information)**"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "id": "SQOtu2Vl2Fwt"
-   },
-   "outputs": [],
-   "source": [
-    "wav = read_audio(f'{files_dir}/en.wav')\n",
-    "# get speech timestamps from full audio file\n",
-    "speech_timestamps = get_speech_ts_adaptive(wav, model, step=500, num_samples_per_window=4000)\n",
-    "pprint(speech_timestamps)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "id": "Lr6zCGXh2Fwt"
-   },
-   "outputs": [],
-   "source": [
-    "# merge all speech chunks to one audio\n",
-    "save_audio('only_speech.wav',\n",
-    "           collect_chunks(speech_timestamps, wav), 16000) \n",
+    "           collect_chunks(speech_timestamps, wav), sampling_rate=SAMPLING_RATE) \n",
    "Audio('only_speech.wav')"
   ]
  },
@@ -154,16 +120,7 @@
    "id": "iDKQbVr8jsry"
   },
   "source": [
-    "### Single Audio Stream"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "id": "xCM-HrUR2Fwu"
-   },
-   "source": [
-    "**Classic way of getting speech chunks, you may need to select the thresholds yourself**"
+    "## Stream imitation example"
   ]
  },
  {
@@ -174,20 +131,20 @@
   },
   "outputs": [],
   "source": [
-    "wav = f'{files_dir}/en.wav'\n",
+    "## using VADIterator class\n",
    "\n",
-    "for batch in single_audio_stream(model, wav):\n",
-    "    if batch:\n",
-    "        print(batch)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "id": "t8TXtnvk2Fwv"
-   },
-   "source": [
-    "**Experimental Adaptive method, algorithm selects thresholds itself (see readme for more information)**"
+    "vad_iterator = VADIterator(model)\n",
+    "wav = read_audio(f'en_example.wav', sampling_rate=SAMPLING_RATE)\n",
+    "\n",
+    "window_size_samples = 1536 # number of samples in a single audio chunk\n",
+    "for i in range(0, len(wav), window_size_samples):\n",
+    "    chunk = wav[i: i+ window_size_samples]\n",
+    "    if len(chunk) < window_size_samples:\n",
+    "      break\n",
+    "    speech_dict = vad_iterator(chunk, return_seconds=True)\n",
+    "    if speech_dict:\n",
+    "        print(speech_dict, end=' ')\n",
+    "vad_iterator.reset_states() # reset model states after each audio"
   ]
  },
  {
@@ -198,48 +155,20 @@
   },
   "outputs": [],
   "source": [
-    "wav = f'{files_dir}/en.wav'\n",
+    "## just probabilities\n",
    "\n",
-    "for batch in single_audio_stream(model, wav, iterator_type='adaptive'):\n",
-    "    if batch:\n",
-    "        print(batch)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "heading_collapsed": true,
-    "id": "KBDVybJCjsrz"
-   },
-   "source": [
-    "### Multiple Audio Streams"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "hidden": true,
-    "id": "BK4tGfWgjsrz"
-   },
-   "outputs": [],
-   "source": [
-    "audios_for_stream = glob.glob(f'{files_dir}/*.wav')\n",
-    "len(audios_for_stream) # total 4 audios"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "hidden": true,
-    "id": "v1l8sam1jsrz"
-   },
-   "outputs": [],
-   "source": [
-    "for batch in state_generator(model, audios_for_stream, audios_in_stream=2): # 2 audio stream\n",
-    "    if batch:\n",
-    "        pprint(batch)"
+    "wav = read_audio('en_example.wav', sampling_rate=SAMPLING_RATE)\n",
+    "speech_probs = []\n",
+    "window_size_samples = 1536\n",
+    "for i in range(0, len(wav), window_size_samples):\n",
+    "    chunk = wav[i: i+ window_size_samples]\n",
+    "    if len(chunk) < window_size_samples:\n",
+    "      break\n",
+    "    speech_prob = model(chunk, SAMPLING_RATE).item()\n",
+    "    speech_probs.append(speech_prob)\n",
+    "vad_iterator.reset_states() # reset model states after each audio\n",
+    "\n",
+    "print(speech_probs[:10]) # first 10 chunks predicts"
   ]
  },
  {
@@ -249,7 +178,7 @@
    "id": "36jY0niD2Fww"
   },
   "source": [
-    "## Number detector"
+    "# Number detector"
   ]
  },
  {
@@ -260,7 +189,7 @@
    "id": "scd1DlS42Fwx"
   },
   "source": [
-    "### Install Dependencies"
+    "## Install Dependencies"
   ]
  },
  {
@@ -275,26 +204,41 @@
    "#@title Install and Import Dependencies\n",
    "\n",
    "# this assumes that you have a relevant version of PyTorch installed\n",
-    "!pip install -q torchaudio soundfile\n",
+    "!pip install -q torchaudio\n",
+    "\n",
+    "SAMPLING_RATE = 16000\n",
    "\n",
-    "import glob\n",
    "import torch\n",
    "torch.set_num_threads(1)\n",
    "\n",
    "from IPython.display import Audio\n",
    "from pprint import pprint\n",
-    "\n",
+    "# download example\n",
+    "torch.hub.download_url_to_file('https://models.silero.ai/vad_models/en_num.wav', 'en_number_example.wav')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "dPwCFHmFycUF"
+   },
+   "outputs": [],
+   "source": [
+    "USE_ONNX = False # change this to True if you want to test onnx model\n",
+    "if USE_ONNX:\n",
+    "    !pip install -q onnxruntime\n",
+    "  \n",
    "model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',\n",
    "                              model='silero_number_detector',\n",
-    "                              force_reload=True)\n",
+    "                              force_reload=True,\n",
+    "                              onnx=USE_ONNX)\n",
    "\n",
    "(get_number_ts,\n",
    " save_audio,\n",
    " read_audio,\n",
    " collect_chunks,\n",
-    " drop_chunks) = utils\n",
-    "\n",
-    "files_dir = torch.hub.get_dir() + '/snakers4_silero-vad_master/files'"
+    " drop_chunks) = utils\n"
   ]
  },
  {
@@ -305,7 +249,7 @@
    "id": "qhPa30ij2Fwy"
   },
   "source": [
-    "### Full audio"
+    "## Full audio"
   ]
  },
  {
@@ -317,7 +261,7 @@
   },
   "outputs": [],
   "source": [
-    "wav = read_audio(f'{files_dir}/en_num.wav')\n",
+    "wav = read_audio('en_number_example.wav', sampling_rate=SAMPLING_RATE)\n",
    "# get number timestamps from full audio file\n",
    "number_timestamps = get_number_ts(wav, model)\n",
    "pprint(number_timestamps)"
@@ -332,11 +276,10 @@
   },
   "outputs": [],
   "source": [
-    "sample_rate = 16000\n",
    "# convert ms in timestamps to samples\n",
    "for timestamp in number_timestamps:\n",
-    "    timestamp['start'] = int(timestamp['start'] * sample_rate / 1000)\n",
-    "    timestamp['end'] = int(timestamp['end'] * sample_rate / 1000)"
+    "    timestamp['start'] = int(timestamp['start'] * SAMPLING_RATE / 1000)\n",
+    "    timestamp['end'] = int(timestamp['end'] * SAMPLING_RATE / 1000)"
   ]
  },
  {
@@ -350,7 +293,7 @@
   "source": [
    "# merge all number chunks to one audio\n",
    "save_audio('only_numbers.wav',\n",
-    "           collect_chunks(number_timestamps, wav), sample_rate) \n",
+    "           collect_chunks(number_timestamps, wav), SAMPLING_RATE) \n",
    "Audio('only_numbers.wav')"
   ]
  },
@@ -365,7 +308,7 @@
   "source": [
    "# drop all number chunks from audio\n",
    "save_audio('no_numbers.wav',\n",
-    "           drop_chunks(number_timestamps, wav), sample_rate) \n",
+    "           drop_chunks(number_timestamps, wav), SAMPLING_RATE) \n",
    "Audio('no_numbers.wav')"
   ]
  },
@@ -376,7 +319,7 @@
    "id": "PnKtJKbq2Fwz"
   },
   "source": [
-    "## Language detector"
+    "# Language detector"
   ]
  },
  {
@@ -387,7 +330,7 @@
    "id": "F5cAmMbP2Fwz"
   },
   "source": [
-    "### Install Dependencies"
+    "## Install Dependencies"
   ]
  },
  {
@@ -402,23 +345,37 @@
    "#@title Install and Import Dependencies\n",
    "\n",
    "# this assumes that you have a relevant version of PyTorch installed\n",
-    "!pip install -q torchaudio soundfile\n",
+    "!pip install -q torchaudio\n",
+    "\n",
+    "SAMPLING_RATE = 16000\n",
    "\n",
-    "import glob\n",
    "import torch\n",
    "torch.set_num_threads(1)\n",
    "\n",
    "from IPython.display import Audio\n",
    "from pprint import pprint\n",
-    "\n",
+    "# download example\n",
+    "torch.hub.download_url_to_file('https://models.silero.ai/vad_models/en.wav', 'en_example.wav')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "JfRKDZiRztFe"
+   },
+   "outputs": [],
+   "source": [
+    "USE_ONNX = False # change this to True if you want to test onnx model\n",
+    "if USE_ONNX:\n",
+    "    !pip install -q onnxruntime\n",
+    "  \n",
    "model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',\n",
    "                              model='silero_lang_detector',\n",
-    "                              force_reload=True)\n",
+    "                              force_reload=True,\n",
+    "                              onnx=USE_ONNX)\n",
    "\n",
-    "(get_language,\n",
-    " read_audio) = utils\n",
-    "\n",
-    "files_dir = torch.hub.get_dir() + '/snakers4_silero-vad_master/files'"
+    "get_language, read_audio = utils"
   ]
  },
  {
@@ -429,7 +386,7 @@
    "id": "iC696eMX2Fwz"
   },
   "source": [
-    "### Full audio"
+    "## Full audio"
   ]
  },
  {
@@ -441,513 +398,10 @@
   },
   "outputs": [],
   "source": [
-    "wav = read_audio(f'{files_dir}/en.wav')\n",
+    "wav = read_audio('en_example.wav', sampling_rate=SAMPLING_RATE)\n",
    "lang = get_language(wav, model)\n",
    "print(lang)"
   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "id": "57avIBd6jsrz"
-   },
-   "source": [
-    "# ONNX Example"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "id": "hEhnfORV2Fw0"
-   },
-   "source": [
-    "## VAD"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "heading_collapsed": true,
-    "id": "bL4kn4KJrlyL"
-   },
-   "source": [
-    "### Install Dependencies"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "cellView": "form",
-    "hidden": true,
-    "id": "Q4QIfSpprnkI"
-   },
-   "outputs": [],
-   "source": [
-    "#@title Install and Import Dependencies\n",
-    "\n",
-    "# this assumes that you have a relevant version of PyTorch installed\n",
-    "!pip install -q torchaudio soundfile onnxruntime\n",
-    "\n",
-    "import glob\n",
-    "import onnxruntime\n",
-    "from pprint import pprint\n",
-    "\n",
-    "from IPython.display import Audio\n",
-    "\n",
-    "_, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',\n",
-    "                              model='silero_vad',\n",
-    "                              force_reload=True)\n",
-    "\n",
-    "(get_speech_ts,\n",
-    " get_speech_ts_adaptive,\n",
-    " save_audio,\n",
-    " read_audio,\n",
-    " state_generator,\n",
-    " single_audio_stream,\n",
-    " collect_speeches) = utils\n",
-    "\n",
-    "files_dir = torch.hub.get_dir() + '/snakers4_silero-vad_master/files'\n",
-    "\n",
-    "def init_onnx_model(model_path: str):\n",
-    "    return onnxruntime.InferenceSession(model_path)\n",
-    "\n",
-    "def validate_onnx(model, inputs):\n",
-    "    with torch.no_grad():\n",
-    "        ort_inputs = {'input': inputs.cpu().numpy()}\n",
-    "        outs = model.run(None, ort_inputs)\n",
-    "        outs = [torch.Tensor(x) for x in outs]\n",
-    "    return outs[0]"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "id": "5JHErdB7jsr0"
-   },
-   "source": [
-    "### Full Audio"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "id": "TNEtK5zi2Fw2"
-   },
-   "source": [
-    "**Classic way of getting speech chunks, you may need to select the thresholds yourself**"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "id": "krnGoA6Kjsr0"
-   },
-   "outputs": [],
-   "source": [
-    "model = init_onnx_model(f'{files_dir}/model.onnx')\n",
-    "wav = read_audio(f'{files_dir}/en.wav')\n",
-    "\n",
-    "# get speech timestamps from full audio file\n",
-    "speech_timestamps = get_speech_ts(wav, model, num_steps=4, run_function=validate_onnx) \n",
-    "pprint(speech_timestamps)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "id": "B176Lzfnjsr1"
-   },
-   "outputs": [],
-   "source": [
-    "# merge all speech chunks to one audio\n",
-    "save_audio('only_speech.wav', collect_chunks(speech_timestamps, wav), 16000)\n",
-    "Audio('only_speech.wav')"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "id": "21RE8KEC2Fw2"
-   },
-   "source": [
-    "**Experimental Adaptive method, algorithm selects thresholds itself (see readme for more information)**"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "id": "uIVs56rb2Fw2"
-   },
-   "outputs": [],
-   "source": [
-    "model = init_onnx_model(f'{files_dir}/model.onnx')\n",
-    "wav = read_audio(f'{files_dir}/en.wav')\n",
-    "\n",
-    "# get speech timestamps from full audio file\n",
-    "speech_timestamps = get_speech_ts_adaptive(wav, model, run_function=validate_onnx) \n",
-    "pprint(speech_timestamps)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "id": "cox6oumC2Fw3"
-   },
-   "outputs": [],
-   "source": [
-    "# merge all speech chunks to one audio\n",
-    "save_audio('only_speech.wav', collect_chunks(speech_timestamps, wav), 16000)\n",
-    "Audio('only_speech.wav')"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "id": "Rio9W50gjsr1"
-   },
-   "source": [
-    "### Single Audio Stream"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "id": "i8EZwtaA2Fw3"
-   },
-   "source": [
-    "**Classic way of getting speech chunks, you may need to select the thresholds yourself**"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "id": "IPkl8Yy1jsr1"
-   },
-   "outputs": [],
-   "source": [
-    "model = init_onnx_model(f'{files_dir}/model.onnx')\n",
-    "wav = f'{files_dir}/en.wav'"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "id": "NC6Jim0hjsr1"
-   },
-   "outputs": [],
-   "source": [
-    "for batch in single_audio_stream(model, wav, run_function=validate_onnx):\n",
-    "    if batch:\n",
-    "        pprint(batch)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "id": "0pSKslpz2Fw3"
-   },
-   "source": [
-    "**Experimental Adaptive method, algorithm selects thresholds itself (see readme for more information)**"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "id": "RZwc-Khk2Fw4"
-   },
-   "outputs": [],
-   "source": [
-    "model = init_onnx_model(f'{files_dir}/model.onnx')\n",
-    "wav = f'{files_dir}/en.wav'"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "id": "Z4lzFPs02Fw4"
-   },
-   "outputs": [],
-   "source": [
-    "for batch in single_audio_stream(model, wav, iterator_type='adaptive', run_function=validate_onnx):\n",
-    "    if batch:\n",
-    "        pprint(batch)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "heading_collapsed": true,
-    "id": "WNZ42u0ajsr1"
-   },
-   "source": [
-    "### Multiple Audio Streams"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "hidden": true,
-    "id": "XjhGQGppjsr1"
-   },
-   "outputs": [],
-   "source": [
-    "model = init_onnx_model(f'{files_dir}/model.onnx')\n",
-    "audios_for_stream = glob.glob(f'{files_dir}/*.wav')\n",
-    "pprint(len(audios_for_stream)) # total 4 audios"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "hidden": true,
-    "id": "QI7-arlqjsr2"
-   },
-   "outputs": [],
-   "source": [
-    "for batch in state_generator(model, audios_for_stream, audios_in_stream=2, run_function=validate_onnx): # 2 audio stream\n",
-    "    if batch:\n",
-    "        pprint(batch)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "heading_collapsed": true,
-    "id": "7QMvUvpg2Fw4"
-   },
-   "source": [
-    "## Number detector"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "heading_collapsed": true,
-    "hidden": true,
-    "id": "tBPDkpHr2Fw4"
-   },
-   "source": [
-    "### Install Dependencies"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "cellView": "form",
-    "hidden": true,
-    "id": "PdjGd56R2Fw5"
-   },
-   "outputs": [],
-   "source": [
-    "#@title Install and Import Dependencies\n",
-    "\n",
-    "# this assumes that you have a relevant version of PyTorch installed\n",
-    "!pip install -q torchaudio soundfile onnxruntime\n",
-    "\n",
-    "import glob\n",
-    "import torch\n",
-    "import onnxruntime\n",
-    "from pprint import pprint\n",
-    "\n",
-    "from IPython.display import Audio\n",
-    "\n",
-    "_, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',\n",
-    "                              model='silero_number_detector',\n",
-    "                              force_reload=True)\n",
-    "\n",
-    "(get_number_ts,\n",
-    " save_audio,\n",
-    " read_audio,\n",
-    " collect_chunks,\n",
-    " drop_chunks) = utils\n",
-    "\n",
-    "files_dir = torch.hub.get_dir() + '/snakers4_silero-vad_master/files'\n",
-    "\n",
-    "def init_onnx_model(model_path: str):\n",
-    "    return onnxruntime.InferenceSession(model_path)\n",
-    "\n",
-    "def validate_onnx(model, inputs):\n",
-    "    with torch.no_grad():\n",
-    "        ort_inputs = {'input': inputs.cpu().numpy()}\n",
-    "        outs = model.run(None, ort_inputs)\n",
-    "        outs = [torch.Tensor(x) for x in outs]\n",
-    "    return outs"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "heading_collapsed": true,
-    "hidden": true,
-    "id": "I9QWSFZh2Fw5"
-   },
-   "source": [
-    "### Full Audio"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "hidden": true,
-    "id": "_r6QZiwu2Fw5"
-   },
-   "outputs": [],
-   "source": [
-    "model = init_onnx_model(f'{files_dir}/number_detector.onnx')\n",
-    "wav = read_audio(f'{files_dir}/en_num.wav')\n",
-    "\n",
-    "# get number timestamps from full audio file\n",
-    "number_timestamps = get_number_ts(wav, model, run_function=validate_onnx)\n",
-    "pprint(number_timestamps)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "hidden": true,
-    "id": "FN4aDwLV2Fw5"
-   },
-   "outputs": [],
-   "source": [
-    "sample_rate = 16000\n",
-    "# convert ms in timestamps to samples\n",
-    "for timestamp in number_timestamps:\n",
-    "    timestamp['start'] = int(timestamp['start'] * sample_rate / 1000)\n",
-    "    timestamp['end'] = int(timestamp['end'] * sample_rate / 1000)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "hidden": true,
-    "id": "JnvS6WTK2Fw5"
-   },
-   "outputs": [],
-   "source": [
-    "# merge all number chunks to one audio\n",
-    "save_audio('only_numbers.wav',\n",
-    "           collect_chunks(number_timestamps, wav), 16000) \n",
-    "Audio('only_numbers.wav')"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "hidden": true,
-    "id": "yUxOcOFG2Fw6"
-   },
-   "outputs": [],
-   "source": [
-    "# drop all number chunks from audio\n",
-    "save_audio('no_numbers.wav',\n",
-    "           drop_chunks(number_timestamps, wav), 16000) \n",
-    "Audio('no_numbers.wav')"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "heading_collapsed": true,
-    "id": "SR8Bgcd52Fw6"
-   },
-   "source": [
-    "## Language detector"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "heading_collapsed": true,
-    "hidden": true,
-    "id": "PBnXPtKo2Fw6"
-   },
-   "source": [
-    "### Install Dependencies"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "cellView": "form",
-    "hidden": true,
-    "id": "iNkDWJ3H2Fw6"
-   },
-   "outputs": [],
-   "source": [
-    "#@title Install and Import Dependencies\n",
-    "\n",
-    "# this assumes that you have a relevant version of PyTorch installed\n",
-    "!pip install -q torchaudio soundfile onnxruntime\n",
-    "\n",
-    "import glob\n",
-    "import torch\n",
-    "import onnxruntime\n",
-    "from pprint import pprint\n",
-    "\n",
-    "from IPython.display import Audio\n",
-    "\n",
-    "_, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',\n",
-    "                              model='silero_lang_detector',\n",
-    "                              force_reload=True)\n",
-    "\n",
-    "(get_language,\n",
-    " read_audio) = utils\n",
-    "\n",
-    "files_dir = torch.hub.get_dir() + '/snakers4_silero-vad_master/files'\n",
-    "\n",
-    "def init_onnx_model(model_path: str):\n",
-    "    return onnxruntime.InferenceSession(model_path)\n",
-    "\n",
-    "def validate_onnx(model, inputs):\n",
-    "    with torch.no_grad():\n",
-    "        ort_inputs = {'input': inputs.cpu().numpy()}\n",
-    "        outs = model.run(None, ort_inputs)\n",
-    "        outs = [torch.Tensor(x) for x in outs]\n",
-    "    return outs"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "hidden": true,
-    "id": "G8N8oP4q2Fw6"
-   },
-   "source": [
-    "### Full Audio"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "hidden": true,
-    "id": "WHXnh9IV2Fw6"
-   },
-   "outputs": [],
-   "source": [
-    "model = init_onnx_model(f'{files_dir}/number_detector.onnx')\n",
-    "wav = read_audio(f'{files_dir}/en.wav')\n",
-    "\n",
-    "lang = get_language(wav, model, run_function=validate_onnx)\n",
-    "print(lang)"
-   ]
  }
 ],
 "metadata": {
--- a/utils_vad.py
+++ b/utils_vad.py
@@ -1,71 +1,143 @@
 import torch
 import torchaudio
 from typing import List
-from itertools import repeat
-from collections import deque
 import torch.nn.functional as F
-
-
-torchaudio.set_audio_backend("soundfile")  # switch backend
-
+import warnings

 languages = ['ru', 'en', 'de', 'es']


-class IterativeMedianMeter():
-    def __init__(self):
-        self.reset()
+class OnnxWrapper():

-    def reset(self):
-        self.median = 0
-        self.counts = {}
-        for i in range(0, 101, 1):
-            self.counts[i / 100] = 0
-        self.total_values = 0
+    def __init__(self, path, force_onnx_cpu=False):
+        import numpy as np
+        global np
+        import onnxruntime
+        if force_onnx_cpu and 'CPUExecutionProvider' in onnxruntime.get_available_providers():
+            self.session = onnxruntime.InferenceSession(path, providers=['CPUExecutionProvider'])
+        else:
+            self.session = onnxruntime.InferenceSession(path)
+        self.session.intra_op_num_threads = 1
+        self.session.inter_op_num_threads = 1

-    def __call__(self, val):
-        self.total_values += 1
-        rounded = round(abs(val), 2)
-        self.counts[rounded] += 1
-        bin_sum = 0
-        for j in self.counts:
-            bin_sum += self.counts[j]
-            if bin_sum >= self.total_values / 2:
-                self.median = j
-                break
-        return self.median
+        self.reset_states()
+        self.sample_rates = [8000, 16000]
+
+    def _validate_input(self, x, sr: int):
+        if x.dim() == 1:
+            x = x.unsqueeze(0)
+        if x.dim() > 2:
+            raise ValueError(f"Too many dimensions for input audio chunk {x.dim()}")
+
+        if sr != 16000 and (sr % 16000 == 0):
+            step = sr // 16000
+            x = x[::step]
+            sr = 16000
+
+        if sr not in self.sample_rates:
+            raise ValueError(f"Supported sampling rates: {self.sample_rates} (or multiply of 16000)")
+
+        if sr / x.shape[1] > 31.25:
+            raise ValueError("Input audio chunk is too short")
+
+        return x, sr
+
+    def reset_states(self, batch_size=1):
+        self._h = np.zeros((2, batch_size, 64)).astype('float32')
+        self._c = np.zeros((2, batch_size, 64)).astype('float32')
+        self._last_sr = 0
+        self._last_batch_size = 0
+
+    def __call__(self, x, sr: int):
+
+        x, sr = self._validate_input(x, sr)
+        batch_size = x.shape[0]
+
+        if not self._last_batch_size:
+            self.reset_states(batch_size)
+        if (self._last_sr) and (self._last_sr != sr):
+            self.reset_states(batch_size)
+        if (self._last_batch_size) and (self._last_batch_size != batch_size):
+            self.reset_states(batch_size)
+
+        if sr in [8000, 16000]:
+            ort_inputs = {'input': x.numpy(), 'h': self._h, 'c': self._c, 'sr': np.array(sr)}
+            ort_outs = self.session.run(None, ort_inputs)
+            out, self._h, self._c = ort_outs
+        else:
+            raise ValueError()
+
+        self._last_sr = sr
+        self._last_batch_size = batch_size
+
+        out = torch.tensor(out)
+        return out
+
+    def audio_forward(self, x, sr: int, num_samples: int = 512):
+        outs = []
+        x, sr = self._validate_input(x, sr)
+
+        if x.shape[1] % num_samples:
+            pad_num = num_samples - (x.shape[1] % num_samples)
+            x = torch.nn.functional.pad(x, (0, pad_num), 'constant', value=0.0)
+
+        self.reset_states(x.shape[0])
+        for i in range(0, x.shape[1], num_samples):
+            wavs_batch = x[:, i:i+num_samples]
+            out_chunk = self.__call__(wavs_batch, sr)
+            outs.append(out_chunk)
+
+        stacked = torch.cat(outs, dim=1)
+        return stacked.cpu()


-def validate(model,
-             inputs: torch.Tensor):
-    with torch.no_grad():
-        outs = model(inputs)
-    return outs
+class Validator():
+    def __init__(self, url, force_onnx_cpu):
+        self.onnx = True if url.endswith('.onnx') else False
+        torch.hub.download_url_to_file(url, 'inf.model')
+        if self.onnx:
+            import onnxruntime
+            if force_onnx_cpu and 'CPUExecutionProvider' in onnxruntime.get_available_providers():
+                self.model = onnxruntime.InferenceSession('inf.model', providers=['CPUExecutionProvider'])
+            else:
+                self.model = onnxruntime.InferenceSession('inf.model')
+        else:
+            self.model = init_jit_model(model_path='inf.model')
+
+    def __call__(self, inputs: torch.Tensor):
+        with torch.no_grad():
+            if self.onnx:
+                ort_inputs = {'input': inputs.cpu().numpy()}
+                outs = self.model.run(None, ort_inputs)
+                outs = [torch.Tensor(x) for x in outs]
+            else:
+                outs = self.model(inputs)
+
+        return outs


 def read_audio(path: str,
-               target_sr: int = 16000):
+               sampling_rate: int = 16000):

-    assert torchaudio.get_audio_backend() == 'soundfile'
    wav, sr = torchaudio.load(path)

    if wav.size(0) > 1:
        wav = wav.mean(dim=0, keepdim=True)

-    if sr != target_sr:
+    if sr != sampling_rate:
        transform = torchaudio.transforms.Resample(orig_freq=sr,
-                                                   new_freq=target_sr)
+                                                   new_freq=sampling_rate)
        wav = transform(wav)
-        sr = target_sr
+        sr = sampling_rate

-    assert sr == target_sr
+    assert sr == sampling_rate
    return wav.squeeze(0)


 def save_audio(path: str,
               tensor: torch.Tensor,
-               sr: int = 16000):
-    torchaudio.save(path, tensor.unsqueeze(0), sr)
+               sampling_rate: int = 16000):
+    torchaudio.save(path, tensor.unsqueeze(0), sampling_rate)


 def init_jit_model(model_path: str,
@@ -76,192 +148,129 @@ def init_jit_model(model_path: str,
    return model


-def get_speech_ts(wav: torch.Tensor,
-                  model,
-                  trig_sum: float = 0.25,
-                  neg_trig_sum: float = 0.07,
-                  num_steps: int = 8,
-                  batch_size: int = 200,
-                  num_samples_per_window: int = 4000,
-                  min_speech_samples: int = 10000, #samples
-                  min_silence_samples: int = 500,
-                  run_function=validate,
-                  visualize_probs=False,
-                  smoothed_prob_func='mean',
-                  device='cpu'):
-
-    assert smoothed_prob_func in ['mean', 'max'],  'smoothed_prob_func not in ["max", "mean"]'
-    num_samples = num_samples_per_window
-    assert num_samples % num_steps == 0
-    step = int(num_samples / num_steps)  # stride / hop
-    outs = []
-    to_concat = []
-    for i in range(0, len(wav), step):
-        chunk = wav[i: i+num_samples]
-        if len(chunk) < num_samples:
-            chunk = F.pad(chunk, (0, num_samples - len(chunk)))
-        to_concat.append(chunk.unsqueeze(0))
-        if len(to_concat) >= batch_size:
-            chunks = torch.Tensor(torch.cat(to_concat, dim=0)).to(device)
-            out = run_function(model, chunks)
-            outs.append(out)
-            to_concat = []
-
-    if to_concat:
-        chunks = torch.Tensor(torch.cat(to_concat, dim=0)).to(device)
-        out = run_function(model, chunks)
-        outs.append(out)
-
-    outs = torch.cat(outs, dim=0)
-
-    buffer = deque(maxlen=num_steps)  # maxlen reached => first element dropped
-    triggered = False
-    speeches = []
-    current_speech = {}
-    if visualize_probs:
-      import pandas as pd
-      smoothed_probs = []
-
-    speech_probs = outs[:, 1]  # this is very misleading
-    temp_end = 0
-    for i, predict in enumerate(speech_probs):  # add name
-        buffer.append(predict)
-        if smoothed_prob_func == 'mean':
-            smoothed_prob = (sum(buffer) / len(buffer))
-        elif smoothed_prob_func == 'max':
-            smoothed_prob = max(buffer)
-
-        if visualize_probs:
-          smoothed_probs.append(float(smoothed_prob))
-        if (smoothed_prob >= trig_sum) and temp_end:
-            temp_end=0
-        if (smoothed_prob >= trig_sum) and not triggered:
-            triggered = True
-            current_speech['start'] = step * max(0, i-num_steps)
-            continue
-        if (smoothed_prob < neg_trig_sum) and triggered:
-            if not temp_end:
-                temp_end = step * i
-            if step * i - temp_end < min_silence_samples:
-                continue
-            else:
-                current_speech['end'] = temp_end
-                if (current_speech['end'] - current_speech['start']) > min_speech_samples:
-                    speeches.append(current_speech)
-                temp_end = 0
-                current_speech = {}
-                triggered = False
-                continue
-    if current_speech:
-        current_speech['end'] = len(wav)
-        speeches.append(current_speech)
-    
-    if visualize_probs:
-      pd.DataFrame({'probs':smoothed_probs}).plot(figsize=(16,8))
-    return speeches
+def make_visualization(probs, step):
+    import pandas as pd
+    pd.DataFrame({'probs': probs},
+                 index=[x * step for x in range(len(probs))]).plot(figsize=(16, 8),
+                 kind='area', ylim=[0, 1.05], xlim=[0, len(probs) * step],
+                 xlabel='seconds',
+                 ylabel='speech probability',
+                 colormap='tab20')


-def get_speech_ts_adaptive(wav: torch.Tensor,
-                      model,
-                      batch_size: int = 200,
-                      step: int = 500,
-                      num_samples_per_window: int = 4000, # Number of samples per audio chunk to feed to NN (4000 for 16k SR, 2000 for 8k SR is optimal)
-                      min_speech_samples: int = 10000,  # samples
-                      min_silence_samples: int = 4000,
-                      speech_pad_samples: int = 2000,
-                      run_function=validate,
-                      visualize_probs=False,
-                      device='cpu'):
+def get_speech_timestamps(audio: torch.Tensor,
+                          model,
+                          threshold: float = 0.5,
+                          sampling_rate: int = 16000,
+                          min_speech_duration_ms: int = 250,
+                          min_silence_duration_ms: int = 100,
+                          window_size_samples: int = 512,
+                          speech_pad_ms: int = 30,
+                          return_seconds: bool = False,
+                          visualize_probs: bool = False):
+
    """
-    This function is used for splitting long audios into speech chunks using silero VAD
-    Attention! All default sample rate values are optimal for 16000 sample rate model, if you are using 8000 sample rate model optimal values are half as much!
+    This method is used for splitting long audios into speech chunks using silero VAD

    Parameters
    ----------
-    batch_size: int
-        batch size to feed to silero VAD (default - 200)
+    audio: torch.Tensor, one dimensional
+        One dimensional float torch.Tensor, other types are casted to torch if possible

-    step: int
-        step size in samples, (default - 500)
+    model: preloaded .jit silero VAD model

-    num_samples_per_window: int
-        window size in samples (chunk length in samples to feed to NN, default - 4000)
+    threshold: float (default - 0.5)
+        Speech threshold. Silero VAD outputs speech probabilities for each audio chunk, probabilities ABOVE this value are considered as SPEECH.
+        It is better to tune this parameter for each dataset separately, but "lazy" 0.5 is pretty good for most datasets.

-    min_speech_samples: int
-        if speech duration is shorter than this value, do not consider it speech (default - 10000)
+    sampling_rate: int (default - 16000)
+        Currently silero VAD models support 8000 and 16000 sample rates

-    min_silence_samples: int
-        number of samples to wait before considering as the end of speech (default - 4000)
+    min_speech_duration_ms: int (default - 250 milliseconds)
+        Final speech chunks shorter min_speech_duration_ms are thrown out

-    speech_pad_samples: int
-        widen speech by this amount of samples each side (default - 2000)
+    min_silence_duration_ms: int (default - 100 milliseconds)
+        In the end of each speech chunk wait for min_silence_duration_ms before separating it

-    run_function: function
-        function to use for the model call
+    window_size_samples: int (default - 1536 samples)
+        Audio chunks of window_size_samples size are fed to the silero VAD model.
+        WARNING! Silero VAD models were trained using 512, 1024, 1536 samples for 16000 sample rate and 256, 512, 768 samples for 8000 sample rate.
+        Values other than these may affect model perfomance!!

-    visualize_probs: bool
-        whether draw prob hist or not (default: False)
+    speech_pad_ms: int (default - 30 milliseconds)
+        Final speech chunks are padded by speech_pad_ms each side

-    device: string
-        torch device to use for the model call (default - "cpu")
+    return_seconds: bool (default - False)
+        whether return timestamps in seconds (default - samples)
+
+    visualize_probs: bool (default - False)
+        whether draw prob hist or not

    Returns
    ----------
-    speeches: list
-        list containing ends and beginnings of speech chunks (in samples)
+    speeches: list of dicts
+        list containing ends and beginnings of speech chunks (samples or seconds based on return_seconds)
    """
-    if visualize_probs:
-      import pandas as pd    

-    num_samples = num_samples_per_window
-    num_steps = int(num_samples / step)
-    assert min_silence_samples >= step
-    outs = []
-    to_concat = []
-    for i in range(0, len(wav), step):
-        chunk = wav[i: i+num_samples]
-        if len(chunk) < num_samples:
-            chunk = F.pad(chunk, (0, num_samples - len(chunk)))
-        to_concat.append(chunk.unsqueeze(0))
-        if len(to_concat) >= batch_size:
-            chunks = torch.Tensor(torch.cat(to_concat, dim=0)).to(device)
-            out = run_function(model, chunks)
-            outs.append(out)
-            to_concat = []
+    if not torch.is_tensor(audio):
+        try:
+            audio = torch.Tensor(audio)
+        except:
+            raise TypeError("Audio cannot be casted to tensor. Cast it manually")

-    if to_concat:
-        chunks = torch.Tensor(torch.cat(to_concat, dim=0)).to(device)
-        out = run_function(model, chunks)
-        outs.append(out)
+    if len(audio.shape) > 1:
+        for i in range(len(audio.shape)):  # trying to squeeze empty dimensions
+            audio = audio.squeeze(0)
+        if len(audio.shape) > 1:
+            raise ValueError("More than one dimension in audio. Are you trying to process audio with 2 channels?")

-    outs = torch.cat(outs, dim=0).cpu()
+    if sampling_rate > 16000 and (sampling_rate % 16000 == 0):
+        step = sampling_rate // 16000
+        sampling_rate = 16000
+        audio = audio[::step]
+        warnings.warn('Sampling rate is a multiply of 16000, casting to 16000 manually!')
+    else:
+        step = 1
+
+    if sampling_rate == 8000 and window_size_samples > 768:
+        warnings.warn('window_size_samples is too big for 8000 sampling_rate! Better set window_size_samples to 256, 512 or 768 for 8000 sample rate!')
+    if window_size_samples not in [256, 512, 768, 1024, 1536]:
+        warnings.warn('Unusual window_size_samples! Supported window_size_samples:\n - [512, 1024, 1536] for 16000 sampling_rate\n - [256, 512, 768] for 8000 sampling_rate')
+
+    model.reset_states()
+    min_speech_samples = sampling_rate * min_speech_duration_ms / 1000
+    min_silence_samples = sampling_rate * min_silence_duration_ms / 1000
+    speech_pad_samples = sampling_rate * speech_pad_ms / 1000
+
+    audio_length_samples = len(audio)
+
+    speech_probs = []
+    for current_start_sample in range(0, audio_length_samples, window_size_samples):
+        chunk = audio[current_start_sample: current_start_sample + window_size_samples]
+        if len(chunk) < window_size_samples:
+            chunk = torch.nn.functional.pad(chunk, (0, int(window_size_samples - len(chunk))))
+        speech_prob = model(chunk, sampling_rate).item()
+        speech_probs.append(speech_prob)

-    buffer = deque(maxlen=num_steps)
    triggered = False
    speeches = []
-    smoothed_probs = []
    current_speech = {}
-    speech_probs = outs[:, 1]  # 0 index for silence probs, 1 index for speech probs
-    median_probs = speech_probs.median()
-
-    trig_sum = 0.89 * median_probs + 0.08 # 0.08 when median is zero, 0.97 when median is 1
-
+    neg_threshold = threshold - 0.15
    temp_end = 0
-    for i, predict in enumerate(speech_probs):
-        buffer.append(predict)
-        smoothed_prob = max(buffer)
-        if visualize_probs:
-            smoothed_probs.append(float(smoothed_prob))
-        if (smoothed_prob >= trig_sum) and temp_end:
+
+    for i, speech_prob in enumerate(speech_probs):
+        if (speech_prob >= threshold) and temp_end:
            temp_end = 0
-        if (smoothed_prob >= trig_sum) and not triggered:
+
+        if (speech_prob >= threshold) and not triggered:
            triggered = True
-            current_speech['start'] = step * max(0, i-num_steps)
+            current_speech['start'] = window_size_samples * i
            continue
-        if (smoothed_prob < trig_sum) and triggered:
+
+        if (speech_prob < neg_threshold) and triggered:
            if not temp_end:
-                temp_end = step * i
-            if step * i - temp_end < min_silence_samples:
+                temp_end = window_size_samples * i
+            if (window_size_samples * i) - temp_end < min_silence_samples:
                continue
            else:
                current_speech['end'] = temp_end
@@ -271,24 +280,36 @@ def get_speech_ts_adaptive(wav: torch.Tensor,
                current_speech = {}
                triggered = False
                continue
-    if current_speech:
-        current_speech['end'] = len(wav)
-        speeches.append(current_speech)
-    if visualize_probs:
-        pd.DataFrame({'probs': smoothed_probs}).plot(figsize=(16, 8))

-    for i, ts in enumerate(speeches):
+    if current_speech and (audio_length_samples - current_speech['start']) > min_speech_samples:
+        current_speech['end'] = audio_length_samples
+        speeches.append(current_speech)
+
+    for i, speech in enumerate(speeches):
        if i == 0:
-            ts['start'] = max(0, ts['start'] - speech_pad_samples)
+            speech['start'] = int(max(0, speech['start'] - speech_pad_samples))
        if i != len(speeches) - 1:
-            silence_duration = speeches[i+1]['start'] - ts['end']
+            silence_duration = speeches[i+1]['start'] - speech['end']
            if silence_duration < 2 * speech_pad_samples:
-                ts['end'] += silence_duration // 2
-                speeches[i+1]['start'] = max(0, speeches[i+1]['start'] - silence_duration // 2)
+                speech['end'] += int(silence_duration // 2)
+                speeches[i+1]['start'] = int(max(0, speeches[i+1]['start'] - silence_duration // 2))
            else:
-                ts['end'] += speech_pad_samples
+                speech['end'] = int(min(audio_length_samples, speech['end'] + speech_pad_samples))
+                speeches[i+1]['start'] = int(max(0, speeches[i+1]['start'] - speech_pad_samples))
        else:
-            ts['end'] = min(len(wav), ts['end'] + speech_pad_samples)
+            speech['end'] = int(min(audio_length_samples, speech['end'] + speech_pad_samples))
+
+    if return_seconds:
+        for speech_dict in speeches:
+            speech_dict['start'] = round(speech_dict['start'] / sampling_rate, 1)
+            speech_dict['end'] = round(speech_dict['end'] / sampling_rate, 1)
+    elif step > 1:
+        for speech_dict in speeches:
+            speech_dict['start'] *= step
+            speech_dict['end'] *= step
+
+    if visualize_probs:
+        make_visualization(speech_probs, window_size_samples / sampling_rate)

    return speeches

@@ -297,10 +318,9 @@ def get_number_ts(wav: torch.Tensor,
                  model,
                  model_stride=8,
                  hop_length=160,
-                  sample_rate=16000,
-                  run_function=validate):
+                  sample_rate=16000):
    wav = torch.unsqueeze(wav, dim=0)
-    perframe_logits = run_function(model, wav)[0]
+    perframe_logits = model(wav)[0]
    perframe_preds = torch.argmax(torch.softmax(perframe_logits, dim=1), dim=1).squeeze()   # (1, num_frames_strided)
    extended_preds = []
    for i in perframe_preds:
@@ -327,10 +347,9 @@ def get_number_ts(wav: torch.Tensor,


 def get_language(wav: torch.Tensor,
-                 model,
-                 run_function=validate):
+                 model):
    wav = torch.unsqueeze(wav, dim=0)
-    lang_logits = run_function(model, wav)[2]
+    lang_logits = model(wav)[2]
    lang_pred = torch.argmax(torch.softmax(lang_logits, dim=1), dim=1).item()   # from 0 to len(languages) - 1
    assert lang_pred < len(languages)
    return languages[lang_pred]
@@ -340,17 +359,16 @@ def get_language_and_group(wav: torch.Tensor,
                           model,
                           lang_dict: dict,
                           lang_group_dict: dict,
-                           top_n=1,
-                           run_function=validate):
+                           top_n=1):
    wav = torch.unsqueeze(wav, dim=0)
-    lang_logits, lang_group_logits = run_function(model, wav)
-    
+    lang_logits, lang_group_logits = model(wav)
+
    softm = torch.softmax(lang_logits, dim=1).squeeze()
    softm_group = torch.softmax(lang_group_logits, dim=1).squeeze()
-    
+
    srtd = torch.argsort(softm, descending=True)
    srtd_group = torch.argsort(softm_group, descending=True)
-    
+
    outs = []
    outs_group = []
    for i in range(top_n):
@@ -362,256 +380,94 @@ def get_language_and_group(wav: torch.Tensor,
    return outs, outs_group


-class VADiterator:
+class VADIterator:
    def __init__(self,
-                 trig_sum: float = 0.26,
-                 neg_trig_sum: float = 0.07,
-                 num_steps: int = 8,
-                 num_samples_per_window: int = 4000):
-        self.num_samples = num_samples_per_window
-        self.num_steps = num_steps
-        assert self.num_samples % num_steps == 0
-        self.step = int(self.num_samples / num_steps)   # 500 samples is good enough
-        self.prev = torch.zeros(self.num_samples)
-        self.last = False
-        self.triggered = False
-        self.buffer = deque(maxlen=num_steps)
-        self.num_frames = 0
-        self.trig_sum = trig_sum
-        self.neg_trig_sum = neg_trig_sum
-        self.current_name = ''
+                 model,
+                 threshold: float = 0.5,
+                 sampling_rate: int = 16000,
+                 min_silence_duration_ms: int = 100,
+                 speech_pad_ms: int = 30
+                 ):

-    def refresh(self):
-        self.prev = torch.zeros(self.num_samples)
-        self.last = False
-        self.triggered = False
-        self.buffer = deque(maxlen=self.num_steps)
-        self.num_frames = 0
-
-    def prepare_batch(self, wav_chunk, name=None):
-        if (name is not None) and (name != self.current_name):
-            self.refresh()
-            self.current_name = name
-        assert len(wav_chunk) <= self.num_samples
-        self.num_frames += len(wav_chunk)
-        if len(wav_chunk) < self.num_samples:
-            wav_chunk = F.pad(wav_chunk, (0, self.num_samples - len(wav_chunk)))  # short chunk => eof audio
-            self.last = True
-
-        stacked = torch.cat([self.prev, wav_chunk])
-        self.prev = wav_chunk
-
-        overlap_chunks = [stacked[i:i+self.num_samples].unsqueeze(0)
-                          for i in range(self.step, self.num_samples+1, self.step)]
-        return torch.cat(overlap_chunks, dim=0)
-
-    def state(self, model_out):
-        current_speech = {}
-        speech_probs = model_out[:, 1]  # this is very misleading
-        for i, predict in enumerate(speech_probs):
-            self.buffer.append(predict)
-            if ((sum(self.buffer) / len(self.buffer)) >= self.trig_sum) and not self.triggered:
-                self.triggered = True
-                current_speech[self.num_frames - (self.num_steps-i) * self.step] = 'start'
-            if ((sum(self.buffer) / len(self.buffer)) < self.neg_trig_sum) and self.triggered:
-                current_speech[self.num_frames - (self.num_steps-i) * self.step] = 'end'
-                self.triggered = False
-        if self.triggered and self.last:
-            current_speech[self.num_frames] = 'end'
-        if self.last:
-            self.refresh()
-        return current_speech, self.current_name
-
-
-class VADiteratorAdaptive:
-    def __init__(self,
-                 trig_sum: float = 0.26,
-                 neg_trig_sum: float = 0.06,
-                 step: int = 500,
-                 num_samples_per_window: int = 4000,
-                 speech_pad_samples: int = 1000,
-                 accum_period: int = 50):
        """
-        This class is used for streaming silero VAD usage
+        Class for stream imitation

        Parameters
        ----------
-        trig_sum: float
-            trigger value for speech probability, probs above this value are considered speech, switch to TRIGGERED state (default - 0.26)
+        model: preloaded .jit silero VAD model

-        neg_trig_sum: float
-            in triggered state probabilites below this value are considered nonspeech, switch to NONTRIGGERED state (default - 0.06)
+        threshold: float (default - 0.5)
+            Speech threshold. Silero VAD outputs speech probabilities for each audio chunk, probabilities ABOVE this value are considered as SPEECH.
+            It is better to tune this parameter for each dataset separately, but "lazy" 0.5 is pretty good for most datasets.

-        step: int
-            step size in samples, (default - 500)
+        sampling_rate: int (default - 16000)
+            Currently silero VAD models support 8000 and 16000 sample rates

-        num_samples_per_window: int
-            window size in samples (chunk length in samples to feed to NN, default - 4000)
-
-        speech_pad_samples: int
-            widen speech by this amount of samples each side (default - 1000)
-
-        accum_period: int
-            number of chunks / iterations to wait before switching from constant (initial) trig and neg_trig coeffs to adaptive median coeffs (default - 50) 
+        min_silence_duration_ms: int (default - 100 milliseconds)
+            In the end of each speech chunk wait for min_silence_duration_ms before separating it

+        speech_pad_ms: int (default - 30 milliseconds)
+            Final speech chunks are padded by speech_pad_ms each side
        """
-        self.num_samples = num_samples_per_window
-        self.num_steps = int(num_samples_per_window / step)
-        self.step = step
-        self.prev = torch.zeros(self.num_samples)
-        self.last = False
+
+        self.model = model
+        self.threshold = threshold
+        self.sampling_rate = sampling_rate
+
+        if sampling_rate not in [8000, 16000]:
+            raise ValueError('VADIterator does not support sampling rates other than [8000, 16000]')
+
+        self.min_silence_samples = sampling_rate * min_silence_duration_ms / 1000
+        self.speech_pad_samples = sampling_rate * speech_pad_ms / 1000
+        self.reset_states()
+
+    def reset_states(self):
+
+        self.model.reset_states()
        self.triggered = False
-        self.buffer = deque(maxlen=self.num_steps)
-        self.num_frames = 0
-        self.trig_sum = trig_sum
-        self.neg_trig_sum = neg_trig_sum
-        self.current_name = ''
-        self.median_meter = IterativeMedianMeter()
-        self.median = 0
-        self.total_steps = 0
-        self.accum_period = accum_period
-        self.speech_pad_samples = speech_pad_samples
+        self.temp_end = 0
+        self.current_sample = 0

-    def refresh(self):
-        self.prev = torch.zeros(self.num_samples)
-        self.last = False
-        self.triggered = False
-        self.buffer = deque(maxlen=self.num_steps)
-        self.num_frames = 0
-        self.median_meter.reset()
-        self.median = 0
-        self.total_steps = 0
+    def __call__(self, x, return_seconds=False):
+        """
+        x: torch.Tensor
+            audio chunk (see examples in repo)

-    def prepare_batch(self, wav_chunk, name=None):
-        if (name is not None) and (name != self.current_name):
-            self.refresh()
-            self.current_name = name
-        assert len(wav_chunk) <= self.num_samples
-        self.num_frames += len(wav_chunk)
-        if len(wav_chunk) < self.num_samples:
-            wav_chunk = F.pad(wav_chunk, (0, self.num_samples - len(wav_chunk)))  # short chunk => eof audio
-            self.last = True
+        return_seconds: bool (default - False)
+            whether return timestamps in seconds (default - samples)
+        """

-        stacked = torch.cat([self.prev, wav_chunk])
-        self.prev = wav_chunk
-
-        overlap_chunks = [stacked[i:i+self.num_samples].unsqueeze(0)
-                          for i in range(self.step, self.num_samples+1, self.step)]
-        return torch.cat(overlap_chunks, dim=0)
-
-    def state(self, model_out):
-        current_speech = {}
-        speech_probs = model_out[:, 1]  # 0 index for silence probs, 1 index for speech probs
-        for i, predict in enumerate(speech_probs):
-            self.median = self.median_meter(predict.item())
-            if self.total_steps < self.accum_period:
-                trig_sum = self.trig_sum
-                neg_trig_sum = self.neg_trig_sum
-            else:
-                trig_sum = 0.89 * self.median + 0.08 # 0.08 when median is zero, 0.97 when median is 1
-                neg_trig_sum = 0.6 * self.median
-            self.total_steps += 1
-            self.buffer.append(predict)
-            smoothed_prob = max(self.buffer)
-            if (smoothed_prob >= trig_sum) and not self.triggered:
-                self.triggered = True
-                current_speech[max(0, self.num_frames - (self.num_steps-i) * self.step - self.speech_pad_samples)] = 'start'
-            if (smoothed_prob < neg_trig_sum) and self.triggered:
-                current_speech[self.num_frames - (self.num_steps-i) * self.step + self.speech_pad_samples] = 'end'
-                self.triggered = False
-        if self.triggered and self.last:
-            current_speech[self.num_frames] = 'end'
-        if self.last:
-            self.refresh()
-        return current_speech, self.current_name
-
-
-def state_generator(model,
-                    audios: List[str],
-                    onnx: bool = False,
-                    trig_sum: float = 0.26,
-                    neg_trig_sum: float = 0.07,
-                    num_steps: int = 8,
-                    num_samples_per_window: int = 4000,
-                    audios_in_stream: int = 2,
-                    run_function=validate):
-    VADiters = [VADiterator(trig_sum, neg_trig_sum, num_steps, num_samples_per_window) for i in range(audios_in_stream)]
-    for i, current_pieces in enumerate(stream_imitator(audios, audios_in_stream, num_samples_per_window)):
-        for_batch = [x.prepare_batch(*y) for x, y in zip(VADiters, current_pieces)]
-        batch = torch.cat(for_batch)
-
-        outs = run_function(model, batch)
-        vad_outs = torch.split(outs, num_steps)
-
-        states = []
-        for x, y in zip(VADiters, vad_outs):
-            cur_st = x.state(y)
-            if cur_st[0]:
-                states.append(cur_st)
-        yield states
-
-
-def stream_imitator(audios: List[str],
-                    audios_in_stream: int,
-                    num_samples_per_window: int = 4000):
-    audio_iter = iter(audios)
-    iterators = []
-    num_samples = num_samples_per_window
-    # initial wavs
-    for i in range(audios_in_stream):
-        next_wav = next(audio_iter)
-        wav = read_audio(next_wav)
-        wav_chunks = iter([(wav[i:i+num_samples], next_wav) for i in range(0, len(wav), num_samples)])
-        iterators.append(wav_chunks)
-    print('Done initial Loading')
-    good_iters = audios_in_stream
-    while True:
-        values = []
-        for i, it in enumerate(iterators):
+        if not torch.is_tensor(x):
            try:
-                out, wav_name = next(it)
-            except StopIteration:
-                try:
-                    next_wav = next(audio_iter)
-                    print('Loading next wav: ', next_wav)
-                    wav = read_audio(next_wav)
-                    iterators[i] = iter([(wav[i:i+num_samples], next_wav) for i in range(0, len(wav), num_samples)])
-                    out, wav_name = next(iterators[i])
-                except StopIteration:
-                    good_iters -= 1
-                    iterators[i] = repeat((torch.zeros(num_samples), 'junk'))
-                    out, wav_name = next(iterators[i])
-                    if good_iters == 0:
-                        return
-            values.append((out, wav_name))
-        yield values
+                x = torch.Tensor(x)
+            except:
+                raise TypeError("Audio cannot be casted to tensor. Cast it manually")

+        window_size_samples = len(x[0]) if x.dim() == 2 else len(x)
+        self.current_sample += window_size_samples

-def single_audio_stream(model,
-                        audio: torch.Tensor,
-                        num_samples_per_window:int = 4000,
-                        run_function=validate,
-                        iterator_type='basic',
-                        **kwargs):
-    
-    num_samples = num_samples_per_window
-    if iterator_type == 'basic':
-        VADiter = VADiterator(num_samples_per_window=num_samples_per_window, **kwargs)
-    elif iterator_type == 'adaptive':
-        VADiter = VADiteratorAdaptive(num_samples_per_window=num_samples_per_window, **kwargs)
-        
-    wav = read_audio(audio)
-    wav_chunks = iter([wav[i:i+num_samples] for i in range(0, len(wav), num_samples)])
-    for chunk in wav_chunks:
-        batch = VADiter.prepare_batch(chunk)
+        speech_prob = self.model(x, self.sampling_rate).item()

-        outs = run_function(model, batch)
+        if (speech_prob >= self.threshold) and self.temp_end:
+            self.temp_end = 0

-        states = []
-        state = VADiter.state(outs)
-        if state[0]:
-            states.append(state[0])
-        yield states
+        if (speech_prob >= self.threshold) and not self.triggered:
+            self.triggered = True
+            speech_start = self.current_sample - self.speech_pad_samples
+            return {'start': int(speech_start) if not return_seconds else round(speech_start / self.sampling_rate, 1)}
+
+        if (speech_prob < self.threshold - 0.15) and self.triggered:
+            if not self.temp_end:
+                self.temp_end = self.current_sample
+            if self.current_sample - self.temp_end < self.min_silence_samples:
+                return None
+            else:
+                speech_end = self.temp_end + self.speech_pad_samples
+                self.temp_end = 0
+                self.triggered = False
+                return {'end': int(speech_end) if not return_seconds else round(speech_end / self.sampling_rate, 1)}
+
+        return None


 def collect_chunks(tss: List[dict],
--- a/utils_vad_additional.py
+++ b/utils_vad_additional.py
@@ -1,56 +0,0 @@
-from utils_vad import *
-import sys
-import os
-from pathlib import Path
-sys.path.append('/home/keras/notebook/nvme_raid/adamnsandle/silero_mono/pipelines/align/bin/')
-from align_utils import load_audio_norm
-import torch
-import pandas as pd
-import numpy as np
-sys.path.append('/home/keras/notebook/nvme_raid/adamnsandle/silero_mono/utils/')
-from open_stt import soundfile_opus as sf
-
-def split_save_audio_chunks(audio_path, model_path, save_path=None, device='cpu', absolute=True, max_duration=10, adaptive=False, **kwargs):
-
-    if not save_path:
-        save_path = str(Path(audio_path).with_name('after_vad'))
-        print(f'No save path specified! Using {save_path} to save audio chunks!')
-
-    SAMPLE_RATE = 16000
-    if type(model_path) == str:
-        #print('Loading model...')
-        model = init_jit_model(model_path, device)
-    else:
-        #print('Using loaded model')
-        model = model_path
-    save_name = Path(audio_path).stem
-    audio, sr = load_audio_norm(audio_path)
-    wav = torch.tensor(audio)
-    if adaptive:
-        speech_timestamps = get_speech_ts_adaptive(wav, model, device=device, **kwargs)
-    else:
-        speech_timestamps = get_speech_ts(wav, model, device=device, **kwargs)
-
-    full_save_path = Path(save_path, save_name)
-    if not os.path.exists(full_save_path):
-        os.makedirs(full_save_path, exist_ok=True)
-
-    chunks = []
-    if not speech_timestamps:
-        return pd.DataFrame()
-    for ts in speech_timestamps:
-        start_ts = int(ts['start'])
-        end_ts = int(ts['end'])
-
-        for i in range(start_ts, end_ts, max_duration * SAMPLE_RATE):
-            new_start = i
-            new_end = min(end_ts, i + max_duration * SAMPLE_RATE)
-            duration = round((new_end - new_start) / SAMPLE_RATE, 2)
-            chunk_path = Path(full_save_path, f'{save_name}_{new_start}-{new_end}.opus')
-            chunk_path = chunk_path.absolute() if absolute else chunk_path
-            sf.write(str(chunk_path), audio[new_start: new_end], 16000, format='OGG', subtype='OPUS')
-            chunks.append({'audio_path': chunk_path,
-                        'text': '', 
-                        'duration': duration,
-                        'domain': ''})
-    return pd.DataFrame(chunks)
Author	SHA1	Message	Date
adamnsandle	915dd3d639	v4.0stable force_onnx_cpu fx	2024-07-08 10:16:52 +00:00
adamnsandle	ac128b3c55	v4.0stable fx	2024-07-01 09:53:25 +00:00
Dimitrii Voronin	82d199ff22	Merge pull request #256 from snakers4/adamnsandle Adamnsandle	2022-10-28 13:57:10 +03:00
adamnsandle	5ba388d894	Merge branch 'master' of github.com:snakers4/silero-vad into adamnsandle	2022-10-28 10:55:59 +00:00
adamnsandle	790844ba0f	revert to exception	2022-10-28 10:55:46 +00:00
Dimitrii Voronin	51b5245410	Merge pull request #255 from snakers4/adamnsandle Adamnsandle	2022-10-28 13:33:18 +03:00
adamnsandle	888970e77d	Merge branch 'master' of github.com:snakers4/silero-vad into adamnsandle	2022-10-28 10:32:08 +00:00
adamnsandle	cb6d308335	fx	2022-10-28 10:31:55 +00:00
adamnsandle	1b212c6e95	change exception to warning	2022-10-28 10:26:07 +00:00
adamnsandle	452060ad65	fx	2022-10-28 10:13:00 +00:00
Dimitrii Voronin	c7eab751b5	Merge pull request #253 from snakers4/adamnsandle add torch version check	2022-10-28 13:09:18 +03:00
adamnsandle	d1714a9ff7	add torch version check	2022-10-28 10:08:07 +00:00
Dimitrii Voronin	94c79d899d	Merge pull request #251 from snakers4/adamnsandle v4 hotfix	2022-10-27 20:26:31 +03:00
adamnsandle	1baf307b35	v4 hotfix	2022-10-27 17:25:31 +00:00
Dimitrii Voronin	e324285cdc	Merge pull request #247 from snakers4/adamnsandle Adamnsandle	2022-10-26 19:17:44 +03:00
adamnsandle	13dce2d067	Merge branch 'MASTER' into adamnsandle	2022-10-26 16:13:37 +00:00
adamnsandle	081e6b9886	VAD v4	2022-10-26 16:10:20 +00:00
Alexander Veysov	572134fdf1	Update README.md	2022-10-25 05:52:53 +03:00
Dimitrii Voronin	a799dea837	Merge pull request #244 from owlsometech-kenyang/feature/support-force-onnx-cpu Suggesting a new kwarg: force_onnx_cpu	2022-10-14 11:53:46 +03:00
ChiehKai Yang	17209e6c4f	add new parameter: force_onnx_cpu	2022-10-12 01:56:43 +08:00
adamnsandle	6661cc9691	Merge branch 'master' of github.com:snakers4/silero-vad into adamnsandle	2022-06-02 10:41:54 +00:00
Dimitrii Voronin	7c671a75c2	Merge pull request #199 from snakers4/adamnsandle fx end of chunk may exceed audio length	2022-06-02 13:40:42 +03:00
adamnsandle	622016e672	Merge branch 'master' of github.com:snakers4/silero-vad into adamnsandle	2022-06-02 10:40:11 +00:00
adamnsandle	8eba346bc9	fx end of chunk may exceed audio length	2022-06-02 10:39:16 +00:00
Dimitrii Voronin	900c71a109	Merge pull request #198 from snakers4/adamnsandle fx get_speech ts start of an audio chunk pad	2022-06-02 13:33:36 +03:00
adamnsandle	bf0127e016	fx get_speech ts start of an audio chunk pad	2022-06-02 10:32:32 +00:00
Dimitrii Voronin	ea7af70fe9	Merge pull request #182 from snakers4/adamnsandle Adamnsandle	2022-04-05 14:36:00 +03:00
adamnsandle	8cdc8d36c9	fx	2022-04-05 11:35:23 +00:00
adamnsandle	6e9fd77500	fx stram imitation example bug	2022-04-05 11:33:34 +00:00
Alexander Veysov	6cc08b1077	Merge pull request #170 from gabrielziegler3/169-fix-min-speech-duration-bug Fix #169	2022-02-10 12:18:23 +03:00
Gabriel Ziegler	0e8e080894	Remove unnecessary if statement	2022-02-09 19:22:04 -03:00
Gabriel Ziegler	af6931d1de	Fix bug where min_speech_duration_ms is not checked in the last speech segment Signed-off-by: Gabriel Ziegler <gabrielziegler3@gmail.com>	2022-02-09 19:18:48 -03:00
Alexander Veysov	76687cbe25	Update README.md	2021-12-21 14:43:36 +03:00
Dimitrii Voronin	b2329fa5f2	Merge pull request #144 from snakers4/adamnsandle Update README.md	2021-12-21 14:25:56 +03:00
Dimitrii Voronin	005886e7eb	Update README.md	2021-12-21 13:25:14 +02:00
Dimitrii Voronin	f6b1294cb2	Merge pull request #143 from snakers4/adamnsandle Adamnsandle	2021-12-21 14:02:25 +03:00
adamnsandle	2392ea33f4	Merge branch 'master' of github.com:snakers4/silero-vad into adamnsandle	2021-12-21 11:01:25 +00:00
adamnsandle	45d72863b6	add multiple of 16k sr support	2021-12-21 11:01:07 +00:00
Alexander Veysov	f40cc128a4	Update utils_vad.py	2021-12-21 08:24:48 +03:00
Alexander Veysov	0d61e4cee1	Update README.md	2021-12-17 22:03:49 +03:00
Alexander Veysov	011268e492	Polish the copy a bit	2021-12-17 22:00:36 +03:00
Dimitrii Voronin	8ebaf139c6	Merge pull request #138 from snakers4/adamnsandle Update README.md	2021-12-17 18:14:03 +03:00
Dimitrii Voronin	0a90316625	Update README.md	2021-12-17 17:13:33 +02:00
Dimitrii Voronin	35d8969322	Merge pull request #137 from snakers4/adamnsandle Adamnsandle	2021-12-17 17:50:13 +03:00
adamnsandle	7c3eb8bfb5	Merge branch 'master' of github.com:snakers4/silero-vad into adamnsandle	2021-12-17 14:48:58 +00:00
adamnsandle	74f759c8f8	add onnx vad	2021-12-17 14:48:32 +00:00
Dimitrii Voronin	5816eb08c4	Merge pull request #135 from snakers4/adamnsandle Adamnsandle	2021-12-10 14:28:59 +03:00
adamnsandle	0feae6cbbe	Merge branch 'master' of github.com:snakers4/silero-vad into adamnsandle	2021-12-10 11:28:25 +00:00
adamnsandle	fc0a70f42e	imporved model	2021-12-10 11:28:07 +00:00
Dimitrii Voronin	13fd927b84	Merge pull request #134 from snakers4/adamnsandle Update README.md	2021-12-10 13:57:17 +03:00
Dimitrii Voronin	124d6564a0	Update README.md	2021-12-10 12:56:59 +02:00
Dimitrii Voronin	56fa93a1c9	Merge pull request #133 from snakers4/adamnsandle Adamnsandle	2021-12-10 13:08:54 +03:00
adamnsandle	1a93276208	fx example	2021-12-10 10:07:38 +00:00
Dimitrii Voronin	9fbd0c4c2d	Merge pull request #132 from snakers4/adamnsandle delete big files from repo	2021-12-10 12:53:52 +03:00
adamnsandle	7b05a183a3	Merge branch 'master' of github.com:snakers4/silero-vad into adamnsandle	2021-12-10 09:53:37 +00:00
adamnsandle	f67e68efc3	delete big files from repo	2021-12-10 09:52:22 +00:00
Alexander Veysov	51b1365bb0	Merge pull request #131 from snakers4/adamnsandle add collab record example	2021-12-10 12:20:26 +03:00
adamnsandle	79fdb55f1c	add collab record example	2021-12-10 09:18:15 +00:00
Alexander Veysov	b17da75dac	Merge pull request #129 from snakers4/adamnsandle Adamnsandle	2021-12-07 15:40:07 +03:00
adamnsandle	184e384697	Merge branch 'master' of github.com:snakers4/silero-vad into adamnsandle	2021-12-07 12:32:16 +00:00
adamnsandle	adf5d6d020	fx example	2021-12-07 12:32:04 +00:00
Alexander Veysov	41ee0f6b9f	Update README.md	2021-12-07 15:26:13 +03:00
Alexander Veysov	236d250a11	Merge pull request #128 from snakers4/adamnsandle Adamnsandle	2021-12-07 14:16:28 +03:00
adamnsandle	8794d6f835	fxx	2021-12-07 10:59:30 +00:00
adamnsandle	8f16c14066	Merge branch 'adamnsandle' of github.com:snakers4/silero-vad into adamnsandle	2021-12-07 10:55:31 +00:00
adamnsandle	f638c47595	collab fx	2021-12-07 10:54:50 +00:00
Alexander Veysov	1fad5f4ffb	Merge pull request #127 from snakers4/adamnsandle Adamnsandle	2021-12-07 13:47:01 +03:00
Dimitrii Voronin	7160ce99d3	Merge branch 'master' into adamnsandle	2021-12-07 13:29:31 +03:00
adamnsandle	8af246df49	file lowercase name	2021-12-07 10:27:12 +00:00
Dimitrii Voronin	b1142bcba4	Update README.md	2021-12-07 12:17:42 +02:00
Dimitrii Voronin	a243bd5dc8	Update README.md	2021-12-07 12:14:32 +02:00
Alexander Veysov	d4d2af5833	Update README.md	2021-12-07 13:13:39 +03:00
Alexander Veysov	469ca8a2f6	Improve copy	2021-12-07 13:11:13 +03:00
Dimitrii Voronin	8c1ae73ee7	Update README.md	2021-12-07 12:01:07 +02:00
adamnsandle	aba7862d58	Merge branch 'adamnsandle' of github.com:snakers4/silero-vad into adamnsandle	2021-12-07 09:44:52 +00:00
adamnsandle	b648546a21	get rid of soundifle dependency	2021-12-07 09:44:35 +00:00
Dimitrii Voronin	2e852d7d41	Update README.md	2021-12-07 10:59:09 +02:00
adamnsandle	044278aa12	initial 3.0 commit	2021-12-07 08:49:47 +00:00