mirror of
https://github.com/snakers4/silero-vad.git
synced 2026-02-04 17:39:22 +08:00
@@ -351,7 +351,7 @@ We use random 250 ms audio chunks for validation. Speech to non-speech ratio amo
|
||||
|
||||
Since our VAD (only VAD, other networks are more flexible) was trained on chunks of the same length, model's output is just one float from 0 to 1 - **speech probability**. We use speech probabilities as thresholds for precision-recall curve. This can be extended to 100 - 150 ms. Less than 100 - 150 ms cannot be distinguished as speech with confidence.
|
||||
|
||||
[Webrtc](https://github.com/wiseman/py-webrtcvad) splits audio into frames, each frame has corresponding number (0 **or** 1). We use 30ms frames for webrtc, so each 250 ms chunk is split into 8 frames, their **mean** value is used as a treshold for plot.
|
||||
[Webrtc](https://github.com/wiseman/py-webrtcvad) splits audio into frames, each frame has corresponding number (0 **or** 1). We use 30ms frames for webrtc, so each 250 ms chunk is split into 8 frames, their **mean** value is used as a threshold for plot.
|
||||
|
||||
[Auditok](https://github.com/amsehili/auditok) - logic same as Webrtc, but we use 50ms frames.
|
||||
|
||||
@@ -363,7 +363,7 @@ Since our VAD (only VAD, other networks are more flexible) was trained on chunks
|
||||
|
||||
#### **Classic way**
|
||||
|
||||
**This is straightforward classic method `get_speech_ts` where tresholds (`trig_sum` and `neg_trig_sum`) are specified by users**
|
||||
**This is straightforward classic method `get_speech_ts` where thresholds (`trig_sum` and `neg_trig_sum`) are specified by users**
|
||||
- Among others, we provide several [utils](https://github.com/snakers4/silero-vad/blob/8b28767292b424e3e505c55f15cd3c4b91e4804b/utils.py#L52-L59) to simplify working with VAD;
|
||||
- We provide sensible basic hyper-parameters that work for us, but your case can be different;
|
||||
- `trig_sum` - overlapping windows are used for each audio chunk, trig sum defines average probability among those windows for switching into triggered state (speech state);
|
||||
@@ -384,7 +384,7 @@ speech_timestamps = get_speech_ts(wav, model,
|
||||
|
||||
#### **Adaptive way**
|
||||
|
||||
**Adaptive algorythm (`get_speech_ts_adaptive`) automatically selects tresholds (`trig_sum` and `neg_trig_sum`) based on median speech probabilities over whole audio, SOME ARGUMENTS VARY FROM CLASSIC WAY FUNCTION ARGUMENTS**
|
||||
**Adaptive algorithm (`get_speech_ts_adaptive`) automatically selects thresholds (`trig_sum` and `neg_trig_sum`) based on median speech probabilities over the whole audio, SOME ARGUMENTS VARY FROM THE CLASSIC WAY FUNCTION ARGUMENTS**
|
||||
- `batch_size` - batch size to feed to silero VAD (default - `200`)
|
||||
- `step` - step size in samples, (default - `500`) (`num_samples_per_window` / `num_steps` from classic method)
|
||||
- `num_samples_per_window` - number of samples in each window, our models were trained using `4000` samples (250 ms) per window, so this is preferable value (lesser values reduce [quality](https://github.com/snakers4/silero-vad/issues/2#issuecomment-750840434));
|
||||
@@ -425,7 +425,7 @@ Please see [Quality Metrics](#quality-metrics)
|
||||
### How Number Detector Works
|
||||
|
||||
- It is recommended to split long audio into short ones (< 15s) and apply model on each of them;
|
||||
- Number Detector can classify if whole audio contains a number, or if each audio frame contains a number;
|
||||
- Number Detector can classify if the whole audio contains a number, or if each audio frame contains a number;
|
||||
- Audio is splitted into frames in a certain way, so, having a per-frame output, we can restore timing bounds for a numbers with an accuracy of about 0.2s;
|
||||
|
||||
### How Language Classifier Works
|
||||
|
||||
140
silero-vad.ipynb
140
silero-vad.ipynb
@@ -3,7 +3,6 @@
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"heading_collapsed": true,
|
||||
"id": "sVNOuHQQjsrp"
|
||||
},
|
||||
"source": [
|
||||
@@ -13,8 +12,6 @@
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"heading_collapsed": true,
|
||||
"hidden": true,
|
||||
"id": "FpMplOCA2Fwp"
|
||||
},
|
||||
"source": [
|
||||
@@ -25,7 +22,6 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"heading_collapsed": true,
|
||||
"hidden": true,
|
||||
"id": "62A6F_072Fwq"
|
||||
},
|
||||
"source": [
|
||||
@@ -36,10 +32,6 @@
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2020-12-30T17:35:43.397137Z",
|
||||
"start_time": "2020-12-30T17:33:10.962078Z"
|
||||
},
|
||||
"hidden": true,
|
||||
"id": "5w5AkskZ2Fwr"
|
||||
},
|
||||
@@ -75,8 +67,6 @@
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"heading_collapsed": true,
|
||||
"hidden": true,
|
||||
"id": "fXbbaUO3jsrw"
|
||||
},
|
||||
"source": [
|
||||
@@ -86,22 +76,16 @@
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"hidden": true,
|
||||
"id": "dY2Us3_Q2Fws"
|
||||
},
|
||||
"source": [
|
||||
"**Classic way of getting speech chunks, you may need to select the tresholds yourself**"
|
||||
"**Classic way of getting speech chunks, you may need to select the thresholds yourself**"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2020-12-30T17:35:44.362860Z",
|
||||
"start_time": "2020-12-30T17:35:43.398441Z"
|
||||
},
|
||||
"hidden": true,
|
||||
"id": "aI_eydBPjsrx"
|
||||
},
|
||||
"outputs": [],
|
||||
@@ -117,11 +101,6 @@
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2020-12-30T17:35:44.419280Z",
|
||||
"start_time": "2020-12-30T17:35:44.364175Z"
|
||||
},
|
||||
"hidden": true,
|
||||
"id": "OuEobLchjsry"
|
||||
},
|
||||
"outputs": [],
|
||||
@@ -135,18 +114,16 @@
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"hidden": true,
|
||||
"id": "n8plzbJU2Fws"
|
||||
},
|
||||
"source": [
|
||||
"**Experimental Adaptive method, algorythm selects tresholds itself (see readme for more information)**"
|
||||
"**Experimental Adaptive method, algorithm selects thresholds itself (see readme for more information)**"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"hidden": true,
|
||||
"id": "SQOtu2Vl2Fwt"
|
||||
},
|
||||
"outputs": [],
|
||||
@@ -161,7 +138,6 @@
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"hidden": true,
|
||||
"id": "Lr6zCGXh2Fwt"
|
||||
},
|
||||
"outputs": [],
|
||||
@@ -175,8 +151,6 @@
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"heading_collapsed": true,
|
||||
"hidden": true,
|
||||
"id": "iDKQbVr8jsry"
|
||||
},
|
||||
"source": [
|
||||
@@ -186,26 +160,16 @@
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2021-04-15T13:29:04.224833Z",
|
||||
"start_time": "2021-04-15T13:29:04.220588Z"
|
||||
},
|
||||
"hidden": true,
|
||||
"id": "xCM-HrUR2Fwu"
|
||||
},
|
||||
"source": [
|
||||
"**Classic way of getting speech chunks, you may need to select the tresholds yourself**"
|
||||
"**Classic way of getting speech chunks, you may need to select the thresholds yourself**"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2020-12-15T13:09:59.199321Z",
|
||||
"start_time": "2020-12-15T13:09:59.196823Z"
|
||||
},
|
||||
"hidden": true,
|
||||
"id": "q-lql_2Wjsry"
|
||||
},
|
||||
"outputs": [],
|
||||
@@ -220,18 +184,16 @@
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"hidden": true,
|
||||
"id": "t8TXtnvk2Fwv"
|
||||
},
|
||||
"source": [
|
||||
"**Experimental Adaptive method, algorythm selects tresholds itself (see readme for more information)**"
|
||||
"**Experimental Adaptive method, algorithm selects thresholds itself (see readme for more information)**"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"hidden": true,
|
||||
"id": "BX3UgwwB2Fwv"
|
||||
},
|
||||
"outputs": [],
|
||||
@@ -247,7 +209,6 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"heading_collapsed": true,
|
||||
"hidden": true,
|
||||
"id": "KBDVybJCjsrz"
|
||||
},
|
||||
"source": [
|
||||
@@ -258,10 +219,6 @@
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2020-12-15T13:10:03.590358Z",
|
||||
"start_time": "2020-12-15T13:10:03.587071Z"
|
||||
},
|
||||
"hidden": true,
|
||||
"id": "BK4tGfWgjsrz"
|
||||
},
|
||||
@@ -275,10 +232,6 @@
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2020-12-15T13:10:15.762491Z",
|
||||
"start_time": "2020-12-15T13:10:03.591388Z"
|
||||
},
|
||||
"hidden": true,
|
||||
"id": "v1l8sam1jsrz"
|
||||
},
|
||||
@@ -293,7 +246,6 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"heading_collapsed": true,
|
||||
"hidden": true,
|
||||
"id": "36jY0niD2Fww"
|
||||
},
|
||||
"source": [
|
||||
@@ -421,7 +373,6 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"heading_collapsed": true,
|
||||
"hidden": true,
|
||||
"id": "PnKtJKbq2Fwz"
|
||||
},
|
||||
"source": [
|
||||
@@ -498,7 +449,6 @@
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"heading_collapsed": true,
|
||||
"id": "57avIBd6jsrz"
|
||||
},
|
||||
"source": [
|
||||
@@ -508,8 +458,6 @@
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"heading_collapsed": true,
|
||||
"hidden": true,
|
||||
"id": "hEhnfORV2Fw0"
|
||||
},
|
||||
"source": [
|
||||
@@ -520,7 +468,6 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"heading_collapsed": true,
|
||||
"hidden": true,
|
||||
"id": "bL4kn4KJrlyL"
|
||||
},
|
||||
"source": [
|
||||
@@ -531,10 +478,6 @@
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2021-04-15T13:30:22.938755Z",
|
||||
"start_time": "2021-04-15T13:30:20.970574Z"
|
||||
},
|
||||
"cellView": "form",
|
||||
"hidden": true,
|
||||
"id": "Q4QIfSpprnkI"
|
||||
@@ -580,8 +523,6 @@
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"heading_collapsed": true,
|
||||
"hidden": true,
|
||||
"id": "5JHErdB7jsr0"
|
||||
},
|
||||
"source": [
|
||||
@@ -591,26 +532,16 @@
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2021-04-15T13:34:22.554010Z",
|
||||
"start_time": "2021-04-15T13:34:22.550308Z"
|
||||
},
|
||||
"hidden": true,
|
||||
"id": "TNEtK5zi2Fw2"
|
||||
},
|
||||
"source": [
|
||||
"**Classic way of getting speech chunks, you may need to select the tresholds yourself**"
|
||||
"**Classic way of getting speech chunks, you may need to select the thresholds yourself**"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2021-04-15T13:30:14.475412Z",
|
||||
"start_time": "2021-04-15T13:30:14.427933Z"
|
||||
},
|
||||
"hidden": true,
|
||||
"id": "krnGoA6Kjsr0"
|
||||
},
|
||||
"outputs": [],
|
||||
@@ -627,11 +558,6 @@
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2020-12-15T13:09:08.862421Z",
|
||||
"start_time": "2020-12-15T13:09:08.820014Z"
|
||||
},
|
||||
"hidden": true,
|
||||
"id": "B176Lzfnjsr1"
|
||||
},
|
||||
"outputs": [],
|
||||
@@ -644,18 +570,16 @@
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"hidden": true,
|
||||
"id": "21RE8KEC2Fw2"
|
||||
},
|
||||
"source": [
|
||||
"**Experimental Adaptive method, algorythm selects tresholds itself (see readme for more information)**"
|
||||
"**Experimental Adaptive method, algorithm selects thresholds itself (see readme for more information)**"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"hidden": true,
|
||||
"id": "uIVs56rb2Fw2"
|
||||
},
|
||||
"outputs": [],
|
||||
@@ -672,11 +596,6 @@
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2021-04-15T13:34:41.375446Z",
|
||||
"start_time": "2021-04-15T13:34:41.368055Z"
|
||||
},
|
||||
"hidden": true,
|
||||
"id": "cox6oumC2Fw3"
|
||||
},
|
||||
"outputs": [],
|
||||
@@ -689,8 +608,6 @@
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"heading_collapsed": true,
|
||||
"hidden": true,
|
||||
"id": "Rio9W50gjsr1"
|
||||
},
|
||||
"source": [
|
||||
@@ -700,22 +617,16 @@
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"hidden": true,
|
||||
"id": "i8EZwtaA2Fw3"
|
||||
},
|
||||
"source": [
|
||||
"**Classic way of getting speech chunks, you may need to select the tresholds yourself**"
|
||||
"**Classic way of getting speech chunks, you may need to select the thresholds yourself**"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2020-12-15T13:09:09.606031Z",
|
||||
"start_time": "2020-12-15T13:09:09.504239Z"
|
||||
},
|
||||
"hidden": true,
|
||||
"id": "IPkl8Yy1jsr1"
|
||||
},
|
||||
"outputs": [],
|
||||
@@ -728,11 +639,6 @@
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2020-12-15T13:09:11.453171Z",
|
||||
"start_time": "2020-12-15T13:09:09.633435Z"
|
||||
},
|
||||
"hidden": true,
|
||||
"id": "NC6Jim0hjsr1"
|
||||
},
|
||||
"outputs": [],
|
||||
@@ -745,18 +651,16 @@
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"hidden": true,
|
||||
"id": "0pSKslpz2Fw3"
|
||||
},
|
||||
"source": [
|
||||
"**Experimental Adaptive method, algorythm selects tresholds itself (see readme for more information)**"
|
||||
"**Experimental Adaptive method, algorithm selects thresholds itself (see readme for more information)**"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"hidden": true,
|
||||
"id": "RZwc-Khk2Fw4"
|
||||
},
|
||||
"outputs": [],
|
||||
@@ -769,7 +673,6 @@
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"hidden": true,
|
||||
"id": "Z4lzFPs02Fw4"
|
||||
},
|
||||
"outputs": [],
|
||||
@@ -783,7 +686,6 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"heading_collapsed": true,
|
||||
"hidden": true,
|
||||
"id": "WNZ42u0ajsr1"
|
||||
},
|
||||
"source": [
|
||||
@@ -794,10 +696,6 @@
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2020-12-15T13:09:11.540423Z",
|
||||
"start_time": "2020-12-15T13:09:11.455706Z"
|
||||
},
|
||||
"hidden": true,
|
||||
"id": "XjhGQGppjsr1"
|
||||
},
|
||||
@@ -812,10 +710,6 @@
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2020-12-15T13:09:19.565434Z",
|
||||
"start_time": "2020-12-15T13:09:11.552097Z"
|
||||
},
|
||||
"hidden": true,
|
||||
"id": "QI7-arlqjsr2"
|
||||
},
|
||||
@@ -830,7 +724,6 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"heading_collapsed": true,
|
||||
"hidden": true,
|
||||
"id": "7QMvUvpg2Fw4"
|
||||
},
|
||||
"source": [
|
||||
@@ -852,10 +745,6 @@
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2020-12-30T17:25:19.107534Z",
|
||||
"start_time": "2020-12-30T17:24:51.853293Z"
|
||||
},
|
||||
"cellView": "form",
|
||||
"hidden": true,
|
||||
"id": "PdjGd56R2Fw5"
|
||||
@@ -912,10 +801,6 @@
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2020-12-15T13:09:06.643812Z",
|
||||
"start_time": "2020-12-15T13:09:06.473386Z"
|
||||
},
|
||||
"hidden": true,
|
||||
"id": "_r6QZiwu2Fw5"
|
||||
},
|
||||
@@ -949,10 +834,6 @@
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2020-12-15T13:09:08.862421Z",
|
||||
"start_time": "2020-12-15T13:09:08.820014Z"
|
||||
},
|
||||
"hidden": true,
|
||||
"id": "JnvS6WTK2Fw5"
|
||||
},
|
||||
@@ -983,7 +864,6 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"heading_collapsed": true,
|
||||
"hidden": true,
|
||||
"id": "SR8Bgcd52Fw6"
|
||||
},
|
||||
"source": [
|
||||
@@ -1005,10 +885,6 @@
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2020-12-30T17:25:19.107534Z",
|
||||
"start_time": "2020-12-30T17:24:51.853293Z"
|
||||
},
|
||||
"cellView": "form",
|
||||
"hidden": true,
|
||||
"id": "iNkDWJ3H2Fw6"
|
||||
|
||||
Reference in New Issue
Block a user