Merge pull request #61 from snakers4/adamnsandle

Adamnsandle
This commit is contained in:
Alexander Veysov
2021-04-15 17:32:40 +03:00
committed by GitHub
2 changed files with 12 additions and 136 deletions

View File

@@ -351,7 +351,7 @@ We use random 250 ms audio chunks for validation. Speech to non-speech ratio amo
Since our VAD (only VAD, other networks are more flexible) was trained on chunks of the same length, model's output is just one float from 0 to 1 - **speech probability**. We use speech probabilities as thresholds for precision-recall curve. This can be extended to 100 - 150 ms. Less than 100 - 150 ms cannot be distinguished as speech with confidence. Since our VAD (only VAD, other networks are more flexible) was trained on chunks of the same length, model's output is just one float from 0 to 1 - **speech probability**. We use speech probabilities as thresholds for precision-recall curve. This can be extended to 100 - 150 ms. Less than 100 - 150 ms cannot be distinguished as speech with confidence.
[Webrtc](https://github.com/wiseman/py-webrtcvad) splits audio into frames, each frame has corresponding number (0 **or** 1). We use 30ms frames for webrtc, so each 250 ms chunk is split into 8 frames, their **mean** value is used as a treshold for plot. [Webrtc](https://github.com/wiseman/py-webrtcvad) splits audio into frames, each frame has corresponding number (0 **or** 1). We use 30ms frames for webrtc, so each 250 ms chunk is split into 8 frames, their **mean** value is used as a threshold for plot.
[Auditok](https://github.com/amsehili/auditok) - logic same as Webrtc, but we use 50ms frames. [Auditok](https://github.com/amsehili/auditok) - logic same as Webrtc, but we use 50ms frames.
@@ -363,7 +363,7 @@ Since our VAD (only VAD, other networks are more flexible) was trained on chunks
#### **Classic way** #### **Classic way**
**This is straightforward classic method `get_speech_ts` where tresholds (`trig_sum` and `neg_trig_sum`) are specified by users** **This is straightforward classic method `get_speech_ts` where thresholds (`trig_sum` and `neg_trig_sum`) are specified by users**
- Among others, we provide several [utils](https://github.com/snakers4/silero-vad/blob/8b28767292b424e3e505c55f15cd3c4b91e4804b/utils.py#L52-L59) to simplify working with VAD; - Among others, we provide several [utils](https://github.com/snakers4/silero-vad/blob/8b28767292b424e3e505c55f15cd3c4b91e4804b/utils.py#L52-L59) to simplify working with VAD;
- We provide sensible basic hyper-parameters that work for us, but your case can be different; - We provide sensible basic hyper-parameters that work for us, but your case can be different;
- `trig_sum` - overlapping windows are used for each audio chunk, trig sum defines average probability among those windows for switching into triggered state (speech state); - `trig_sum` - overlapping windows are used for each audio chunk, trig sum defines average probability among those windows for switching into triggered state (speech state);
@@ -384,7 +384,7 @@ speech_timestamps = get_speech_ts(wav, model,
#### **Adaptive way** #### **Adaptive way**
**Adaptive algorythm (`get_speech_ts_adaptive`) automatically selects tresholds (`trig_sum` and `neg_trig_sum`) based on median speech probabilities over whole audio, SOME ARGUMENTS VARY FROM CLASSIC WAY FUNCTION ARGUMENTS** **Adaptive algorithm (`get_speech_ts_adaptive`) automatically selects thresholds (`trig_sum` and `neg_trig_sum`) based on median speech probabilities over the whole audio, SOME ARGUMENTS VARY FROM THE CLASSIC WAY FUNCTION ARGUMENTS**
- `batch_size` - batch size to feed to silero VAD (default - `200`) - `batch_size` - batch size to feed to silero VAD (default - `200`)
- `step` - step size in samples, (default - `500`) (`num_samples_per_window` / `num_steps` from classic method) - `step` - step size in samples, (default - `500`) (`num_samples_per_window` / `num_steps` from classic method)
- `num_samples_per_window` - number of samples in each window, our models were trained using `4000` samples (250 ms) per window, so this is preferable value (lesser values reduce [quality](https://github.com/snakers4/silero-vad/issues/2#issuecomment-750840434)); - `num_samples_per_window` - number of samples in each window, our models were trained using `4000` samples (250 ms) per window, so this is preferable value (lesser values reduce [quality](https://github.com/snakers4/silero-vad/issues/2#issuecomment-750840434));
@@ -425,7 +425,7 @@ Please see [Quality Metrics](#quality-metrics)
### How Number Detector Works ### How Number Detector Works
- It is recommended to split long audio into short ones (< 15s) and apply model on each of them; - It is recommended to split long audio into short ones (< 15s) and apply model on each of them;
- Number Detector can classify if whole audio contains a number, or if each audio frame contains a number; - Number Detector can classify if the whole audio contains a number, or if each audio frame contains a number;
- Audio is splitted into frames in a certain way, so, having a per-frame output, we can restore timing bounds for a numbers with an accuracy of about 0.2s; - Audio is splitted into frames in a certain way, so, having a per-frame output, we can restore timing bounds for a numbers with an accuracy of about 0.2s;
### How Language Classifier Works ### How Language Classifier Works

View File

@@ -3,7 +3,6 @@
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": { "metadata": {
"heading_collapsed": true,
"id": "sVNOuHQQjsrp" "id": "sVNOuHQQjsrp"
}, },
"source": [ "source": [
@@ -13,8 +12,6 @@
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": { "metadata": {
"heading_collapsed": true,
"hidden": true,
"id": "FpMplOCA2Fwp" "id": "FpMplOCA2Fwp"
}, },
"source": [ "source": [
@@ -25,7 +22,6 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": { "metadata": {
"heading_collapsed": true, "heading_collapsed": true,
"hidden": true,
"id": "62A6F_072Fwq" "id": "62A6F_072Fwq"
}, },
"source": [ "source": [
@@ -36,10 +32,6 @@
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": null,
"metadata": { "metadata": {
"ExecuteTime": {
"end_time": "2020-12-30T17:35:43.397137Z",
"start_time": "2020-12-30T17:33:10.962078Z"
},
"hidden": true, "hidden": true,
"id": "5w5AkskZ2Fwr" "id": "5w5AkskZ2Fwr"
}, },
@@ -75,8 +67,6 @@
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": { "metadata": {
"heading_collapsed": true,
"hidden": true,
"id": "fXbbaUO3jsrw" "id": "fXbbaUO3jsrw"
}, },
"source": [ "source": [
@@ -86,22 +76,16 @@
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": { "metadata": {
"hidden": true,
"id": "dY2Us3_Q2Fws" "id": "dY2Us3_Q2Fws"
}, },
"source": [ "source": [
"**Classic way of getting speech chunks, you may need to select the tresholds yourself**" "**Classic way of getting speech chunks, you may need to select the thresholds yourself**"
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": null,
"metadata": { "metadata": {
"ExecuteTime": {
"end_time": "2020-12-30T17:35:44.362860Z",
"start_time": "2020-12-30T17:35:43.398441Z"
},
"hidden": true,
"id": "aI_eydBPjsrx" "id": "aI_eydBPjsrx"
}, },
"outputs": [], "outputs": [],
@@ -117,11 +101,6 @@
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": null,
"metadata": { "metadata": {
"ExecuteTime": {
"end_time": "2020-12-30T17:35:44.419280Z",
"start_time": "2020-12-30T17:35:44.364175Z"
},
"hidden": true,
"id": "OuEobLchjsry" "id": "OuEobLchjsry"
}, },
"outputs": [], "outputs": [],
@@ -135,18 +114,16 @@
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": { "metadata": {
"hidden": true,
"id": "n8plzbJU2Fws" "id": "n8plzbJU2Fws"
}, },
"source": [ "source": [
"**Experimental Adaptive method, algorythm selects tresholds itself (see readme for more information)**" "**Experimental Adaptive method, algorithm selects thresholds itself (see readme for more information)**"
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": null,
"metadata": { "metadata": {
"hidden": true,
"id": "SQOtu2Vl2Fwt" "id": "SQOtu2Vl2Fwt"
}, },
"outputs": [], "outputs": [],
@@ -161,7 +138,6 @@
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": null,
"metadata": { "metadata": {
"hidden": true,
"id": "Lr6zCGXh2Fwt" "id": "Lr6zCGXh2Fwt"
}, },
"outputs": [], "outputs": [],
@@ -175,8 +151,6 @@
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": { "metadata": {
"heading_collapsed": true,
"hidden": true,
"id": "iDKQbVr8jsry" "id": "iDKQbVr8jsry"
}, },
"source": [ "source": [
@@ -186,26 +160,16 @@
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": { "metadata": {
"ExecuteTime": {
"end_time": "2021-04-15T13:29:04.224833Z",
"start_time": "2021-04-15T13:29:04.220588Z"
},
"hidden": true,
"id": "xCM-HrUR2Fwu" "id": "xCM-HrUR2Fwu"
}, },
"source": [ "source": [
"**Classic way of getting speech chunks, you may need to select the tresholds yourself**" "**Classic way of getting speech chunks, you may need to select the thresholds yourself**"
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": null,
"metadata": { "metadata": {
"ExecuteTime": {
"end_time": "2020-12-15T13:09:59.199321Z",
"start_time": "2020-12-15T13:09:59.196823Z"
},
"hidden": true,
"id": "q-lql_2Wjsry" "id": "q-lql_2Wjsry"
}, },
"outputs": [], "outputs": [],
@@ -220,18 +184,16 @@
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": { "metadata": {
"hidden": true,
"id": "t8TXtnvk2Fwv" "id": "t8TXtnvk2Fwv"
}, },
"source": [ "source": [
"**Experimental Adaptive method, algorythm selects tresholds itself (see readme for more information)**" "**Experimental Adaptive method, algorithm selects thresholds itself (see readme for more information)**"
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": null,
"metadata": { "metadata": {
"hidden": true,
"id": "BX3UgwwB2Fwv" "id": "BX3UgwwB2Fwv"
}, },
"outputs": [], "outputs": [],
@@ -247,7 +209,6 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": { "metadata": {
"heading_collapsed": true, "heading_collapsed": true,
"hidden": true,
"id": "KBDVybJCjsrz" "id": "KBDVybJCjsrz"
}, },
"source": [ "source": [
@@ -258,10 +219,6 @@
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": null,
"metadata": { "metadata": {
"ExecuteTime": {
"end_time": "2020-12-15T13:10:03.590358Z",
"start_time": "2020-12-15T13:10:03.587071Z"
},
"hidden": true, "hidden": true,
"id": "BK4tGfWgjsrz" "id": "BK4tGfWgjsrz"
}, },
@@ -275,10 +232,6 @@
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": null,
"metadata": { "metadata": {
"ExecuteTime": {
"end_time": "2020-12-15T13:10:15.762491Z",
"start_time": "2020-12-15T13:10:03.591388Z"
},
"hidden": true, "hidden": true,
"id": "v1l8sam1jsrz" "id": "v1l8sam1jsrz"
}, },
@@ -293,7 +246,6 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": { "metadata": {
"heading_collapsed": true, "heading_collapsed": true,
"hidden": true,
"id": "36jY0niD2Fww" "id": "36jY0niD2Fww"
}, },
"source": [ "source": [
@@ -421,7 +373,6 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": { "metadata": {
"heading_collapsed": true, "heading_collapsed": true,
"hidden": true,
"id": "PnKtJKbq2Fwz" "id": "PnKtJKbq2Fwz"
}, },
"source": [ "source": [
@@ -498,7 +449,6 @@
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": { "metadata": {
"heading_collapsed": true,
"id": "57avIBd6jsrz" "id": "57avIBd6jsrz"
}, },
"source": [ "source": [
@@ -508,8 +458,6 @@
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": { "metadata": {
"heading_collapsed": true,
"hidden": true,
"id": "hEhnfORV2Fw0" "id": "hEhnfORV2Fw0"
}, },
"source": [ "source": [
@@ -520,7 +468,6 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": { "metadata": {
"heading_collapsed": true, "heading_collapsed": true,
"hidden": true,
"id": "bL4kn4KJrlyL" "id": "bL4kn4KJrlyL"
}, },
"source": [ "source": [
@@ -531,10 +478,6 @@
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": null,
"metadata": { "metadata": {
"ExecuteTime": {
"end_time": "2021-04-15T13:30:22.938755Z",
"start_time": "2021-04-15T13:30:20.970574Z"
},
"cellView": "form", "cellView": "form",
"hidden": true, "hidden": true,
"id": "Q4QIfSpprnkI" "id": "Q4QIfSpprnkI"
@@ -580,8 +523,6 @@
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": { "metadata": {
"heading_collapsed": true,
"hidden": true,
"id": "5JHErdB7jsr0" "id": "5JHErdB7jsr0"
}, },
"source": [ "source": [
@@ -591,26 +532,16 @@
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": { "metadata": {
"ExecuteTime": {
"end_time": "2021-04-15T13:34:22.554010Z",
"start_time": "2021-04-15T13:34:22.550308Z"
},
"hidden": true,
"id": "TNEtK5zi2Fw2" "id": "TNEtK5zi2Fw2"
}, },
"source": [ "source": [
"**Classic way of getting speech chunks, you may need to select the tresholds yourself**" "**Classic way of getting speech chunks, you may need to select the thresholds yourself**"
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": null,
"metadata": { "metadata": {
"ExecuteTime": {
"end_time": "2021-04-15T13:30:14.475412Z",
"start_time": "2021-04-15T13:30:14.427933Z"
},
"hidden": true,
"id": "krnGoA6Kjsr0" "id": "krnGoA6Kjsr0"
}, },
"outputs": [], "outputs": [],
@@ -627,11 +558,6 @@
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": null,
"metadata": { "metadata": {
"ExecuteTime": {
"end_time": "2020-12-15T13:09:08.862421Z",
"start_time": "2020-12-15T13:09:08.820014Z"
},
"hidden": true,
"id": "B176Lzfnjsr1" "id": "B176Lzfnjsr1"
}, },
"outputs": [], "outputs": [],
@@ -644,18 +570,16 @@
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": { "metadata": {
"hidden": true,
"id": "21RE8KEC2Fw2" "id": "21RE8KEC2Fw2"
}, },
"source": [ "source": [
"**Experimental Adaptive method, algorythm selects tresholds itself (see readme for more information)**" "**Experimental Adaptive method, algorithm selects thresholds itself (see readme for more information)**"
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": null,
"metadata": { "metadata": {
"hidden": true,
"id": "uIVs56rb2Fw2" "id": "uIVs56rb2Fw2"
}, },
"outputs": [], "outputs": [],
@@ -672,11 +596,6 @@
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": null,
"metadata": { "metadata": {
"ExecuteTime": {
"end_time": "2021-04-15T13:34:41.375446Z",
"start_time": "2021-04-15T13:34:41.368055Z"
},
"hidden": true,
"id": "cox6oumC2Fw3" "id": "cox6oumC2Fw3"
}, },
"outputs": [], "outputs": [],
@@ -689,8 +608,6 @@
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": { "metadata": {
"heading_collapsed": true,
"hidden": true,
"id": "Rio9W50gjsr1" "id": "Rio9W50gjsr1"
}, },
"source": [ "source": [
@@ -700,22 +617,16 @@
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": { "metadata": {
"hidden": true,
"id": "i8EZwtaA2Fw3" "id": "i8EZwtaA2Fw3"
}, },
"source": [ "source": [
"**Classic way of getting speech chunks, you may need to select the tresholds yourself**" "**Classic way of getting speech chunks, you may need to select the thresholds yourself**"
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": null,
"metadata": { "metadata": {
"ExecuteTime": {
"end_time": "2020-12-15T13:09:09.606031Z",
"start_time": "2020-12-15T13:09:09.504239Z"
},
"hidden": true,
"id": "IPkl8Yy1jsr1" "id": "IPkl8Yy1jsr1"
}, },
"outputs": [], "outputs": [],
@@ -728,11 +639,6 @@
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": null,
"metadata": { "metadata": {
"ExecuteTime": {
"end_time": "2020-12-15T13:09:11.453171Z",
"start_time": "2020-12-15T13:09:09.633435Z"
},
"hidden": true,
"id": "NC6Jim0hjsr1" "id": "NC6Jim0hjsr1"
}, },
"outputs": [], "outputs": [],
@@ -745,18 +651,16 @@
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": { "metadata": {
"hidden": true,
"id": "0pSKslpz2Fw3" "id": "0pSKslpz2Fw3"
}, },
"source": [ "source": [
"**Experimental Adaptive method, algorythm selects tresholds itself (see readme for more information)**" "**Experimental Adaptive method, algorithm selects thresholds itself (see readme for more information)**"
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": null,
"metadata": { "metadata": {
"hidden": true,
"id": "RZwc-Khk2Fw4" "id": "RZwc-Khk2Fw4"
}, },
"outputs": [], "outputs": [],
@@ -769,7 +673,6 @@
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": null,
"metadata": { "metadata": {
"hidden": true,
"id": "Z4lzFPs02Fw4" "id": "Z4lzFPs02Fw4"
}, },
"outputs": [], "outputs": [],
@@ -783,7 +686,6 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": { "metadata": {
"heading_collapsed": true, "heading_collapsed": true,
"hidden": true,
"id": "WNZ42u0ajsr1" "id": "WNZ42u0ajsr1"
}, },
"source": [ "source": [
@@ -794,10 +696,6 @@
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": null,
"metadata": { "metadata": {
"ExecuteTime": {
"end_time": "2020-12-15T13:09:11.540423Z",
"start_time": "2020-12-15T13:09:11.455706Z"
},
"hidden": true, "hidden": true,
"id": "XjhGQGppjsr1" "id": "XjhGQGppjsr1"
}, },
@@ -812,10 +710,6 @@
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": null,
"metadata": { "metadata": {
"ExecuteTime": {
"end_time": "2020-12-15T13:09:19.565434Z",
"start_time": "2020-12-15T13:09:11.552097Z"
},
"hidden": true, "hidden": true,
"id": "QI7-arlqjsr2" "id": "QI7-arlqjsr2"
}, },
@@ -830,7 +724,6 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": { "metadata": {
"heading_collapsed": true, "heading_collapsed": true,
"hidden": true,
"id": "7QMvUvpg2Fw4" "id": "7QMvUvpg2Fw4"
}, },
"source": [ "source": [
@@ -852,10 +745,6 @@
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": null,
"metadata": { "metadata": {
"ExecuteTime": {
"end_time": "2020-12-30T17:25:19.107534Z",
"start_time": "2020-12-30T17:24:51.853293Z"
},
"cellView": "form", "cellView": "form",
"hidden": true, "hidden": true,
"id": "PdjGd56R2Fw5" "id": "PdjGd56R2Fw5"
@@ -912,10 +801,6 @@
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": null,
"metadata": { "metadata": {
"ExecuteTime": {
"end_time": "2020-12-15T13:09:06.643812Z",
"start_time": "2020-12-15T13:09:06.473386Z"
},
"hidden": true, "hidden": true,
"id": "_r6QZiwu2Fw5" "id": "_r6QZiwu2Fw5"
}, },
@@ -949,10 +834,6 @@
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": null,
"metadata": { "metadata": {
"ExecuteTime": {
"end_time": "2020-12-15T13:09:08.862421Z",
"start_time": "2020-12-15T13:09:08.820014Z"
},
"hidden": true, "hidden": true,
"id": "JnvS6WTK2Fw5" "id": "JnvS6WTK2Fw5"
}, },
@@ -983,7 +864,6 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": { "metadata": {
"heading_collapsed": true, "heading_collapsed": true,
"hidden": true,
"id": "SR8Bgcd52Fw6" "id": "SR8Bgcd52Fw6"
}, },
"source": [ "source": [
@@ -1005,10 +885,6 @@
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": null,
"metadata": { "metadata": {
"ExecuteTime": {
"end_time": "2020-12-30T17:25:19.107534Z",
"start_time": "2020-12-30T17:24:51.853293Z"
},
"cellView": "form", "cellView": "form",
"hidden": true, "hidden": true,
"id": "iNkDWJ3H2Fw6" "id": "iNkDWJ3H2Fw6"