From 3efee3b272ad70f8ef2d1ec61d0b7fd6028d70fe Mon Sep 17 00:00:00 2001 From: Dimitrii Voronin <36505480+adamnsandle@users.noreply.github.com> Date: Mon, 11 Jan 2021 14:11:56 +0200 Subject: [PATCH 1/4] Update README.md --- README.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 26b5fa9..7d2521e 100644 --- a/README.md +++ b/README.md @@ -217,7 +217,11 @@ TBD, but there is no explicit limiation on the way audio is split into chunks. ### How Language Classifier Works -TBD, but there is no explicit limiation on the way audio is split into chunks. +- **99%** validation accuracy +- Language classifier was trained using audio samples in 4 languages: **Russian**, **English**, **Spanish**, **German** +- More languages TBD +- Arbitrary audio length can be used, although network was trained using audio shorter than 15 seconds + ## Contact From 07687e33d801deee83c744606d9136cec6917c11 Mon Sep 17 00:00:00 2001 From: adamnsandle Date: Mon, 11 Jan 2021 12:13:46 +0000 Subject: [PATCH 2/4] fx models dimension bug --- files/model.jit | Bin 2870157 -> 2870221 bytes files/model.onnx | Bin 4451292 -> 4451716 bytes utils.py | 1 - 3 files changed, 1 deletion(-) diff --git a/files/model.jit b/files/model.jit index b2d21a86ab335ac80d30bf85284c49a9a0517ee6..f52a626a60aa41d941ec1277355adf81b7bcebca 100644 GIT binary patch delta 3949 zcmZ9N2|SeB8^_1nLbkCFDoP{FTT%#xVPq#`BrW!lgAV=MQky)1Y3$JVanJtY&oWat-#2b zDz*|+!_+a2+Y}ngv~}v$lBoRRvghEz-L7J<#k3Dc9^q{=AQ-K^U{Ty{n!ifHT3n;5 z5^X#(a*2NJ;P@Y<(|*eRpLWOXv@PHk+vOU_H)Ki0jczDZz2HEVtl3s4wuT#CeBrY8 z!Cw;T9cAhZRebN!X|0bm*n3wyG6*510t-b9o0Hau*2-6MEbEzjZXLRGHK#1;wAZ0z z&a#_#LqG^Ae?jV4lFR0naE<*onsp)}wA^5N*ZnANCd0RCl~I7*;}5RGA1Vz7PB+m* zhMhv4G7ME#Rx|9*KP;-IckN}lwvvlpnH>{x&{bEvM_hlF(BN&jGfgSq{FUM-A0aiz zmUSKn9i!}v9CR&bn50L3KX;BZd|sJLt$7{k9s$xO=e|&UZo?B~359_m-;aQkI>F_=9xi&R=^yWL8=5!&mM*~-0 zhL`kQXdGJ8I^<{gf@{2pH|f0dg5U9PDyna?jo&@%o~ZKWbSQje9x4b-wex%Ec&qt+Ri+XCEC-3vURa3Jo+}Gzxo3hTfJl_Jk zI8_!E?w&CcsUv5zkB@AME!!dc4H<|H-rd|o^ghpG4o<|B`c`V9NpDY)+f%oO|8#vC z_c+J$!15gXERFrrD-}7GHj=_x9WNoChc4li=aI{gq>YBV*lAsq z8SfA(wNnx%G7jd;roU}__H~JZI{C<`x+-wpvB%+!_J;GO8Dxz0{Xs~4vbF~M$yNs$ zu`fK0lC~j*LX+us{gsj{3^$2h!94OrZVM*nJ*=}mzDb{k?Yhc5s~&J{Tv(X9x$L5b zKO3XJ3CP?pC0hTc@~4k)e7Nl1>)fEM8$sYDC;r(PL(hS$dmjo3wX-rvq1Ie8j*#B@wfNm=&q;4zhtv`=IFaTki6lDQn5N<(P4S3oVL2xMn zm1b)C?D7uw^$7IeXF*#?*zx-lv(-LoiMdQNyRUyOd6`iCTlxX)qr30%l3dS{4QErr z_x1FBD5s-&ou$z-Od#Q12*C0m)aW$(GlkGNr0 z64K?NV}c)ACl7b-4tY4WtM5z2XhpYMA=BO7D6u2I#oNNrvA%U1o2Bm0eH?u!ZM9qp zTU096Y;U$)$@<+F+s=8QDHqqqTQ+h8m%4|ZP+DpC%HsVK_h;1OQ5be zyKk)zZ*ZN{4~FL6g(|$SebOJa{}9ykE;Mi25>+&My|$Rd&^<{}y{)*x$Nrt%k)~73 zZC||mNI`{4nEU9o-Py?%c1pU746G*`HPT4QgPxnpwVfG%ro9}H_EwEPAb)8<$}557 zCm(MkuPN(?F54Apl%?Khn6kvFLCyTrH!LDPJbQ4$+Crb3_g?P!J*!01h@y2{hS%-> z!{F(N(+k#~>Yjk5lJ=?BdPloOuJ!sp4r3mzKjx7zl+JT%JS8xg^H+ve73*|!+|0a| z4^#B5&zaT{>zBQAbL;(Xneaf65uy2FT4wx|SMQ4L8;QJAb&hd2&xTccyGakOZFE<1 zb$z=IE~ zW?)C`Wy{OYu~f-Q7q8_FR)v?p)q*?W*?Yi;g*vQlXxsO(~czxG=+wusW58I3Nr`3Xc?j~ZS ziOK1pj&T~NE8SV%ek76hRCc(@Tj6L1lfx{Qw~z{adrm2k94Y9fv$CdMc#&?GP%jp< z{Q6@75>$Xm6_V@IP>UZv-%UMc!sBIAI17j`8cmtQp&6HR%e?O$o65cvAJVa~MC z;)Y@A~KwH=+fFCELso6&KK6~9ZKvG5 zdJ26%a~EMBj%3~0@wO7i6y;fJ8DC6ekc+8RXrJhCG0JzApC57aTiILm*O|I zp5Ty^oJDw<`9hkIEKp?28AZmB_UIHS}E%5w!fGs zQU}X~cu`&Rssy%KRXcZ1JW!nIFDeP^wY;+aqms49U`2bKKRe$wtd@FiNT+yBTgB#6 zGk>MEYr2^A*c^~sG%Vl5?WQUyt~yc1{L~sTSo*QLaE@?HB<*l*)S`F!3<7J3~{m@5KSlxKo5s1$FUVCy^gL%(k=L6RWfYM)Cw0gkd=&T zA}ek9Lg5Cgk2u=lf(r8ODT_ej4U~e!+0PPbesPTySbX1RKK_w{iUFA%yipvZ^jU0* z=rBNW3AzKYZ^8ziI`M_j7K#{HlL{+i&Uh>h@ZM6SKw&CuWsl1olHfuS|NAs)uyXWv z+}W)lWhZ4mXh?(k2X@X9alpI_TkuVXfsSr?k?r7RI;<$|!53B@uvlyjj5(75V`@ET zHXe9H|+7Zk*rK;b2BK)M2{fb{XlUhY74^r_%>6JnIJq#5Ad>3d&J=| zUq~FJtN=u421v_>Tb&4*#llx_p&JpCWB&q@Aodo#c==N}r2-~yCG-;(XU>vMNNhM? z{5%c!7SDytZ%)q=84!b$Kny2;`Pl)&#( z2Pi5*;}I-%PCT0fyYow*Lzdo#Lmiz7Uf+g!&<={gHNd+K^GtK_CS5>FDNGs9c=XU4yLV{F^KQ=kCE9wJpWR#R>$JQsltW z`!IFhy*b3_9`wJz4&FBRK0MpL2Y8X+_b&zxRKOH!8BDRL_%~k0AIqUK+>WX_`aZs zYFKZydluUwQa${Huo_r=nhT3}_RbOmGxRzrV%9bVj&Lw0C z6~bg9ie$~!q#FP8c4z#}+|S2{-}gMf=RD_m&OJIo0bMBk@yF?K1gFOta7LU7--I*c zEI2F9hHu8%aSogl--2`DTXAlj2j|7N;e0qhE`ST-+i@XW7~g@5;G(z~z7rS6C2&c6 z7cPZM||EePY$h7UzwRcCHOF7yP@+k~hewqf4CPF`5Phz|Ihn2x8bAfh_%dF&_ zD1#=_$$MHJAGh(h+hkK7$1#-o8M?j8X8OSJrE7f4$jK!q*WUEfEZe6^o8L)LtjCYZ zvKF$eWRFX)Au8lfGIsi2uAkwk_|S{T^NZOYT{;EbOpfM*$el1P7rHB2+a~gkx_4`h zcRO2tu77<}0V{1C4&yx$9ea1{gK`V?Sc@$sOK}Nmh0ddX?cR?Zj^bh$Jd9^GtuME{ zUj*aV1XWZAr05dsE`}&v7CH0qf>5(m(x0oVbCc8F#??tj$;XBcWgB75HIiDbSLrl! z$nyFf=NhPN5?tBY)E%#6cj`&ge+=Hq^2WNz`kHhvzd*^tCU)Mb$0wxk8Z_T)^eJ~r zjb%{sx-UFwVf9!dMXlzfQ`z%x)XdmrC-vc2-*K;^qxXmIi`(>~QR<-NeetIV+h9Sz zEQ|D>M&-PM?+?ihVbeeM-)#6{sOs#}A<7Ws6L>xbMC1LAOEU8TEUOVaZwaQxH0X?K z8RwS>R%6D$lxs?KiD$;a4M!K#>idi^uEiH z4_=fQifarGK0F&}lj%Ox<}@VvWm;lcl#{>W(ZCCOZEpt`2VEn#iYM-ctq&F^P4&~Q zUR-kH=f66{si^v?Tv_$f?_alvRTtbUzGaSC>8~{<$O#Vr)@`3$=iA4=Vzca4_1M*O zudeNEu!$Lu#W$Q%rK} zird^DJZtW$?57|4;!Y-+n`~khUo2bPM^7Rtv4T_-6++BOz}^890h$HaE-)#Ja+29d zHvj*Go-hp-(&I^vdprEAl!spMxkCLizr-1%bLLxxE{GZLzwBo23z{W1EA=S#%v`lH zvs(T>Z5*)4NAsl#t~)=Gp$tCiG18-1e|o(P4%cCb9?|;k8rOHasm>+uK1W}NZ+qvy zz5VibD&MLgBe`qxgrlT5IazwlG54@syUqM!;*2vEcxl(2rm^sYId^mK#HD+U_J?G- z3bm!buVfQGbK@5-nQmdzY*|!%OT16O*fw$ASD^y=&0;;(L7RuH9q*lK*Xz_|NE{;H zaCMv>ahxtS3S{M4sr3=^6Fc6Mc~sjqRcC+nmGhaSciH8}b-2C=YJDWL*L8ch$~V5Z z2nqK|+kR(6kzRZ>G@BwV^5MjhUw!se+G`p=#_dn1*&oiaudqp*@wen(Nu*qx^Y=U~ zoDx+r!M&}EDJ`bkN#GqralEbI$kxk@#~V@e)b7~MQp4iz3w|$yeFhGS+2|_y4&*s0 z`nD$1FD5RR$)AY~_#*L5g&YfVNBoo#SaKJJTE0_3aA3Hk>o_ghr{#=)PlRbygti&nMK853#yP^I1KS;4id zXDjS)QI*AWo(l~w&9n)zJ?pagF*m55KW8l5G@L)WhI5vgvp-uk64MP7ab7$sxE_nR$CJFzjcsPVHbzJxJYPK{PB-^b$XQ_7)Tg9Z}SqJz&Q zs`ps)8M|GWDMS7!^&rJN4l|6f^m(yTqa682YK&(K9Ovlg13q21F#Uo0Yj~~< zn=hGujo*%vxu{y5b$mPDS{$0osZ?ij*WT`*Lbpe9B>vM^z0j|&r>x1;@A#{XH@jYe z>#m+3Za6BJ3w<^lRgswv@hV6dzVXcYDq0#|7CEJZR0Vm?vzrC&VoMQ9_G1&P^>B2E zv&`c((vkVaOoi9YOLc3ehhImiw2F7tSLaiQHD>j=U(>%<_}F-%N>t=XmCnjO`H4Kq zioJ#n4sCx)m-bU@ob9%bqu12?j=Vp%Uv6;s+4_=v=EW0jZ)FB`c5gR`948YmSoEuB zTtN&ZQu@yqYzG|^XwXMxfLuH#$e^!{knXB0qUGv}nAQcB^~Vkb$6~HR|96r|=A@0E zJF-Bm0MjELKz2}@iERNMahMf$j)a+y2~6}Z0VayqSjU`TI2U4{L`;xQ2ZSeL!3duYv1$ii=%B2$ z-6N8qsZp1}hG01flR+-%)5N7@Oc|jwB!o7YNXBH5aU+_rPQeZ!T!#sv0cK257Vsto zLy@DV>x7$@+~omGzUK%HtEFP>z%qv@l)zJF>sSj}vH+ipFdHzK3JahdqX|Px*dn(y zm|6EY5vvVEKcHK|TpBDz+G+zCutGWhK218zY+_Bc_yicSMH#@RhtMx;yG}SjeK{;4 z^C1l6vLljM1EEJSW6x=t2tEUobvMG8r;lI^es+B$ZlIU}%X2-qfe4-dXN-Lru+1T0 zJ_Eao2)KaH8JG#6y3pE;BDEfbNg0S{VWtSDH%*v(qT;{|HUo@h!L8JM*D)*5%*Ip^ z(v5!;F3_C~M?Y|rz#5>QXeR6^Oix}Xx`^3rnoz$3cR%tNE~f{q6CTh>5P6&+e|K~Q zhmHDqh3s=-9eaZQyEYO8Gr`M*B4)whT@Ab`$MRq&rtZ9#2}VJv@f#{~ zPk`0t6~bcj6aR~gGhrH&BADhyGO_Pz&|Cz+P?2E12#Z8?(>BBp7Q^axKiojB67z~qpbqJI-c&{hE_%2rBT7(-A{if#jTmC*XMd;_^z0sAk~0;f$^1<$spnn?0< z{_G&X7Fx(v(85&vZ!AjtPDoV~?Wo(ZuYWy!G7_tyWs>?|oZLXWvRi;^4fJh(N{sk4 zD6N6f%guzKAW9vC%Nlgm!V_)lq=}ABl#{lhZyhu?bP;Vj0o>~`Mlf3k-F7e5i3+0k zisoXV!W447G!aI{R1j((A+$kw1I9;tF7`ISdRfEMRn$CcZbp0s`I>;^$X`4}>?v*o)&Eh{yzO z7AO+(k-!>Ytp#HRQ7tg=<>We1N4%#9p$NjKQ66yj8C)h!uah0LR`IsNMIpiCSRJCCSi1ZpEl)>}O@J)rwTYyzNY?8`m zK-EWoVm{!M0`W@%>njsj3~+Q{O7y+T7*IHXNdngnOrGoL0qp00Cvy{~O~fA<$nC(k IGtuz>02z1S!~g&Q diff --git a/files/model.onnx b/files/model.onnx index ec07633500f68b65172397b9b578c6d32f01f9b3..f2c293933a3bef0f910b8cad58e5cc7f66eba4e6 100644 GIT binary patch delta 2000 zcmZvcTWl0n7{@)Q?V0IzwwLX0ce`L|OQBE-&diyaodJ~vLJXCQrYSO$)Uv>q7IAyq zDiD-LkpWRI)^LdMP$G&Xyx;>T@c}Dd5=e-MNGl+`7(&z+AFO~;{ARi(;j+nZ_nb4| z`TxK3&DrnT$X>kt>Rwzfl9157Ios8{wj-XW%t_4s{mRMm%?K;<;2yLSE81WTFU5-f zb~|?4F#^2bJRlz^;O&^u7qMcnL(QZLCHNQ3q#0pVVI?hODOOe9+Ct8jVwLh4?Ia>% zRpawBBp*U_zPO7lgMq<5>?VCE!NeYN7nwY-muyE-iw5!))zU)Vw;C50b!A>_Q!IA! z9C=B?Lr4fr@&i?XaKQnqV1xr3?3x4-9G z7`{Tnx6pGIRC&Zg=Wp#HrzK2vzVHyK01tyZ9}|M$ys4w)1%yoY`6nccELQ(1xr;TK zPd!FlRAGTWm1SSw%;{_N5q*UlLCY_%0K&(tbl>W?hKPY ztm!QCm1kk_4QI$ZsD#R@Z$iO@$_o4MYtoqq1HR!K@~s33%0G}s5vtnyGbu+l!cp z3na0w^J!zgror65$rkX$w@D!sg-dtHM<{CgdX}%3$WQP$H+S_{0a6MV`)gkwl< zgUh!8^9PxybJ@CofU;m*r0q-p&-D5#+L?!*KuU7S6q|ECAc4b*;FRm zW|{^&xsudBTLr$Vsrmt0ys)_=)t!c?iP{2x81OHcsrgY}`BYPn2cAo1JJP)n#*X|( zqB(iv0eNe%te}2doxjhkm1=smQq4rXaB)w6YGa#e$+ZRkFmM_90sa*;%{W=o4o{U@ zyrjKheV_=eGHq*mQ<=3LYD>VY(?q^jyr{XSKb_tJ8CzAvJWR`f2B zo)4Az|J>OdIRkV7SLZs2)>vLpn1wCM<)BcK#FoYvEEC`| zSO#CYTqsS#C0M^l2zwW9>t3Na30G3)?|%;~FjC#47X)_nf>;6*%1>Po!x%=od;X%x z&R;C_mJNR6VqpbFraN}2kd0jml!27^Qy_>fMQ#$&6XTzTu!u^9g>Z`$tx3$ESZx&Z zbE|q&^-ykgi2XS$FWT5pT#Eb+Mw1txygyIm=JzF`i|ro?grSSEV8%UJ)u{QZN}ab5Z*T@*rv$<-#mJZ~iMV zVTjk=Zad7n?T7?DYFswq9{tZe>Ipl>PS}+}!5SB3%bn^9_p)8dF5B@^ z=u7S`JI-#|lO$+OEaTKf5p|25Nx)>_NmlGsg`o+#bxswlbEZgfed1j_l%8(5P0ke7 zlDv(-*b5Sinmmb=5L0cwG%Y==`9hW6tgcIwz+XPOi@Cf>+HqK4|8ai*~| z&UA>U?x-`JjXL!b^dTC1YiOdFc+8y XJPkYpJPXVRlE4CBA;0M?8r%0jLx>xa delta 1701 zcmZvbTWl0%6vzF(mNV03wp(_$m)#=eQZ5!OGxN-K5#_`bZ08^z<+?8qwZaUh<7chV%D7?4#5!o9$_>+VlhL=GRabOvd420;J%#feH2ffUc^WTS?TnLabZ;f~{wQMs`3?GrGCapB+xpTinDxbns*Y~#X{TfT@= z_}KX+yv7oyZx@p|#Uz%zbrL$tOy9!Jui$Hq3e?ZSB9*4vHU=q1Q|i79Pclwbe0v49 zbIZcjc;dHzPx@Xy_LWUHRAdMMD>#{Q(-|+;VXLFYvO;Ef=ePBTH_1IQLr7 z)0g3O=%<|gcc`GDpmqa>S*<~SLO01z=z3`it}kSF8?;R+2C?A}Ku*P7{1aYMxh-%V zC`ra`3p*5LJsoU$T2)@;wu37w6ompVw#1Ydcy4x7tQVP^p2$vZ?e5#OsejW3&vRxg zrH5E!Q6rMj)z`bVua{IE&pg-F@ieJi`B9~^kb4eZs#Fp=?z#BWYK6M;e7v}Do3i}wIuc+b7 z2uChd$dL=NX=L@ug;<<7>tdD4?wa^CrBzm?ERV+Fu2Rb+Mcnb$R4rF$cqNB@15C}S zxstDsAC65;Mg5=1XmOI2RQm_YNHxkPWTuHfKB18vt=;uZoa{{NPIjh^_4#-*M@xu{ zhe?+)-8&n4xb|{9GW2otaAiDKJK5lW$`c0exm-!*nRqyx&~mhM^5cA>hIS}Cl}`{= z#58#-pG?y_h2#089M7k;tZ+(cUmqw}so{ZABf<@Nkdiz|limyKg0!p)sx?})$hVg1 z;nJX*s3B_Q(x9%IZgg!>C)WmwOUD-QWS&Z-ST)aBTai1o(X2iGeLuvwI;9Twqh+myc#sftHI1_DoicRlxksCoEdT?sIO&a z*j$)J%qHf@=0andHY!|MXp}1pbG58Jc|o_C+IHAkm@7LA^V0IypaH_c!aO-xm{0rY zhC_w Date: Mon, 11 Jan 2021 14:35:22 +0200 Subject: [PATCH 3/4] Update README.md --- README.md | 155 ++++++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 140 insertions(+), 15 deletions(-) diff --git a/README.md b/README.md index 7d2521e..79b345f 100644 --- a/README.md +++ b/README.md @@ -57,9 +57,9 @@ The models are small enough to be included directly into this repository. Newer Currently we provide the following functionality: -| PyTorch | ONNX | VAD | Number Detector | Language Clf | Languages | Colab | -|-------------------|--------------------|---------------------|-----------------|--------------|------------------------|-------| -| :heavy_check_mark:| :heavy_check_mark: | :heavy_check_mark: | | | `ru`, `en`, `de`, `es` | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) | +| PyTorch | ONNX | VAD | Number Detector | Language Clf | Languages | Colab | +|-------------------|--------------------|---------------------|--------------------|--------------------|------------------------|-------| +| :heavy_check_mark:| :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | `ru`, `en`, `de`, `es` | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) | **Version history:** @@ -67,13 +67,17 @@ Currently we provide the following functionality: |---------|-------------|---------------------------------------------------| | `v1` | 2020-12-15 | Initial release | | `v1.1` | 2020-12-24 | better vad models compatible with chunks shorter than 250 ms -| `v2` | coming soon | Add Number Detector and Language Classifier heads | +| `v1.2` | 2020-12-30 | Number Detector added +| `v2` | 2021-01-11 | Add Language Classifier heads | ### PyTorch [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) [![Open on Torch Hub](https://img.shields.io/badge/Torch-Hub-red?logo=pytorch&style=for-the-badge)](https://pytorch.org/hub/snakers4_silero-vad/) (coming soon) + +#### VAD + ```python import torch torch.set_num_threads(1) @@ -96,12 +100,63 @@ speech_timestamps = get_speech_ts(wav, model, num_steps=4) pprint(speech_timestamps) ``` + +#### Number Detector + +```python +import torch +torch.set_num_threads(1) +from pprint import pprint + +model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad', + model='silero_number_detector', + force_reload=True) + +(get_number_ts, + _, read_audio, + _, _) = utils + +files_dir = torch.hub.get_dir() + '/snakers4_silero-vad_master/files' + +wav = read_audio(f'{files_dir}/en_num.wav') +# full audio +# get number timestamps from full audio file +number_timestamps = get_number_ts(wav, model) + +pprint(number_timestamps) +``` + +### Language Classifier + +```python +import torch +torch.set_num_threads(1) +from pprint import pprint + +model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad', + model='silero_lang_detector', + force_reload=True) + +get_language, read_audio = utils + +files_dir = torch.hub.get_dir() + '/snakers4_silero-vad_master/files' + +wav = read_audio(f'{files_dir}/de.wav') +language = get_language(wav, model) + +pprint(language) +``` + ### ONNX [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) -You can run our model everywhere, where you can import the ONNX model or run ONNX runtime. +You can run our models everywhere, where you can import the ONNX model or run ONNX runtime. + +#### VAD + ```python +import torch import onnxruntime from pprint import pprint @@ -133,6 +188,72 @@ speech_timestamps = get_speech_ts(wav, model, num_steps=4, run_function=validate pprint(speech_timestamps) ``` +#### Number Detector + +```python +import torch +import onnxruntime +from pprint import pprint + +model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad', + model='silero_number_detector', + force_reload=True) + +(get_number_ts, + _, read_audio, + _, _) = utils + +files_dir = torch.hub.get_dir() + '/snakers4_silero-vad_master/files' + +def init_onnx_model(model_path: str): + return onnxruntime.InferenceSession(model_path) + +def validate_onnx(model, inputs): + with torch.no_grad(): + ort_inputs = {'input': inputs.cpu().numpy()} + outs = model.run(None, ort_inputs) + outs = [torch.Tensor(x) for x in outs] + return outs + +model = init_onnx_model(f'{files_dir}/number_detector.onnx') +wav = read_audio(f'{files_dir}/en_num.wav') + +# get speech timestamps from full audio file +number_timestamps = get_number_ts(wav, model, run_function=validate_onnx) +pprint(number_timestamps) +``` + +### Language Classifier + +```python +import torch +import onnxruntime +from pprint import pprint + +model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad', + model='silero_lang_detector', + force_reload=True) + +get_language, read_audio = utils + +files_dir = torch.hub.get_dir() + '/snakers4_silero-vad_master/files' + +def init_onnx_model(model_path: str): + return onnxruntime.InferenceSession(model_path) + +def validate_onnx(model, inputs): + with torch.no_grad(): + ort_inputs = {'input': inputs.cpu().numpy()} + outs = model.run(None, ort_inputs) + outs = [torch.Tensor(x) for x in outs] + return outs + +model = init_onnx_model(f'{files_dir}/number_detector.onnx') +wav = read_audio(f'{files_dir}/de.wav') + +language = get_language(wav, model, run_function=validate_onnx) +print(language) +``` ## Metrics ### Performance Metrics @@ -184,7 +305,7 @@ So **batch size** for streaming is **num_steps * number of audio streams**. Time We use random 250 ms audio chunks for validation. Speech to non-speech ratio among chunks is about ~50/50 (i.e. balanced). Speech chunks are sampled from real audios in four different languages (English, Russian, Spanish, German), then random background noise is added to some of them (~40%). -Since our VAD (only VAD, other networks are more flexible) was trained on chunks of the same length, model's output is just one float from 0 to 1 - **speech probability**. We use speech probabilities as thresholds for precision-recall curve. This can be extended to 100 - 150 ms (coming soon). Less than 100 - 150 ms cannot be distinguished as speech with confidence. +Since our VAD (only VAD, other networks are more flexible) was trained on chunks of the same length, model's output is just one float from 0 to 1 - **speech probability**. We use speech probabilities as thresholds for precision-recall curve. This can be extended to 100 - 150 ms. Less than 100 - 150 ms cannot be distinguished as speech with confidence. [Webrtc](https://github.com/wiseman/py-webrtcvad) splits audio into frames, each frame has corresponding number (0 **or** 1). We use 30ms frames for webrtc, so each 250 ms chunk is split into 8 frames, their **mean** value is used as a treshold for plot. @@ -192,20 +313,23 @@ Since our VAD (only VAD, other networks are more flexible) was trained on chunks ## FAQ -### Method' argument to use for VAD quality/speed tuning -- `trig_sum` - overlapping windows are used for each audio chunk, trig sum defines average probability among those windows for switching into triggered state (speech state) -- `neg_trig_sum` - same as `trig_sum`, but for switching from triggered to non-triggered state (no speech) -- `num_steps` - nubmer of overlapping windows to split audio chunk by (we recommend 4 or 8) -- `num_samples_per_window` - number of samples in each window, our models were trained using `4000` samples (250 ms) per window, so this is preferable value (lesser reduces quality) +### VAD Parameter Fine Tuning + +- Among others, we provide several [utils](https://github.com/snakers4/silero-vad/blob/8b28767292b424e3e505c55f15cd3c4b91e4804b/utils.py#L52-L59) to simplify working with VAD; +- We provide sensible basic hyper-parameters that work for us, but your case can be different; +- `trig_sum` - overlapping windows are used for each audio chunk, trig sum defines average probability among those windows for switching into triggered state (speech state); +- `neg_trig_sum` - same as `trig_sum`, but for switching from triggered to non-triggered state (non-speech) +- `num_steps` - nubmer of overlapping windows to split audio chunk into (we recommend 4 or 8) +- `num_samples_per_window` - number of samples in each window, our models were trained using `4000` samples (250 ms) per window, so this is preferable value (lesser values reduce [quality](https://github.com/snakers4/silero-vad/issues/2#issuecomment-750840434)); ### How VAD Works -- Audio is split into 250 ms chunks; +- Audio is split into 250 ms chunks (you can choose any chunk size, but quality with chunks shorter than 100ms will suffer and there will be more false positives and "unnatural" pauses); - VAD keeps record of a previous chunk (or zeros at the beginning of the stream); - Then this 500 ms audio (250 ms + 250 ms) is split into N (typically 4 or 8) windows and the model is applied to this window batch. Each window is 250 ms long (naturally, windows overlap); - Then probability is averaged across these windows; - Though typically pauses in speech are 300 ms+ or longer (pauses less than 200-300ms are typically not meaninful), it is hard to confidently classify speech vs noise / music on very short chunks (i.e. 30 - 50ms); -- We are working on lifting this limitation, so that you can use 100 - 125ms windows; +- ~~We are working on lifting this limitation, so that you can use 100 - 125ms windows~~; ### VAD Quality Metrics Methodology @@ -213,7 +337,9 @@ Please see [Quality Metrics](#quality-metrics) ### How Number Detector Works -TBD, but there is no explicit limiation on the way audio is split into chunks. +- It is recommended to split long audio into short ones (< 15s) and apply model on each of them; +- Number Detector can classify if whole audio contains a number, or if each audio frame contains a number; +- Audio is splitted into frames in a certain way, so, having a per-frame output, we can restore timing bounds for a numbers with an accuracy of about 0.2s; ### How Language Classifier Works @@ -222,7 +348,6 @@ TBD, but there is no explicit limiation on the way audio is split into chunks. - More languages TBD - Arbitrary audio length can be used, although network was trained using audio shorter than 15 seconds - ## Contact ### Get in Touch From 9a60b3a31865bb19837c0589a91fcbd03e9345f6 Mon Sep 17 00:00:00 2001 From: Dimitrii Voronin <36505480+adamnsandle@users.noreply.github.com> Date: Mon, 11 Jan 2021 14:46:19 +0200 Subject: [PATCH 4/4] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 79b345f..ea42fe0 100644 --- a/README.md +++ b/README.md @@ -68,7 +68,7 @@ Currently we provide the following functionality: | `v1` | 2020-12-15 | Initial release | | `v1.1` | 2020-12-24 | better vad models compatible with chunks shorter than 250 ms | `v1.2` | 2020-12-30 | Number Detector added -| `v2` | 2021-01-11 | Add Language Classifier heads | +| `v2` | 2021-01-11 | Add Language Classifier heads (en, ru, de, es) | ### PyTorch