327 Commits
v3.1 ... master

Author SHA1 Message Date
Alexander Veysov
2688a6e352 Merge pull request #747 from d-e-s-o/topic/ort-rc.10
Update ort dependency to 2.0.0-rc.10
2025-12-30 07:05:45 +03:00
Daniel Müller
c5542cd4a8 Update ort dependency to 2.0.0-rc.10
Update the ort dependency from 2.0.0-rc.2 to 2.0.0-rc.10 and adapt the code
to work with the new API. This includes:
- Updating ndarray to 0.16 to match ort's requirements
- Using Session and Value from their new module locations
- Adapting to the new Value::from_array() and try_extract_tensor() APIs
- Converting SessionInputs from Value references

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-29 19:29:22 -08:00
Alexander Veysov
4725c40105 Merge pull request #746 from d-e-s-o/topic/fix-rust
Fix `rust-example`
2025-12-29 09:34:47 +03:00
Daniel Müller
cfe63384f0 Update model plumbing for Rust example
The v6.2 models broke the Rust example. Update the logic for driving
them to reflect what the reference Python code does.

Fixes: #745
Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-28 07:15:01 -08:00
Daniel Müller
2a08f0b90d Remove 'load-dynamic' feature of 'ort' dependency
It's unclear why we'd want this feature. It seems to make things even
less isolated and self-contained than it already is, which certainly
isn't a boon for an example.
2025-12-27 06:36:07 -08:00
Daniel Müller
21ffe8576e Fix model path in Rust example 2025-12-25 18:25:33 -08:00
Dimitrii Voronin
d5b52843f7 Merge pull request #736 from snakers4/adamnsandle
add tinygrad model
2025-12-10 16:35:36 +03:00
adamnsandle
fb7d7c7f5d add tinygrad model 2025-12-10 13:31:25 +00:00
Dimitrii Voronin
e7c3d6f2bd Merge pull request #734 from snakers4/adamnsandle
Adamnsandle
2025-12-08 10:27:37 +03:00
adamnsandle
390614894d Merge branch 'master' of github.com:snakers4/silero-vad into adamnsandle 2025-12-08 07:26:37 +00:00
adamnsandle
33eb4c7f84 fx ifless model 2025-12-08 07:26:18 +00:00
Dimitrii Voronin
c913b0c4b3 Merge pull request #732 from snakers4/adamnsandle
add ifless model
2025-12-05 16:58:29 +03:00
adamnsandle
4dd2e8f6f9 add ifless model 2025-12-05 13:57:43 +00:00
Alexander Veysov
63fe03add7 Merge pull request #727 from dfengpo/master
delete debug code
2025-11-25 13:31:10 +03:00
dongfp
29a582ba37 fix 2025-11-25 16:46:03 +08:00
Alexander Veysov
3ca476e4fb Merge pull request #722 from dfengpo/master
修复C# CalculateProb方法计算句子EndOffset的bug
2025-11-10 11:04:45 +03:00
Alexander Veysov
7de462944a Update README.md 2025-11-10 10:59:13 +03:00
Alexander Veysov
12b0121993 Merge pull request #721 from NathanJHLee/feature/onnx-libtorch-cpp-examples
Add C++ examples supporting ONNX & LibTorch; rename legacy folder
2025-11-10 10:58:27 +03:00
dongfp
7b0aaa1c4c 修复CalculateProb方法计算句子EndOffset的bug
修改语法提示
2025-11-10 15:58:20 +08:00
NathanLee
540eff3e24 Rename cpp_libtorch to cpp_libtorch_deprecated 2025-11-10 07:32:10 +00:00
NathanLee
dfeba4fc0f Add C++ folder for supporting ONNX & LibTorch 2025-11-10 07:31:58 +00:00
Dimitrii Voronin
be95df9152 Merge pull request #719 from snakers4/adamnsandle
Adamnsandle
2025-11-06 11:25:49 +03:00
adamnsandle
ec56fe50a5 fx workflow 2025-11-06 08:18:46 +00:00
adamnsandle
dea5980320 fx workflow 2025-11-06 08:04:02 +00:00
adamnsandle
90d9ce7695 fx workflow 2025-11-06 07:49:44 +00:00
adamnsandle
c56dbb11ac Merge branch 'master' of github.com:snakers4/silero-vad into adamnsandle 2025-11-06 07:36:38 +00:00
adamnsandle
9b686893ad fx test workflow 2025-11-06 07:36:23 +00:00
Dimitrii Voronin
6979fbd535 Merge pull request #717 from snakers4/adamnsandle
v6.2.0 release
2025-11-06 10:28:00 +03:00
adamnsandle
1cff663de5 fix version to 6.2.0 2025-11-06 07:27:07 +00:00
adamnsandle
bfdc019302 add v6.2 model 2025-11-06 07:23:43 +00:00
Alexander Veysov
c0c0ffa0c5 Merge pull request #714 from Purfview/patch-4
Fix type hint for min_silence_at_max_speech (float -> int)
2025-11-05 08:44:00 +03:00
Alexander Veysov
3f0c9ead54 Update pyproject.toml 2025-11-05 08:38:07 +03:00
Purfview
556a442942 Fix type hint for min_silence_at_max_speech (float -> int) 2025-11-04 08:30:01 +00:00
Dimitrii Voronin
9623ce72da Merge pull request #710 from Purfview/patch-3
Fixes and refines - use_max_poss_sil_at_max_speech arg
2025-10-29 12:36:58 +03:00
Dimitrii Voronin
b6dd0599fc Merge pull request #712 from snakers4/adamnsandle
drop_chunks fix
2025-10-29 12:16:10 +03:00
adamnsandle
d8f88c9157 drop_chunks fix 2025-10-29 09:14:45 +00:00
Purfview
b15a216b47 Reword a comment 2025-10-24 10:30:34 +01:00
Purfview
2389039408 Fixes and refines - use_max_poss_sil_at_max_speech arg
Removed redundant "if temp_end != 0:" check.
Multiple "window_size_samples * i" - assigned to a variable.
Restored the previous functionality (which was broken) when use_max_poss_sil_at_max_speech=False.

@shashank14k was your https://github.com/snakers4/silero-vad/pull/664 PR still WIP when it was merged?
Anyway, please test if use_max_poss_sil_at_max_speech=True behaviour is same, and "False" is same as before your PR.
2025-10-24 07:46:41 +01:00
Alexander Veysov
df22fcaec8 Merge pull request #708 from Purfview/patch-2
Removes redundant hop_size_samples variable
2025-10-23 15:58:00 +03:00
Purfview
81e8a48e25 Removes redundant hop_size_samples variable
Remove redundant hop_size_samples variable
2025-10-23 05:23:18 +01:00
Alexander Veysov
a14a23faa7 Merge pull request #707 from Purfview/patch-1
Fixes few typos
2025-10-23 06:35:58 +03:00
Purfview
a30b5843c1 Fixes various typos 2025-10-23 04:02:13 +01:00
Dimitrii Voronin
a66c890188 Merge pull request #704 from snakers4/adamnsandle
resolve torchaudio 2.9 utils
2025-10-17 15:50:20 +03:00
adamnsandle
77c91a91fa resolve torchaudio 2.9 utils 2025-10-17 12:35:40 +00:00
Alexander Veysov
33093c6f1b Update utils.py 2025-10-14 14:51:23 +03:00
Alexander Veysov
dc0b62e1e4 Merge pull request #699 from JiJiJiang/master
fix bug in tuning/utils.py: add optimizer.zero_grad() before loss.bac…
2025-10-14 14:50:58 +03:00
Hongji Wang
64fb49e1c8 fix bug in tuning/utils.py: add optimizer.zero_grad() before loss.backward() 2025-10-13 20:50:29 +08:00
Alexander Veysov
55ba6e2825 Merge pull request #697 from VvvvvGH/java-example-v6
Update java example for v6
2025-10-11 11:41:15 +03:00
GH
b90f8c012f Update SlieroVadOnnxModel.java 2025-10-11 16:21:57 +08:00
GH
25a778c798 Update SlieroVadDetector.java 2025-10-11 16:21:45 +08:00
GH
3d860e6ace Update App.java 2025-10-11 16:21:32 +08:00
GH
f5ea01bfda Update pom.xml 2025-10-11 16:21:03 +08:00
Alexander Veysov
dd651a54a5 Merge pull request #695 from mpariente/master
Remove ipdb and raise error directly in get_speech_timestamps
2025-10-11 08:07:18 +03:00
Manuel Pariente
f1175c902f Remove ipdb and raise error directly 2025-10-10 10:46:44 +02:00
Alexander Veysov
7819fd911b Update README.md 2025-10-09 17:34:33 +03:00
Dimitrii Voronin
fba061dc55 Merge pull request #677 from snakers4/adamnsandle
get rid of hop_size_ratio
2025-08-26 09:54:35 +03:00
adamnsandle
11631356a2 get rid of hop_size_ratio 2025-08-26 06:53:53 +00:00
Dimitrii Voronin
34dea51680 Merge pull request #664 from shashank14k/master
Adding additional params to get_speech_timestamps
2025-08-26 09:50:44 +03:00
Dimitrii Voronin
51fd43130a Update README.md 2025-08-25 19:30:20 +03:00
Dimitrii Voronin
3080062489 Update README.md 2025-08-25 18:07:06 +03:00
Dimitrii Voronin
f974f2d6bc Merge pull request #676 from snakers4/adamnsandle
Adamnsandle
2025-08-25 17:59:19 +03:00
adamnsandle
f1886d9088 Merge branch 'master' of github.com:snakers4/silero-vad into adamnsandle 2025-08-25 14:57:11 +00:00
adamnsandle
4c00cd14be add v6 models 2025-08-25 14:56:50 +00:00
Dimitrii Voronin
5d70880844 Merge pull request #675 from snakers4/adamnsandle
Adamnsandle
2025-08-25 17:28:38 +03:00
adamnsandle
a16f3ed079 Merge branch 'master' of github.com:snakers4/silero-vad into adamnsandle 2025-08-25 14:27:26 +00:00
adamnsandle
b0fbf4bec6 fx 2025-08-25 14:27:15 +00:00
Dimitrii Voronin
ab02267584 Merge pull request #674 from snakers4/adamnsandle
Adamnsandle
2025-08-25 17:09:07 +03:00
adamnsandle
485a7d91b0 git push Merge branch 'master' of github.com:snakers4/silero-vad into adamnsandle 2025-08-25 14:08:15 +00:00
adamnsandle
1da76acfc3 fx 2025-08-25 14:07:32 +00:00
Dimitrii Voronin
3c70b587e8 Merge pull request #673 from snakers4/adamnsandle
Adamnsandle
2025-08-25 16:56:19 +03:00
adamnsandle
7aff370d68 Merge branch 'master' of github.com:snakers4/silero-vad into adamnsandle 2025-08-25 13:55:30 +00:00
adamnsandle
931eddfdab fx 2025-08-25 13:55:24 +00:00
Dimitrii Voronin
6143b9a5d9 Merge pull request #672 from snakers4/adamnsandle
fx
2025-08-25 16:46:24 +03:00
adamnsandle
8ca8cf7d9b fx 2025-08-25 13:45:36 +00:00
Dimitrii Voronin
ad0fdbe4ac Merge pull request #671 from snakers4/adamnsandle
Adamnsandle
2025-08-25 16:40:10 +03:00
adamnsandle
06806eb70b Merge branch 'master' of github.com:snakers4/silero-vad into adamnsandle 2025-08-25 13:39:32 +00:00
adamnsandle
c90e1603c5 fx 2025-08-25 13:39:15 +00:00
Dimitrii Voronin
023d3a36f0 Merge pull request #670 from snakers4/adamnsandle
fx
2025-08-25 16:25:39 +03:00
adamnsandle
aa2a66cf46 fx 2025-08-25 13:24:43 +00:00
Dimitrii Voronin
b1cd34aae2 Merge pull request #669 from snakers4/adamnsandle
Adamnsandle
2025-08-25 16:17:17 +03:00
adamnsandle
50be3744fe fix 2025-08-25 13:08:02 +00:00
adamnsandle
fce776f872 fix workflow 2025-08-25 12:59:58 +00:00
adamnsandle
fbddc91a5d initial autotest commit 2025-08-25 12:54:47 +00:00
shashank14k
bbf22a0064 Added params for hop_size, and min_silence_at_max speech to cut at a possible silence when max_dur reached to avoid abrupt cuts 2025-07-25 20:51:40 +05:30
Alexander Veysov
94811cbe12 Merge pull request #656 from davidrs/patch-1
Surface drop_chunks in init
2025-06-11 07:45:36 +03:00
David Rust-Smith
22a2362b4c Surface drop_chunks in init 2025-06-10 11:36:10 -07:00
Dimitrii Voronin
0dd45f0bcd Merge pull request #626 from b3by/feature/process_chunks_in_seconds
Use second coordinates for audio concatenation in collect_chunks and drop_chunks
2025-03-24 19:02:56 +03:00
Dimitrii Voronin
feba8cd5c4 Merge pull request #627 from b3by/feature/time_coordinates_resolution
Specify time resolution when returning speech coordinates in seconds
2025-03-24 18:59:25 +03:00
Antonio Bevilacqua
6622e562e4 time resolution can be specified when coordinates are returned in seconds 2025-03-24 08:53:28 +01:00
Antonio Bevilacqua
d5625d5c38 added audio concatenation for collect_chunks and drop_chunks based on second coordinates 2025-03-21 13:06:59 +01:00
Alexander Veysov
cd92290a15 Merge pull request #605 from OJRYK/fix/cpp-vad-context
Fix/cpp vad context
2025-02-17 11:01:04 +03:00
Ojuro Yokoyama
33a9d190fe Update wav.h 2025-02-17 16:03:42 +09:00
Ojuro Yokoyama
7440bc4689 Update silero-vad-onnx.cpp
I fixed bug of silero-vad-onnx.cpp
2025-02-17 16:02:24 +09:00
Alexander Veysov
10e7e8a8bc Merge pull request #601 from kiwamizamurai/master
Add CITATION.cff file for proper citation
2025-02-11 08:42:10 +03:00
きわみざむらい
5a5b662496 Create CITATION.cff 2025-02-11 08:54:16 +09:00
Alexander Veysov
9060f664f2 Merge pull request #591 from qwbarch/master
Add haskell example
2024-12-26 19:05:13 +03:00
qwbarch
94271e9096 Add haskell example 2024-12-26 11:18:10 -05:00
Dimitrii Voronin
3f9fffc261 Merge pull request #581 from snakers4/adamnsandle
fx negative ths bug
2024-11-25 16:55:38 +03:00
adamnsandle
eaf633ec9d fx negative ths bug 2024-11-25 13:54:46 +00:00
Alexander Veysov
cff5eb2980 Merge pull request #578 from NathanJHLee/add-torch-cpp
Add cpp source based on libtorch
2024-11-22 11:26:49 +03:00
Dimitrii Voronin
f356a8081a Merge pull request #579 from snakers4/adamnsandle
fx https://github.com/snakers4/silero-vad/issues/576
2024-11-22 11:18:26 +03:00
adamnsandle
782e30d28f fx https://github.com/snakers4/silero-vad/issues/576 2024-11-22 08:17:25 +00:00
Nathan Lee
caee535cf6 ReadMe v4 2024-11-22 06:48:27 +00:00
Nathan Lee
8ab5be005f ReadMe v3 2024-11-22 06:46:28 +00:00
Nathan Lee
9f67a54e87 ReadMe v2 2024-11-22 06:42:20 +00:00
Nathan Lee
c8df1dee3f modified Readme 2024-11-22 06:35:16 +00:00
Nathan Lee
0189ebd8af Changed some source. 2024-11-22 06:21:49 +00:00
Nathan Lee
05e380c1de add c++ inference based on libtorch 2024-11-22 00:10:13 +00:00
Alexander Veysov
93b9782f28 Merge pull request #573 from snakers4/adamnsandle
Adamnsandle
2024-11-13 12:32:55 +03:00
adamnsandle
d2ab7c254e add just 16k model 2024-11-13 08:53:27 +00:00
adamnsandle
6217b08bbb add other opsets 2024-11-12 08:25:06 +00:00
adamnsandle
d53ba1ea11 Merge branch 'master' of github.com:snakers4/silero-vad into adamnsandle 2024-11-12 08:19:54 +00:00
Alexander Veysov
102e6d0962 Add downloads shield 2024-11-07 14:40:33 +03:00
Alexander Veysov
e531cd3462 Update README.md 2024-10-21 10:22:02 +03:00
Alexander Veysov
fd41da0b15 Merge pull request #553 from EarningsCall/master
Improve documentation.
2024-10-12 18:25:46 +03:00
EarningsCall
9db72c35bd Update README.md
update again
2024-10-12 09:23:29 -05:00
EarningsCall
867a067bee Update README.md
I assume most people want seconds, so it's useful to show example to return seconds in README file.
2024-10-12 09:22:39 -05:00
Alexander Veysov
2c43391b17 Update README.md 2024-10-09 12:56:22 +03:00
Alexander Veysov
6478567951 Update pyproject.toml 2024-10-09 12:49:27 +03:00
adamnsandle
add6e3028e Merge branch 'master' of github.com:snakers4/silero-vad into adamnsandle 2024-10-09 09:48:51 +00:00
adamnsandle
e7025ed8c5 5.1.1 tag 2024-10-09 09:48:37 +00:00
Alexander Veysov
35d601adc6 Update pyproject.toml 2024-10-09 12:47:08 +03:00
Dimitrii Voronin
032ca21a70 Merge pull request #549 from snakers4/adamnsandle
Adamnsandle
2024-10-09 12:32:09 +03:00
adamnsandle
001d57d6ff fx dependencies 2024-10-09 09:26:39 +00:00
adamnsandle
6e6da04e7a fix pyaudio streaming example 2024-10-09 08:49:39 +00:00
Alexander Veysov
9c1eff9169 Delete files/real_time_example.mp4 2024-10-09 10:10:03 +03:00
Alexander Veysov
36b759d053 Add files via upload 2024-10-09 10:02:04 +03:00
Dimitrii Voronin
1a7499607a Merge pull request #543 from snakers4/adamnsandle
Adamnsandle
2024-09-24 15:19:30 +03:00
Alexander Veysov
87451b059f Update README.md 2024-09-24 15:16:18 +03:00
Alexander Veysov
becc7770c7 Update README.md 2024-09-24 15:15:10 +03:00
Alexander Veysov
3f2eff0303 Merge pull request #542 from snakers4/snakers4-patch-1
Update README.md
2024-09-24 15:14:18 +03:00
Alexander Veysov
3a25110cf9 Update README.md 2024-09-24 15:13:34 +03:00
adamnsandle
d23867da10 fx parallel example 2024-09-24 12:03:07 +00:00
adamnsandle
2043282182 Merge branch 'master' of github.com:snakers4/silero-vad into adamnsandle 2024-09-24 12:02:00 +00:00
adamnsandle
fa8036ae1c fx old examples 2024-09-24 12:01:47 +00:00
Dimitrii Voronin
2fff4b8ce8 Merge pull request #541 from snakers4/adamnsandle-1
Update README.md
2024-09-24 14:48:51 +03:00
Dimitrii Voronin
64b863d2ff Update README.md 2024-09-24 14:48:35 +03:00
Dimitrii Voronin
8a3600665b Merge pull request #540 from snakers4/adamnsandle-patch-2
Update README.md
2024-09-24 13:45:31 +03:00
Dimitrii Voronin
9c2c90aa1c Update README.md 2024-09-24 13:45:16 +03:00
Dimitrii Voronin
1d48167271 Merge pull request #539 from gengyuchao/update/python_pyaudio_example
Fixed the pyaudio example can not run issue.
2024-09-11 12:27:15 +03:00
GengYuchao
d0139d94d9 Fixed the pyaudio example can not run issue.
Update the related packages.
2024-09-11 00:45:49 +08:00
Dimitrii Voronin
46f94b7d60 Merge pull request #529 from snakers4/adamnsandle
Adamnsandle
2024-08-22 17:31:42 +03:00
adamnsandle
3de3ee3abe Merge branch 'master' of github.com:snakers4/silero-vad into adamnsandle 2024-08-22 14:30:27 +00:00
adamnsandle
e680ea6633 add half onnx model 2024-08-22 14:30:13 +00:00
Dimitrii Voronin
199de226e5 Merge pull request #528 from snakers4/adamnsandle
add neg_threshold parameter explicitly
2024-08-22 16:39:33 +03:00
adamnsandle
4109b107c1 add neg_threshold parameter explicitly 2024-08-20 08:53:15 +00:00
Alexander Veysov
36854a90db Merge pull request #526 from snakers4/adamnsandle
код для тюнинга
2024-08-19 20:01:21 +03:00
adamnsandle
827e86e685 добавлен поиск порогов 2024-08-19 16:53:28 +00:00
Dimitrii Voronin
e706ec6fee Update README.md 2024-08-19 18:31:11 +03:00
adamnsandle
88df0ce1dd код для тюнинга 2024-08-19 14:36:45 +00:00
Dimitrii Voronin
d18b91e037 Merge pull request #521 from snakers4/adamnsandle
downgrade onnxruntime dependency
2024-08-09 14:23:16 +03:00
adamnsandle
1e3f343767 downgrade onnxruntime dependency 2024-08-09 11:15:22 +00:00
Alexander Veysov
6a8ee81ee0 Merge pull request #507 from nganju98/master
add csharp example
2024-07-21 09:03:38 +03:00
nick.ganju
cb25c0c047 add csharp example 2024-07-20 22:59:18 -04:00
Alexander Veysov
7af8628a27 Merge pull request #506 from yuguanqin/master
Add java example for wav file & support V5 model
2024-07-18 07:34:40 +03:00
yuguanqin
3682cb189c java example for whole wav file & compatible with V5 model 2024-07-18 10:34:02 +08:00
Dimitrii Voronin
57c0b51f9b Merge pull request #505 from snakers4/adamnsandle
VadIterator first chunk bag fx
2024-07-15 13:42:36 +03:00
adamnsandle
dd0b143803 VadIterator first chunk bag fx 2024-07-15 10:37:46 +00:00
Alexander Veysov
181cdf92b6 Merge pull request #497 from rumbleFTW/fix/rust-example-v5
fix: rust example for v5 checkpoint
2024-07-11 17:48:58 +03:00
rumbleFTW
a7bd2dd38f fix: rust example 2024-07-11 20:06:54 +05:30
Alexander Veysov
df7de797a5 Merge pull request #496 from streamer45/update-golang-example
Fix Golang example
2024-07-10 21:31:15 +03:00
streamer45
87ed11b508 Fix Golang example 2024-07-10 20:26:41 +02:00
Alexander Veysov
84768cefdf Merge pull request #493 from snakers4/adamnsandle
Adamnsandle
2024-07-09 16:16:40 +03:00
adamnsandle
6de3660f25 fx version 2024-07-09 10:27:00 +00:00
adamnsandle
d9a6941852 add pip examples to collab 2024-07-09 10:20:50 +00:00
adamnsandle
dfdc9a484e Merge branch 'master' of github.com:snakers4/silero-vad into adamnsandle 2024-07-09 09:51:42 +00:00
adamnsandle
f2e3a23d96 fx version 2024-07-09 09:45:10 +00:00
Dimitrii Voronin
2b97f61160 Merge pull request #492 from snakers4/adamnsandle-patch-1
Create python-publish.yml
2024-07-09 12:42:23 +03:00
Dimitrii Voronin
e8850d2b9b Create python-publish.yml 2024-07-09 12:41:49 +03:00
adamnsandle
657dac8736 add pyproject.toml 2024-07-09 09:31:18 +00:00
Dimitrii Voronin
412a478e29 Update README.md 2024-07-09 12:25:06 +03:00
adamnsandle
9adf6d2192 add abs import path 2024-07-09 09:06:05 +00:00
adamnsandle
8a2a73c14f fx package import 2024-07-09 09:02:33 +00:00
adamnsandle
3e0305559d fx hubconf 2024-07-09 08:32:18 +00:00
adamnsandle
f0d880d79c make package structure 2024-07-09 08:26:17 +00:00
Dimitrii Voronin
3888946c0c Merge pull request #489 from streamer45/update-golang-example
Update Golang example to support model v5
2024-07-08 09:03:12 +03:00
streamer45
24f51645d0 Update to support model v5 2024-07-08 07:43:42 +02:00
Dimitrii Voronin
fdbb0a3a81 Merge pull request #482 from filtercodes/v5_cpp_support
cpp example
2024-07-01 19:17:44 +03:00
Stefan Miletic
60ae7abfb7 v5 model cpp example 2024-07-01 15:32:40 +01:00
Stefan Miletic
0b3d43d432 cpp example v5 model 2024-07-01 15:04:48 +01:00
Dimitrii Voronin
a395853982 Merge pull request #475 from eltociear/patch-1
Update microphone_and_webRTC_integration.py
2024-07-01 12:09:08 +03:00
Dimitrii Voronin
78958b6fb6 Merge pull request #481 from snakers4/adamnsandle
Adamnsandle
2024-07-01 12:02:50 +03:00
adamnsandle
902cfc9248 fx dtype bug 2024-07-01 09:00:59 +00:00
adamnsandle
89e66a3474 Merge branch 'master' of github.com:snakers4/silero-vad into adamnsandle 2024-07-01 08:54:27 +00:00
Alexander Veysov
a3bdebed16 Update README.md 2024-07-01 10:21:20 +03:00
Ikko Eltociear Ashimine
4bdcf31d17 Update microphone_and_webRTC_integration.py
nubmer -> number
2024-06-30 02:10:59 +09:00
adamnsandle
136cdcdf5b tst 2024-06-28 14:13:18 +00:00
Alexander Veysov
5cd2ba54db Update hubconf.py 2024-06-27 22:58:52 +03:00
Alexander Veysov
06b9e17c1e Update hubconf.py 2024-06-27 22:56:37 +03:00
Alexander Veysov
aace7e25b1 Update README.md 2024-06-27 22:25:19 +03:00
Alexander Veysov
74c3f7f3fb Remove old unused utils 2024-06-27 22:08:15 +03:00
Alexander Veysov
d77d0fd42c Merge pull request #468 from snakers4/adamnsandle
Adamnsandle
2024-06-27 21:03:38 +03:00
Dimitrii Voronin
49b421a9cd Update README.md 2024-06-27 19:12:27 +03:00
adamnsandle
fd1f1a62b7 v5 initial push 2024-06-27 15:41:20 +00:00
adamnsandle
8145ed9a91 Merge branch 'master' of github.com:snakers4/silero-vad 2024-06-25 08:13:21 +00:00
Dimitrii Voronin
4392725328 Merge pull request #467 from gau-nernst/fix_grad
Replace `torch.set_grad_enabled(False)` with `torch.no_grad()`
2024-06-21 16:31:27 +03:00
Thien Tran
8b0566682b use torch.no_grad() 2024-06-21 21:10:48 +08:00
Alexander Veysov
82342b8a4c Merge pull request #457 from akmitrich/rust-example
Rust example
2024-05-30 13:25:18 +03:00
Alexander Kalashnikov
4b8ce743a8 Rust example 2024-05-30 12:08:47 +03:00
Alexander Veysov
b4b6f2ab3e Update README.md 2024-04-02 07:18:09 +03:00
Alexander Veysov
5b02d84a4a Update README.md 2024-03-26 22:28:38 +03:00
Alexander Veysov
60465a7e61 Update README.md 2024-03-26 22:21:43 +03:00
Dimitrii Voronin
fcef5b3955 Merge pull request #435 from snakers4/adamnsandle-patch-1
Update README.md
2024-03-26 21:37:22 +03:00
Dimitrii Voronin
156436762f Update README.md 2024-03-26 21:37:06 +03:00
Dimitrii Voronin
a27b176e45 Update README.md 2024-03-26 21:24:06 +03:00
Alexander Veysov
8125483ef7 Merge pull request #433 from snakers4/snakers4-patch-3
Update README.md
2024-03-26 21:23:27 +03:00
Alexander Veysov
6969dcc2dc Update README.md 2024-03-26 21:23:16 +03:00
Alexander Veysov
6c816a05f0 Merge pull request #432 from snakers4/snakers4-patch-2
Update README.md
2024-03-26 21:20:03 +03:00
Alexander Veysov
9dc344df7f Update README.md 2024-03-26 21:19:40 +03:00
Dimitrii Voronin
41c5172dd9 Update README.md 2024-03-26 21:17:35 +03:00
Alexander Veysov
894ea259f9 Merge pull request #431 from snakers4/adamnsandle
add open datasets' annotation
2024-03-26 21:13:58 +03:00
adamnsandle
f56f56ffaa add open datasets' annotation 2024-03-26 18:06:11 +00:00
Alexander Veysov
6c8d844710 Merge pull request #424 from yairl/master
Support both sox and sox_io backends for in-place audio resampling.
2024-02-16 14:08:40 +03:00
Yair Lifshitz
d8cc947c73 Support both sox and sox_io backends for in-place audio resampling. 2024-02-16 05:58:09 -05:00
Alexander Veysov
797a88a386 Update utils_vad.py 2024-02-16 10:55:47 +03:00
Alexander Veysov
48b7c742dd Merge pull request #423 from snakers4/snakers4-patch-1
Update utils_vad.py
2024-02-16 10:53:21 +03:00
Alexander Veysov
c5ec6bae3d Update utils_vad.py 2024-02-16 10:53:12 +03:00
Alexander Veysov
af152c18f6 Merge pull request #421 from yairl/master
Perform in-place resampling during read_audio.
2024-02-16 09:49:28 +03:00
Yair Lifshitz
d391f4c302 Use SoX when possible for loading a file with in-place resampling, ffmpeg otherwise. 2024-02-15 12:25:22 -05:00
Yair Lifshitz
bf18ea6b56 Perform in-place resampling during read_audio. 2024-02-14 17:11:57 -05:00
Alexander Veysov
a65732a393 Merge pull request #417 from abinthomasonline/offset-bug-fix
fix window size offset bug
2024-01-22 09:15:44 +03:00
Abin Thomas
aae1e4f40d fix window size offset bug 2024-01-22 11:08:24 +05:30
Alexander Veysov
94504ece54 Merge pull request #407 from bygreencn/master
Fix a bug at c sample code and some bugs at wav.h.
2023-12-17 20:26:40 +03:00
bygreencn
0b7da6e74b Fix wav functions:
1. fix data_size is not correct and be 0.
2. detect data format of IEEE-float.
3. add PCMS8bit, PCMS16bit and PCMS32 convert to float 32bit at class WavReader.
2023-12-17 23:00:12 +08:00
bygreencn
efb5effc8f Fix a bug in c code sample when only one timestamp and start from 0. 2023-12-17 23:00:12 +08:00
Alexander Veysov
03dc3fae5c Merge pull request #406 from bygreencn/master
Make the c code sample have the same function as the python code
2023-12-15 16:46:58 +03:00
bygreencn
4a6d1701a4 Make the c code sample have the same function as the python code 2023-12-15 21:32:22 +08:00
Alexander Veysov
5e7ee10ee0 Merge pull request #392 from xiaoqiang306/fix_example_cpp
fix int16_t bytes normalized to float
2023-11-13 15:20:08 +03:00
jiqiang.fu
03fb810fab fix int16_t bytes normalized to float 2023-11-13 17:14:34 +08:00
Alexander Veysov
e30a7e32a9 Merge pull request #391 from streamer45/golang-example
Implement Golang example
2023-11-09 08:33:20 +03:00
streamer45
bbbc657dad Golang example 2023-11-08 17:48:36 -06:00
Alexander Veysov
cb92cdd1e3 Merge pull request #386 from VvvvvGH/master
add java onnx inference example
2023-10-18 09:29:04 +03:00
VvvvvGH
3780baf49f add java onnx example 2023-10-18 13:57:18 +08:00
Alexander Veysov
563106ef8c Merge pull request #350 from archive-r/archive-r-fix-typo
Fix typo
2023-06-16 09:45:49 +03:00
kh
f795bc479b Fix typo
https://github.com/snakers4/silero-vad/discussions/319#discussion-5081706
2023-06-16 11:47:59 +09:00
Alexander Veysov
7e9680bc83 Merge pull request #342 from AlexRainHao/master
fix #341 issue of cpp example coding mistake
2023-05-18 15:17:36 +03:00
AlexRainHao
3b4c02dfe3 fix #341 issue of cpp example coding mistake 2023-05-18 20:12:03 +08:00
Alexander Veysov
bc5a0a2dbf Merge pull request #340 from chenqianhe/master
fix speech and silence state transition
2023-05-12 12:03:50 +03:00
Qianhe Chen
b03fcb2ebe fix speech and silence state transition 2023-05-12 16:59:51 +08:00
Dimitrii Voronin
026bc3d292 Merge pull request #329 from mhThomsen/master
(Bug fix) Slices in correct dimension (audio dim), so batch size is not reduced
2023-04-28 15:00:12 +03:00
Dimitrii Voronin
e755baa3c2 Merge branch 'master' into master 2023-04-28 15:00:02 +03:00
Dimitrii Voronin
b88084c7ed Merge pull request #332 from snakers4/adamnsandle
fx https://github.com/snakers4/silero-vad/pull/329 bug
2023-04-28 14:54:44 +03:00
adamnsandle
a9d2b591de fx https://github.com/snakers4/silero-vad/pull/329 bug 2023-04-28 11:48:01 +00:00
Dimitrii Voronin
c3c67cdcb8 Merge pull request #330 from snakers4/adamnsandle
del deprecated models examples
2023-04-27 15:05:45 +03:00
adamnsandle
874c66ccbc del deprecated models examples 2023-04-27 12:00:01 +00:00
Alexander Veysov
51fbbcb32e Update README.md 2023-04-27 13:33:03 +03:00
Alexander Veysov
14a0715955 Deprecate lang detector and number detector models 2023-04-27 13:22:00 +03:00
Alexander Veysov
a0d26769e0 Update README.md 2023-04-27 13:19:39 +03:00
mhThomsen
e0c2015193 slices in correct dimension (audio dim), so batch size is not reduced 2023-04-27 07:37:54 +02:00
Alexander Veysov
5872cffd78 Update README.md 2023-03-29 21:56:23 +03:00
Dimitrii Voronin
86400b9a12 Merge pull request #313 from snakers4/adamnsandle
change int2float
2023-03-28 17:31:27 +03:00
adamnsandle
6ef43d1c5d change int2float 2023-03-28 14:30:35 +00:00
Alexander Veysov
540e092276 Merge pull request #308 from zzzacwork/example-parallelization
add a simple parallelization example
2023-03-13 08:18:23 +03:00
Ziyuan Wang
55c41abf46 use a process specific copy of model 2023-03-10 16:09:07 +00:00
Ziyuan Wang
17903cb41d add a initializer 2023-03-09 16:27:36 +00:00
Ziyuan Wang
c39dccc1fd add a initializer 2023-03-09 16:27:18 +00:00
Ziyuan Wang
a6a067de44 add a initializer 2023-03-09 16:16:51 +00:00
Ziyuan Wang
9865b3cb93 fix typos 2023-03-09 04:26:15 +00:00
Ziyuan Wang
3d10c2d950 add parallel example 2023-03-09 04:17:34 +00:00
Dimitrii Voronin
4f57fae3fa Merge pull request #289 from Tomiinek/patch-1
Fixing ONNX init
2023-02-28 15:43:18 +02:00
Tomáš Nekvinda
085d76f08e Update utils_vad.py 2023-02-09 11:13:02 +01:00
Dimitrii Voronin
262bcb4b40 Merge pull request #285 from snakers4/adamnsandle
fx versiontuple bug
2023-01-09 11:09:32 +02:00
adamnsandle
e84eca68d7 fx versiontuple bug 2023-01-09 09:07:59 +00:00
Dimitrii Voronin
e7c4539106 Merge pull request #280 from bclark-videra/test/hubconf-with-ref
Use the path to hubconf.py to find models
2022-12-29 14:02:09 +02:00
Dimitrii Voronin
a480e85aec Merge pull request #282 from saenyakorn/master
Add `progress_tracking_callback` argument to `get_speech_timestamps` function
2022-12-29 13:31:55 +02:00
Saenyakorn Siangsanoh
c69cb6c9c0 fix progress logic 2022-12-28 14:40:55 +07:00
Saenyakorn Siangsanoh
11da69d88b add progress_tracking callback to get_speech_timestamps 2022-12-28 14:18:42 +07:00
Byron Clark
df1d52042d Use the path to hubconf.py to find models
While `torch.hub.load` uses predictable names when downloading from github,
the name referenced only worked when using the `master` branch of the
repo. Using a ref like `snakers4/silero_vad:v4.0` results in a
different directory name and the model files not being found.

Since the model files are stored in a path relative to hubconf.py, use
the path of hubconf.py to find the models instead of assuming the
directory `torch.hub.load` will extract the files to.
2022-12-22 14:47:14 -07:00
Alexander Veysov
d5a944b9f1 Merge pull request #279 from pengzhendong/patch-1
fix sample rate of onnx input
2022-12-21 06:00:10 +03:00
彭震东
d90416e63e fix sample rate of onnx input 2022-12-21 08:36:52 +08:00
Alexander Veysov
91f0aaecef Merge pull request #277 from yuGAN6/yugan6
Yugan6
2022-12-11 16:27:48 +03:00
yuGAN6
015bfc8b21 Update README.md 2022-12-11 21:14:48 +08:00
yuGAN6
5d56b1ea40 Update README.md 2022-12-11 21:13:46 +08:00
yuGAN6
ff3c596cab Update README.md 2022-12-11 21:09:57 +08:00
yuGAN6
63e1be5a22 Update README.md 2022-12-11 21:08:31 +08:00
yuGAN6
1d8f8f38db Move to example 2022-12-11 21:05:11 +08:00
yuGAN6
7198087152 Move into examples 2022-12-11 13:06:21 +08:00
Alexander Veysov
ad57d17f5f Update README.md 2022-12-11 05:57:19 +03:00
yuGAN6
04e87c208a Move directory 2022-12-10 22:50:14 +08:00
yuGAN6
c583fd1e52 Add c++ onnxruntime example 2022-12-10 22:29:34 +08:00
Dimitrii Voronin
5814e548db Merge pull request #270 from kafan1986/master
Resolves Windows inference issue INVALID_ARGUMENT : Unexpected input data type.
2022-12-06 11:38:41 +02:00
Dimitrii Voronin
42565d5baa Merge pull request #271 from b-med/max_speech_duration_v4
Max speech duration + bits per sample
2022-12-06 11:29:15 +02:00
Mohamed Bouaziz
ab7af9745b delete commented lines 2022-11-17 17:52:34 +01:00
Mohamed Bouaziz
83e68c56ea Merge branch 'master' of https://github.com/snakers4/silero-vad into max_speech_duration_v4 2022-11-17 17:32:22 +01:00
kafan1986
d3882c9ebf Merge pull request #1 from kafan1986/kafan1986-patch-1
Solves data type mismatch issue on windows
2022-11-11 23:50:20 +05:30
kafan1986
25f04dda35 Solves data type mismatch issue on windows
Solves windows 11 error regarding mismatched data type. Expected data type is int64 and actual data type is int32
2022-11-11 23:46:51 +05:30
Mohamed Bouaziz
94b4c21874 utils_vad save_audio force bits_per_sample=16 2022-11-04 21:43:28 +01:00
Mohamed Bouaziz
324bc74a58 utils_vad max duration 2022-11-04 21:42:22 +01:00
Dimitrii Voronin
82d199ff22 Merge pull request #256 from snakers4/adamnsandle
Adamnsandle
2022-10-28 13:57:10 +03:00
adamnsandle
5ba388d894 Merge branch 'master' of github.com:snakers4/silero-vad into adamnsandle 2022-10-28 10:55:59 +00:00
adamnsandle
790844ba0f revert to exception 2022-10-28 10:55:46 +00:00
Dimitrii Voronin
51b5245410 Merge pull request #255 from snakers4/adamnsandle
Adamnsandle
2022-10-28 13:33:18 +03:00
adamnsandle
888970e77d Merge branch 'master' of github.com:snakers4/silero-vad into adamnsandle 2022-10-28 10:32:08 +00:00
adamnsandle
cb6d308335 fx 2022-10-28 10:31:55 +00:00
adamnsandle
1b212c6e95 change exception to warning 2022-10-28 10:26:07 +00:00
adamnsandle
452060ad65 fx 2022-10-28 10:13:00 +00:00
Dimitrii Voronin
c7eab751b5 Merge pull request #253 from snakers4/adamnsandle
add torch version check
2022-10-28 13:09:18 +03:00
adamnsandle
d1714a9ff7 add torch version check 2022-10-28 10:08:07 +00:00
Dimitrii Voronin
94c79d899d Merge pull request #251 from snakers4/adamnsandle
v4 hotfix
2022-10-27 20:26:31 +03:00
adamnsandle
1baf307b35 v4 hotfix 2022-10-27 17:25:31 +00:00
Dimitrii Voronin
e324285cdc Merge pull request #247 from snakers4/adamnsandle
Adamnsandle
2022-10-26 19:17:44 +03:00
adamnsandle
13dce2d067 Merge branch 'MASTER' into adamnsandle 2022-10-26 16:13:37 +00:00
adamnsandle
081e6b9886 VAD v4 2022-10-26 16:10:20 +00:00
Alexander Veysov
572134fdf1 Update README.md 2022-10-25 05:52:53 +03:00
Dimitrii Voronin
a799dea837 Merge pull request #244 from owlsometech-kenyang/feature/support-force-onnx-cpu
Suggesting a new kwarg: force_onnx_cpu
2022-10-14 11:53:46 +03:00
ChiehKai Yang
17209e6c4f add new parameter: force_onnx_cpu 2022-10-12 01:56:43 +08:00
adamnsandle
6661cc9691 Merge branch 'master' of github.com:snakers4/silero-vad into adamnsandle 2022-06-02 10:41:54 +00:00
Dimitrii Voronin
7c671a75c2 Merge pull request #199 from snakers4/adamnsandle
fx end of chunk may exceed audio length
2022-06-02 13:40:42 +03:00
adamnsandle
622016e672 Merge branch 'master' of github.com:snakers4/silero-vad into adamnsandle 2022-06-02 10:40:11 +00:00
adamnsandle
8eba346bc9 fx end of chunk may exceed audio length 2022-06-02 10:39:16 +00:00
Dimitrii Voronin
900c71a109 Merge pull request #198 from snakers4/adamnsandle
fx get_speech ts start of an audio chunk pad
2022-06-02 13:33:36 +03:00
adamnsandle
bf0127e016 fx get_speech ts start of an audio chunk pad 2022-06-02 10:32:32 +00:00
Dimitrii Voronin
ea7af70fe9 Merge pull request #182 from snakers4/adamnsandle
Adamnsandle
2022-04-05 14:36:00 +03:00
adamnsandle
8cdc8d36c9 fx 2022-04-05 11:35:23 +00:00
adamnsandle
6e9fd77500 fx stram imitation example bug 2022-04-05 11:33:34 +00:00
Alexander Veysov
6cc08b1077 Merge pull request #170 from gabrielziegler3/169-fix-min-speech-duration-bug
Fix #169
2022-02-10 12:18:23 +03:00
Gabriel Ziegler
0e8e080894 Remove unnecessary if statement 2022-02-09 19:22:04 -03:00
Gabriel Ziegler
af6931d1de Fix bug where min_speech_duration_ms is not checked in the last speech segment
Signed-off-by: Gabriel Ziegler <gabrielziegler3@gmail.com>
2022-02-09 19:18:48 -03:00
Alexander Veysov
76687cbe25 Update README.md 2021-12-21 14:43:36 +03:00
Dimitrii Voronin
b2329fa5f2 Merge pull request #144 from snakers4/adamnsandle
Update README.md
2021-12-21 14:25:56 +03:00
Dimitrii Voronin
005886e7eb Update README.md 2021-12-21 13:25:14 +02:00
Dimitrii Voronin
f6b1294cb2 Merge pull request #143 from snakers4/adamnsandle
Adamnsandle
2021-12-21 14:02:25 +03:00
adamnsandle
2392ea33f4 Merge branch 'master' of github.com:snakers4/silero-vad into adamnsandle 2021-12-21 11:01:25 +00:00
adamnsandle
45d72863b6 add multiple of 16k sr support 2021-12-21 11:01:07 +00:00
Alexander Veysov
f40cc128a4 Update utils_vad.py 2021-12-21 08:24:48 +03:00
Alexander Veysov
0d61e4cee1 Update README.md 2021-12-17 22:03:49 +03:00
Alexander Veysov
011268e492 Polish the copy a bit 2021-12-17 22:00:36 +03:00
97 changed files with 8110 additions and 849 deletions

40
.github/workflows/python-publish.yml vendored Normal file
View File

@@ -0,0 +1,40 @@
# This workflow will upload a Python Package using Twine when a release is created
# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python#publishing-to-package-registries
# This workflow uses actions that are not certified by GitHub.
# They are provided by a third-party and are governed by
# separate terms of service, privacy policy, and support
# documentation.
name: Upload Python Package
on:
push:
tags:
- '*'
permissions:
contents: read
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v3
with:
python-version: '3.x'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install build
- name: Build package
run: python -m build
- name: Publish package
uses: pypa/gh-action-pypi-publish@27b31702a0e7fc50959f5ad993c78deac1bdfc29
with:
user: __token__
password: ${{ secrets.PYPI_API_TOKEN }}

40
.github/workflows/test.yml vendored Normal file
View File

@@ -0,0 +1,40 @@
name: Test Package
on:
workflow_dispatch: # запуск вручную
jobs:
test:
runs-on: ${{ matrix.os }}
strategy:
fail-fast: false
matrix:
os: [ubuntu-latest, windows-latest, macos-latest]
python-version: ["3.8","3.9","3.10","3.11","3.12","3.13"]
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install build hatchling pytest soundfile
pip install .[test]
- name: Build package
run: python -m build --wheel --outdir dist
- name: Install package
run: |
import glob, subprocess, sys
whl = glob.glob("dist/*.whl")[0]
subprocess.check_call([sys.executable, "-m", "pip", "install", whl])
shell: python
- name: Run tests
run: pytest tests

20
CITATION.cff Normal file
View File

@@ -0,0 +1,20 @@
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
title: "Silero VAD"
authors:
- family-names: "Silero Team"
email: "hello@silero.ai"
type: software
repository-code: "https://github.com/snakers4/silero-vad"
license: MIT
abstract: "Pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier"
preferred-citation:
type: software
authors:
- family-names: "Silero Team"
email: "hello@silero.ai"
title: "Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier"
year: 2024
publisher: "GitHub"
journal: "GitHub repository"
howpublished: "https://github.com/snakers4/silero-vad"

114
README.md
View File

@@ -1,6 +1,6 @@
[![Mailing list : test](http://img.shields.io/badge/Email-gray.svg?style=for-the-badge&logo=gmail)](mailto:hello@silero.ai) [![Mailing list : test](http://img.shields.io/badge/Telegram-blue.svg?style=for-the-badge&logo=telegram)](https://t.me/silero_speech) [![License: CC BY-NC 4.0](https://img.shields.io/badge/License-MIT-lightgrey.svg?style=for-the-badge)](https://github.com/snakers4/silero-vad/blob/master/LICENSE)
[![Mailing list : test](http://img.shields.io/badge/Email-gray.svg?style=for-the-badge&logo=gmail)](mailto:hello@silero.ai) [![Mailing list : test](http://img.shields.io/badge/Telegram-blue.svg?style=for-the-badge&logo=telegram)](https://t.me/silero_speech) [![License: CC BY-NC 4.0](https://img.shields.io/badge/License-MIT-lightgrey.svg?style=for-the-badge)](https://github.com/snakers4/silero-vad/blob/master/LICENSE) [![downloads](https://img.shields.io/pypi/dm/silero-vad?style=for-the-badge)](https://pypi.org/project/silero-vad/)
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb)
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) [![Test Package](https://github.com/snakers4/silero-vad/actions/workflows/test.yml/badge.svg)](https://github.com/snakers4/silero-vad/actions/workflows/test.yml) [![Pypi version](https://img.shields.io/pypi/v/silero-vad)](https://pypi.org/project/silero-vad/) [![Python version](https://img.shields.io/pypi/pyversions/silero-vad)](https://pypi.org/project/silero-vad)
![header](https://user-images.githubusercontent.com/12515440/89997349-b3523080-dc94-11ea-9906-ca2e8bc50535.png)
@@ -10,50 +10,120 @@
**Silero VAD** - pre-trained enterprise-grade [Voice Activity Detector](https://en.wikipedia.org/wiki/Voice_activity_detection) (also see our [STT models](https://github.com/snakers4/silero-models)).
This repository also includes Number Detector and Language classifier [models](https://github.com/snakers4/silero-vad/wiki/Other-Models)
<br/>
<p align="center">
<img src="https://user-images.githubusercontent.com/36505480/145563071-681b57e3-06b5-4cd0-bdee-e2ade3d50a60.png" />
<img src="https://github.com/user-attachments/assets/f2940867-0a51-4bdb-8c14-1129d3c44e64" />
</p>
<details>
<summary>Real Time Example</summary>
https://user-images.githubusercontent.com/36505480/144874384-95f80f6d-a4f1-42cc-9be7-004c891dd481.mp4
Please note, that video loads only if you are logged in your GitHub account.
</details>
<br/>
<h2 align="center">Fast start</h2>
<br/>
<details>
<summary>Dependencies</summary>
System requirements to run python examples on `x86-64` systems:
- `python 3.8+`;
- 1G+ RAM;
- A modern CPU with AVX, AVX2, AVX-512 or AMX instruction sets.
Dependencies:
- `torch>=1.12.0`;
- `torchaudio>=0.12.0` (for I/O only);
- `onnxruntime>=1.16.1` (for ONNX model usage).
Silero VAD uses torchaudio library for audio I/O (`torchaudio.info`, `torchaudio.load`, and `torchaudio.save`), so a proper audio backend is required:
- Option №1 - [**FFmpeg**](https://www.ffmpeg.org/) backend. `conda install -c conda-forge 'ffmpeg<7'`;
- Option №2 - [**sox_io**](https://pypi.org/project/sox/) backend. `apt-get install sox`, TorchAudio is tested on libsox 14.4.2;
- Option №3 - [**soundfile**](https://pypi.org/project/soundfile/) backend. `pip install soundfile`.
If you are planning to run the VAD using solely the `onnx-runtime`, it will run on any other system architectures where onnx-runtume is [supported](https://onnxruntime.ai/getting-started). In this case please note that:
- You will have to implement the I/O;
- You will have to adapt the existing wrappers / examples / post-processing for your use-case.
</details>
**Using pip**:
`pip install silero-vad`
```python3
from silero_vad import load_silero_vad, read_audio, get_speech_timestamps
model = load_silero_vad()
wav = read_audio('path_to_audio_file')
speech_timestamps = get_speech_timestamps(
wav,
model,
return_seconds=True, # Return speech timestamps in seconds (default is samples)
)
```
**Using torch.hub**:
```python3
import torch
torch.set_num_threads(1)
model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad', model='silero_vad')
(get_speech_timestamps, _, read_audio, _, _) = utils
wav = read_audio('path_to_audio_file')
speech_timestamps = get_speech_timestamps(
wav,
model,
return_seconds=True, # Return speech timestamps in seconds (default is samples)
)
```
<br/>
<h2 align="center">Key Features</h2>
<br/>
- **High accuracy**
- **Stellar accuracy**
Silero VAD has [excellent results](https://github.com/snakers4/silero-vad/wiki/Quality-Metrics#vs-other-available-solutions) on speech detection tasks.
- **Fast**
One audio chunk (30+ ms) [takes](https://github.com/snakers4/silero-vad/wiki/Performance-Metrics#silero-vad-performance-metrics) around **1ms** to be processed on a single CPU thread. Using batching or GPU can also improve performance considerably.
One audio chunk (30+ ms) [takes](https://github.com/snakers4/silero-vad/wiki/Performance-Metrics#silero-vad-performance-metrics) less than **1ms** to be processed on a single CPU thread. Using batching or GPU can also improve performance considerably. Under certain conditions ONNX may even run up to 4-5x faster.
- **Lightweight**
JIT model is less than one megabyte in size.
JIT model is around two megabytes in size.
- **General**
Silero VAD was trained on huge corpora that include over **100** languages and it performs well on audios from different domains with various background noise and quality levels.
Silero VAD was trained on huge corpora that include over **6000** languages and it performs well on audios from different domains with various background noise and quality levels.
- **Flexible sampling rate**
Silero VAD [supports](https://github.com/snakers4/silero-vad/wiki/Quality-Metrics#sample-rate-comparison) **8000 Hz** and **16000 Hz** (JIT) and **16000 Hz** (ONNX) [sampling rates](https://en.wikipedia.org/wiki/Sampling_(signal_processing)#Sampling_rate).
Silero VAD [supports](https://github.com/snakers4/silero-vad/wiki/Quality-Metrics#sample-rate-comparison) **8000 Hz** and **16000 Hz** [sampling rates](https://en.wikipedia.org/wiki/Sampling_(signal_processing)#Sampling_rate).
- **Flexible chunk size**
- **Highly Portable**
Model was trained on audio chunks of different lengths. **30 ms**, **60 ms** and **100 ms** long chunks are supported directly, others may work as well.
Silero VAD reaps benefits from the rich ecosystems built around **PyTorch** and **ONNX** running everywhere where these runtimes are available.
- **No Strings Attached**
Published under permissive license (MIT) Silero VAD has zero strings attached - no telemetry, no keys, no registration, no built-in expiration, no keys or vendor lock.
<br/>
<h2 align="center">Typical Use Cases</h2>
<br/>
@@ -70,9 +140,9 @@ https://user-images.githubusercontent.com/36505480/144874384-95f80f6d-a4f1-42cc-
- [Examples and Dependencies](https://github.com/snakers4/silero-vad/wiki/Examples-and-Dependencies#dependencies)
- [Quality Metrics](https://github.com/snakers4/silero-vad/wiki/Quality-Metrics)
- [Performance Metrics](https://github.com/snakers4/silero-vad/wiki/Performance-Metrics)
- Number Detector and Language classifier [models](https://github.com/snakers4/silero-vad/wiki/Other-Models)
- [Versions and Available Models](https://github.com/snakers4/silero-vad/wiki/Version-history-and-Available-Models)
- [Further reading](https://github.com/snakers4/silero-models#further-reading)
- [FAQ](https://github.com/snakers4/silero-vad/wiki/FAQ)
<br/>
<h2 align="center">Get In Touch</h2>
@@ -80,7 +150,7 @@ https://user-images.githubusercontent.com/36505480/144874384-95f80f6d-a4f1-42cc-
Try our models, create an [issue](https://github.com/snakers4/silero-vad/issues/new), start a [discussion](https://github.com/snakers4/silero-vad/discussions/new), join our telegram [chat](https://t.me/silero_speech), [email](mailto:hello@silero.ai) us, read our [news](https://t.me/silero_news).
Please see our [wiki](https://github.com/snakers4/silero-models/wiki) and [tiers](https://github.com/snakers4/silero-models/wiki/Licensing-and-Tiers) for relevant information and [email](mailto:hello@silero.ai) us directly.
Please see our [wiki](https://github.com/snakers4/silero-models/wiki) for relevant information and [email](mailto:hello@silero.ai) us directly.
**Citations**
@@ -88,7 +158,7 @@ Please see our [wiki](https://github.com/snakers4/silero-models/wiki) and [tiers
@misc{Silero VAD,
author = {Silero Team},
title = {Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier},
year = {2021},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/snakers4/silero-vad}},
@@ -96,3 +166,13 @@ Please see our [wiki](https://github.com/snakers4/silero-models/wiki) and [tiers
email = {hello@silero.ai}
}
```
<br/>
<h2 align="center">Examples and VAD-based Community Apps</h2>
<br/>
- Example of VAD ONNX Runtime model usage in [C++](https://github.com/snakers4/silero-vad/tree/master/examples/cpp)
- Voice activity detection for the [browser](https://github.com/ricky0123/vad) using ONNX Runtime Web
- [Rust](https://github.com/snakers4/silero-vad/tree/master/examples/rust-example), [Go](https://github.com/snakers4/silero-vad/tree/master/examples/go), [Java](https://github.com/snakers4/silero-vad/tree/master/examples/java-example), [C++](https://github.com/snakers4/silero-vad/tree/master/examples/cpp), [C#](https://github.com/snakers4/silero-vad/tree/master/examples/csharp) and [other](https://github.com/snakers4/silero-vad/tree/master/examples) community examples

84
datasets/README.md Normal file
View File

@@ -0,0 +1,84 @@
# Датасет Silero-VAD
> Датасет создан при поддержке Фонда содействия инновациям в рамках федерального проекта «Искусственный
интеллект» национальной программы «Цифровая экономика Российской Федерации».
По ссылкам ниже представлены `.feather` файлы, содержащие размеченные с помощью Silero VAD открытые наборы аудиоданных, а также короткое описание каждого набора данных с примерами загрузки. `.feather` файлы можно открыть с помощью библиотеки `pandas`:
```python3
import pandas as pd
dataframe = pd.read_feather(PATH_TO_FEATHER_FILE)
```
Каждый `.feather` файл с разметкой содержит следующие колонки:
- `speech_timings` - разметка данного аудио. Это список, содержащий словари вида `{'start': START_SECOND, 'end': END_SECOND}`, где `START_SECOND` и `END_SECOND` - время начала и конца речи в секундах. Количество данных словарей равно количеству речевых аудио отрывков, найденных в данном аудио;
- `language` - ISO код языка данного аудио.
Колонки, содержащие информацию о загрузке аудио файла различаются и описаны для каждого набора данных ниже.
**Все данные размечены при временной дискретизации в ~30 миллисекунд (`num_samples` - 512)**
| Название | Число часов | Число языков | Ссылка | Лицензия | md5sum |
|----------------------|-------------|-------------|--------|----------|----------|
| **Bible.is** | 53,138 | 1,596 | [URL](https://live.bible.is/) | [Уникальная](https://live.bible.is/terms) | ea404eeaf2cd283b8223f63002be11f9 |
| **globalrecordings.net** | 9,743 | 6,171[^1] | [URL](https://globalrecordings.net/en) | CC BY-NC-SA 4.0 | 3c5c0f31b0abd9fe94ddbe8b1e2eb326 |
| **VoxLingua107** | 6,628 | 107 | [URL](https://bark.phon.ioc.ee/voxlingua107/) | CC BY 4.0 | 5dfef33b4d091b6d399cfaf3d05f2140 |
| **Common Voice** | 30,329 | 120 | [URL](https://commonvoice.mozilla.org/en/datasets) | CC0 | 5e30a85126adf74a5fd1496e6ac8695d |
| **MLS** | 50,709 | 8 | [URL](https://www.openslr.org/94/) | CC BY 4.0 | a339d0e94bdf41bba3c003756254ac4e |
| **Итого** | **150,547** | **6,171+** | | | |
## Bible.is
[Ссылка на `.feather` файл с разметкой](https://models.silero.ai/vad_datasets/BibleIs.feather)
- Колонка `audio_link` содержит ссылки на конкретные аудио файлы.
## globalrecordings.net
[Ссылка на `.feather` файл с разметкой](https://models.silero.ai/vad_datasets/globalrecordings.feather)
- Колонка `folder_link` содержит ссылки на скачивание `.zip` архива для конкретного языка. Внимание! Ссылки на архивы дублируются, т.к каждый архив может содержать множество аудио.
- Колонка `audio_path` содержит пути до конкретного аудио после распаковки соответствующего архива из колонки `folder_link`
``Количество уникальных ISO кодов данного датасета не совпадает с фактическим количеством представленных языков, т.к некоторые близкие языки могут кодироваться одним и тем же ISO кодом.``
## VoxLingua107
[Ссылка на `.feather` файл с разметкой](https://models.silero.ai/vad_datasets/VoxLingua107.feather)
- Колонка `folder_link` содержит ссылки на скачивание `.zip` архива для конкретного языка. Внимание! Ссылки на архивы дублируются, т.к каждый архив может содержать множество аудио.
- Колонка `audio_path` содержит пути до конкретного аудио после распаковки соответствующего архива из колонки `folder_link`
## Common Voice
[Ссылка на `.feather` файл с разметкой](https://models.silero.ai/vad_datasets/common_voice.feather)
Этот датасет невозможно скачать по статичным ссылкам. Для загрузки необходимо перейти по [ссылке](https://commonvoice.mozilla.org/en/datasets) и, получив доступ в соответствующей форме, скачать архивы для каждого доступного языка. Внимание! Представленная разметка актуальна для версии исходного датасета `Common Voice Corpus 16.1`.
- Колонка `audio_path` содержит уникальные названия `.mp3` файлов, полученных после скачивания соответствующего датасета.
## MLS
[Ссылка на `.feather` файл с разметкой](https://models.silero.ai/vad_datasets/MLS.feather)
- Колонка `folder_link` содержит ссылки на скачивание `.zip` архива для конкретного языка. Внимание! Ссылки на архивы дублируются, т.к каждый архив может содержать множество аудио.
- Колонка `audio_path` содержит пути до конкретного аудио после распаковки соответствующего архива из колонки `folder_link`
## Лицензия
Данный датасет распространяется под [лицензией](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en) `CC BY-NC-SA 4.0`.
## Цитирование
```
@misc{Silero VAD Dataset,
author = {Silero Team},
title = {Silero-VAD Dataset: a large public Internet-scale dataset for voice activity detection for 6000+ languages},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/snakers4/silero-vad/datasets/README.md}},
email = {hello@silero.ai}
}
```
[^1]: ``Количество уникальных ISO кодов данного датасета не совпадает с фактическим количеством представленных языков, т.к некоторые близкие языки могут кодироваться одним и тем же ISO кодом.``

49
examples/c++/README.md Normal file
View File

@@ -0,0 +1,49 @@
# Silero-VAD V6 in C++ (based on LibTorch)
This is the source code for Silero-VAD V6 in C++, utilizing LibTorch & Onnxruntime.
You should compare its results with the Python version.
Results at 16 and 8kHz have been tested. Batch and CUDA inference options are deprecated.
## Requirements
- GCC 11.4.0 (GCC >= 5.1)
- Onnxruntime 1.11.0 (other versions are also acceptable)
- LibTorch 1.13.0 (other versions are also acceptable)
## Download LibTorch
```bash
-Onnxruntime
$wget https://github.com/microsoft/onnxruntime/releases/download/v1.11.1/onnxruntime-linux-x64-1.11.1.tgz
$tar -xvf onnxruntime-linux-x64-1.11.1.tgz
$ln -s onnxruntime-linux-x64-1.11.1 onnxruntime-linux #soft-link
-Libtorch
$wget https://download.pytorch.org/libtorch/cpu/libtorch-shared-with-deps-1.13.0%2Bcpu.zip
$unzip libtorch-shared-with-deps-1.13.0+cpu.zip
```
## Compilation
```bash
-ONNX-build
$g++ main.cc silero.cc -I ./onnxruntime-linux/include/ -L ./onnxruntime-linux/lib/ -lonnxruntime -Wl,-rpath,./onnxruntime-linux/lib/ -o silero -std=c++14 -D_GLIBCXX_USE_CXX11_ABI=0 -DUSE_ONNX
-TORCH-build
$g++ main.cc silero.cc -I ./libtorch/include/ -I ./libtorch/include/torch/csrc/api/include -L ./libtorch/lib/ -ltorch -ltorch_cpu -lc10 -Wl,-rpath,./libtorch/lib/ -o silero -std=c++14 -D_GLIBCXX_USE_CXX11_ABI=0 -DUSE_TORCH
```
## Optional Compilation Flags
-DUSE_TORCH
-DUSE_ONNX
## Run the Program
To run the program, use the following command:
`./silero <sample.wav> <SampleRate> <threshold>`
`./silero aepyx.wav 16000 0.5`
`./silero aepyx_8k.wav 8000 0.5`
The sample file aepyx.wav is part of the Voxconverse dataset.
File details: aepyx.wav is a 16kHz, 16-bit audio file.
File details: aepyx_8k.wav is a 8kHz, 16-bit audio file.

BIN
examples/c++/aepyx.wav Normal file

Binary file not shown.

BIN
examples/c++/aepyx_8k.wav Normal file

Binary file not shown.

61
examples/c++/main.cc Normal file
View File

@@ -0,0 +1,61 @@
#include <iostream>
#include "silero.h"
#include "wav.h"
int main(int argc, char* argv[]) {
if(argc != 4){
std::cerr<<"Usage : "<<argv[0]<<" <wav.path> <SampleRate> <Threshold>"<<std::endl;
std::cerr<<"Usage : "<<argv[0]<<" sample.wav 16000 0.5"<<std::endl;
return 1;
}
std::string wav_path = argv[1];
float sample_rate = std::stof(argv[2]);
float threshold = std::stof(argv[3]);
if (sample_rate != 16000 && sample_rate != 8000) {
std::cout<<"Unsupported sample rate (only 16000 or 8000)."<<std::endl;
exit (0);
}
//Load Model
#ifdef USE_TORCH
std::string model_path = "../../src/silero_vad/data/silero_vad.jit";
#elif USE_ONNX
std::string model_path = "../../src/silero_vad/data/silero_vad.onnx";
#endif
silero::VadIterator vad(model_path);
vad.threshold=threshold; //(Default:0.5)
vad.sample_rate=sample_rate; //16000Hz,8000Hz. (Default:16000)
vad.print_as_samples=false; //if true, it prints time-stamp with samples. otherwise, in seconds
//(Default:false)
vad.SetVariables();
// Read wav
wav::WavReader wav_reader(wav_path);
std::vector<float> input_wav(wav_reader.num_samples());
for (int i = 0; i < wav_reader.num_samples(); i++)
{
input_wav[i] = static_cast<float>(*(wav_reader.data() + i));
}
vad.SpeechProbs(input_wav);
std::vector<silero::Interval> speeches = vad.GetSpeechTimestamps();
for(const auto& speech : speeches){
if(vad.print_as_samples){
std::cout<<"{'start': "<<static_cast<int>(speech.start)<<", 'end': "<<static_cast<int>(speech.end)<<"}"<<std::endl;
}
else{
std::cout<<"{'start': "<<speech.start<<", 'end': "<<speech.end<<"}"<<std::endl;
}
}
return 0;
}

273
examples/c++/silero.cc Normal file
View File

@@ -0,0 +1,273 @@
// silero.cc
// Author : NathanJHLee
// Created On : 2025-11-10
// Description : silero 6.2 system for onnx-runtime(c++) and torch-script(c++)
// Version : 1.3
#include "silero.h"
namespace silero {
#ifdef USE_TORCH
VadIterator::VadIterator(const std::string &model_path,
float threshold,
int sample_rate,
int window_size_ms,
int speech_pad_ms,
int min_silence_duration_ms,
int min_speech_duration_ms,
int max_duration_merge_ms,
bool print_as_samples)
: threshold(threshold), sample_rate(sample_rate), window_size_ms(window_size_ms),
speech_pad_ms(speech_pad_ms), min_silence_duration_ms(min_silence_duration_ms),
min_speech_duration_ms(min_speech_duration_ms), max_duration_merge_ms(max_duration_merge_ms),
print_as_samples(print_as_samples)
{
init_torch_model(model_path);
}
VadIterator::~VadIterator(){
}
void VadIterator::init_torch_model(const std::string& model_path) {
at::set_num_threads(1);
model = torch::jit::load(model_path);
model.eval();
torch::NoGradGuard no_grad;
std::cout<<"Silero libtorch-Model loaded successfully"<<std::endl;
}
void VadIterator::SpeechProbs(std::vector<float>& input_wav) {
int num_samples = input_wav.size();
int num_chunks = num_samples / window_size_samples;
int remainder_samples = num_samples % window_size_samples;
total_sample_size += num_samples;
std::vector<torch::Tensor> chunks;
for (int i = 0; i < num_chunks; i++) {
float* chunk_start = input_wav.data() + i * window_size_samples;
torch::Tensor chunk = torch::from_blob(chunk_start, {1, window_size_samples}, torch::kFloat32);
chunks.push_back(chunk);
if (i == num_chunks - 1 && remainder_samples > 0) {
int remaining_samples = num_samples - num_chunks * window_size_samples;
float* chunk_start_remainder = input_wav.data() + num_chunks * window_size_samples;
torch::Tensor remainder_chunk = torch::from_blob(chunk_start_remainder, {1, remaining_samples}, torch::kFloat32);
torch::Tensor padded_chunk = torch::cat({remainder_chunk, torch::zeros({1, window_size_samples - remaining_samples}, torch::kFloat32)}, 1);
chunks.push_back(padded_chunk);
}
}
if (!chunks.empty()) {
std::vector<torch::Tensor> outputs;
torch::Tensor batched_chunks = torch::stack(chunks);
for (size_t i = 0; i < chunks.size(); i++) {
torch::NoGradGuard no_grad;
std::vector<torch::jit::IValue> inputs;
inputs.push_back(batched_chunks[i]);
inputs.push_back(sample_rate);
torch::Tensor output = model.forward(inputs).toTensor();
outputs.push_back(output);
}
torch::Tensor all_outputs = torch::stack(outputs);
for (size_t i = 0; i < chunks.size(); i++) {
float output_f = all_outputs[i].item<float>();
outputs_prob.push_back(output_f);
//////To print Probs by libtorch
//std::cout << "Chunk " << i << " prob: " << output_f<< "\n";
}
}
}
#elif USE_ONNX
VadIterator::VadIterator(const std::string &model_path,
float threshold,
int sample_rate,
int window_size_ms,
int speech_pad_ms,
int min_silence_duration_ms,
int min_speech_duration_ms,
int max_duration_merge_ms,
bool print_as_samples)
:sample_rate(sample_rate), threshold(threshold), window_size_ms(window_size_ms),
speech_pad_ms(speech_pad_ms), min_silence_duration_ms(min_silence_duration_ms),
min_speech_duration_ms(min_speech_duration_ms), max_duration_merge_ms(max_duration_merge_ms),
print_as_samples(print_as_samples),
env(ORT_LOGGING_LEVEL_ERROR, "Vad"), session_options(), session(nullptr), allocator(),
memory_info(Ort::MemoryInfo::CreateCpu(OrtArenaAllocator, OrtMemTypeCPU)), context_samples(64),
_context(64, 0.0f), current_sample(0), size_state(2 * 1 * 128),
input_node_names({"input", "state", "sr"}), output_node_names({"output", "stateN"}),
state_node_dims{2, 1, 128}, sr_node_dims{1}
{
init_onnx_model(model_path);
}
VadIterator::~VadIterator(){
}
void VadIterator::init_onnx_model(const std::string& model_path) {
int inter_threads=1;
int intra_threads=1;
session_options.SetIntraOpNumThreads(intra_threads);
session_options.SetInterOpNumThreads(inter_threads);
session_options.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_ALL);
session = std::make_shared<Ort::Session>(env, model_path.c_str(), session_options);
std::cout<<"Silero onnx-Model loaded successfully"<<std::endl;
}
float VadIterator::predict(const std::vector<float>& data_chunk) {
// _context와 현재 청크를 결합하여 입력 데이터 구성
std::vector<float> new_data(effective_window_size, 0.0f);
std::copy(_context.begin(), _context.end(), new_data.begin());
std::copy(data_chunk.begin(), data_chunk.end(), new_data.begin() + context_samples);
input = new_data;
Ort::Value input_ort = Ort::Value::CreateTensor<float>(
memory_info, input.data(), input.size(), input_node_dims, 2);
Ort::Value state_ort = Ort::Value::CreateTensor<float>(
memory_info, _state.data(), _state.size(), state_node_dims, 3);
Ort::Value sr_ort = Ort::Value::CreateTensor<int64_t>(
memory_info, sr.data(), sr.size(), sr_node_dims, 1);
ort_inputs.clear();
ort_inputs.push_back(std::move(input_ort));
ort_inputs.push_back(std::move(state_ort));
ort_inputs.push_back(std::move(sr_ort));
ort_outputs = session->Run(
Ort::RunOptions{ nullptr },
input_node_names.data(), ort_inputs.data(), ort_inputs.size(),
output_node_names.data(), output_node_names.size());
float speech_prob = ort_outputs[0].GetTensorMutableData<float>()[0]; // ONNX 출력: 첫 번째 값이 음성 확률
float* stateN = ort_outputs[1].GetTensorMutableData<float>(); // 두 번째 출력값: 상태 업데이트
std::memcpy(_state.data(), stateN, size_state * sizeof(float));
std::copy(new_data.end() - context_samples, new_data.end(), _context.begin());
// _context 업데이트: new_data의 마지막 context_samples 유지
return speech_prob;
}
void VadIterator::SpeechProbs(std::vector<float>& input_wav) {
reset_states();
total_sample_size = static_cast<int>(input_wav.size());
for (size_t j = 0; j < static_cast<size_t>(total_sample_size); j += window_size_samples) {
if (j + window_size_samples > static_cast<size_t>(total_sample_size))
break;
std::vector<float> chunk(input_wav.begin() + j, input_wav.begin() + j + window_size_samples);
float speech_prob = predict(chunk);
outputs_prob.push_back(speech_prob);
}
}
#endif
void VadIterator::reset_states() {
triggered = false;
current_sample = 0;
temp_end = 0;
outputs_prob.clear();
total_sample_size = 0;
#ifdef USE_TORCH
model.run_method("reset_states"); // Reset model states if applicable
#elif USE_ONNX
std::memset(_state.data(), 0, _state.size() * sizeof(float));
std::fill(_context.begin(), _context.end(), 0.0f);
#endif
}
std::vector<Interval> VadIterator::GetSpeechTimestamps() {
std::vector<Interval> speeches = DoVad();
if(!print_as_samples){
for (auto& speech : speeches) {
speech.start /= sample_rate;
speech.end /= sample_rate;
}
}
return speeches;
}
void VadIterator::SetVariables(){
// Initialize internal engine parameters
init_engine(window_size_ms);
}
void VadIterator::init_engine(int window_size_ms) {
min_silence_samples = sample_rate * min_silence_duration_ms / 1000;
speech_pad_samples = sample_rate * speech_pad_ms / 1000;
window_size_samples = sample_rate / 1000 * window_size_ms;
min_speech_samples = sample_rate * min_speech_duration_ms / 1000;
#ifdef USE_ONNX
//for ONNX
context_samples=window_size_samples / 8;
_context.assign(context_samples, 0.0f);
effective_window_size = window_size_samples + context_samples; // 예: 512 + 64 = 576 samples
input_node_dims[0] = 1;
input_node_dims[1] = effective_window_size;
_state.resize(size_state);
sr.resize(1);
sr[0] = sample_rate;
#endif
}
std::vector<Interval> VadIterator::DoVad() {
std::vector<Interval> speeches;
for (size_t i = 0; i < outputs_prob.size(); ++i) {
float speech_prob = outputs_prob[i];
current_sample += window_size_samples;
if (speech_prob >= threshold && temp_end != 0) {
temp_end = 0;
}
if (speech_prob >= threshold) {
if (!triggered) {
triggered = true;
Interval segment;
segment.start = std::max(0, current_sample - speech_pad_samples - window_size_samples);
speeches.push_back(segment);
}
}else {
if (triggered) {
if (speech_prob < threshold - 0.15f) {
if (temp_end == 0) {
temp_end = current_sample;
}
if (current_sample - temp_end >= min_silence_samples) {
Interval& segment = speeches.back();
segment.end = temp_end + speech_pad_samples - window_size_samples;
temp_end = 0;
triggered = false;
}
}
}
}
}
if (triggered) {
std::cout<<"Finalizing active speech segment at stream end."<<std::endl;
Interval& segment = speeches.back();
segment.end = total_sample_size;
triggered = false;
}
speeches.erase(std::remove_if(speeches.begin(), speeches.end(),
[this](const Interval& speech) {
return ((speech.end - this->speech_pad_samples) - (speech.start + this->speech_pad_samples) < min_speech_samples);
}), speeches.end());
reset_states();
return speeches;
}
} // namespace silero

123
examples/c++/silero.h Normal file
View File

@@ -0,0 +1,123 @@
#ifndef SILERO_H
#define SILERO_H
// silero.h
// Author : NathanJHLee
// Created On : 2025-11-10
// Description : silero 6.2 system for onnx-runtime(c++) and torch-script(c++)
// Version : 1.3
#include <string>
#include <vector>
#include <iostream>
#include <fstream>
#include <chrono>
#include <algorithm>
#include <cstring>
#ifdef USE_TORCH
#include <torch/torch.h>
#include <torch/script.h>
#elif USE_ONNX
#include "onnxruntime_cxx_api.h"
#endif
namespace silero {
struct Interval {
float start;
float end;
int numberOfSubseg;
void initialize() {
start = 0;
end = 0;
numberOfSubseg = 0;
}
};
class VadIterator {
public:
VadIterator(const std::string &model_path,
float threshold = 0.5,
int sample_rate = 16000,
int window_size_ms = 32,
int speech_pad_ms = 30,
int min_silence_duration_ms = 100,
int min_speech_duration_ms = 250,
int max_duration_merge_ms = 300,
bool print_as_samples = false);
~VadIterator();
// Batch (non-streaming) interface (for backward compatibility)
void SpeechProbs(std::vector<float>& input_wav);
std::vector<Interval> GetSpeechTimestamps();
void SetVariables();
// Public parameters (can be modified by user)
float threshold;
int sample_rate;
int window_size_ms;
int min_speech_duration_ms;
int max_duration_merge_ms;
bool print_as_samples;
private:
#ifdef USE_TORCH
torch::jit::script::Module model;
void init_torch_model(const std::string& model_path);
#elif USE_ONNX
Ort::Env env; // 환경 객체
Ort::SessionOptions session_options; // 세션 옵션
std::shared_ptr<Ort::Session> session; // ONNX 세션
Ort::AllocatorWithDefaultOptions allocator; // 기본 할당자
Ort::MemoryInfo memory_info; // 메모리 정보 (CPU)
void init_onnx_model(const std::string& model_path);
float predict(const std::vector<float>& data_chunk);
//const int context_samples; // 예: 64 samples
int context_samples; // 예: 64 samples
std::vector<float> _context; // 초기값 모두 0
int effective_window_size;
// ONNX 입력/출력 관련 버퍼 및 노드 이름들
std::vector<Ort::Value> ort_inputs;
std::vector<const char*> input_node_names;
std::vector<float> input;
unsigned int size_state; // 고정값: 2*1*128
std::vector<float> _state;
std::vector<int64_t> sr;
int64_t input_node_dims[2]; // [1, effective_window_size]
const int64_t state_node_dims[3]; // [ 2, 1, 128 ]
const int64_t sr_node_dims[1]; // [ 1 ]
std::vector<Ort::Value> ort_outputs;
std::vector<const char*> output_node_names; // 기본값: [ "output", "stateN" ]
#endif
std::vector<float> outputs_prob; // used in batch mode
int min_silence_samples;
int min_speech_samples;
int speech_pad_samples;
int window_size_samples;
int duration_merge_samples;
int current_sample = 0;
int total_sample_size = 0;
int min_silence_duration_ms;
int speech_pad_ms;
bool triggered = false;
int temp_end = 0;
int global_end = 0;
int erase_tail_count = 0;
void init_engine(int window_size_ms);
void reset_states();
std::vector<Interval> DoVad();
};
} // namespace silero
#endif // SILERO_H

237
examples/c++/wav.h Normal file
View File

@@ -0,0 +1,237 @@
// Copyright (c) 2016 Personal (Binbin Zhang)
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#ifndef FRONTEND_WAV_H_
#define FRONTEND_WAV_H_
#include <assert.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <string>
// #include "utils/log.h"
namespace wav {
struct WavHeader {
char riff[4]; // "riff"
unsigned int size;
char wav[4]; // "WAVE"
char fmt[4]; // "fmt "
unsigned int fmt_size;
uint16_t format;
uint16_t channels;
unsigned int sample_rate;
unsigned int bytes_per_second;
uint16_t block_size;
uint16_t bit;
char data[4]; // "data"
unsigned int data_size;
};
class WavReader {
public:
WavReader() : data_(nullptr) {}
explicit WavReader(const std::string& filename) { Open(filename); }
bool Open(const std::string& filename) {
FILE* fp = fopen(filename.c_str(), "rb"); //文件读取
if (NULL == fp) {
std::cout << "Error in read " << filename;
return false;
}
WavHeader header;
fread(&header, 1, sizeof(header), fp);
if (header.fmt_size < 16) {
printf("WaveData: expect PCM format data "
"to have fmt chunk of at least size 16.\n");
return false;
} else if (header.fmt_size > 16) {
int offset = 44 - 8 + header.fmt_size - 16;
fseek(fp, offset, SEEK_SET);
fread(header.data, 8, sizeof(char), fp);
}
// check "riff" "WAVE" "fmt " "data"
// Skip any sub-chunks between "fmt" and "data". Usually there will
// be a single "fact" sub chunk, but on Windows there can also be a
// "list" sub chunk.
while (0 != strncmp(header.data, "data", 4)) {
// We will just ignore the data in these chunks.
fseek(fp, header.data_size, SEEK_CUR);
// read next sub chunk
fread(header.data, 8, sizeof(char), fp);
}
if (header.data_size == 0) {
int offset = ftell(fp);
fseek(fp, 0, SEEK_END);
header.data_size = ftell(fp) - offset;
fseek(fp, offset, SEEK_SET);
}
num_channel_ = header.channels;
sample_rate_ = header.sample_rate;
bits_per_sample_ = header.bit;
int num_data = header.data_size / (bits_per_sample_ / 8);
data_ = new float[num_data]; // Create 1-dim array
num_samples_ = num_data / num_channel_;
std::cout << "num_channel_ :" << num_channel_ << std::endl;
std::cout << "sample_rate_ :" << sample_rate_ << std::endl;
std::cout << "bits_per_sample_:" << bits_per_sample_ << std::endl;
std::cout << "num_samples :" << num_data << std::endl;
std::cout << "num_data_size :" << header.data_size << std::endl;
switch (bits_per_sample_) {
case 8: {
char sample;
for (int i = 0; i < num_data; ++i) {
fread(&sample, 1, sizeof(char), fp);
data_[i] = static_cast<float>(sample) / 32768;
}
break;
}
case 16: {
int16_t sample;
for (int i = 0; i < num_data; ++i) {
fread(&sample, 1, sizeof(int16_t), fp);
data_[i] = static_cast<float>(sample) / 32768;
}
break;
}
case 32:
{
if (header.format == 1) //S32
{
int sample;
for (int i = 0; i < num_data; ++i) {
fread(&sample, 1, sizeof(int), fp);
data_[i] = static_cast<float>(sample) / 32768;
}
}
else if (header.format == 3) // IEEE-float
{
float sample;
for (int i = 0; i < num_data; ++i) {
fread(&sample, 1, sizeof(float), fp);
data_[i] = static_cast<float>(sample);
}
}
else {
printf("unsupported quantization bits\n");
}
break;
}
default:
printf("unsupported quantization bits\n");
break;
}
fclose(fp);
return true;
}
int num_channel() const { return num_channel_; }
int sample_rate() const { return sample_rate_; }
int bits_per_sample() const { return bits_per_sample_; }
int num_samples() const { return num_samples_; }
~WavReader() {
delete[] data_;
}
const float* data() const { return data_; }
private:
int num_channel_;
int sample_rate_;
int bits_per_sample_;
int num_samples_; // sample points per channel
float* data_;
};
class WavWriter {
public:
WavWriter(const float* data, int num_samples, int num_channel,
int sample_rate, int bits_per_sample)
: data_(data),
num_samples_(num_samples),
num_channel_(num_channel),
sample_rate_(sample_rate),
bits_per_sample_(bits_per_sample) {}
void Write(const std::string& filename) {
FILE* fp = fopen(filename.c_str(), "w");
// init char 'riff' 'WAVE' 'fmt ' 'data'
WavHeader header;
char wav_header[44] = {0x52, 0x49, 0x46, 0x46, 0x00, 0x00, 0x00, 0x00, 0x57,
0x41, 0x56, 0x45, 0x66, 0x6d, 0x74, 0x20, 0x10, 0x00,
0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x64, 0x61, 0x74, 0x61, 0x00, 0x00, 0x00, 0x00};
memcpy(&header, wav_header, sizeof(header));
header.channels = num_channel_;
header.bit = bits_per_sample_;
header.sample_rate = sample_rate_;
header.data_size = num_samples_ * num_channel_ * (bits_per_sample_ / 8);
header.size = sizeof(header) - 8 + header.data_size;
header.bytes_per_second =
sample_rate_ * num_channel_ * (bits_per_sample_ / 8);
header.block_size = num_channel_ * (bits_per_sample_ / 8);
fwrite(&header, 1, sizeof(header), fp);
for (int i = 0; i < num_samples_; ++i) {
for (int j = 0; j < num_channel_; ++j) {
switch (bits_per_sample_) {
case 8: {
char sample = static_cast<char>(data_[i * num_channel_ + j]);
fwrite(&sample, 1, sizeof(sample), fp);
break;
}
case 16: {
int16_t sample = static_cast<int16_t>(data_[i * num_channel_ + j]);
fwrite(&sample, 1, sizeof(sample), fp);
break;
}
case 32: {
int sample = static_cast<int>(data_[i * num_channel_ + j]);
fwrite(&sample, 1, sizeof(sample), fp);
break;
}
}
}
}
fclose(fp);
}
private:
const float* data_;
int num_samples_; // total float points in data_
int num_channel_;
int sample_rate_;
int bits_per_sample_;
};
} // namespace wav
#endif // FRONTEND_WAV_H_

View File

@@ -17,6 +17,7 @@
},
"outputs": [],
"source": [
"#!apt install ffmpeg\n",
"!pip -q install pydub\n",
"from google.colab import output\n",
"from base64 import b64decode, b64encode\n",
@@ -37,13 +38,12 @@
" model='silero_vad',\n",
" force_reload=True)\n",
"\n",
"def int2float(sound):\n",
" abs_max = np.abs(sound).max()\n",
" sound = sound.astype('float32')\n",
" if abs_max > 0:\n",
" sound *= 1/abs_max\n",
" sound = sound.squeeze()\n",
" return sound\n",
"def int2float(audio):\n",
" samples = audio.get_array_of_samples()\n",
" new_sound = audio._spawn(samples)\n",
" arr = np.array(samples).astype(np.float32)\n",
" arr = arr / np.abs(arr).max()\n",
" return arr\n",
"\n",
"AUDIO_HTML = \"\"\"\n",
"<script>\n",
@@ -68,10 +68,10 @@
" //bitsPerSecond: 8000, //chrome seems to ignore, always 48k\n",
" mimeType : 'audio/webm;codecs=opus'\n",
" //mimeType : 'audio/webm;codecs=pcm'\n",
" }; \n",
" };\n",
" //recorder = new MediaRecorder(stream, options);\n",
" recorder = new MediaRecorder(stream);\n",
" recorder.ondataavailable = function(e) { \n",
" recorder.ondataavailable = function(e) {\n",
" var url = URL.createObjectURL(e.data);\n",
" // var preview = document.createElement('audio');\n",
" // preview.controls = true;\n",
@@ -79,7 +79,7 @@
" // document.body.appendChild(preview);\n",
"\n",
" reader = new FileReader();\n",
" reader.readAsDataURL(e.data); \n",
" reader.readAsDataURL(e.data);\n",
" reader.onloadend = function() {\n",
" base64data = reader.result;\n",
" //console.log(\"Inside FileReader:\" + base64data);\n",
@@ -121,7 +121,7 @@
"\n",
"}\n",
"});\n",
" \n",
"\n",
"</script>\n",
"\"\"\"\n",
"\n",
@@ -133,8 +133,8 @@
" audio.export('test.mp3', format='mp3')\n",
" audio = audio.set_channels(1)\n",
" audio = audio.set_frame_rate(16000)\n",
" audio_float = int2float(np.array(audio.get_array_of_samples()))\n",
" audio_tens = torch.tensor(audio_float )\n",
" audio_float = int2float(audio)\n",
" audio_tens = torch.tensor(audio_float)\n",
" return audio_tens\n",
"\n",
"def make_animation(probs, audio_duration, interval=40):\n",
@@ -154,19 +154,18 @@
" def animate(i):\n",
" x = i * interval / 1000 - 0.04\n",
" y = np.linspace(0, 1.02, 2)\n",
" \n",
"\n",
" line.set_data(x, y)\n",
" line.set_color('#990000')\n",
" return line,\n",
" anim = FuncAnimation(fig, animate, init_func=init, interval=interval, save_count=int(audio_duration / (interval / 1000)))\n",
"\n",
" anim = FuncAnimation(fig, animate, init_func=init, interval=interval, save_count=audio_duration / (interval / 1000))\n",
"\n",
" f = r\"animation.mp4\" \n",
" writervideo = FFMpegWriter(fps=1000/interval) \n",
" f = r\"animation.mp4\"\n",
" writervideo = FFMpegWriter(fps=1000/interval)\n",
" anim.save(f, writer=writervideo)\n",
" plt.close('all')\n",
"\n",
"def combine_audio(vidname, audname, outname, fps=25): \n",
"def combine_audio(vidname, audname, outname, fps=25):\n",
" my_clip = mpe.VideoFileClip(vidname, verbose=False)\n",
" audio_background = mpe.AudioFileClip(audname)\n",
" final_clip = my_clip.set_audio(audio_background)\n",
@@ -174,15 +173,10 @@
"\n",
"def record_make_animation():\n",
" tensor = record()\n",
"\n",
" print('Calculating probabilities...')\n",
" speech_probs = []\n",
" window_size_samples = 512\n",
" for i in range(0, len(tensor), window_size_samples):\n",
" if len(tensor[i: i+ window_size_samples]) < window_size_samples:\n",
" break\n",
" speech_prob = model(tensor[i: i+ window_size_samples], 16000).item()\n",
" speech_probs.append(speech_prob)\n",
" speech_probs = model.audio_forward(tensor, sr=16000)[0].tolist()\n",
" model.reset_states()\n",
" print('Making animation...')\n",
" make_animation(speech_probs, len(tensor) / 16000)\n",
@@ -196,7 +190,9 @@
" <video width=800 controls>\n",
" <source src=\"%s\" type=\"video/mp4\">\n",
" </video>\n",
" \"\"\" % data_url))"
" \"\"\" % data_url))\n",
"\n",
" return speech_probs"
]
},
{
@@ -216,7 +212,7 @@
},
"outputs": [],
"source": [
"record_make_animation()"
"speech_probs = record_make_animation()"
]
}
],

43
examples/cpp/README.md Normal file
View File

@@ -0,0 +1,43 @@
# Stream example in C++
Here's a simple example of the vad model in c++ onnxruntime.
## Requirements
Code are tested in the environments bellow, feel free to try others.
- WSL2 + Debian-bullseye (docker)
- gcc 12.2.0
- onnxruntime-linux-x64-1.12.1
## Usage
1. Install gcc 12.2.0, or just pull the docker image with `docker pull gcc:12.2.0-bullseye`
2. Install onnxruntime-linux-x64-1.12.1
- Download lib onnxruntime:
`wget https://github.com/microsoft/onnxruntime/releases/download/v1.12.1/onnxruntime-linux-x64-1.12.1.tgz`
- Unzip. Assume the path is `/root/onnxruntime-linux-x64-1.12.1`
3. Modify wav path & Test configs in main function
`wav::WavReader wav_reader("${path_to_your_wav_file}");`
test sample rate, frame per ms, threshold...
4. Build with gcc and run
```bash
# Build
g++ silero-vad-onnx.cpp -I /root/onnxruntime-linux-x64-1.12.1/include/ -L /root/onnxruntime-linux-x64-1.12.1/lib/ -lonnxruntime -Wl,-rpath,/root/onnxruntime-linux-x64-1.12.1/lib/ -o test
# Run
./test
```

View File

@@ -0,0 +1,367 @@
#ifndef _CRT_SECURE_NO_WARNINGS
#define _CRT_SECURE_NO_WARNINGS
#endif
#include <iostream>
#include <vector>
#include <sstream>
#include <cstring>
#include <limits>
#include <chrono>
#include <iomanip>
#include <memory>
#include <string>
#include <stdexcept>
#include <cstdio>
#include <cstdarg>
#include <cmath> // for std::rint
#if __cplusplus < 201703L
#include <memory>
#endif
//#define __DEBUG_SPEECH_PROB___
#include "onnxruntime_cxx_api.h"
#include "wav.h" // For reading WAV files
// timestamp_t class: stores the start and end (in samples) of a speech segment.
class timestamp_t {
public:
int start;
int end;
timestamp_t(int start = -1, int end = -1)
: start(start), end(end) { }
timestamp_t& operator=(const timestamp_t& a) {
start = a.start;
end = a.end;
return *this;
}
bool operator==(const timestamp_t& a) const {
return (start == a.start && end == a.end);
}
// Returns a formatted string of the timestamp.
std::string c_str() const {
return format("{start:%08d, end:%08d}", start, end);
}
private:
// Helper function for formatting.
std::string format(const char* fmt, ...) const {
char buf[256];
va_list args;
va_start(args, fmt);
const auto r = std::vsnprintf(buf, sizeof(buf), fmt, args);
va_end(args);
if (r < 0)
return {};
const size_t len = r;
if (len < sizeof(buf))
return std::string(buf, len);
#if __cplusplus >= 201703L
std::string s(len, '\0');
va_start(args, fmt);
std::vsnprintf(s.data(), len + 1, fmt, args);
va_end(args);
return s;
#else
auto vbuf = std::unique_ptr<char[]>(new char[len + 1]);
va_start(args, fmt);
std::vsnprintf(vbuf.get(), len + 1, fmt, args);
va_end(args);
return std::string(vbuf.get(), len);
#endif
}
};
// VadIterator class: uses ONNX Runtime to detect speech segments.
class VadIterator {
private:
// ONNX Runtime resources
Ort::Env env;
Ort::SessionOptions session_options;
std::shared_ptr<Ort::Session> session = nullptr;
Ort::AllocatorWithDefaultOptions allocator;
Ort::MemoryInfo memory_info = Ort::MemoryInfo::CreateCpu(OrtArenaAllocator, OrtMemTypeCPU);
// ----- Context-related additions -----
const int context_samples = 64; // For 16kHz, 64 samples are added as context.
std::vector<float> _context; // Holds the last 64 samples from the previous chunk (initialized to zero).
// Original window size (e.g., 32ms corresponds to 512 samples)
int window_size_samples;
// Effective window size = window_size_samples + context_samples
int effective_window_size;
// Additional declaration: samples per millisecond
int sr_per_ms;
// ONNX Runtime input/output buffers
std::vector<Ort::Value> ort_inputs;
std::vector<const char*> input_node_names = { "input", "state", "sr" };
std::vector<float> input;
unsigned int size_state = 2 * 1 * 128;
std::vector<float> _state;
std::vector<int64_t> sr;
int64_t input_node_dims[2] = {};
const int64_t state_node_dims[3] = { 2, 1, 128 };
const int64_t sr_node_dims[1] = { 1 };
std::vector<Ort::Value> ort_outputs;
std::vector<const char*> output_node_names = { "output", "stateN" };
// Model configuration parameters
int sample_rate;
float threshold;
int min_silence_samples;
int min_silence_samples_at_max_speech;
int min_speech_samples;
float max_speech_samples;
int speech_pad_samples;
int audio_length_samples;
// State management
bool triggered = false;
unsigned int temp_end = 0;
unsigned int current_sample = 0;
int prev_end;
int next_start = 0;
std::vector<timestamp_t> speeches;
timestamp_t current_speech;
// Loads the ONNX model.
void init_onnx_model(const std::wstring& model_path) {
init_engine_threads(1, 1);
session = std::make_shared<Ort::Session>(env, model_path.c_str(), session_options);
}
// Initializes threading settings.
void init_engine_threads(int inter_threads, int intra_threads) {
session_options.SetIntraOpNumThreads(intra_threads);
session_options.SetInterOpNumThreads(inter_threads);
session_options.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_ALL);
}
// Resets internal state (_state, _context, etc.)
void reset_states() {
std::memset(_state.data(), 0, _state.size() * sizeof(float));
triggered = false;
temp_end = 0;
current_sample = 0;
prev_end = next_start = 0;
speeches.clear();
current_speech = timestamp_t();
std::fill(_context.begin(), _context.end(), 0.0f);
}
// Inference: runs inference on one chunk of input data.
// data_chunk is expected to have window_size_samples samples.
void predict(const std::vector<float>& data_chunk) {
// Build new input: first context_samples from _context, followed by the current chunk (window_size_samples).
std::vector<float> new_data(effective_window_size, 0.0f);
std::copy(_context.begin(), _context.end(), new_data.begin());
std::copy(data_chunk.begin(), data_chunk.end(), new_data.begin() + context_samples);
input = new_data;
// Create input tensor (input_node_dims[1] is already set to effective_window_size).
Ort::Value input_ort = Ort::Value::CreateTensor<float>(
memory_info, input.data(), input.size(), input_node_dims, 2);
Ort::Value state_ort = Ort::Value::CreateTensor<float>(
memory_info, _state.data(), _state.size(), state_node_dims, 3);
Ort::Value sr_ort = Ort::Value::CreateTensor<int64_t>(
memory_info, sr.data(), sr.size(), sr_node_dims, 1);
ort_inputs.clear();
ort_inputs.emplace_back(std::move(input_ort));
ort_inputs.emplace_back(std::move(state_ort));
ort_inputs.emplace_back(std::move(sr_ort));
// Run inference.
ort_outputs = session->Run(
Ort::RunOptions{ nullptr },
input_node_names.data(), ort_inputs.data(), ort_inputs.size(),
output_node_names.data(), output_node_names.size());
float speech_prob = ort_outputs[0].GetTensorMutableData<float>()[0];
float* stateN = ort_outputs[1].GetTensorMutableData<float>();
std::memcpy(_state.data(), stateN, size_state * sizeof(float));
current_sample += static_cast<unsigned int>(window_size_samples); // Advance by the original window size.
// If speech is detected (probability >= threshold)
if (speech_prob >= threshold) {
#ifdef __DEBUG_SPEECH_PROB___
float speech = current_sample - window_size_samples;
printf("{ start: %.3f s (%.3f) %08d}\n", 1.0f * speech / sample_rate, speech_prob, current_sample - window_size_samples);
#endif
if (temp_end != 0) {
temp_end = 0;
if (next_start < prev_end)
next_start = current_sample - window_size_samples;
}
if (!triggered) {
triggered = true;
current_speech.start = current_sample - window_size_samples;
}
// Update context: copy the last context_samples from new_data.
std::copy(new_data.end() - context_samples, new_data.end(), _context.begin());
return;
}
// If the speech segment becomes too long.
if (triggered && ((current_sample - current_speech.start) > max_speech_samples)) {
if (prev_end > 0) {
current_speech.end = prev_end;
speeches.push_back(current_speech);
current_speech = timestamp_t();
if (next_start < prev_end)
triggered = false;
else
current_speech.start = next_start;
prev_end = 0;
next_start = 0;
temp_end = 0;
}
else {
current_speech.end = current_sample;
speeches.push_back(current_speech);
current_speech = timestamp_t();
prev_end = 0;
next_start = 0;
temp_end = 0;
triggered = false;
}
std::copy(new_data.end() - context_samples, new_data.end(), _context.begin());
return;
}
if ((speech_prob >= (threshold - 0.15)) && (speech_prob < threshold)) {
// When the speech probability temporarily drops but is still in speech, update context without changing state.
std::copy(new_data.end() - context_samples, new_data.end(), _context.begin());
return;
}
if (speech_prob < (threshold - 0.15)) {
#ifdef __DEBUG_SPEECH_PROB___
float speech = current_sample - window_size_samples - speech_pad_samples;
printf("{ end: %.3f s (%.3f) %08d}\n", 1.0f * speech / sample_rate, speech_prob, current_sample - window_size_samples);
#endif
if (triggered) {
if (temp_end == 0)
temp_end = current_sample;
if (current_sample - temp_end > min_silence_samples_at_max_speech)
prev_end = temp_end;
if ((current_sample - temp_end) >= min_silence_samples) {
current_speech.end = temp_end;
if (current_speech.end - current_speech.start > min_speech_samples) {
speeches.push_back(current_speech);
current_speech = timestamp_t();
prev_end = 0;
next_start = 0;
temp_end = 0;
triggered = false;
}
}
}
std::copy(new_data.end() - context_samples, new_data.end(), _context.begin());
return;
}
}
public:
// Process the entire audio input.
void process(const std::vector<float>& input_wav) {
reset_states();
audio_length_samples = static_cast<int>(input_wav.size());
// Process audio in chunks of window_size_samples (e.g., 512 samples)
for (size_t j = 0; j < static_cast<size_t>(audio_length_samples); j += static_cast<size_t>(window_size_samples)) {
if (j + static_cast<size_t>(window_size_samples) > static_cast<size_t>(audio_length_samples))
break;
std::vector<float> chunk(&input_wav[j], &input_wav[j] + window_size_samples);
predict(chunk);
}
if (current_speech.start >= 0) {
current_speech.end = audio_length_samples;
speeches.push_back(current_speech);
current_speech = timestamp_t();
prev_end = 0;
next_start = 0;
temp_end = 0;
triggered = false;
}
}
// Returns the detected speech timestamps.
const std::vector<timestamp_t> get_speech_timestamps() const {
return speeches;
}
// Public method to reset the internal state.
void reset() {
reset_states();
}
public:
// Constructor: sets model path, sample rate, window size (ms), and other parameters.
// The parameters are set to match the Python version.
VadIterator(const std::wstring ModelPath,
int Sample_rate = 16000, int windows_frame_size = 32,
float Threshold = 0.5, int min_silence_duration_ms = 100,
int speech_pad_ms = 30, int min_speech_duration_ms = 250,
float max_speech_duration_s = std::numeric_limits<float>::infinity())
: sample_rate(Sample_rate), threshold(Threshold), speech_pad_samples(speech_pad_ms), prev_end(0)
{
sr_per_ms = sample_rate / 1000; // e.g., 16000 / 1000 = 16
window_size_samples = windows_frame_size * sr_per_ms; // e.g., 32ms * 16 = 512 samples
effective_window_size = window_size_samples + context_samples; // e.g., 512 + 64 = 576 samples
input_node_dims[0] = 1;
input_node_dims[1] = effective_window_size;
_state.resize(size_state);
sr.resize(1);
sr[0] = sample_rate;
_context.assign(context_samples, 0.0f);
min_speech_samples = sr_per_ms * min_speech_duration_ms;
max_speech_samples = (sample_rate * max_speech_duration_s - window_size_samples - 2 * speech_pad_samples);
min_silence_samples = sr_per_ms * min_silence_duration_ms;
min_silence_samples_at_max_speech = sr_per_ms * 98;
init_onnx_model(ModelPath);
}
};
int main() {
// Read the WAV file (expects 16000 Hz, mono, PCM).
wav::WavReader wav_reader("audio/recorder.wav"); // File located in the "audio" folder.
int numSamples = wav_reader.num_samples();
std::vector<float> input_wav(static_cast<size_t>(numSamples));
for (size_t i = 0; i < static_cast<size_t>(numSamples); i++) {
input_wav[i] = static_cast<float>(*(wav_reader.data() + i));
}
// Set the ONNX model path (file located in the "model" folder).
std::wstring model_path = L"model/silero_vad.onnx";
// Initialize the VadIterator.
VadIterator vad(model_path);
// Process the audio.
vad.process(input_wav);
// Retrieve the speech timestamps (in samples).
std::vector<timestamp_t> stamps = vad.get_speech_timestamps();
// Convert timestamps to seconds and round to one decimal place (for 16000 Hz).
const float sample_rate_float = 16000.0f;
for (size_t i = 0; i < stamps.size(); i++) {
float start_sec = std::rint((stamps[i].start / sample_rate_float) * 10.0f) / 10.0f;
float end_sec = std::rint((stamps[i].end / sample_rate_float) * 10.0f) / 10.0f;
std::cout << "Speech detected from "
<< std::fixed << std::setprecision(1) << start_sec
<< " s to "
<< std::fixed << std::setprecision(1) << end_sec
<< " s" << std::endl;
}
// Optionally, reset the internal state.
vad.reset();
return 0;
}

237
examples/cpp/wav.h Normal file
View File

@@ -0,0 +1,237 @@
// Copyright (c) 2016 Personal (Binbin Zhang)
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#ifndef FRONTEND_WAV_H_
#define FRONTEND_WAV_H_
#include <assert.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <string>
#include <iostream>
// #include "utils/log.h"
namespace wav {
struct WavHeader {
char riff[4]; // "riff"
unsigned int size;
char wav[4]; // "WAVE"
char fmt[4]; // "fmt "
unsigned int fmt_size;
uint16_t format;
uint16_t channels;
unsigned int sample_rate;
unsigned int bytes_per_second;
uint16_t block_size;
uint16_t bit;
char data[4]; // "data"
unsigned int data_size;
};
class WavReader {
public:
WavReader() : data_(nullptr) {}
explicit WavReader(const std::string& filename) { Open(filename); }
bool Open(const std::string& filename) {
FILE* fp = fopen(filename.c_str(), "rb"); //文件读取
if (NULL == fp) {
std::cout << "Error in read " << filename;
return false;
}
WavHeader header;
fread(&header, 1, sizeof(header), fp);
if (header.fmt_size < 16) {
printf("WaveData: expect PCM format data "
"to have fmt chunk of at least size 16.\n");
return false;
} else if (header.fmt_size > 16) {
int offset = 44 - 8 + header.fmt_size - 16;
fseek(fp, offset, SEEK_SET);
fread(header.data, 8, sizeof(char), fp);
}
// check "riff" "WAVE" "fmt " "data"
// Skip any sub-chunks between "fmt" and "data". Usually there will
// be a single "fact" sub chunk, but on Windows there can also be a
// "list" sub chunk.
while (0 != strncmp(header.data, "data", 4)) {
// We will just ignore the data in these chunks.
fseek(fp, header.data_size, SEEK_CUR);
// read next sub chunk
fread(header.data, 8, sizeof(char), fp);
}
if (header.data_size == 0) {
int offset = ftell(fp);
fseek(fp, 0, SEEK_END);
header.data_size = ftell(fp) - offset;
fseek(fp, offset, SEEK_SET);
}
num_channel_ = header.channels;
sample_rate_ = header.sample_rate;
bits_per_sample_ = header.bit;
int num_data = header.data_size / (bits_per_sample_ / 8);
data_ = new float[num_data]; // Create 1-dim array
num_samples_ = num_data / num_channel_;
std::cout << "num_channel_ :" << num_channel_ << std::endl;
std::cout << "sample_rate_ :" << sample_rate_ << std::endl;
std::cout << "bits_per_sample_:" << bits_per_sample_ << std::endl;
std::cout << "num_samples :" << num_data << std::endl;
std::cout << "num_data_size :" << header.data_size << std::endl;
switch (bits_per_sample_) {
case 8: {
char sample;
for (int i = 0; i < num_data; ++i) {
fread(&sample, 1, sizeof(char), fp);
data_[i] = static_cast<float>(sample) / 32768;
}
break;
}
case 16: {
int16_t sample;
for (int i = 0; i < num_data; ++i) {
fread(&sample, 1, sizeof(int16_t), fp);
data_[i] = static_cast<float>(sample) / 32768;
}
break;
}
case 32:
{
if (header.format == 1) //S32
{
int sample;
for (int i = 0; i < num_data; ++i) {
fread(&sample, 1, sizeof(int), fp);
data_[i] = static_cast<float>(sample) / 32768;
}
}
else if (header.format == 3) // IEEE-float
{
float sample;
for (int i = 0; i < num_data; ++i) {
fread(&sample, 1, sizeof(float), fp);
data_[i] = static_cast<float>(sample);
}
}
else {
printf("unsupported quantization bits\n");
}
break;
}
default:
printf("unsupported quantization bits\n");
break;
}
fclose(fp);
return true;
}
int num_channel() const { return num_channel_; }
int sample_rate() const { return sample_rate_; }
int bits_per_sample() const { return bits_per_sample_; }
int num_samples() const { return num_samples_; }
~WavReader() {
delete[] data_;
}
const float* data() const { return data_; }
private:
int num_channel_;
int sample_rate_;
int bits_per_sample_;
int num_samples_; // sample points per channel
float* data_;
};
class WavWriter {
public:
WavWriter(const float* data, int num_samples, int num_channel,
int sample_rate, int bits_per_sample)
: data_(data),
num_samples_(num_samples),
num_channel_(num_channel),
sample_rate_(sample_rate),
bits_per_sample_(bits_per_sample) {}
void Write(const std::string& filename) {
FILE* fp = fopen(filename.c_str(), "w");
// init char 'riff' 'WAVE' 'fmt ' 'data'
WavHeader header;
char wav_header[44] = {0x52, 0x49, 0x46, 0x46, 0x00, 0x00, 0x00, 0x00, 0x57,
0x41, 0x56, 0x45, 0x66, 0x6d, 0x74, 0x20, 0x10, 0x00,
0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x64, 0x61, 0x74, 0x61, 0x00, 0x00, 0x00, 0x00};
memcpy(&header, wav_header, sizeof(header));
header.channels = num_channel_;
header.bit = bits_per_sample_;
header.sample_rate = sample_rate_;
header.data_size = num_samples_ * num_channel_ * (bits_per_sample_ / 8);
header.size = sizeof(header) - 8 + header.data_size;
header.bytes_per_second =
sample_rate_ * num_channel_ * (bits_per_sample_ / 8);
header.block_size = num_channel_ * (bits_per_sample_ / 8);
fwrite(&header, 1, sizeof(header), fp);
for (int i = 0; i < num_samples_; ++i) {
for (int j = 0; j < num_channel_; ++j) {
switch (bits_per_sample_) {
case 8: {
char sample = static_cast<char>(data_[i * num_channel_ + j]);
fwrite(&sample, 1, sizeof(sample), fp);
break;
}
case 16: {
int16_t sample = static_cast<int16_t>(data_[i * num_channel_ + j]);
fwrite(&sample, 1, sizeof(sample), fp);
break;
}
case 32: {
int sample = static_cast<int>(data_[i * num_channel_ + j]);
fwrite(&sample, 1, sizeof(sample), fp);
break;
}
}
}
}
fclose(fp);
}
private:
const float* data_;
int num_samples_; // total float points in data_
int num_channel_;
int sample_rate_;
int bits_per_sample_;
};
} // namespace wav
#endif // FRONTEND_WAV_H_

View File

@@ -0,0 +1,45 @@
# Silero-VAD V5 in C++ (based on LibTorch)
This is the source code for Silero-VAD V5 in C++, utilizing LibTorch. The primary implementation is CPU-based, and you should compare its results with the Python version. Only results at 16kHz have been tested.
Additionally, batch and CUDA inference options are available if you want to explore further. Note that when using batch inference, the speech probabilities may slightly differ from the standard version, likely due to differences in caching. Unlike individual input processing, batch inference may not use the cache from previous chunks. Despite this, batch inference offers significantly faster processing. For optimal performance, consider adjusting the threshold when using batch inference.
## Requirements
- GCC 11.4.0 (GCC >= 5.1)
- LibTorch 1.13.0 (other versions are also acceptable)
## Download LibTorch
```bash
-CPU Version
wget https://download.pytorch.org/libtorch/cpu/libtorch-shared-with-deps-1.13.0%2Bcpu.zip
unzip libtorch-shared-with-deps-1.13.0+cpu.zip'
-CUDA Version
wget https://download.pytorch.org/libtorch/cu116/libtorch-shared-with-deps-1.13.0%2Bcu116.zip
unzip libtorch-shared-with-deps-1.13.0+cu116.zip
```
## Compilation
```bash
-CPU Version
g++ main.cc silero_torch.cc -I ./libtorch/include/ -I ./libtorch/include/torch/csrc/api/include -L ./libtorch/lib/ -ltorch -ltorch_cpu -lc10 -Wl,-rpath,./libtorch/lib/ -o silero -std=c++14 -D_GLIBCXX_USE_CXX11_ABI=0
-CUDA Version
g++ main.cc silero_torch.cc -I ./libtorch/include/ -I ./libtorch/include/torch/csrc/api/include -L ./libtorch/lib/ -ltorch -ltorch_cuda -ltorch_cpu -lc10 -Wl,-rpath,./libtorch/lib/ -o silero -std=c++14 -D_GLIBCXX_USE_CXX11_ABI=0 -DUSE_GPU
```
## Optional Compilation Flags
-DUSE_BATCH: Enable batch inference
-DUSE_GPU: Use GPU for inference
## Run the Program
To run the program, use the following command:
`./silero aepyx.wav 16000 0.5`
The sample file aepyx.wav is part of the Voxconverse dataset.
File details: aepyx.wav is a 16kHz, 16-bit audio file.

Binary file not shown.

View File

@@ -0,0 +1,54 @@
#include <iostream>
#include "silero_torch.h"
#include "wav.h"
int main(int argc, char* argv[]) {
if(argc != 4){
std::cerr<<"Usage : "<<argv[0]<<" <wav.path> <SampleRate> <Threshold>"<<std::endl;
std::cerr<<"Usage : "<<argv[0]<<" sample.wav 16000 0.5"<<std::endl;
return 1;
}
std::string wav_path = argv[1];
float sample_rate = std::stof(argv[2]);
float threshold = std::stof(argv[3]);
//Load Model
std::string model_path = "../../src/silero_vad/data/silero_vad.jit";
silero::VadIterator vad(model_path);
vad.threshold=threshold; //(Default:0.5)
vad.sample_rate=sample_rate; //16000Hz,8000Hz. (Default:16000)
vad.print_as_samples=true; //if true, it prints time-stamp with samples. otherwise, in seconds
//(Default:false)
vad.SetVariables();
// Read wav
wav::WavReader wav_reader(wav_path);
std::vector<float> input_wav(wav_reader.num_samples());
for (int i = 0; i < wav_reader.num_samples(); i++)
{
input_wav[i] = static_cast<float>(*(wav_reader.data() + i));
}
vad.SpeechProbs(input_wav);
std::vector<silero::SpeechSegment> speeches = vad.GetSpeechTimestamps();
for(const auto& speech : speeches){
if(vad.print_as_samples){
std::cout<<"{'start': "<<static_cast<int>(speech.start)<<", 'end': "<<static_cast<int>(speech.end)<<"}"<<std::endl;
}
else{
std::cout<<"{'start': "<<speech.start<<", 'end': "<<speech.end<<"}"<<std::endl;
}
}
return 0;
}

BIN
examples/cpp_libtorch/silero Executable file

Binary file not shown.

View File

@@ -0,0 +1,285 @@
//Author : Nathan Lee
//Created On : 2024-11-18
//Description : silero 5.1 system for torch-script(c++).
//Version : 1.0
#include "silero_torch.h"
namespace silero {
VadIterator::VadIterator(const std::string &model_path, float threshold, int sample_rate, int window_size_ms, int speech_pad_ms, int min_silence_duration_ms, int min_speech_duration_ms, int max_duration_merge_ms, bool print_as_samples)
:sample_rate(sample_rate), threshold(threshold), window_size_ms(window_size_ms), speech_pad_ms(speech_pad_ms), min_silence_duration_ms(min_silence_duration_ms), min_speech_duration_ms(min_speech_duration_ms), max_duration_merge_ms(max_duration_merge_ms), print_as_samples(print_as_samples)
{
init_torch_model(model_path);
//init_engine(window_size_ms);
}
VadIterator::~VadIterator(){
}
void VadIterator::SpeechProbs(std::vector<float>& input_wav){
// Set the sample rate (must match the model's expected sample rate)
// Process the waveform in chunks of 512 samples
int num_samples = input_wav.size();
int num_chunks = num_samples / window_size_samples;
int remainder_samples = num_samples % window_size_samples;
total_sample_size += num_samples;
torch::Tensor output;
std::vector<torch::Tensor> chunks;
for (int i = 0; i < num_chunks; i++) {
float* chunk_start = input_wav.data() + i *window_size_samples;
torch::Tensor chunk = torch::from_blob(chunk_start, {1,window_size_samples}, torch::kFloat32);
//std::cout<<"chunk size : "<<chunk.sizes()<<std::endl;
chunks.push_back(chunk);
if(i==num_chunks-1 && remainder_samples>0){//마지막 chunk && 나머지가 존재
int remaining_samples = num_samples - num_chunks * window_size_samples;
//std::cout<<"Remainder size : "<<remaining_samples;
float* chunk_start_remainder = input_wav.data() + num_chunks *window_size_samples;
torch::Tensor remainder_chunk = torch::from_blob(chunk_start_remainder, {1,remaining_samples},
torch::kFloat32);
// Pad the remainder chunk to match window_size_samples
torch::Tensor padded_chunk = torch::cat({remainder_chunk, torch::zeros({1, window_size_samples
- remaining_samples}, torch::kFloat32)}, 1);
//std::cout<<", padded_chunk size : "<<padded_chunk.size(1)<<std::endl;
chunks.push_back(padded_chunk);
}
}
if (!chunks.empty()) {
#ifdef USE_BATCH
torch::Tensor batched_chunks = torch::stack(chunks); // Stack all chunks into a single tensor
//batched_chunks = batched_chunks.squeeze(1);
batched_chunks = torch::cat({batched_chunks.squeeze(1)});
#ifdef USE_GPU
batched_chunks = batched_chunks.to(at::kCUDA); // Move the entire batch to GPU once
#endif
// Prepare input for model
std::vector<torch::jit::IValue> inputs;
inputs.push_back(batched_chunks); // Batch of chunks
inputs.push_back(sample_rate); // Assuming sample_rate is a valid input for the model
// Run inference on the batch
torch::NoGradGuard no_grad;
torch::Tensor output = model.forward(inputs).toTensor();
#ifdef USE_GPU
output = output.to(at::kCPU); // Move the output back to CPU once
#endif
// Collect output probabilities
for (int i = 0; i < chunks.size(); i++) {
float output_f = output[i].item<float>();
outputs_prob.push_back(output_f);
//std::cout << "Chunk " << i << " prob: " << output_f<< "\n";
}
#else
std::vector<torch::Tensor> outputs;
torch::Tensor batched_chunks = torch::stack(chunks);
#ifdef USE_GPU
batched_chunks = batched_chunks.to(at::kCUDA);
#endif
for (int i = 0; i < chunks.size(); i++) {
torch::NoGradGuard no_grad;
std::vector<torch::jit::IValue> inputs;
inputs.push_back(batched_chunks[i]);
inputs.push_back(sample_rate);
torch::Tensor output = model.forward(inputs).toTensor();
outputs.push_back(output);
}
torch::Tensor all_outputs = torch::stack(outputs);
#ifdef USE_GPU
all_outputs = all_outputs.to(at::kCPU);
#endif
for (int i = 0; i < chunks.size(); i++) {
float output_f = all_outputs[i].item<float>();
outputs_prob.push_back(output_f);
}
#endif
}
}
std::vector<SpeechSegment> VadIterator::GetSpeechTimestamps() {
std::vector<SpeechSegment> speeches = DoVad();
#ifdef USE_BATCH
//When you use BATCH inference. You would better use 'mergeSpeeches' function to arrage time stamp.
//It could be better get reasonable output because of distorted probs.
duration_merge_samples = sample_rate * max_duration_merge_ms / 1000;
std::vector<SpeechSegment> speeches_merge = mergeSpeeches(speeches, duration_merge_samples);
if(!print_as_samples){
for (auto& speech : speeches_merge) { //samples to second
speech.start /= sample_rate;
speech.end /= sample_rate;
}
}
return speeches_merge;
#else
if(!print_as_samples){
for (auto& speech : speeches) { //samples to second
speech.start /= sample_rate;
speech.end /= sample_rate;
}
}
return speeches;
#endif
}
void VadIterator::SetVariables(){
init_engine(window_size_ms);
}
void VadIterator::init_engine(int window_size_ms) {
min_silence_samples = sample_rate * min_silence_duration_ms / 1000;
speech_pad_samples = sample_rate * speech_pad_ms / 1000;
window_size_samples = sample_rate / 1000 * window_size_ms;
min_speech_samples = sample_rate * min_speech_duration_ms / 1000;
}
void VadIterator::init_torch_model(const std::string& model_path) {
at::set_num_threads(1);
model = torch::jit::load(model_path);
#ifdef USE_GPU
if (!torch::cuda::is_available()) {
std::cout<<"CUDA is not available! Please check your GPU settings"<<std::endl;
throw std::runtime_error("CUDA is not available!");
model.to(at::Device(at::kCPU));
} else {
std::cout<<"CUDA available! Running on '0'th GPU"<<std::endl;
model.to(at::Device(at::kCUDA, 0)); //select 0'th machine
}
#endif
model.eval();
torch::NoGradGuard no_grad;
std::cout << "Model loaded successfully"<<std::endl;
}
void VadIterator::reset_states() {
triggered = false;
current_sample = 0;
temp_end = 0;
outputs_prob.clear();
model.run_method("reset_states");
total_sample_size = 0;
}
std::vector<SpeechSegment> VadIterator::DoVad() {
std::vector<SpeechSegment> speeches;
for (size_t i = 0; i < outputs_prob.size(); ++i) {
float speech_prob = outputs_prob[i];
//std::cout << speech_prob << std::endl;
//std::cout << "Chunk " << i << " Prob: " << speech_prob << "\n";
//std::cout << speech_prob << " ";
current_sample += window_size_samples;
if (speech_prob >= threshold && temp_end != 0) {
temp_end = 0;
}
if (speech_prob >= threshold && !triggered) {
triggered = true;
SpeechSegment segment;
segment.start = std::max(static_cast<int>(0), current_sample - speech_pad_samples - window_size_samples);
speeches.push_back(segment);
continue;
}
if (speech_prob < threshold - 0.15f && triggered) {
if (temp_end == 0) {
temp_end = current_sample;
}
if (current_sample - temp_end < min_silence_samples) {
continue;
} else {
SpeechSegment& segment = speeches.back();
segment.end = temp_end + speech_pad_samples - window_size_samples;
temp_end = 0;
triggered = false;
}
}
}
if (triggered) { //만약 낮은 확률을 보이다가 마지막프레임 prbos만 딱 확률이 높게 나오면 위에서 triggerd = true 메핑과 동시에 segment start가 돼서 문제가 될것 같은데? start = end 같은값? 후처리가 있으니 문제가 없으려나?
std::cout<<"when last triggered is keep working until last Probs"<<std::endl;
SpeechSegment& segment = speeches.back();
segment.end = total_sample_size; // 현재 샘플을 마지막 구간의 종료 시간으로 설정
triggered = false; // VAD 상태 초기화
}
speeches.erase(
std::remove_if(
speeches.begin(),
speeches.end(),
[this](const SpeechSegment& speech) {
return ((speech.end - this->speech_pad_samples) - (speech.start + this->speech_pad_samples) < min_speech_samples);
//min_speech_samples is 4000samples(0.25sec)
//여기서 포인트!! 계산 할때는 start,end sample에'speech_pad_samples' 사이즈를 추가한후 길이를 측정함.
}
),
speeches.end()
);
//std::cout<<std::endl;
//std::cout<<"outputs_prob.size : "<<outputs_prob.size()<<std::endl;
reset_states();
return speeches;
}
std::vector<SpeechSegment> VadIterator::mergeSpeeches(const std::vector<SpeechSegment>& speeches, int duration_merge_samples) {
std::vector<SpeechSegment> mergedSpeeches;
if (speeches.empty()) {
return mergedSpeeches; // 빈 벡터 반환
}
// 첫 번째 구간으로 초기화
SpeechSegment currentSegment = speeches[0];
for (size_t i = 1; i < speeches.size(); ++i) { //첫번째 start,end 정보 건너뛰기. 그래서 i=1부터
// 두 구간의 차이가 threshold(duration_merge_samples)보다 작은 경우, 합침
if (speeches[i].start - currentSegment.end < duration_merge_samples) {
// 현재 구간의 끝점을 업데이트
currentSegment.end = speeches[i].end;
} else {
// 차이가 threshold(duration_merge_samples) 이상이면 현재 구간을 저장하고 새로운 구간 시작
mergedSpeeches.push_back(currentSegment);
currentSegment = speeches[i];
}
}
// 마지막 구간 추가
mergedSpeeches.push_back(currentSegment);
return mergedSpeeches;
}
}

View File

@@ -0,0 +1,75 @@
//Author : Nathan Lee
//Created On : 2024-11-18
//Description : silero 5.1 system for torch-script(c++).
//Version : 1.0
#ifndef SILERO_TORCH_H
#define SILERO_TORCH_H
#include <string>
#include <memory>
#include <stdexcept>
#include <iostream>
#include <memory>
#include <vector>
#include <fstream>
#include <chrono>
#include <torch/torch.h>
#include <torch/script.h>
namespace silero{
struct SpeechSegment{
int start;
int end;
};
class VadIterator{
public:
VadIterator(const std::string &model_path, float threshold = 0.5, int sample_rate = 16000,
int window_size_ms = 32, int speech_pad_ms = 30, int min_silence_duration_ms = 100,
int min_speech_duration_ms = 250, int max_duration_merge_ms = 300, bool print_as_samples = false);
~VadIterator();
void SpeechProbs(std::vector<float>& input_wav);
std::vector<silero::SpeechSegment> GetSpeechTimestamps();
void SetVariables();
float threshold;
int sample_rate;
int window_size_ms;
int min_speech_duration_ms;
int max_duration_merge_ms;
bool print_as_samples;
private:
torch::jit::script::Module model;
std::vector<float> outputs_prob;
int min_silence_samples;
int min_speech_samples;
int speech_pad_samples;
int window_size_samples;
int duration_merge_samples;
int current_sample = 0;
int total_sample_size=0;
int min_silence_duration_ms;
int speech_pad_ms;
bool triggered = false;
int temp_end = 0;
void init_engine(int window_size_ms);
void init_torch_model(const std::string& model_path);
void reset_states();
std::vector<SpeechSegment> DoVad();
std::vector<SpeechSegment> mergeSpeeches(const std::vector<SpeechSegment>& speeches, int duration_merge_samples);
};
}
#endif // SILERO_TORCH_H

235
examples/cpp_libtorch/wav.h Normal file
View File

@@ -0,0 +1,235 @@
// Copyright (c) 2016 Personal (Binbin Zhang)
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#ifndef FRONTEND_WAV_H_
#define FRONTEND_WAV_H_
#include <assert.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <string>
// #include "utils/log.h"
namespace wav {
struct WavHeader {
char riff[4]; // "riff"
unsigned int size;
char wav[4]; // "WAVE"
char fmt[4]; // "fmt "
unsigned int fmt_size;
uint16_t format;
uint16_t channels;
unsigned int sample_rate;
unsigned int bytes_per_second;
uint16_t block_size;
uint16_t bit;
char data[4]; // "data"
unsigned int data_size;
};
class WavReader {
public:
WavReader() : data_(nullptr) {}
explicit WavReader(const std::string& filename) { Open(filename); }
bool Open(const std::string& filename) {
FILE* fp = fopen(filename.c_str(), "rb"); //文件读取
if (NULL == fp) {
std::cout << "Error in read " << filename;
return false;
}
WavHeader header;
fread(&header, 1, sizeof(header), fp);
if (header.fmt_size < 16) {
printf("WaveData: expect PCM format data "
"to have fmt chunk of at least size 16.\n");
return false;
} else if (header.fmt_size > 16) {
int offset = 44 - 8 + header.fmt_size - 16;
fseek(fp, offset, SEEK_SET);
fread(header.data, 8, sizeof(char), fp);
}
// check "riff" "WAVE" "fmt " "data"
// Skip any sub-chunks between "fmt" and "data". Usually there will
// be a single "fact" sub chunk, but on Windows there can also be a
// "list" sub chunk.
while (0 != strncmp(header.data, "data", 4)) {
// We will just ignore the data in these chunks.
fseek(fp, header.data_size, SEEK_CUR);
// read next sub chunk
fread(header.data, 8, sizeof(char), fp);
}
if (header.data_size == 0) {
int offset = ftell(fp);
fseek(fp, 0, SEEK_END);
header.data_size = ftell(fp) - offset;
fseek(fp, offset, SEEK_SET);
}
num_channel_ = header.channels;
sample_rate_ = header.sample_rate;
bits_per_sample_ = header.bit;
int num_data = header.data_size / (bits_per_sample_ / 8);
data_ = new float[num_data]; // Create 1-dim array
num_samples_ = num_data / num_channel_;
std::cout << "num_channel_ :" << num_channel_ << std::endl;
std::cout << "sample_rate_ :" << sample_rate_ << std::endl;
std::cout << "bits_per_sample_:" << bits_per_sample_ << std::endl;
std::cout << "num_samples :" << num_data << std::endl;
std::cout << "num_data_size :" << header.data_size << std::endl;
switch (bits_per_sample_) {
case 8: {
char sample;
for (int i = 0; i < num_data; ++i) {
fread(&sample, 1, sizeof(char), fp);
data_[i] = static_cast<float>(sample) / 32768;
}
break;
}
case 16: {
int16_t sample;
for (int i = 0; i < num_data; ++i) {
fread(&sample, 1, sizeof(int16_t), fp);
data_[i] = static_cast<float>(sample) / 32768;
}
break;
}
case 32:
{
if (header.format == 1) //S32
{
int sample;
for (int i = 0; i < num_data; ++i) {
fread(&sample, 1, sizeof(int), fp);
data_[i] = static_cast<float>(sample) / 32768;
}
}
else if (header.format == 3) // IEEE-float
{
float sample;
for (int i = 0; i < num_data; ++i) {
fread(&sample, 1, sizeof(float), fp);
data_[i] = static_cast<float>(sample);
}
}
else {
printf("unsupported quantization bits\n");
}
break;
}
default:
printf("unsupported quantization bits\n");
break;
}
fclose(fp);
return true;
}
int num_channel() const { return num_channel_; }
int sample_rate() const { return sample_rate_; }
int bits_per_sample() const { return bits_per_sample_; }
int num_samples() const { return num_samples_; }
~WavReader() {
delete[] data_;
}
const float* data() const { return data_; }
private:
int num_channel_;
int sample_rate_;
int bits_per_sample_;
int num_samples_; // sample points per channel
float* data_;
};
class WavWriter {
public:
WavWriter(const float* data, int num_samples, int num_channel,
int sample_rate, int bits_per_sample)
: data_(data),
num_samples_(num_samples),
num_channel_(num_channel),
sample_rate_(sample_rate),
bits_per_sample_(bits_per_sample) {}
void Write(const std::string& filename) {
FILE* fp = fopen(filename.c_str(), "w");
// init char 'riff' 'WAVE' 'fmt ' 'data'
WavHeader header;
char wav_header[44] = {0x52, 0x49, 0x46, 0x46, 0x00, 0x00, 0x00, 0x00, 0x57,
0x41, 0x56, 0x45, 0x66, 0x6d, 0x74, 0x20, 0x10, 0x00,
0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x64, 0x61, 0x74, 0x61, 0x00, 0x00, 0x00, 0x00};
memcpy(&header, wav_header, sizeof(header));
header.channels = num_channel_;
header.bit = bits_per_sample_;
header.sample_rate = sample_rate_;
header.data_size = num_samples_ * num_channel_ * (bits_per_sample_ / 8);
header.size = sizeof(header) - 8 + header.data_size;
header.bytes_per_second =
sample_rate_ * num_channel_ * (bits_per_sample_ / 8);
header.block_size = num_channel_ * (bits_per_sample_ / 8);
fwrite(&header, 1, sizeof(header), fp);
for (int i = 0; i < num_samples_; ++i) {
for (int j = 0; j < num_channel_; ++j) {
switch (bits_per_sample_) {
case 8: {
char sample = static_cast<char>(data_[i * num_channel_ + j]);
fwrite(&sample, 1, sizeof(sample), fp);
break;
}
case 16: {
int16_t sample = static_cast<int16_t>(data_[i * num_channel_ + j]);
fwrite(&sample, 1, sizeof(sample), fp);
break;
}
case 32: {
int sample = static_cast<int>(data_[i * num_channel_ + j]);
fwrite(&sample, 1, sizeof(sample), fp);
break;
}
}
}
}
fclose(fp);
}
private:
const float* data_;
int num_samples_; // total float points in data_
int num_channel_;
int sample_rate_;
int bits_per_sample_;
};
} // namespace wenet
#endif // FRONTEND_WAV_H_

View File

@@ -0,0 +1,45 @@
# Silero-VAD V5 in C++ (based on LibTorch)
This is the source code for Silero-VAD V5 in C++, utilizing LibTorch. The primary implementation is CPU-based, and you should compare its results with the Python version. Only results at 16kHz have been tested.
Additionally, batch and CUDA inference options are available if you want to explore further. Note that when using batch inference, the speech probabilities may slightly differ from the standard version, likely due to differences in caching. Unlike individual input processing, batch inference may not use the cache from previous chunks. Despite this, batch inference offers significantly faster processing. For optimal performance, consider adjusting the threshold when using batch inference.
## Requirements
- GCC 11.4.0 (GCC >= 5.1)
- LibTorch 1.13.0 (other versions are also acceptable)
## Download LibTorch
```bash
-CPU Version
wget https://download.pytorch.org/libtorch/cpu/libtorch-shared-with-deps-1.13.0%2Bcpu.zip
unzip libtorch-shared-with-deps-1.13.0+cpu.zip'
-CUDA Version
wget https://download.pytorch.org/libtorch/cu116/libtorch-shared-with-deps-1.13.0%2Bcu116.zip
unzip libtorch-shared-with-deps-1.13.0+cu116.zip
```
## Compilation
```bash
-CPU Version
g++ main.cc silero_torch.cc -I ./libtorch/include/ -I ./libtorch/include/torch/csrc/api/include -L ./libtorch/lib/ -ltorch -ltorch_cpu -lc10 -Wl,-rpath,./libtorch/lib/ -o silero -std=c++14 -D_GLIBCXX_USE_CXX11_ABI=0
-CUDA Version
g++ main.cc silero_torch.cc -I ./libtorch/include/ -I ./libtorch/include/torch/csrc/api/include -L ./libtorch/lib/ -ltorch -ltorch_cuda -ltorch_cpu -lc10 -Wl,-rpath,./libtorch/lib/ -o silero -std=c++14 -D_GLIBCXX_USE_CXX11_ABI=0 -DUSE_GPU
```
## Optional Compilation Flags
-DUSE_BATCH: Enable batch inference
-DUSE_GPU: Use GPU for inference
## Run the Program
To run the program, use the following command:
`./silero aepyx.wav 16000 0.5`
The sample file aepyx.wav is part of the Voxconverse dataset.
File details: aepyx.wav is a 16kHz, 16-bit audio file.

Binary file not shown.

View File

@@ -0,0 +1,54 @@
#include <iostream>
#include "silero_torch.h"
#include "wav.h"
int main(int argc, char* argv[]) {
if(argc != 4){
std::cerr<<"Usage : "<<argv[0]<<" <wav.path> <SampleRate> <Threshold>"<<std::endl;
std::cerr<<"Usage : "<<argv[0]<<" sample.wav 16000 0.5"<<std::endl;
return 1;
}
std::string wav_path = argv[1];
float sample_rate = std::stof(argv[2]);
float threshold = std::stof(argv[3]);
//Load Model
std::string model_path = "../../src/silero_vad/data/silero_vad.jit";
silero::VadIterator vad(model_path);
vad.threshold=threshold; //(Default:0.5)
vad.sample_rate=sample_rate; //16000Hz,8000Hz. (Default:16000)
vad.print_as_samples=true; //if true, it prints time-stamp with samples. otherwise, in seconds
//(Default:false)
vad.SetVariables();
// Read wav
wav::WavReader wav_reader(wav_path);
std::vector<float> input_wav(wav_reader.num_samples());
for (int i = 0; i < wav_reader.num_samples(); i++)
{
input_wav[i] = static_cast<float>(*(wav_reader.data() + i));
}
vad.SpeechProbs(input_wav);
std::vector<silero::SpeechSegment> speeches = vad.GetSpeechTimestamps();
for(const auto& speech : speeches){
if(vad.print_as_samples){
std::cout<<"{'start': "<<static_cast<int>(speech.start)<<", 'end': "<<static_cast<int>(speech.end)<<"}"<<std::endl;
}
else{
std::cout<<"{'start': "<<speech.start<<", 'end': "<<speech.end<<"}"<<std::endl;
}
}
return 0;
}

Binary file not shown.

View File

@@ -0,0 +1,285 @@
//Author : Nathan Lee
//Created On : 2024-11-18
//Description : silero 5.1 system for torch-script(c++).
//Version : 1.0
#include "silero_torch.h"
namespace silero {
VadIterator::VadIterator(const std::string &model_path, float threshold, int sample_rate, int window_size_ms, int speech_pad_ms, int min_silence_duration_ms, int min_speech_duration_ms, int max_duration_merge_ms, bool print_as_samples)
:sample_rate(sample_rate), threshold(threshold), window_size_ms(window_size_ms), speech_pad_ms(speech_pad_ms), min_silence_duration_ms(min_silence_duration_ms), min_speech_duration_ms(min_speech_duration_ms), max_duration_merge_ms(max_duration_merge_ms), print_as_samples(print_as_samples)
{
init_torch_model(model_path);
//init_engine(window_size_ms);
}
VadIterator::~VadIterator(){
}
void VadIterator::SpeechProbs(std::vector<float>& input_wav){
// Set the sample rate (must match the model's expected sample rate)
// Process the waveform in chunks of 512 samples
int num_samples = input_wav.size();
int num_chunks = num_samples / window_size_samples;
int remainder_samples = num_samples % window_size_samples;
total_sample_size += num_samples;
torch::Tensor output;
std::vector<torch::Tensor> chunks;
for (int i = 0; i < num_chunks; i++) {
float* chunk_start = input_wav.data() + i *window_size_samples;
torch::Tensor chunk = torch::from_blob(chunk_start, {1,window_size_samples}, torch::kFloat32);
//std::cout<<"chunk size : "<<chunk.sizes()<<std::endl;
chunks.push_back(chunk);
if(i==num_chunks-1 && remainder_samples>0){//마지막 chunk && 나머지가 존재
int remaining_samples = num_samples - num_chunks * window_size_samples;
//std::cout<<"Remainder size : "<<remaining_samples;
float* chunk_start_remainder = input_wav.data() + num_chunks *window_size_samples;
torch::Tensor remainder_chunk = torch::from_blob(chunk_start_remainder, {1,remaining_samples},
torch::kFloat32);
// Pad the remainder chunk to match window_size_samples
torch::Tensor padded_chunk = torch::cat({remainder_chunk, torch::zeros({1, window_size_samples
- remaining_samples}, torch::kFloat32)}, 1);
//std::cout<<", padded_chunk size : "<<padded_chunk.size(1)<<std::endl;
chunks.push_back(padded_chunk);
}
}
if (!chunks.empty()) {
#ifdef USE_BATCH
torch::Tensor batched_chunks = torch::stack(chunks); // Stack all chunks into a single tensor
//batched_chunks = batched_chunks.squeeze(1);
batched_chunks = torch::cat({batched_chunks.squeeze(1)});
#ifdef USE_GPU
batched_chunks = batched_chunks.to(at::kCUDA); // Move the entire batch to GPU once
#endif
// Prepare input for model
std::vector<torch::jit::IValue> inputs;
inputs.push_back(batched_chunks); // Batch of chunks
inputs.push_back(sample_rate); // Assuming sample_rate is a valid input for the model
// Run inference on the batch
torch::NoGradGuard no_grad;
torch::Tensor output = model.forward(inputs).toTensor();
#ifdef USE_GPU
output = output.to(at::kCPU); // Move the output back to CPU once
#endif
// Collect output probabilities
for (int i = 0; i < chunks.size(); i++) {
float output_f = output[i].item<float>();
outputs_prob.push_back(output_f);
//std::cout << "Chunk " << i << " prob: " << output_f<< "\n";
}
#else
std::vector<torch::Tensor> outputs;
torch::Tensor batched_chunks = torch::stack(chunks);
#ifdef USE_GPU
batched_chunks = batched_chunks.to(at::kCUDA);
#endif
for (int i = 0; i < chunks.size(); i++) {
torch::NoGradGuard no_grad;
std::vector<torch::jit::IValue> inputs;
inputs.push_back(batched_chunks[i]);
inputs.push_back(sample_rate);
torch::Tensor output = model.forward(inputs).toTensor();
outputs.push_back(output);
}
torch::Tensor all_outputs = torch::stack(outputs);
#ifdef USE_GPU
all_outputs = all_outputs.to(at::kCPU);
#endif
for (int i = 0; i < chunks.size(); i++) {
float output_f = all_outputs[i].item<float>();
outputs_prob.push_back(output_f);
}
#endif
}
}
std::vector<SpeechSegment> VadIterator::GetSpeechTimestamps() {
std::vector<SpeechSegment> speeches = DoVad();
#ifdef USE_BATCH
//When you use BATCH inference. You would better use 'mergeSpeeches' function to arrage time stamp.
//It could be better get reasonable output because of distorted probs.
duration_merge_samples = sample_rate * max_duration_merge_ms / 1000;
std::vector<SpeechSegment> speeches_merge = mergeSpeeches(speeches, duration_merge_samples);
if(!print_as_samples){
for (auto& speech : speeches_merge) { //samples to second
speech.start /= sample_rate;
speech.end /= sample_rate;
}
}
return speeches_merge;
#else
if(!print_as_samples){
for (auto& speech : speeches) { //samples to second
speech.start /= sample_rate;
speech.end /= sample_rate;
}
}
return speeches;
#endif
}
void VadIterator::SetVariables(){
init_engine(window_size_ms);
}
void VadIterator::init_engine(int window_size_ms) {
min_silence_samples = sample_rate * min_silence_duration_ms / 1000;
speech_pad_samples = sample_rate * speech_pad_ms / 1000;
window_size_samples = sample_rate / 1000 * window_size_ms;
min_speech_samples = sample_rate * min_speech_duration_ms / 1000;
}
void VadIterator::init_torch_model(const std::string& model_path) {
at::set_num_threads(1);
model = torch::jit::load(model_path);
#ifdef USE_GPU
if (!torch::cuda::is_available()) {
std::cout<<"CUDA is not available! Please check your GPU settings"<<std::endl;
throw std::runtime_error("CUDA is not available!");
model.to(at::Device(at::kCPU));
} else {
std::cout<<"CUDA available! Running on '0'th GPU"<<std::endl;
model.to(at::Device(at::kCUDA, 0)); //select 0'th machine
}
#endif
model.eval();
torch::NoGradGuard no_grad;
std::cout << "Model loaded successfully"<<std::endl;
}
void VadIterator::reset_states() {
triggered = false;
current_sample = 0;
temp_end = 0;
outputs_prob.clear();
model.run_method("reset_states");
total_sample_size = 0;
}
std::vector<SpeechSegment> VadIterator::DoVad() {
std::vector<SpeechSegment> speeches;
for (size_t i = 0; i < outputs_prob.size(); ++i) {
float speech_prob = outputs_prob[i];
//std::cout << speech_prob << std::endl;
//std::cout << "Chunk " << i << " Prob: " << speech_prob << "\n";
//std::cout << speech_prob << " ";
current_sample += window_size_samples;
if (speech_prob >= threshold && temp_end != 0) {
temp_end = 0;
}
if (speech_prob >= threshold && !triggered) {
triggered = true;
SpeechSegment segment;
segment.start = std::max(static_cast<int>(0), current_sample - speech_pad_samples - window_size_samples);
speeches.push_back(segment);
continue;
}
if (speech_prob < threshold - 0.15f && triggered) {
if (temp_end == 0) {
temp_end = current_sample;
}
if (current_sample - temp_end < min_silence_samples) {
continue;
} else {
SpeechSegment& segment = speeches.back();
segment.end = temp_end + speech_pad_samples - window_size_samples;
temp_end = 0;
triggered = false;
}
}
}
if (triggered) { //만약 낮은 확률을 보이다가 마지막프레임 prbos만 딱 확률이 높게 나오면 위에서 triggerd = true 메핑과 동시에 segment start가 돼서 문제가 될것 같은데? start = end 같은값? 후처리가 있으니 문제가 없으려나?
std::cout<<"when last triggered is keep working until last Probs"<<std::endl;
SpeechSegment& segment = speeches.back();
segment.end = total_sample_size; // 현재 샘플을 마지막 구간의 종료 시간으로 설정
triggered = false; // VAD 상태 초기화
}
speeches.erase(
std::remove_if(
speeches.begin(),
speeches.end(),
[this](const SpeechSegment& speech) {
return ((speech.end - this->speech_pad_samples) - (speech.start + this->speech_pad_samples) < min_speech_samples);
//min_speech_samples is 4000samples(0.25sec)
//여기서 포인트!! 계산 할때는 start,end sample에'speech_pad_samples' 사이즈를 추가한후 길이를 측정함.
}
),
speeches.end()
);
//std::cout<<std::endl;
//std::cout<<"outputs_prob.size : "<<outputs_prob.size()<<std::endl;
reset_states();
return speeches;
}
std::vector<SpeechSegment> VadIterator::mergeSpeeches(const std::vector<SpeechSegment>& speeches, int duration_merge_samples) {
std::vector<SpeechSegment> mergedSpeeches;
if (speeches.empty()) {
return mergedSpeeches; // 빈 벡터 반환
}
// 첫 번째 구간으로 초기화
SpeechSegment currentSegment = speeches[0];
for (size_t i = 1; i < speeches.size(); ++i) { //첫번째 start,end 정보 건너뛰기. 그래서 i=1부터
// 두 구간의 차이가 threshold(duration_merge_samples)보다 작은 경우, 합침
if (speeches[i].start - currentSegment.end < duration_merge_samples) {
// 현재 구간의 끝점을 업데이트
currentSegment.end = speeches[i].end;
} else {
// 차이가 threshold(duration_merge_samples) 이상이면 현재 구간을 저장하고 새로운 구간 시작
mergedSpeeches.push_back(currentSegment);
currentSegment = speeches[i];
}
}
// 마지막 구간 추가
mergedSpeeches.push_back(currentSegment);
return mergedSpeeches;
}
}

View File

@@ -0,0 +1,75 @@
//Author : Nathan Lee
//Created On : 2024-11-18
//Description : silero 5.1 system for torch-script(c++).
//Version : 1.0
#ifndef SILERO_TORCH_H
#define SILERO_TORCH_H
#include <string>
#include <memory>
#include <stdexcept>
#include <iostream>
#include <memory>
#include <vector>
#include <fstream>
#include <chrono>
#include <torch/torch.h>
#include <torch/script.h>
namespace silero{
struct SpeechSegment{
int start;
int end;
};
class VadIterator{
public:
VadIterator(const std::string &model_path, float threshold = 0.5, int sample_rate = 16000,
int window_size_ms = 32, int speech_pad_ms = 30, int min_silence_duration_ms = 100,
int min_speech_duration_ms = 250, int max_duration_merge_ms = 300, bool print_as_samples = false);
~VadIterator();
void SpeechProbs(std::vector<float>& input_wav);
std::vector<silero::SpeechSegment> GetSpeechTimestamps();
void SetVariables();
float threshold;
int sample_rate;
int window_size_ms;
int min_speech_duration_ms;
int max_duration_merge_ms;
bool print_as_samples;
private:
torch::jit::script::Module model;
std::vector<float> outputs_prob;
int min_silence_samples;
int min_speech_samples;
int speech_pad_samples;
int window_size_samples;
int duration_merge_samples;
int current_sample = 0;
int total_sample_size=0;
int min_silence_duration_ms;
int speech_pad_ms;
bool triggered = false;
int temp_end = 0;
void init_engine(int window_size_ms);
void init_torch_model(const std::string& model_path);
void reset_states();
std::vector<SpeechSegment> DoVad();
std::vector<SpeechSegment> mergeSpeeches(const std::vector<SpeechSegment>& speeches, int duration_merge_samples);
};
}
#endif // SILERO_TORCH_H

View File

@@ -0,0 +1,235 @@
// Copyright (c) 2016 Personal (Binbin Zhang)
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#ifndef FRONTEND_WAV_H_
#define FRONTEND_WAV_H_
#include <assert.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <string>
// #include "utils/log.h"
namespace wav {
struct WavHeader {
char riff[4]; // "riff"
unsigned int size;
char wav[4]; // "WAVE"
char fmt[4]; // "fmt "
unsigned int fmt_size;
uint16_t format;
uint16_t channels;
unsigned int sample_rate;
unsigned int bytes_per_second;
uint16_t block_size;
uint16_t bit;
char data[4]; // "data"
unsigned int data_size;
};
class WavReader {
public:
WavReader() : data_(nullptr) {}
explicit WavReader(const std::string& filename) { Open(filename); }
bool Open(const std::string& filename) {
FILE* fp = fopen(filename.c_str(), "rb"); //文件读取
if (NULL == fp) {
std::cout << "Error in read " << filename;
return false;
}
WavHeader header;
fread(&header, 1, sizeof(header), fp);
if (header.fmt_size < 16) {
printf("WaveData: expect PCM format data "
"to have fmt chunk of at least size 16.\n");
return false;
} else if (header.fmt_size > 16) {
int offset = 44 - 8 + header.fmt_size - 16;
fseek(fp, offset, SEEK_SET);
fread(header.data, 8, sizeof(char), fp);
}
// check "riff" "WAVE" "fmt " "data"
// Skip any sub-chunks between "fmt" and "data". Usually there will
// be a single "fact" sub chunk, but on Windows there can also be a
// "list" sub chunk.
while (0 != strncmp(header.data, "data", 4)) {
// We will just ignore the data in these chunks.
fseek(fp, header.data_size, SEEK_CUR);
// read next sub chunk
fread(header.data, 8, sizeof(char), fp);
}
if (header.data_size == 0) {
int offset = ftell(fp);
fseek(fp, 0, SEEK_END);
header.data_size = ftell(fp) - offset;
fseek(fp, offset, SEEK_SET);
}
num_channel_ = header.channels;
sample_rate_ = header.sample_rate;
bits_per_sample_ = header.bit;
int num_data = header.data_size / (bits_per_sample_ / 8);
data_ = new float[num_data]; // Create 1-dim array
num_samples_ = num_data / num_channel_;
std::cout << "num_channel_ :" << num_channel_ << std::endl;
std::cout << "sample_rate_ :" << sample_rate_ << std::endl;
std::cout << "bits_per_sample_:" << bits_per_sample_ << std::endl;
std::cout << "num_samples :" << num_data << std::endl;
std::cout << "num_data_size :" << header.data_size << std::endl;
switch (bits_per_sample_) {
case 8: {
char sample;
for (int i = 0; i < num_data; ++i) {
fread(&sample, 1, sizeof(char), fp);
data_[i] = static_cast<float>(sample) / 32768;
}
break;
}
case 16: {
int16_t sample;
for (int i = 0; i < num_data; ++i) {
fread(&sample, 1, sizeof(int16_t), fp);
data_[i] = static_cast<float>(sample) / 32768;
}
break;
}
case 32:
{
if (header.format == 1) //S32
{
int sample;
for (int i = 0; i < num_data; ++i) {
fread(&sample, 1, sizeof(int), fp);
data_[i] = static_cast<float>(sample) / 32768;
}
}
else if (header.format == 3) // IEEE-float
{
float sample;
for (int i = 0; i < num_data; ++i) {
fread(&sample, 1, sizeof(float), fp);
data_[i] = static_cast<float>(sample);
}
}
else {
printf("unsupported quantization bits\n");
}
break;
}
default:
printf("unsupported quantization bits\n");
break;
}
fclose(fp);
return true;
}
int num_channel() const { return num_channel_; }
int sample_rate() const { return sample_rate_; }
int bits_per_sample() const { return bits_per_sample_; }
int num_samples() const { return num_samples_; }
~WavReader() {
delete[] data_;
}
const float* data() const { return data_; }
private:
int num_channel_;
int sample_rate_;
int bits_per_sample_;
int num_samples_; // sample points per channel
float* data_;
};
class WavWriter {
public:
WavWriter(const float* data, int num_samples, int num_channel,
int sample_rate, int bits_per_sample)
: data_(data),
num_samples_(num_samples),
num_channel_(num_channel),
sample_rate_(sample_rate),
bits_per_sample_(bits_per_sample) {}
void Write(const std::string& filename) {
FILE* fp = fopen(filename.c_str(), "w");
// init char 'riff' 'WAVE' 'fmt ' 'data'
WavHeader header;
char wav_header[44] = {0x52, 0x49, 0x46, 0x46, 0x00, 0x00, 0x00, 0x00, 0x57,
0x41, 0x56, 0x45, 0x66, 0x6d, 0x74, 0x20, 0x10, 0x00,
0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x64, 0x61, 0x74, 0x61, 0x00, 0x00, 0x00, 0x00};
memcpy(&header, wav_header, sizeof(header));
header.channels = num_channel_;
header.bit = bits_per_sample_;
header.sample_rate = sample_rate_;
header.data_size = num_samples_ * num_channel_ * (bits_per_sample_ / 8);
header.size = sizeof(header) - 8 + header.data_size;
header.bytes_per_second =
sample_rate_ * num_channel_ * (bits_per_sample_ / 8);
header.block_size = num_channel_ * (bits_per_sample_ / 8);
fwrite(&header, 1, sizeof(header), fp);
for (int i = 0; i < num_samples_; ++i) {
for (int j = 0; j < num_channel_; ++j) {
switch (bits_per_sample_) {
case 8: {
char sample = static_cast<char>(data_[i * num_channel_ + j]);
fwrite(&sample, 1, sizeof(sample), fp);
break;
}
case 16: {
int16_t sample = static_cast<int16_t>(data_[i * num_channel_ + j]);
fwrite(&sample, 1, sizeof(sample), fp);
break;
}
case 32: {
int sample = static_cast<int>(data_[i * num_channel_ + j]);
fwrite(&sample, 1, sizeof(sample), fp);
break;
}
}
}
}
fclose(fp);
}
private:
const float* data_;
int num_samples_; // total float points in data_
int num_channel_;
int sample_rate_;
int bits_per_sample_;
};
} // namespace wenet
#endif // FRONTEND_WAV_H_

View File

@@ -0,0 +1,35 @@
using System.Text;
namespace VadDotNet;
class Program
{
private const string MODEL_PATH = "./resources/silero_vad.onnx";
private const string EXAMPLE_WAV_FILE = "./resources/example.wav";
private const int SAMPLE_RATE = 16000;
private const float THRESHOLD = 0.5f;
private const int MIN_SPEECH_DURATION_MS = 250;
private const float MAX_SPEECH_DURATION_SECONDS = float.PositiveInfinity;
private const int MIN_SILENCE_DURATION_MS = 100;
private const int SPEECH_PAD_MS = 30;
public static void Main(string[] args)
{
var vadDetector = new SileroVadDetector(MODEL_PATH, THRESHOLD, SAMPLE_RATE,
MIN_SPEECH_DURATION_MS, MAX_SPEECH_DURATION_SECONDS, MIN_SILENCE_DURATION_MS, SPEECH_PAD_MS);
List<SileroSpeechSegment> speechTimeList = vadDetector.GetSpeechSegmentList(new FileInfo(EXAMPLE_WAV_FILE));
//Console.WriteLine(speechTimeList.ToJson());
StringBuilder sb = new();
foreach (var speechSegment in speechTimeList)
{
sb.Append($"start second: {speechSegment.StartSecond}, end second: {speechSegment.EndSecond}\n");
}
Console.WriteLine(sb.ToString());
}
}

View File

@@ -0,0 +1,21 @@
namespace VadDotNet;
public class SileroSpeechSegment
{
public int? StartOffset { get; set; }
public int? EndOffset { get; set; }
public float? StartSecond { get; set; }
public float? EndSecond { get; set; }
public SileroSpeechSegment()
{
}
public SileroSpeechSegment(int startOffset, int? endOffset, float? startSecond, float? endSecond)
{
StartOffset = startOffset;
EndOffset = endOffset;
StartSecond = startSecond;
EndSecond = endSecond;
}
}

View File

@@ -0,0 +1,249 @@
using NAudio.Wave;
using VADdotnet;
namespace VadDotNet;
public class SileroVadDetector
{
private readonly SileroVadOnnxModel _model;
private readonly float _threshold;
private readonly float _negThreshold;
private readonly int _samplingRate;
private readonly int _windowSizeSample;
private readonly float _minSpeechSamples;
private readonly float _speechPadSamples;
private readonly float _maxSpeechSamples;
private readonly float _minSilenceSamples;
private readonly float _minSilenceSamplesAtMaxSpeech;
private int _audioLengthSamples;
private const float THRESHOLD_GAP = 0.15f;
// ReSharper disable once InconsistentNaming
private const int SAMPLING_RATE_8K = 8000;
// ReSharper disable once InconsistentNaming
private const int SAMPLING_RATE_16K = 16000;
public SileroVadDetector(string onnxModelPath, float threshold, int samplingRate,
int minSpeechDurationMs, float maxSpeechDurationSeconds,
int minSilenceDurationMs, int speechPadMs)
{
if (samplingRate != SAMPLING_RATE_8K && samplingRate != SAMPLING_RATE_16K)
{
throw new ArgumentException("Sampling rate not support, only available for [8000, 16000]");
}
this._model = new SileroVadOnnxModel(onnxModelPath);
this._samplingRate = samplingRate;
this._threshold = threshold;
this._negThreshold = threshold - THRESHOLD_GAP;
this._windowSizeSample = samplingRate == SAMPLING_RATE_16K ? 512 : 256;
this._minSpeechSamples = samplingRate * minSpeechDurationMs / 1000f;
this._speechPadSamples = samplingRate * speechPadMs / 1000f;
this._maxSpeechSamples = samplingRate * maxSpeechDurationSeconds - _windowSizeSample - 2 * _speechPadSamples;
this._minSilenceSamples = samplingRate * minSilenceDurationMs / 1000f;
this._minSilenceSamplesAtMaxSpeech = samplingRate * 98 / 1000f;
this.Reset();
}
public void Reset()
{
_model.ResetStates();
}
public List<SileroSpeechSegment> GetSpeechSegmentList(FileInfo wavFile)
{
Reset();
using var audioFile = new AudioFileReader(wavFile.FullName);
List<float> speechProbList = [];
this._audioLengthSamples = (int)(audioFile.Length / 2);
float[] buffer = new float[this._windowSizeSample];
while (audioFile.Read(buffer, 0, buffer.Length) > 0)
{
float speechProb = _model.Call([buffer], _samplingRate)[0];
speechProbList.Add(speechProb);
}
return CalculateProb(speechProbList);
}
private List<SileroSpeechSegment> CalculateProb(List<float> speechProbList)
{
List<SileroSpeechSegment> result = [];
bool triggered = false;
int tempEnd = 0, prevEnd = 0, nextStart = 0;
SileroSpeechSegment segment = new();
for (int i = 0; i < speechProbList.Count; i++)
{
float speechProb = speechProbList[i];
if (speechProb >= _threshold && (tempEnd != 0))
{
tempEnd = 0;
if (nextStart < prevEnd)
{
nextStart = _windowSizeSample * i;
}
}
if (speechProb >= _threshold && !triggered)
{
triggered = true;
segment.StartOffset = _windowSizeSample * i;
continue;
}
if (triggered && (_windowSizeSample * i) - segment.StartOffset > _maxSpeechSamples)
{
if (prevEnd != 0)
{
segment.EndOffset = prevEnd;
result.Add(segment);
segment = new SileroSpeechSegment();
if (nextStart < prevEnd)
{
triggered = false;
}
else
{
segment.StartOffset = nextStart;
}
prevEnd = 0;
nextStart = 0;
tempEnd = 0;
}
else
{
segment.EndOffset = _windowSizeSample * i;
result.Add(segment);
segment = new SileroSpeechSegment();
prevEnd = 0;
nextStart = 0;
tempEnd = 0;
triggered = false;
continue;
}
}
if (speechProb < _negThreshold && triggered)
{
if (tempEnd == 0)
{
tempEnd = _windowSizeSample * i;
}
if (((_windowSizeSample * i) - tempEnd) > _minSilenceSamplesAtMaxSpeech)
{
prevEnd = tempEnd;
}
if ((_windowSizeSample * i) - tempEnd < _minSilenceSamples)
{
continue;
}
else
{
segment.EndOffset = tempEnd;
if ((segment.EndOffset - segment.StartOffset) > _minSpeechSamples)
{
result.Add(segment);
}
segment = new SileroSpeechSegment();
prevEnd = 0;
nextStart = 0;
tempEnd = 0;
triggered = false;
continue;
}
}
}
if (segment.StartOffset != null && (_audioLengthSamples - segment.StartOffset) > _minSpeechSamples)
{
//segment.EndOffset = _audioLengthSamples;
segment.EndOffset = speechProbList.Count * _windowSizeSample;
result.Add(segment);
}
for (int i = 0; i < result.Count; i++)
{
SileroSpeechSegment item = result[i];
if (i == 0)
{
item.StartOffset = (int)Math.Max(0, item.StartOffset.Value - _speechPadSamples);
}
if (i != result.Count - 1)
{
SileroSpeechSegment nextItem = result[i + 1];
int silenceDuration = nextItem.StartOffset.Value - item.EndOffset.Value;
if (silenceDuration < 2 * _speechPadSamples)
{
item.EndOffset += (silenceDuration / 2);
nextItem.StartOffset = Math.Max(0, nextItem.StartOffset.Value - (silenceDuration / 2));
}
else
{
item.EndOffset = (int)Math.Min(_audioLengthSamples, item.EndOffset.Value + _speechPadSamples);
nextItem.StartOffset = (int)Math.Max(0, nextItem.StartOffset.Value - _speechPadSamples);
}
}
else
{
item.EndOffset = (int)Math.Min(_audioLengthSamples, item.EndOffset.Value + _speechPadSamples);
}
}
return MergeListAndCalculateSecond(result, _samplingRate);
}
private static List<SileroSpeechSegment> MergeListAndCalculateSecond(List<SileroSpeechSegment> original, int samplingRate)
{
List<SileroSpeechSegment> result = [];
if (original == null || original.Count == 0)
{
return result;
}
int left = original[0].StartOffset.Value;
int right = original[0].EndOffset.Value;
if (original.Count > 1)
{
original.Sort((a, b) => a.StartOffset.Value.CompareTo(b.StartOffset.Value));
for (int i = 1; i < original.Count; i++)
{
SileroSpeechSegment segment = original[i];
if (segment.StartOffset > right)
{
result.Add(new SileroSpeechSegment(left, right,
CalculateSecondByOffset(left, samplingRate), CalculateSecondByOffset(right, samplingRate)));
left = segment.StartOffset.Value;
right = segment.EndOffset.Value;
}
else
{
right = Math.Max(right, segment.EndOffset.Value);
}
}
result.Add(new SileroSpeechSegment(left, right,
CalculateSecondByOffset(left, samplingRate), CalculateSecondByOffset(right, samplingRate)));
}
else
{
result.Add(new SileroSpeechSegment(left, right,
CalculateSecondByOffset(left, samplingRate), CalculateSecondByOffset(right, samplingRate)));
}
return result;
}
private static float CalculateSecondByOffset(int offset, int samplingRate)
{
float secondValue = offset * 1.0f / samplingRate;
return (float)Math.Floor(secondValue * 1000.0f) / 1000.0f;
}
}

View File

@@ -0,0 +1,215 @@
using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;
using System;
using System.Collections.Generic;
using System.Linq;
namespace VADdotnet;
public class SileroVadOnnxModel : IDisposable
{
private readonly InferenceSession session;
private float[][][] state;
private float[][] context;
private int lastSr = 0;
private int lastBatchSize = 0;
private static readonly List<int> SAMPLE_RATES = [8000, 16000];
public SileroVadOnnxModel(string modelPath)
{
var sessionOptions = new SessionOptions
{
InterOpNumThreads = 1,
IntraOpNumThreads = 1,
EnableCpuMemArena = true
};
session = new InferenceSession(modelPath, sessionOptions);
ResetStates();
}
public void ResetStates()
{
state = new float[2][][];
state[0] = new float[1][];
state[1] = new float[1][];
state[0][0] = new float[128];
state[1][0] = new float[128];
context = [];
lastSr = 0;
lastBatchSize = 0;
}
public void Dispose()
{
GC.SuppressFinalize(this);
}
public class ValidationResult(float[][] x, int sr)
{
public float[][] X { get; } = x;
public int Sr { get; } = sr;
}
private static ValidationResult ValidateInput(float[][] x, int sr)
{
if (x.Length == 1)
{
x = [x[0]];
}
if (x.Length > 2)
{
throw new ArgumentException($"Incorrect audio data dimension: {x[0].Length}");
}
if (sr != 16000 && (sr % 16000 == 0))
{
int step = sr / 16000;
float[][] reducedX = new float[x.Length][];
for (int i = 0; i < x.Length; i++)
{
float[] current = x[i];
float[] newArr = new float[(current.Length + step - 1) / step];
for (int j = 0, index = 0; j < current.Length; j += step, index++)
{
newArr[index] = current[j];
}
reducedX[i] = newArr;
}
x = reducedX;
sr = 16000;
}
if (!SAMPLE_RATES.Contains(sr))
{
throw new ArgumentException($"Only supports sample rates {string.Join(", ", SAMPLE_RATES)} (or multiples of 16000)");
}
if (((float)sr) / x[0].Length > 31.25)
{
throw new ArgumentException("Input audio is too short");
}
return new ValidationResult(x, sr);
}
private static float[][] Concatenate(float[][] a, float[][] b)
{
if (a.Length != b.Length)
{
throw new ArgumentException("The number of rows in both arrays must be the same.");
}
int rows = a.Length;
int colsA = a[0].Length;
int colsB = b[0].Length;
float[][] result = new float[rows][];
for (int i = 0; i < rows; i++)
{
result[i] = new float[colsA + colsB];
Array.Copy(a[i], 0, result[i], 0, colsA);
Array.Copy(b[i], 0, result[i], colsA, colsB);
}
return result;
}
private static float[][] GetLastColumns(float[][] array, int contextSize)
{
int rows = array.Length;
int cols = array[0].Length;
if (contextSize > cols)
{
throw new ArgumentException("contextSize cannot be greater than the number of columns in the array.");
}
float[][] result = new float[rows][];
for (int i = 0; i < rows; i++)
{
result[i] = new float[contextSize];
Array.Copy(array[i], cols - contextSize, result[i], 0, contextSize);
}
return result;
}
public float[] Call(float[][] x, int sr)
{
var result = ValidateInput(x, sr);
x = result.X;
sr = result.Sr;
int numberSamples = sr == 16000 ? 512 : 256;
if (x[0].Length != numberSamples)
{
throw new ArgumentException($"Provided number of samples is {x[0].Length} (Supported values: 256 for 8000 sample rate, 512 for 16000)");
}
int batchSize = x.Length;
int contextSize = sr == 16000 ? 64 : 32;
if (lastBatchSize == 0)
{
ResetStates();
}
if (lastSr != 0 && lastSr != sr)
{
ResetStates();
}
if (lastBatchSize != 0 && lastBatchSize != batchSize)
{
ResetStates();
}
if (context.Length == 0)
{
context = new float[batchSize][];
for (int i = 0; i < batchSize; i++)
{
context[i] = new float[contextSize];
}
}
x = Concatenate(context, x);
var inputs = new List<NamedOnnxValue>
{
NamedOnnxValue.CreateFromTensor("input", new DenseTensor<float>(x.SelectMany(a => a).ToArray(), [x.Length, x[0].Length])),
NamedOnnxValue.CreateFromTensor("sr", new DenseTensor<long>(new[] { (long)sr }, [1])),
NamedOnnxValue.CreateFromTensor("state", new DenseTensor<float>(state.SelectMany(a => a.SelectMany(b => b)).ToArray(), [state.Length, state[0].Length, state[0][0].Length]))
};
using var outputs = session.Run(inputs);
var output = outputs.First(o => o.Name == "output").AsTensor<float>();
var newState = outputs.First(o => o.Name == "stateN").AsTensor<float>();
context = GetLastColumns(x, contextSize);
lastSr = sr;
lastBatchSize = batchSize;
state = new float[newState.Dimensions[0]][][];
for (int i = 0; i < newState.Dimensions[0]; i++)
{
state[i] = new float[newState.Dimensions[1]][];
for (int j = 0; j < newState.Dimensions[1]; j++)
{
state[i][j] = new float[newState.Dimensions[2]];
for (int k = 0; k < newState.Dimensions[2]; k++)
{
state[i][j][k] = newState[i, j, k];
}
}
}
return [.. output];
}
}

View File

@@ -0,0 +1,25 @@
<Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<OutputType>Exe</OutputType>
<TargetFramework>net8.0</TargetFramework>
<ImplicitUsings>enable</ImplicitUsings>
<Nullable>enable</Nullable>
</PropertyGroup>
<ItemGroup>
<PackageReference Include="Microsoft.ML.OnnxRuntime" Version="1.18.1" />
<PackageReference Include="NAudio" Version="2.2.1" />
</ItemGroup>
<ItemGroup>
<Folder Include="resources\" />
</ItemGroup>
<ItemGroup>
<Content Include="resources\**">
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
</Content>
</ItemGroup>
</Project>

View File

@@ -0,0 +1,25 @@

Microsoft Visual Studio Solution File, Format Version 12.00
# Visual Studio Version 17
VisualStudioVersion = 17.14.36616.10 d17.14
MinimumVisualStudioVersion = 10.0.40219.1
Project("{FAE04EC0-301F-11D3-BF4B-00C04F79EFBC}") = "VadDotNet", "VadDotNet.csproj", "{F36E1741-EDDB-90C7-7501-4911058F8996}"
EndProject
Global
GlobalSection(SolutionConfigurationPlatforms) = preSolution
Debug|Any CPU = Debug|Any CPU
Release|Any CPU = Release|Any CPU
EndGlobalSection
GlobalSection(ProjectConfigurationPlatforms) = postSolution
{F36E1741-EDDB-90C7-7501-4911058F8996}.Debug|Any CPU.ActiveCfg = Debug|Any CPU
{F36E1741-EDDB-90C7-7501-4911058F8996}.Debug|Any CPU.Build.0 = Debug|Any CPU
{F36E1741-EDDB-90C7-7501-4911058F8996}.Release|Any CPU.ActiveCfg = Release|Any CPU
{F36E1741-EDDB-90C7-7501-4911058F8996}.Release|Any CPU.Build.0 = Release|Any CPU
EndGlobalSection
GlobalSection(SolutionProperties) = preSolution
HideSolutionNode = FALSE
EndGlobalSection
GlobalSection(ExtensibilityGlobals) = postSolution
SolutionGuid = {DFC4CEE8-1034-46B4-A5F4-D1649B3543E6}
EndGlobalSection
EndGlobal

View File

@@ -0,0 +1 @@
place onnx model file and example.wav file in this folder

19
examples/go/README.md Normal file
View File

@@ -0,0 +1,19 @@
## Golang Example
This is a sample program of how to run speech detection using `silero-vad` from Golang (CGO + ONNX Runtime).
### Requirements
- Golang >= v1.21
- ONNX Runtime
### Usage
```sh
go run ./cmd/main.go test.wav
```
> **_Note_**
>
> Make sure you have the ONNX Runtime library and C headers installed in your path.

63
examples/go/cmd/main.go Normal file
View File

@@ -0,0 +1,63 @@
package main
import (
"log"
"os"
"github.com/streamer45/silero-vad-go/speech"
"github.com/go-audio/wav"
)
func main() {
sd, err := speech.NewDetector(speech.DetectorConfig{
ModelPath: "../../src/silero_vad/data/silero_vad.onnx",
SampleRate: 16000,
Threshold: 0.5,
MinSilenceDurationMs: 100,
SpeechPadMs: 30,
})
if err != nil {
log.Fatalf("failed to create speech detector: %s", err)
}
if len(os.Args) != 2 {
log.Fatalf("invalid arguments provided: expecting one file path")
}
f, err := os.Open(os.Args[1])
if err != nil {
log.Fatalf("failed to open sample audio file: %s", err)
}
defer f.Close()
dec := wav.NewDecoder(f)
if ok := dec.IsValidFile(); !ok {
log.Fatalf("invalid WAV file")
}
buf, err := dec.FullPCMBuffer()
if err != nil {
log.Fatalf("failed to get PCM buffer")
}
pcmBuf := buf.AsFloat32Buffer()
segments, err := sd.Detect(pcmBuf.Data)
if err != nil {
log.Fatalf("Detect failed: %s", err)
}
for _, s := range segments {
log.Printf("speech starts at %0.2fs", s.SpeechStartAt)
if s.SpeechEndAt > 0 {
log.Printf("speech ends at %0.2fs", s.SpeechEndAt)
}
}
err = sd.Destroy()
if err != nil {
log.Fatalf("failed to destroy detector: %s", err)
}
}

13
examples/go/go.mod Normal file
View File

@@ -0,0 +1,13 @@
module silero
go 1.21.4
require (
github.com/go-audio/wav v1.1.0
github.com/streamer45/silero-vad-go v0.2.1
)
require (
github.com/go-audio/audio v1.0.0 // indirect
github.com/go-audio/riff v1.0.0 // indirect
)

18
examples/go/go.sum Normal file
View File

@@ -0,0 +1,18 @@
github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c=
github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
github.com/go-audio/audio v1.0.0 h1:zS9vebldgbQqktK4H0lUqWrG8P0NxCJVqcj7ZpNnwd4=
github.com/go-audio/audio v1.0.0/go.mod h1:6uAu0+H2lHkwdGsAY+j2wHPNPpPoeg5AaEFh9FlA+Zs=
github.com/go-audio/riff v1.0.0 h1:d8iCGbDvox9BfLagY94fBynxSPHO80LmZCaOsmKxokA=
github.com/go-audio/riff v1.0.0/go.mod h1:l3cQwc85y79NQFCRB7TiPoNiaijp6q8Z0Uv38rVG498=
github.com/go-audio/wav v1.1.0 h1:jQgLtbqBzY7G+BM8fXF7AHUk1uHUviWS4X39d5rsL2g=
github.com/go-audio/wav v1.1.0/go.mod h1:mpe9qfwbScEbkd8uybLuIpTgHyrISw/OTuvjUW2iGtE=
github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM=
github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
github.com/streamer45/silero-vad-go v0.2.0 h1:bbRTa6cQuc7VI88y0qicx375UyWoxE6wlVOF+mUg0+g=
github.com/streamer45/silero-vad-go v0.2.0/go.mod h1:B+2FXs/5fZ6pzl6unUZYhZqkYdOB+3saBVzjOzdZnUs=
github.com/streamer45/silero-vad-go v0.2.1 h1:Li1/tTC4H/3cyw6q4weX+U8GWwEL3lTekK/nYa1Cvuk=
github.com/streamer45/silero-vad-go v0.2.1/go.mod h1:B+2FXs/5fZ6pzl6unUZYhZqkYdOB+3saBVzjOzdZnUs=
github.com/stretchr/testify v1.8.4 h1:CcVxjf3Q8PM0mHUKJCdn+eZZtm5yQwehR5yeSVQQcUk=
github.com/stretchr/testify v1.8.4/go.mod h1:sz/lmYIOXD/1dqDmKjjqLyZ2RngseejIcXlSw2iwfAo=
gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA=
gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=

View File

@@ -0,0 +1,13 @@
# Haskell example
To run the example, make sure you put an ``example.wav`` in this directory, and then run the following:
```bash
stack run
```
The ``example.wav`` file must have the following requirements:
- Must be 16khz sample rate.
- Must be mono channel.
- Must be 16-bit audio.
This uses the [silero-vad](https://hackage.haskell.org/package/silero-vad) package, a haskell implementation based on the C# example.

View File

@@ -0,0 +1,22 @@
module Main (main) where
import qualified Data.Vector.Storable as Vector
import Data.WAVE
import Data.Function
import Silero
main :: IO ()
main =
withModel $ \model -> do
wav <- getWAVEFile "example.wav"
let samples =
concat (waveSamples wav)
& Vector.fromList
& Vector.map (realToFrac . sampleToDouble)
let vad =
(defaultVad model)
{ startThreshold = 0.5
, endThreshold = 0.35
}
segments <- detectSegments vad samples
print segments

View File

@@ -0,0 +1,23 @@
cabal-version: 1.12
-- This file has been generated from package.yaml by hpack version 0.37.0.
--
-- see: https://github.com/sol/hpack
name: example
version: 0.1.0.0
build-type: Simple
executable example-exe
main-is: Main.hs
other-modules:
Paths_example
hs-source-dirs:
app
ghc-options: -Wall -Wcompat -Widentities -Wincomplete-record-updates -Wincomplete-uni-patterns -Wmissing-export-lists -Wmissing-home-modules -Wpartial-fields -Wredundant-constraints -threaded -rtsopts -with-rtsopts=-N
build-depends:
WAVE
, base >=4.7 && <5
, silero-vad
, vector
default-language: Haskell2010

View File

@@ -0,0 +1,28 @@
name: example
version: 0.1.0.0
dependencies:
- base >= 4.7 && < 5
- silero-vad
- WAVE
- vector
ghc-options:
- -Wall
- -Wcompat
- -Widentities
- -Wincomplete-record-updates
- -Wincomplete-uni-patterns
- -Wmissing-export-lists
- -Wmissing-home-modules
- -Wpartial-fields
- -Wredundant-constraints
executables:
example-exe:
main: Main.hs
source-dirs: app
ghc-options:
- -threaded
- -rtsopts
- -with-rtsopts=-N

View File

@@ -0,0 +1,11 @@
snapshot:
url: https://raw.githubusercontent.com/commercialhaskell/stackage-snapshots/master/lts/20/26.yaml
packages:
- .
extra-deps:
- silero-vad-0.1.0.4@sha256:2bff95be978a2782915b250edc795760d4cf76838e37bb7d4a965dc32566eb0f,5476
- WAVE-0.1.6@sha256:f744ff68f5e3a0d1f84fab373ea35970659085d213aef20860357512d0458c5c,1016
- derive-storable-0.3.1.0@sha256:bd1c51c155a00e2be18325d553d6764dd678904a85647d6ba952af998e70aa59,2313
- vector-0.13.2.0@sha256:98f5cb3080a3487527476e3c272dcadaba1376539f2aa0646f2f19b3af6b2f67,8481

View File

@@ -0,0 +1,41 @@
# This file was autogenerated by Stack.
# You should not edit this file by hand.
# For more information, please see the documentation at:
# https://docs.haskellstack.org/en/stable/lock_files
packages:
- completed:
hackage: silero-vad-0.1.0.4@sha256:2bff95be978a2782915b250edc795760d4cf76838e37bb7d4a965dc32566eb0f,5476
pantry-tree:
sha256: a62e813f978d32c87769796fded981d25fcf2875bb2afdf60ed6279f931ccd7f
size: 1391
original:
hackage: silero-vad-0.1.0.4@sha256:2bff95be978a2782915b250edc795760d4cf76838e37bb7d4a965dc32566eb0f,5476
- completed:
hackage: WAVE-0.1.6@sha256:f744ff68f5e3a0d1f84fab373ea35970659085d213aef20860357512d0458c5c,1016
pantry-tree:
sha256: ee5ccd70fa7fe6ffc360ebd762b2e3f44ae10406aa27f3842d55b8cbd1a19498
size: 405
original:
hackage: WAVE-0.1.6@sha256:f744ff68f5e3a0d1f84fab373ea35970659085d213aef20860357512d0458c5c,1016
- completed:
hackage: derive-storable-0.3.1.0@sha256:bd1c51c155a00e2be18325d553d6764dd678904a85647d6ba952af998e70aa59,2313
pantry-tree:
sha256: 48e35a72d1bb593173890616c8d7efd636a650a306a50bb3e1513e679939d27e
size: 902
original:
hackage: derive-storable-0.3.1.0@sha256:bd1c51c155a00e2be18325d553d6764dd678904a85647d6ba952af998e70aa59,2313
- completed:
hackage: vector-0.13.2.0@sha256:98f5cb3080a3487527476e3c272dcadaba1376539f2aa0646f2f19b3af6b2f67,8481
pantry-tree:
sha256: 2176fd677a02a4c47337f7dca5aeca2745dbb821a6ea5c7099b3a991ecd7f4f0
size: 4478
original:
hackage: vector-0.13.2.0@sha256:98f5cb3080a3487527476e3c272dcadaba1376539f2aa0646f2f19b3af6b2f67,8481
snapshots:
- completed:
sha256: 5a59b2a405b3aba3c00188453be172b85893cab8ebc352b1ef58b0eae5d248a2
size: 650475
url: https://raw.githubusercontent.com/commercialhaskell/stackage-snapshots/master/lts/20/26.yaml
original:
url: https://raw.githubusercontent.com/commercialhaskell/stackage-snapshots/master/lts/20/26.yaml

View File

@@ -0,0 +1,31 @@
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId>
<artifactId>java-example</artifactId>
<version>1.0-SNAPSHOT</version>
<packaging>jar</packaging>
<name>sliero-vad-example</name>
<url>http://maven.apache.org</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/com.microsoft.onnxruntime/onnxruntime -->
<dependency>
<groupId>com.microsoft.onnxruntime</groupId>
<artifactId>onnxruntime</artifactId>
<version>1.23.1</version>
</dependency>
</dependencies>
</project>

View File

@@ -0,0 +1,264 @@
package org.example;
import ai.onnxruntime.OrtException;
import javax.sound.sampled.*;
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
/**
* Silero VAD Java Example
* Voice Activity Detection using ONNX model
*
* @author VvvvvGH
*/
public class App {
// ONNX model path - using the model file from the project
private static final String MODEL_PATH = "../../src/silero_vad/data/silero_vad.onnx";
// Test audio file path
private static final String AUDIO_FILE_PATH = "../../en_example.wav";
// Sampling rate
private static final int SAMPLE_RATE = 16000;
// Speech threshold (consistent with Python default)
private static final float THRESHOLD = 0.5f;
// Negative threshold (used to determine speech end)
private static final float NEG_THRESHOLD = 0.35f; // threshold - 0.15
// Minimum speech duration (milliseconds)
private static final int MIN_SPEECH_DURATION_MS = 250;
// Minimum silence duration (milliseconds)
private static final int MIN_SILENCE_DURATION_MS = 100;
// Speech padding (milliseconds)
private static final int SPEECH_PAD_MS = 30;
// Window size (samples) - 512 samples for 16kHz
private static final int WINDOW_SIZE_SAMPLES = 512;
public static void main(String[] args) {
System.out.println("=".repeat(60));
System.out.println("Silero VAD Java ONNX Example");
System.out.println("=".repeat(60));
// Load ONNX model
SlieroVadOnnxModel model;
try {
System.out.println("Loading ONNX model: " + MODEL_PATH);
model = new SlieroVadOnnxModel(MODEL_PATH);
System.out.println("Model loaded successfully!");
} catch (OrtException e) {
System.err.println("Failed to load model: " + e.getMessage());
e.printStackTrace();
return;
}
// Read WAV file
float[] audioData;
try {
System.out.println("\nReading audio file: " + AUDIO_FILE_PATH);
audioData = readWavFileAsFloatArray(AUDIO_FILE_PATH);
System.out.println("Audio file read successfully, samples: " + audioData.length);
System.out.println("Audio duration: " + String.format("%.2f", (audioData.length / (float) SAMPLE_RATE)) + " seconds");
} catch (Exception e) {
System.err.println("Failed to read audio file: " + e.getMessage());
e.printStackTrace();
return;
}
// Get speech timestamps (batch mode, consistent with Python's get_speech_timestamps)
System.out.println("\nDetecting speech segments...");
List<Map<String, Integer>> speechTimestamps;
try {
speechTimestamps = getSpeechTimestamps(
audioData,
model,
THRESHOLD,
SAMPLE_RATE,
MIN_SPEECH_DURATION_MS,
MIN_SILENCE_DURATION_MS,
SPEECH_PAD_MS,
NEG_THRESHOLD
);
} catch (OrtException e) {
System.err.println("Failed to detect speech timestamps: " + e.getMessage());
e.printStackTrace();
return;
}
// Output detection results
System.out.println("\nDetected speech timestamps (in samples):");
for (Map<String, Integer> timestamp : speechTimestamps) {
System.out.println(timestamp);
}
// Output summary
System.out.println("\n" + "=".repeat(60));
System.out.println("Detection completed!");
System.out.println("Total detected " + speechTimestamps.size() + " speech segments");
System.out.println("=".repeat(60));
// Close model
try {
model.close();
} catch (OrtException e) {
System.err.println("Error closing model: " + e.getMessage());
}
}
/**
* Get speech timestamps
* Implements the same logic as Python's get_speech_timestamps
*
* @param audio Audio data (float array)
* @param model ONNX model
* @param threshold Speech threshold
* @param samplingRate Sampling rate
* @param minSpeechDurationMs Minimum speech duration (milliseconds)
* @param minSilenceDurationMs Minimum silence duration (milliseconds)
* @param speechPadMs Speech padding (milliseconds)
* @param negThreshold Negative threshold (used to determine speech end)
* @return List of speech timestamps
*/
private static List<Map<String, Integer>> getSpeechTimestamps(
float[] audio,
SlieroVadOnnxModel model,
float threshold,
int samplingRate,
int minSpeechDurationMs,
int minSilenceDurationMs,
int speechPadMs,
float negThreshold) throws OrtException {
// Reset model states
model.resetStates();
// Calculate parameters
int minSpeechSamples = samplingRate * minSpeechDurationMs / 1000;
int speechPadSamples = samplingRate * speechPadMs / 1000;
int minSilenceSamples = samplingRate * minSilenceDurationMs / 1000;
int windowSizeSamples = samplingRate == 16000 ? 512 : 256;
int audioLengthSamples = audio.length;
// Calculate speech probabilities for all audio chunks
List<Float> speechProbs = new ArrayList<>();
for (int currentStart = 0; currentStart < audioLengthSamples; currentStart += windowSizeSamples) {
float[] chunk = new float[windowSizeSamples];
int chunkLength = Math.min(windowSizeSamples, audioLengthSamples - currentStart);
System.arraycopy(audio, currentStart, chunk, 0, chunkLength);
// Pad with zeros if chunk is shorter than window size
if (chunkLength < windowSizeSamples) {
for (int i = chunkLength; i < windowSizeSamples; i++) {
chunk[i] = 0.0f;
}
}
float speechProb = model.call(new float[][]{chunk}, samplingRate)[0];
speechProbs.add(speechProb);
}
// Detect speech segments using the same algorithm as Python
boolean triggered = false;
List<Map<String, Integer>> speeches = new ArrayList<>();
Map<String, Integer> currentSpeech = null;
int tempEnd = 0;
for (int i = 0; i < speechProbs.size(); i++) {
float speechProb = speechProbs.get(i);
// Reset temporary end if speech probability exceeds threshold
if (speechProb >= threshold && tempEnd != 0) {
tempEnd = 0;
}
// Detect speech start
if (speechProb >= threshold && !triggered) {
triggered = true;
currentSpeech = new HashMap<>();
currentSpeech.put("start", windowSizeSamples * i);
continue;
}
// Detect speech end
if (speechProb < negThreshold && triggered) {
if (tempEnd == 0) {
tempEnd = windowSizeSamples * i;
}
if (windowSizeSamples * i - tempEnd < minSilenceSamples) {
continue;
} else {
currentSpeech.put("end", tempEnd);
if (currentSpeech.get("end") - currentSpeech.get("start") > minSpeechSamples) {
speeches.add(currentSpeech);
}
currentSpeech = null;
tempEnd = 0;
triggered = false;
}
}
}
// Handle the last speech segment
if (currentSpeech != null &&
(audioLengthSamples - currentSpeech.get("start")) > minSpeechSamples) {
currentSpeech.put("end", audioLengthSamples);
speeches.add(currentSpeech);
}
// Add speech padding - same logic as Python
for (int i = 0; i < speeches.size(); i++) {
Map<String, Integer> speech = speeches.get(i);
if (i == 0) {
speech.put("start", Math.max(0, speech.get("start") - speechPadSamples));
}
if (i != speeches.size() - 1) {
int silenceDuration = speeches.get(i + 1).get("start") - speech.get("end");
if (silenceDuration < 2 * speechPadSamples) {
speech.put("end", speech.get("end") + silenceDuration / 2);
speeches.get(i + 1).put("start",
Math.max(0, speeches.get(i + 1).get("start") - silenceDuration / 2));
} else {
speech.put("end", Math.min(audioLengthSamples, speech.get("end") + speechPadSamples));
speeches.get(i + 1).put("start",
Math.max(0, speeches.get(i + 1).get("start") - speechPadSamples));
}
} else {
speech.put("end", Math.min(audioLengthSamples, speech.get("end") + speechPadSamples));
}
}
return speeches;
}
/**
* Read WAV file and return as float array
*
* @param filePath WAV file path
* @return Audio data as float array (normalized to -1.0 to 1.0)
*/
private static float[] readWavFileAsFloatArray(String filePath)
throws UnsupportedAudioFileException, IOException {
File audioFile = new File(filePath);
AudioInputStream audioStream = AudioSystem.getAudioInputStream(audioFile);
// Get audio format information
AudioFormat format = audioStream.getFormat();
System.out.println("Audio format: " + format);
// Read all audio data
byte[] audioBytes = audioStream.readAllBytes();
audioStream.close();
// Convert to float array
float[] audioData = new float[audioBytes.length / 2];
for (int i = 0; i < audioData.length; i++) {
// 16-bit PCM: two bytes per sample (little-endian)
short sample = (short) ((audioBytes[i * 2] & 0xff) | (audioBytes[i * 2 + 1] << 8));
audioData[i] = sample / 32768.0f; // Normalize to -1.0 to 1.0
}
return audioData;
}
}

View File

@@ -0,0 +1,156 @@
package org.example;
import ai.onnxruntime.OrtException;
import java.math.BigDecimal;
import java.math.RoundingMode;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;
/**
* Silero VAD Detector
* Real-time voice activity detection
*
* @author VvvvvGH
*/
public class SlieroVadDetector {
// ONNX model for speech processing
private final SlieroVadOnnxModel model;
// Speech start threshold
private final float startThreshold;
// Speech end threshold
private final float endThreshold;
// Sampling rate
private final int samplingRate;
// Minimum silence samples to determine speech end
private final float minSilenceSamples;
// Speech padding samples for calculating speech boundaries
private final float speechPadSamples;
// Triggered state (whether speech is being detected)
private boolean triggered;
// Temporary speech end sample position
private int tempEnd;
// Current sample position
private int currentSample;
public SlieroVadDetector(String modelPath,
float startThreshold,
float endThreshold,
int samplingRate,
int minSilenceDurationMs,
int speechPadMs) throws OrtException {
// Validate sampling rate
if (samplingRate != 8000 && samplingRate != 16000) {
throw new IllegalArgumentException("Does not support sampling rates other than [8000, 16000]");
}
// Initialize parameters
this.model = new SlieroVadOnnxModel(modelPath);
this.startThreshold = startThreshold;
this.endThreshold = endThreshold;
this.samplingRate = samplingRate;
this.minSilenceSamples = samplingRate * minSilenceDurationMs / 1000f;
this.speechPadSamples = samplingRate * speechPadMs / 1000f;
// Reset state
reset();
}
/**
* Reset detector state
*/
public void reset() {
model.resetStates();
triggered = false;
tempEnd = 0;
currentSample = 0;
}
/**
* Process audio data and detect speech events
*
* @param data Audio data as byte array
* @param returnSeconds Whether to return timestamps in seconds
* @return Speech event (start or end) or empty map if no event
*/
public Map<String, Double> apply(byte[] data, boolean returnSeconds) {
// Convert byte array to float array
float[] audioData = new float[data.length / 2];
for (int i = 0; i < audioData.length; i++) {
audioData[i] = ((data[i * 2] & 0xff) | (data[i * 2 + 1] << 8)) / 32767.0f;
}
// Get window size from audio data length
int windowSizeSamples = audioData.length;
// Update current sample position
currentSample += windowSizeSamples;
// Get speech probability from model
float speechProb = 0;
try {
speechProb = model.call(new float[][]{audioData}, samplingRate)[0];
} catch (OrtException e) {
throw new RuntimeException(e);
}
// Reset temporary end if speech probability exceeds threshold
if (speechProb >= startThreshold && tempEnd != 0) {
tempEnd = 0;
}
// Detect speech start
if (speechProb >= startThreshold && !triggered) {
triggered = true;
int speechStart = (int) (currentSample - speechPadSamples);
speechStart = Math.max(speechStart, 0);
Map<String, Double> result = new HashMap<>();
// Return in seconds or samples based on returnSeconds parameter
if (returnSeconds) {
double speechStartSeconds = speechStart / (double) samplingRate;
double roundedSpeechStart = BigDecimal.valueOf(speechStartSeconds).setScale(1, RoundingMode.HALF_UP).doubleValue();
result.put("start", roundedSpeechStart);
} else {
result.put("start", (double) speechStart);
}
return result;
}
// Detect speech end
if (speechProb < endThreshold && triggered) {
// Initialize or update temporary end position
if (tempEnd == 0) {
tempEnd = currentSample;
}
// Wait for minimum silence duration before confirming speech end
if (currentSample - tempEnd < minSilenceSamples) {
return Collections.emptyMap();
} else {
// Calculate speech end time and reset state
int speechEnd = (int) (tempEnd + speechPadSamples);
tempEnd = 0;
triggered = false;
Map<String, Double> result = new HashMap<>();
if (returnSeconds) {
double speechEndSeconds = speechEnd / (double) samplingRate;
double roundedSpeechEnd = BigDecimal.valueOf(speechEndSeconds).setScale(1, RoundingMode.HALF_UP).doubleValue();
result.put("end", roundedSpeechEnd);
} else {
result.put("end", (double) speechEnd);
}
return result;
}
}
// No speech event detected
return Collections.emptyMap();
}
public void close() throws OrtException {
reset();
model.close();
}
}

View File

@@ -0,0 +1,224 @@
package org.example;
import ai.onnxruntime.OnnxTensor;
import ai.onnxruntime.OrtEnvironment;
import ai.onnxruntime.OrtException;
import ai.onnxruntime.OrtSession;
import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
/**
* Silero VAD ONNX Model Wrapper
*
* @author VvvvvGH
*/
public class SlieroVadOnnxModel {
// ONNX runtime session
private final OrtSession session;
// Model state - dimensions: [2, batch_size, 128]
private float[][][] state;
// Context - stores the tail of the previous audio chunk
private float[][] context;
// Last sample rate
private int lastSr = 0;
// Last batch size
private int lastBatchSize = 0;
// Supported sample rates
private static final List<Integer> SAMPLE_RATES = Arrays.asList(8000, 16000);
// Constructor
public SlieroVadOnnxModel(String modelPath) throws OrtException {
// Get the ONNX runtime environment
OrtEnvironment env = OrtEnvironment.getEnvironment();
// Create ONNX session options
OrtSession.SessionOptions opts = new OrtSession.SessionOptions();
// Set InterOp thread count to 1 (for parallel processing of different graph operations)
opts.setInterOpNumThreads(1);
// Set IntraOp thread count to 1 (for parallel processing within a single operation)
opts.setIntraOpNumThreads(1);
// Enable CPU execution optimization
opts.addCPU(true);
// Create ONNX session with the environment, model path, and options
session = env.createSession(modelPath, opts);
// Reset states
resetStates();
}
/**
* Reset states with default batch size
*/
void resetStates() {
resetStates(1);
}
/**
* Reset states with specific batch size
*
* @param batchSize Batch size for state initialization
*/
void resetStates(int batchSize) {
state = new float[2][batchSize][128];
context = new float[0][]; // Empty context
lastSr = 0;
lastBatchSize = 0;
}
public void close() throws OrtException {
session.close();
}
/**
* Inner class for validation result
*/
public static class ValidationResult {
public final float[][] x;
public final int sr;
public ValidationResult(float[][] x, int sr) {
this.x = x;
this.sr = sr;
}
}
/**
* Validate input data
*
* @param x Audio data array
* @param sr Sample rate
* @return Validated input data and sample rate
*/
private ValidationResult validateInput(float[][] x, int sr) {
// Ensure input is at least 2D
if (x.length == 1) {
x = new float[][]{x[0]};
}
// Check if input dimension is valid
if (x.length > 2) {
throw new IllegalArgumentException("Incorrect audio data dimension: " + x[0].length);
}
// Downsample if sample rate is a multiple of 16000
if (sr != 16000 && (sr % 16000 == 0)) {
int step = sr / 16000;
float[][] reducedX = new float[x.length][];
for (int i = 0; i < x.length; i++) {
float[] current = x[i];
float[] newArr = new float[(current.length + step - 1) / step];
for (int j = 0, index = 0; j < current.length; j += step, index++) {
newArr[index] = current[j];
}
reducedX[i] = newArr;
}
x = reducedX;
sr = 16000;
}
// Validate sample rate
if (!SAMPLE_RATES.contains(sr)) {
throw new IllegalArgumentException("Only supports sample rates " + SAMPLE_RATES + " (or multiples of 16000)");
}
// Check if audio chunk is too short
if (((float) sr) / x[0].length > 31.25) {
throw new IllegalArgumentException("Input audio is too short");
}
return new ValidationResult(x, sr);
}
/**
* Call the ONNX model for inference
*
* @param x Audio data array
* @param sr Sample rate
* @return Speech probability output
* @throws OrtException If ONNX runtime error occurs
*/
public float[] call(float[][] x, int sr) throws OrtException {
ValidationResult result = validateInput(x, sr);
x = result.x;
sr = result.sr;
int batchSize = x.length;
int numSamples = sr == 16000 ? 512 : 256;
int contextSize = sr == 16000 ? 64 : 32;
// Reset states only when sample rate or batch size changes
if (lastSr != 0 && lastSr != sr) {
resetStates(batchSize);
} else if (lastBatchSize != 0 && lastBatchSize != batchSize) {
resetStates(batchSize);
} else if (lastBatchSize == 0) {
// First call - state is already initialized, just set batch size
lastBatchSize = batchSize;
}
// Initialize context if needed
if (context.length == 0) {
context = new float[batchSize][contextSize];
}
// Concatenate context and input
float[][] xWithContext = new float[batchSize][contextSize + numSamples];
for (int i = 0; i < batchSize; i++) {
// Copy context
System.arraycopy(context[i], 0, xWithContext[i], 0, contextSize);
// Copy input
System.arraycopy(x[i], 0, xWithContext[i], contextSize, numSamples);
}
OrtEnvironment env = OrtEnvironment.getEnvironment();
OnnxTensor inputTensor = null;
OnnxTensor stateTensor = null;
OnnxTensor srTensor = null;
OrtSession.Result ortOutputs = null;
try {
// Create input tensors
inputTensor = OnnxTensor.createTensor(env, xWithContext);
stateTensor = OnnxTensor.createTensor(env, state);
srTensor = OnnxTensor.createTensor(env, new long[]{sr});
Map<String, OnnxTensor> inputs = new HashMap<>();
inputs.put("input", inputTensor);
inputs.put("sr", srTensor);
inputs.put("state", stateTensor);
// Run ONNX model inference
ortOutputs = session.run(inputs);
// Get output results
float[][] output = (float[][]) ortOutputs.get(0).getValue();
state = (float[][][]) ortOutputs.get(1).getValue();
// Update context - save the last contextSize samples from input
for (int i = 0; i < batchSize; i++) {
System.arraycopy(xWithContext[i], xWithContext[i].length - contextSize,
context[i], 0, contextSize);
}
lastSr = sr;
lastBatchSize = batchSize;
return output[0];
} finally {
if (inputTensor != null) {
inputTensor.close();
}
if (stateTensor != null) {
stateTensor.close();
}
if (srTensor != null) {
srTensor.close();
}
if (ortOutputs != null) {
ortOutputs.close();
}
}
}
}

View File

@@ -0,0 +1,37 @@
package org.example;
import ai.onnxruntime.OrtException;
import java.io.File;
import java.util.List;
public class App {
private static final String MODEL_PATH = "/path/silero_vad.onnx";
private static final String EXAMPLE_WAV_FILE = "/path/example.wav";
private static final int SAMPLE_RATE = 16000;
private static final float THRESHOLD = 0.5f;
private static final int MIN_SPEECH_DURATION_MS = 250;
private static final float MAX_SPEECH_DURATION_SECONDS = Float.POSITIVE_INFINITY;
private static final int MIN_SILENCE_DURATION_MS = 100;
private static final int SPEECH_PAD_MS = 30;
public static void main(String[] args) {
// Initialize the Voice Activity Detector
SileroVadDetector vadDetector;
try {
vadDetector = new SileroVadDetector(MODEL_PATH, THRESHOLD, SAMPLE_RATE,
MIN_SPEECH_DURATION_MS, MAX_SPEECH_DURATION_SECONDS, MIN_SILENCE_DURATION_MS, SPEECH_PAD_MS);
fromWavFile(vadDetector, new File(EXAMPLE_WAV_FILE));
} catch (OrtException e) {
System.err.println("Error initializing the VAD detector: " + e.getMessage());
}
}
public static void fromWavFile(SileroVadDetector vadDetector, File wavFile) {
List<SileroSpeechSegment> speechTimeList = vadDetector.getSpeechSegmentList(wavFile);
for (SileroSpeechSegment speechSegment : speechTimeList) {
System.out.println(String.format("start second: %f, end second: %f",
speechSegment.getStartSecond(), speechSegment.getEndSecond()));
}
}
}

View File

@@ -0,0 +1,51 @@
package org.example;
public class SileroSpeechSegment {
private Integer startOffset;
private Integer endOffset;
private Float startSecond;
private Float endSecond;
public SileroSpeechSegment() {
}
public SileroSpeechSegment(Integer startOffset, Integer endOffset, Float startSecond, Float endSecond) {
this.startOffset = startOffset;
this.endOffset = endOffset;
this.startSecond = startSecond;
this.endSecond = endSecond;
}
public Integer getStartOffset() {
return startOffset;
}
public Integer getEndOffset() {
return endOffset;
}
public Float getStartSecond() {
return startSecond;
}
public Float getEndSecond() {
return endSecond;
}
public void setStartOffset(Integer startOffset) {
this.startOffset = startOffset;
}
public void setEndOffset(Integer endOffset) {
this.endOffset = endOffset;
}
public void setStartSecond(Float startSecond) {
this.startSecond = startSecond;
}
public void setEndSecond(Float endSecond) {
this.endSecond = endSecond;
}
}

View File

@@ -0,0 +1,244 @@
package org.example;
import ai.onnxruntime.OrtException;
import javax.sound.sampled.AudioInputStream;
import javax.sound.sampled.AudioSystem;
import java.io.File;
import java.util.ArrayList;
import java.util.Comparator;
import java.util.List;
public class SileroVadDetector {
private final SileroVadOnnxModel model;
private final float threshold;
private final float negThreshold;
private final int samplingRate;
private final int windowSizeSample;
private final float minSpeechSamples;
private final float speechPadSamples;
private final float maxSpeechSamples;
private final float minSilenceSamples;
private final float minSilenceSamplesAtMaxSpeech;
private int audioLengthSamples;
private static final float THRESHOLD_GAP = 0.15f;
private static final Integer SAMPLING_RATE_8K = 8000;
private static final Integer SAMPLING_RATE_16K = 16000;
/**
* Constructor
* @param onnxModelPath the path of silero-vad onnx model
* @param threshold threshold for speech start
* @param samplingRate audio sampling rate, only available for [8k, 16k]
* @param minSpeechDurationMs Minimum speech length in millis, any speech duration that smaller than this value would not be considered as speech
* @param maxSpeechDurationSeconds Maximum speech length in millis, recommend to be set as Float.POSITIVE_INFINITY
* @param minSilenceDurationMs Minimum silence length in millis, any silence duration that smaller than this value would not be considered as silence
* @param speechPadMs Additional pad millis for speech start and end
* @throws OrtException
*/
public SileroVadDetector(String onnxModelPath, float threshold, int samplingRate,
int minSpeechDurationMs, float maxSpeechDurationSeconds,
int minSilenceDurationMs, int speechPadMs) throws OrtException {
if (samplingRate != SAMPLING_RATE_8K && samplingRate != SAMPLING_RATE_16K) {
throw new IllegalArgumentException("Sampling rate not support, only available for [8000, 16000]");
}
this.model = new SileroVadOnnxModel(onnxModelPath);
this.samplingRate = samplingRate;
this.threshold = threshold;
this.negThreshold = threshold - THRESHOLD_GAP;
if (samplingRate == SAMPLING_RATE_16K) {
this.windowSizeSample = 512;
} else {
this.windowSizeSample = 256;
}
this.minSpeechSamples = samplingRate * minSpeechDurationMs / 1000f;
this.speechPadSamples = samplingRate * speechPadMs / 1000f;
this.maxSpeechSamples = samplingRate * maxSpeechDurationSeconds - windowSizeSample - 2 * speechPadSamples;
this.minSilenceSamples = samplingRate * minSilenceDurationMs / 1000f;
this.minSilenceSamplesAtMaxSpeech = samplingRate * 98 / 1000f;
this.reset();
}
/**
* Method to reset the state
*/
public void reset() {
model.resetStates();
}
/**
* Get speech segment list by given wav-format file
* @param wavFile wav file
* @return list of speech segment
*/
public List<SileroSpeechSegment> getSpeechSegmentList(File wavFile) {
reset();
try (AudioInputStream audioInputStream = AudioSystem.getAudioInputStream(wavFile)){
List<Float> speechProbList = new ArrayList<>();
this.audioLengthSamples = audioInputStream.available() / 2;
byte[] data = new byte[this.windowSizeSample * 2];
int numBytesRead = 0;
while ((numBytesRead = audioInputStream.read(data)) != -1) {
if (numBytesRead <= 0) {
break;
}
// Convert the byte array to a float array
float[] audioData = new float[data.length / 2];
for (int i = 0; i < audioData.length; i++) {
audioData[i] = ((data[i * 2] & 0xff) | (data[i * 2 + 1] << 8)) / 32767.0f;
}
float speechProb = 0;
try {
speechProb = model.call(new float[][]{audioData}, samplingRate)[0];
speechProbList.add(speechProb);
} catch (OrtException e) {
throw e;
}
}
return calculateProb(speechProbList);
} catch (Exception e) {
throw new RuntimeException("SileroVadDetector getSpeechTimeList with error", e);
}
}
/**
* Calculate speech segement by probability
* @param speechProbList speech probability list
* @return list of speech segment
*/
private List<SileroSpeechSegment> calculateProb(List<Float> speechProbList) {
List<SileroSpeechSegment> result = new ArrayList<>();
boolean triggered = false;
int tempEnd = 0, prevEnd = 0, nextStart = 0;
SileroSpeechSegment segment = new SileroSpeechSegment();
for (int i = 0; i < speechProbList.size(); i++) {
Float speechProb = speechProbList.get(i);
if (speechProb >= threshold && (tempEnd != 0)) {
tempEnd = 0;
if (nextStart < prevEnd) {
nextStart = windowSizeSample * i;
}
}
if (speechProb >= threshold && !triggered) {
triggered = true;
segment.setStartOffset(windowSizeSample * i);
continue;
}
if (triggered && (windowSizeSample * i) - segment.getStartOffset() > maxSpeechSamples) {
if (prevEnd != 0) {
segment.setEndOffset(prevEnd);
result.add(segment);
segment = new SileroSpeechSegment();
if (nextStart < prevEnd) {
triggered = false;
}else {
segment.setStartOffset(nextStart);
}
prevEnd = 0;
nextStart = 0;
tempEnd = 0;
}else {
segment.setEndOffset(windowSizeSample * i);
result.add(segment);
segment = new SileroSpeechSegment();
prevEnd = 0;
nextStart = 0;
tempEnd = 0;
triggered = false;
continue;
}
}
if (speechProb < negThreshold && triggered) {
if (tempEnd == 0) {
tempEnd = windowSizeSample * i;
}
if (((windowSizeSample * i) - tempEnd) > minSilenceSamplesAtMaxSpeech) {
prevEnd = tempEnd;
}
if ((windowSizeSample * i) - tempEnd < minSilenceSamples) {
continue;
}else {
segment.setEndOffset(tempEnd);
if ((segment.getEndOffset() - segment.getStartOffset()) > minSpeechSamples) {
result.add(segment);
}
segment = new SileroSpeechSegment();
prevEnd = 0;
nextStart = 0;
tempEnd = 0;
triggered = false;
continue;
}
}
}
if (segment.getStartOffset() != null && (audioLengthSamples - segment.getStartOffset()) > minSpeechSamples) {
segment.setEndOffset(audioLengthSamples);
result.add(segment);
}
for (int i = 0; i < result.size(); i++) {
SileroSpeechSegment item = result.get(i);
if (i == 0) {
item.setStartOffset((int)(Math.max(0,item.getStartOffset() - speechPadSamples)));
}
if (i != result.size() - 1) {
SileroSpeechSegment nextItem = result.get(i + 1);
Integer silenceDuration = nextItem.getStartOffset() - item.getEndOffset();
if(silenceDuration < 2 * speechPadSamples){
item.setEndOffset(item.getEndOffset() + (silenceDuration / 2 ));
nextItem.setStartOffset(Math.max(0, nextItem.getStartOffset() - (silenceDuration / 2)));
} else {
item.setEndOffset((int)(Math.min(audioLengthSamples, item.getEndOffset() + speechPadSamples)));
nextItem.setStartOffset((int)(Math.max(0,nextItem.getStartOffset() - speechPadSamples)));
}
}else {
item.setEndOffset((int)(Math.min(audioLengthSamples, item.getEndOffset() + speechPadSamples)));
}
}
return mergeListAndCalculateSecond(result, samplingRate);
}
private List<SileroSpeechSegment> mergeListAndCalculateSecond(List<SileroSpeechSegment> original, Integer samplingRate) {
List<SileroSpeechSegment> result = new ArrayList<>();
if (original == null || original.size() == 0) {
return result;
}
Integer left = original.get(0).getStartOffset();
Integer right = original.get(0).getEndOffset();
if (original.size() > 1) {
original.sort(Comparator.comparingLong(SileroSpeechSegment::getStartOffset));
for (int i = 1; i < original.size(); i++) {
SileroSpeechSegment segment = original.get(i);
if (segment.getStartOffset() > right) {
result.add(new SileroSpeechSegment(left, right,
calculateSecondByOffset(left, samplingRate), calculateSecondByOffset(right, samplingRate)));
left = segment.getStartOffset();
right = segment.getEndOffset();
} else {
right = Math.max(right, segment.getEndOffset());
}
}
result.add(new SileroSpeechSegment(left, right,
calculateSecondByOffset(left, samplingRate), calculateSecondByOffset(right, samplingRate)));
}else {
result.add(new SileroSpeechSegment(left, right,
calculateSecondByOffset(left, samplingRate), calculateSecondByOffset(right, samplingRate)));
}
return result;
}
private Float calculateSecondByOffset(Integer offset, Integer samplingRate) {
float secondValue = offset * 1.0f / samplingRate;
return (float) Math.floor(secondValue * 1000.0f) / 1000.0f;
}
}

View File

@@ -0,0 +1,234 @@
package org.example;
import ai.onnxruntime.OnnxTensor;
import ai.onnxruntime.OrtEnvironment;
import ai.onnxruntime.OrtException;
import ai.onnxruntime.OrtSession;
import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
public class SileroVadOnnxModel {
// Define private variable OrtSession
private final OrtSession session;
private float[][][] state;
private float[][] context;
// Define the last sample rate
private int lastSr = 0;
// Define the last batch size
private int lastBatchSize = 0;
// Define a list of supported sample rates
private static final List<Integer> SAMPLE_RATES = Arrays.asList(8000, 16000);
// Constructor
public SileroVadOnnxModel(String modelPath) throws OrtException {
// Get the ONNX runtime environment
OrtEnvironment env = OrtEnvironment.getEnvironment();
// Create an ONNX session options object
OrtSession.SessionOptions opts = new OrtSession.SessionOptions();
// Set the InterOp thread count to 1, InterOp threads are used for parallel processing of different computation graph operations
opts.setInterOpNumThreads(1);
// Set the IntraOp thread count to 1, IntraOp threads are used for parallel processing within a single operation
opts.setIntraOpNumThreads(1);
// Add a CPU device, setting to false disables CPU execution optimization
opts.addCPU(true);
// Create an ONNX session using the environment, model path, and options
session = env.createSession(modelPath, opts);
// Reset states
resetStates();
}
/**
* Reset states
*/
void resetStates() {
state = new float[2][1][128];
context = new float[0][];
lastSr = 0;
lastBatchSize = 0;
}
public void close() throws OrtException {
session.close();
}
/**
* Define inner class ValidationResult
*/
public static class ValidationResult {
public final float[][] x;
public final int sr;
// Constructor
public ValidationResult(float[][] x, int sr) {
this.x = x;
this.sr = sr;
}
}
/**
* Function to validate input data
*/
private ValidationResult validateInput(float[][] x, int sr) {
// Process the input data with dimension 1
if (x.length == 1) {
x = new float[][]{x[0]};
}
// Throw an exception when the input data dimension is greater than 2
if (x.length > 2) {
throw new IllegalArgumentException("Incorrect audio data dimension: " + x[0].length);
}
// Process the input data when the sample rate is not equal to 16000 and is a multiple of 16000
if (sr != 16000 && (sr % 16000 == 0)) {
int step = sr / 16000;
float[][] reducedX = new float[x.length][];
for (int i = 0; i < x.length; i++) {
float[] current = x[i];
float[] newArr = new float[(current.length + step - 1) / step];
for (int j = 0, index = 0; j < current.length; j += step, index++) {
newArr[index] = current[j];
}
reducedX[i] = newArr;
}
x = reducedX;
sr = 16000;
}
// If the sample rate is not in the list of supported sample rates, throw an exception
if (!SAMPLE_RATES.contains(sr)) {
throw new IllegalArgumentException("Only supports sample rates " + SAMPLE_RATES + " (or multiples of 16000)");
}
// If the input audio block is too short, throw an exception
if (((float) sr) / x[0].length > 31.25) {
throw new IllegalArgumentException("Input audio is too short");
}
// Return the validated result
return new ValidationResult(x, sr);
}
private static float[][] concatenate(float[][] a, float[][] b) {
if (a.length != b.length) {
throw new IllegalArgumentException("The number of rows in both arrays must be the same.");
}
int rows = a.length;
int colsA = a[0].length;
int colsB = b[0].length;
float[][] result = new float[rows][colsA + colsB];
for (int i = 0; i < rows; i++) {
System.arraycopy(a[i], 0, result[i], 0, colsA);
System.arraycopy(b[i], 0, result[i], colsA, colsB);
}
return result;
}
private static float[][] getLastColumns(float[][] array, int contextSize) {
int rows = array.length;
int cols = array[0].length;
if (contextSize > cols) {
throw new IllegalArgumentException("contextSize cannot be greater than the number of columns in the array.");
}
float[][] result = new float[rows][contextSize];
for (int i = 0; i < rows; i++) {
System.arraycopy(array[i], cols - contextSize, result[i], 0, contextSize);
}
return result;
}
/**
* Method to call the ONNX model
*/
public float[] call(float[][] x, int sr) throws OrtException {
ValidationResult result = validateInput(x, sr);
x = result.x;
sr = result.sr;
int numberSamples = 256;
if (sr == 16000) {
numberSamples = 512;
}
if (x[0].length != numberSamples) {
throw new IllegalArgumentException("Provided number of samples is " + x[0].length + " (Supported values: 256 for 8000 sample rate, 512 for 16000)");
}
int batchSize = x.length;
int contextSize = 32;
if (sr == 16000) {
contextSize = 64;
}
if (lastBatchSize == 0) {
resetStates();
}
if (lastSr != 0 && lastSr != sr) {
resetStates();
}
if (lastBatchSize != 0 && lastBatchSize != batchSize) {
resetStates();
}
if (context.length == 0) {
context = new float[batchSize][contextSize];
}
x = concatenate(context, x);
OrtEnvironment env = OrtEnvironment.getEnvironment();
OnnxTensor inputTensor = null;
OnnxTensor stateTensor = null;
OnnxTensor srTensor = null;
OrtSession.Result ortOutputs = null;
try {
// Create input tensors
inputTensor = OnnxTensor.createTensor(env, x);
stateTensor = OnnxTensor.createTensor(env, state);
srTensor = OnnxTensor.createTensor(env, new long[]{sr});
Map<String, OnnxTensor> inputs = new HashMap<>();
inputs.put("input", inputTensor);
inputs.put("sr", srTensor);
inputs.put("state", stateTensor);
// Call the ONNX model for calculation
ortOutputs = session.run(inputs);
// Get the output results
float[][] output = (float[][]) ortOutputs.get(0).getValue();
state = (float[][][]) ortOutputs.get(1).getValue();
context = getLastColumns(x, contextSize);
lastSr = sr;
lastBatchSize = batchSize;
return output[0];
} finally {
if (inputTensor != null) {
inputTensor.close();
}
if (stateTensor != null) {
stateTensor.close();
}
if (srTensor != null) {
srTensor.close();
}
if (ortOutputs != null) {
ortOutputs.close();
}
}
}
}

View File

@@ -186,7 +186,7 @@ if __name__ == '__main__':
help="same as trig_sum, but for switching from triggered to non-triggered state (non-speech)")
parser.add_argument('-N', '--num_steps', type=int, default=8,
help="nubmer of overlapping windows to split audio chunk into (we recommend 4 or 8)")
help="number of overlapping windows to split audio chunk into (we recommend 4 or 8)")
parser.add_argument('-nspw', '--num_samples_per_window', type=int, default=4000,
help="number of samples in each window, our models were trained using 4000 samples (250 ms) per window, so this is preferable value (lesser values reduce quality)")
@@ -198,4 +198,4 @@ if __name__ == '__main__':
help=" minimum silence duration in samples between to separate speech chunks")
ARGS = parser.parse_args()
ARGS.rate=DEFAULT_SAMPLE_RATE
main(ARGS)
main(ARGS)

View File

@@ -0,0 +1,161 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Install Dependencies"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# !pip install -q torchaudio\n",
"SAMPLING_RATE = 16000\n",
"import torch\n",
"from pprint import pprint\n",
"import time\n",
"import shutil\n",
"\n",
"torch.set_num_threads(1)\n",
"NUM_PROCESS=4 # set to the number of CPU cores in the machine\n",
"NUM_COPIES=8\n",
"# download wav files, make multiple copies\n",
"torch.hub.download_url_to_file('https://models.silero.ai/vad_models/en.wav', f\"en_example0.wav\")\n",
"for idx in range(NUM_COPIES-1):\n",
" shutil.copy(f\"en_example0.wav\", f\"en_example{idx+1}.wav\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load VAD model from torch hub"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',\n",
" model='silero_vad',\n",
" force_reload=True,\n",
" onnx=False)\n",
"\n",
"(get_speech_timestamps,\n",
"save_audio,\n",
"read_audio,\n",
"VADIterator,\n",
"collect_chunks) = utils"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Define a vad process function"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import multiprocessing\n",
"\n",
"vad_models = dict()\n",
"\n",
"def init_model(model):\n",
" pid = multiprocessing.current_process().pid\n",
" model, _ = torch.hub.load(repo_or_dir='snakers4/silero-vad',\n",
" model='silero_vad',\n",
" force_reload=False,\n",
" onnx=False)\n",
" vad_models[pid] = model\n",
"\n",
"def vad_process(audio_file: str):\n",
" \n",
" pid = multiprocessing.current_process().pid\n",
" \n",
" with torch.no_grad():\n",
" wav = read_audio(audio_file, sampling_rate=SAMPLING_RATE)\n",
" return get_speech_timestamps(\n",
" wav,\n",
" vad_models[pid],\n",
" 0.46, # speech prob threshold\n",
" 16000, # sample rate\n",
" 300, # min speech duration in ms\n",
" 20, # max speech duration in seconds\n",
" 600, # min silence duration\n",
" 512, # window size\n",
" 200, # spech pad ms\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Parallelization"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from concurrent.futures import ProcessPoolExecutor, as_completed\n",
"\n",
"futures = []\n",
"\n",
"with ProcessPoolExecutor(max_workers=NUM_PROCESS, initializer=init_model, initargs=(model,)) as ex:\n",
" for i in range(NUM_COPIES):\n",
" futures.append(ex.submit(vad_process, f\"en_example{idx}.wav\"))\n",
"\n",
"for finished in as_completed(futures):\n",
" pprint(finished.result())"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.14"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@@ -7,6 +7,8 @@ It has been designed as a low-level example for binary real-time streaming using
Currently, the notebook consits of two examples:
- One that records audio of a predefined length from the microphone, process it with Silero-VAD, and plots it afterwards.
- The other one plots the speech probabilities in real-time (using jupyterplot) and records the audio until you press enter.
This example does not work in google colab! For local usage only.
## Example Video for the Real-Time Visualization

View File

@@ -2,7 +2,7 @@
"cells": [
{
"cell_type": "markdown",
"id": "62a0cccb",
"id": "76aa55ba",
"metadata": {},
"source": [
"# Pyaudio Microphone Streaming Examples\n",
@@ -12,12 +12,14 @@
"I created it as an example on how binary data from a stream could be feed into Silero VAD.\n",
"\n",
"\n",
"Has been tested on Ubuntu 21.04 (x86). After you installed the dependencies below, no additional setup is required."
"Has been tested on Ubuntu 21.04 (x86). After you installed the dependencies below, no additional setup is required.\n",
"\n",
"This notebook does not work in google colab! For local usage only."
]
},
{
"cell_type": "markdown",
"id": "64cbe1eb",
"id": "4a4e15c2",
"metadata": {},
"source": [
"## Dependencies\n",
@@ -26,22 +28,27 @@
},
{
"cell_type": "code",
"execution_count": null,
"id": "57bc2aac",
"metadata": {},
"execution_count": 1,
"id": "24205cce",
"metadata": {
"ExecuteTime": {
"end_time": "2024-10-09T08:47:34.056898Z",
"start_time": "2024-10-09T08:47:34.053418Z"
}
},
"outputs": [],
"source": [
"#!pip install numpy==1.20.2\n",
"#!pip install torch==1.9.0\n",
"#!pip install matplotlib==3.4.2\n",
"#!pip install torchaudio==0.9.0\n",
"#!pip install soundfile==0.10.3.post1\n",
"#!pip install pyaudio==0.2.11"
"#!pip install numpy>=1.24.0\n",
"#!pip install torch>=1.12.0\n",
"#!pip install matplotlib>=3.6.0\n",
"#!pip install torchaudio>=0.12.0\n",
"#!pip install soundfile==0.12.1\n",
"#!apt install python3-pyaudio (linux) or pip install pyaudio (windows)"
]
},
{
"cell_type": "markdown",
"id": "110de761",
"id": "cd22818f",
"metadata": {},
"source": [
"## Imports"
@@ -49,10 +56,27 @@
},
{
"cell_type": "code",
"execution_count": null,
"id": "5a647d8d",
"metadata": {},
"outputs": [],
"execution_count": 2,
"id": "994d7f3a",
"metadata": {
"ExecuteTime": {
"end_time": "2024-10-09T08:47:39.005032Z",
"start_time": "2024-10-09T08:47:36.489952Z"
}
},
"outputs": [
{
"ename": "ModuleNotFoundError",
"evalue": "No module named 'pyaudio'",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mModuleNotFoundError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[2], line 8\u001b[0m\n\u001b[1;32m 6\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m \u001b[38;5;21;01mmatplotlib\u001b[39;00m\n\u001b[1;32m 7\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m \u001b[38;5;21;01mmatplotlib\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mpylab\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m \u001b[38;5;21;01mplt\u001b[39;00m\n\u001b[0;32m----> 8\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m \u001b[38;5;21;01mpyaudio\u001b[39;00m\n",
"\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'pyaudio'"
]
}
],
"source": [
"import io\n",
"import numpy as np\n",
@@ -61,14 +85,13 @@
"import torchaudio\n",
"import matplotlib\n",
"import matplotlib.pylab as plt\n",
"torchaudio.set_audio_backend(\"soundfile\")\n",
"import pyaudio"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "725d7066",
"id": "ac5c52f7",
"metadata": {},
"outputs": [],
"source": [
@@ -80,7 +103,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "1c0b2ea7",
"id": "ad5919dc",
"metadata": {},
"outputs": [],
"source": [
@@ -93,7 +116,7 @@
},
{
"cell_type": "markdown",
"id": "f9112603",
"id": "784d1ab6",
"metadata": {},
"source": [
"### Helper Methods"
@@ -102,7 +125,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "5abc6330",
"id": "af4bca64",
"metadata": {},
"outputs": [],
"source": [
@@ -118,14 +141,14 @@
" abs_max = np.abs(sound).max()\n",
" sound = sound.astype('float32')\n",
" if abs_max > 0:\n",
" sound *= 1/abs_max\n",
" sound *= 1/32768\n",
" sound = sound.squeeze() # depends on the use case\n",
" return sound"
]
},
{
"cell_type": "markdown",
"id": "5124095e",
"id": "ca13e514",
"metadata": {},
"source": [
"## Pyaudio Set-up"
@@ -134,7 +157,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "a845356e",
"id": "75f99022",
"metadata": {},
"outputs": [],
"source": [
@@ -148,7 +171,7 @@
},
{
"cell_type": "markdown",
"id": "0b910c99",
"id": "4da7d2ef",
"metadata": {},
"source": [
"## Simple Example\n",
@@ -158,17 +181,17 @@
{
"cell_type": "code",
"execution_count": null,
"id": "9d3d2c10",
"id": "6fe77661",
"metadata": {},
"outputs": [],
"source": [
"num_samples = 1536"
"num_samples = 512"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3cb44a4a",
"id": "23f4da3e",
"metadata": {},
"outputs": [],
"source": [
@@ -180,6 +203,8 @@
"data = []\n",
"voiced_confidences = []\n",
"\n",
"frames_to_record = 50\n",
"\n",
"print(\"Started Recording\")\n",
"for i in range(0, frames_to_record):\n",
" \n",
@@ -206,7 +231,7 @@
},
{
"cell_type": "markdown",
"id": "a3dda982",
"id": "fd243e8f",
"metadata": {},
"source": [
"## Real Time Visualization\n",
@@ -219,7 +244,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "05ef4100",
"id": "d36980c2",
"metadata": {},
"outputs": [],
"source": [
@@ -229,7 +254,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "d1d4cdd6",
"id": "5607b616",
"metadata": {},
"outputs": [],
"source": [
@@ -286,7 +311,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "1e398009",
"id": "dc4f0108",
"metadata": {},
"outputs": [],
"source": [
@@ -296,7 +321,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
@@ -310,7 +335,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.10"
"version": "3.10.14"
},
"toc": {
"base_numbering": 1,

2
examples/rust-example/.gitignore vendored Normal file
View File

@@ -0,0 +1,2 @@
target/
recorder.wav

823
examples/rust-example/Cargo.lock generated Normal file
View File

@@ -0,0 +1,823 @@
# This file is automatically @generated by Cargo.
# It is not intended for manual editing.
version = 4
[[package]]
name = "adler"
version = "1.0.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "f26201604c87b1e01bd3d98f8d5d9a8fcbb815e8cedb41ffccbeb4bf593a35fe"
[[package]]
name = "autocfg"
version = "1.3.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "0c4b4d0bd25bd0b74681c0ad21497610ce1b7c91b1022cd21c80c6fbdd9476b0"
[[package]]
name = "base64"
version = "0.22.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "72b3254f16251a8381aa12e40e3c4d2f0199f8c6508fbecb9d91f575e0fbb8c6"
[[package]]
name = "base64ct"
version = "1.8.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "0e050f626429857a27ddccb31e0aca21356bfa709c04041aefddac081a8f068a"
[[package]]
name = "bitflags"
version = "1.3.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "bef38d45163c2f1dde094a7dfd33ccf595c92905c8f8f4fdc18d06fb1037718a"
[[package]]
name = "bitflags"
version = "2.5.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "cf4b9d6a944f767f8e5e0db018570623c85f3d925ac718db4e06d0187adb21c1"
[[package]]
name = "block-buffer"
version = "0.10.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "3078c7629b62d3f0439517fa394996acacc5cbc91c5a20d8c658e77abd503a71"
dependencies = [
"generic-array",
]
[[package]]
name = "byteorder"
version = "1.5.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "1fd0f2584146f6f2ef48085050886acf353beff7305ebd1ae69500e27c67f64b"
[[package]]
name = "bytes"
version = "1.11.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b35204fbdc0b3f4446b89fc1ac2cf84a8a68971995d0bf2e925ec7cd960f9cb3"
[[package]]
name = "cc"
version = "1.0.98"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "41c270e7540d725e65ac7f1b212ac8ce349719624d7bcff99f8e2e488e8cf03f"
[[package]]
name = "cfg-if"
version = "1.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "baf1de4339761588bc0619e3cbc0120ee582ebb74b53b4efbf79117bd2da40fd"
[[package]]
name = "core-foundation"
version = "0.9.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "91e195e091a93c46f7102ec7818a2aa394e1e1771c3ab4825963fa03e45afb8f"
dependencies = [
"core-foundation-sys",
"libc",
]
[[package]]
name = "core-foundation-sys"
version = "0.8.7"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "773648b94d0e5d620f64f280777445740e61fe701025087ec8b57f45c791888b"
[[package]]
name = "cpufeatures"
version = "0.2.12"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "53fe5e26ff1b7aef8bca9c6080520cfb8d9333c7568e1829cef191a9723e5504"
dependencies = [
"libc",
]
[[package]]
name = "crc32fast"
version = "1.4.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "a97769d94ddab943e4510d138150169a2758b5ef3eb191a9ee688de3e23ef7b3"
dependencies = [
"cfg-if",
]
[[package]]
name = "crypto-common"
version = "0.1.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "1bfb12502f3fc46cca1bb51ac28df9d618d813cdc3d2f25b9fe775a34af26bb3"
dependencies = [
"generic-array",
"typenum",
]
[[package]]
name = "der"
version = "0.7.10"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e7c1832837b905bbfb5101e07cc24c8deddf52f93225eee6ead5f4d63d53ddcb"
dependencies = [
"pem-rfc7468",
"zeroize",
]
[[package]]
name = "digest"
version = "0.10.7"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "9ed9a281f7bc9b7576e61468ba615a66a5c8cfdff42420a70aa82701a3b1e292"
dependencies = [
"block-buffer",
"crypto-common",
]
[[package]]
name = "errno"
version = "0.3.9"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "534c5cf6194dfab3db3242765c03bbe257cf92f22b38f6bc0c58d59108a820ba"
dependencies = [
"libc",
"windows-sys 0.52.0",
]
[[package]]
name = "fastrand"
version = "2.3.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "37909eebbb50d72f9059c3b6d82c0463f2ff062c9e95845c43a6c9c0355411be"
[[package]]
name = "filetime"
version = "0.2.23"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "1ee447700ac8aa0b2f2bd7bc4462ad686ba06baa6727ac149a2d6277f0d240fd"
dependencies = [
"cfg-if",
"libc",
"redox_syscall",
"windows-sys 0.52.0",
]
[[package]]
name = "flate2"
version = "1.0.30"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "5f54427cfd1c7829e2a139fcefea601bf088ebca651d2bf53ebc600eac295dae"
dependencies = [
"crc32fast",
"miniz_oxide",
]
[[package]]
name = "foreign-types"
version = "0.3.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "f6f339eb8adc052cd2ca78910fda869aefa38d22d5cb648e6485e4d3fc06f3b1"
dependencies = [
"foreign-types-shared",
]
[[package]]
name = "foreign-types-shared"
version = "0.1.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "00b0228411908ca8685dba7fc2cdd70ec9990a6e753e89b6ac91a84c40fbaf4b"
[[package]]
name = "generic-array"
version = "0.14.7"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "85649ca51fd72272d7821adaf274ad91c288277713d9c18820d8499a7ff69e9a"
dependencies = [
"typenum",
"version_check",
]
[[package]]
name = "hound"
version = "3.5.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "62adaabb884c94955b19907d60019f4e145d091c75345379e70d1ee696f7854f"
[[package]]
name = "http"
version = "1.4.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e3ba2a386d7f85a81f119ad7498ebe444d2e22c2af0b86b069416ace48b3311a"
dependencies = [
"bytes",
"itoa",
]
[[package]]
name = "httparse"
version = "1.10.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "6dbf3de79e51f3d586ab4cb9d5c3e2c14aa28ed23d180cf89b4df0454a69cc87"
[[package]]
name = "itoa"
version = "1.0.17"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "92ecc6618181def0457392ccd0ee51198e065e016d1d527a7ac1b6dc7c1f09d2"
[[package]]
name = "libc"
version = "0.2.155"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "97b3888a4aecf77e811145cadf6eef5901f4782c53886191b2f693f24761847c"
[[package]]
name = "linux-raw-sys"
version = "0.4.14"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "78b3ae25bc7c8c38cec158d1f2757ee79e9b3740fbc7ccf0e59e4b08d793fa89"
[[package]]
name = "log"
version = "0.4.29"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "5e5032e24019045c762d3c0f28f5b6b8bbf38563a65908389bf7978758920897"
[[package]]
name = "matrixmultiply"
version = "0.3.8"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "7574c1cf36da4798ab73da5b215bbf444f50718207754cb522201d78d1cd0ff2"
dependencies = [
"autocfg",
"rawpointer",
]
[[package]]
name = "miniz_oxide"
version = "0.7.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "87dfd01fe195c66b572b37921ad8803d010623c0aca821bea2302239d155cdae"
dependencies = [
"adler",
]
[[package]]
name = "native-tls"
version = "0.2.14"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "87de3442987e9dbec73158d5c715e7ad9072fda936bb03d19d7fa10e00520f0e"
dependencies = [
"libc",
"log",
"openssl",
"openssl-probe",
"openssl-sys",
"schannel",
"security-framework",
"security-framework-sys",
"tempfile",
]
[[package]]
name = "ndarray"
version = "0.16.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "882ed72dce9365842bf196bdeedf5055305f11fc8c03dee7bb0194a6cad34841"
dependencies = [
"matrixmultiply",
"num-complex",
"num-integer",
"num-traits",
"portable-atomic",
"portable-atomic-util",
"rawpointer",
]
[[package]]
name = "num-complex"
version = "0.4.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "73f88a1307638156682bada9d7604135552957b7818057dcef22705b4d509495"
dependencies = [
"num-traits",
]
[[package]]
name = "num-integer"
version = "0.1.46"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "7969661fd2958a5cb096e56c8e1ad0444ac2bbcd0061bd28660485a44879858f"
dependencies = [
"num-traits",
]
[[package]]
name = "num-traits"
version = "0.2.19"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "071dfc062690e90b734c0b2273ce72ad0ffa95f0c74596bc250dcfd960262841"
dependencies = [
"autocfg",
]
[[package]]
name = "once_cell"
version = "1.19.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "3fdb12b2476b595f9358c5161aa467c2438859caa136dec86c26fdd2efe17b92"
[[package]]
name = "openssl"
version = "0.10.75"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "08838db121398ad17ab8531ce9de97b244589089e290a384c900cb9ff7434328"
dependencies = [
"bitflags 2.5.0",
"cfg-if",
"foreign-types",
"libc",
"once_cell",
"openssl-macros",
"openssl-sys",
]
[[package]]
name = "openssl-macros"
version = "0.1.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "a948666b637a0f465e8564c73e89d4dde00d72d4d473cc972f390fc3dcee7d9c"
dependencies = [
"proc-macro2",
"quote",
"syn",
]
[[package]]
name = "openssl-probe"
version = "0.1.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d05e27ee213611ffe7d6348b942e8f942b37114c00cc03cec254295a4a17852e"
[[package]]
name = "openssl-sys"
version = "0.9.111"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "82cab2d520aa75e3c58898289429321eb788c3106963d0dc886ec7a5f4adc321"
dependencies = [
"cc",
"libc",
"pkg-config",
"vcpkg",
]
[[package]]
name = "ort"
version = "2.0.0-rc.10"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "1fa7e49bd669d32d7bc2a15ec540a527e7764aec722a45467814005725bcd721"
dependencies = [
"ndarray",
"ort-sys",
"smallvec",
"tracing",
]
[[package]]
name = "ort-sys"
version = "2.0.0-rc.10"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e2aba9f5c7c479925205799216e7e5d07cc1d4fa76ea8058c60a9a30f6a4e890"
dependencies = [
"flate2",
"pkg-config",
"sha2",
"tar",
"ureq",
]
[[package]]
name = "pem-rfc7468"
version = "0.7.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "88b39c9bfcfc231068454382784bb460aae594343fb030d46e9f50a645418412"
dependencies = [
"base64ct",
]
[[package]]
name = "percent-encoding"
version = "2.3.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e3148f5046208a5d56bcfc03053e3ca6334e51da8dfb19b6cdc8b306fae3283e"
[[package]]
name = "pin-project-lite"
version = "0.2.14"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "bda66fc9667c18cb2758a2ac84d1167245054bcf85d5d1aaa6923f45801bdd02"
[[package]]
name = "pkg-config"
version = "0.3.32"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "7edddbd0b52d732b21ad9a5fab5c704c14cd949e5e9a1ec5929a24fded1b904c"
[[package]]
name = "portable-atomic"
version = "1.13.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "f89776e4d69bb58bc6993e99ffa1d11f228b839984854c7daeb5d37f87cbe950"
[[package]]
name = "portable-atomic-util"
version = "0.2.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d8a2f0d8d040d7848a709caf78912debcc3f33ee4b3cac47d73d1e1069e83507"
dependencies = [
"portable-atomic",
]
[[package]]
name = "proc-macro2"
version = "1.0.84"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "ec96c6a92621310b51366f1e28d05ef11489516e93be030060e5fc12024a49d6"
dependencies = [
"unicode-ident",
]
[[package]]
name = "quote"
version = "1.0.36"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "0fa76aaf39101c457836aec0ce2316dbdc3ab723cdda1c6bd4e6ad4208acaca7"
dependencies = [
"proc-macro2",
]
[[package]]
name = "rawpointer"
version = "0.2.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "60a357793950651c4ed0f3f52338f53b2f809f32d83a07f72909fa13e4c6c1e3"
[[package]]
name = "redox_syscall"
version = "0.4.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "4722d768eff46b75989dd134e5c353f0d6296e5aaa3132e776cbdb56be7731aa"
dependencies = [
"bitflags 1.3.2",
]
[[package]]
name = "rust-example"
version = "0.1.0"
dependencies = [
"hound",
"ndarray",
"ort",
]
[[package]]
name = "rustix"
version = "0.38.34"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "70dc5ec042f7a43c4a73241207cecc9873a06d45debb38b329f8541d85c2730f"
dependencies = [
"bitflags 2.5.0",
"errno",
"libc",
"linux-raw-sys",
"windows-sys 0.52.0",
]
[[package]]
name = "rustls-pki-types"
version = "1.13.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "21e6f2ab2928ca4291b86736a8bd920a277a399bba1589409d72154ff87c1282"
dependencies = [
"zeroize",
]
[[package]]
name = "schannel"
version = "0.1.28"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "891d81b926048e76efe18581bf793546b4c0eaf8448d72be8de2bbee5fd166e1"
dependencies = [
"windows-sys 0.61.2",
]
[[package]]
name = "security-framework"
version = "2.11.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "c627723fd09706bacdb5cf41499e95098555af3c3c29d014dc3c458ef6be11c0"
dependencies = [
"bitflags 2.5.0",
"core-foundation",
"core-foundation-sys",
"libc",
"security-framework-sys",
]
[[package]]
name = "security-framework-sys"
version = "2.15.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "cc1f0cbffaac4852523ce30d8bd3c5cdc873501d96ff467ca09b6767bb8cd5c0"
dependencies = [
"core-foundation-sys",
"libc",
]
[[package]]
name = "sha2"
version = "0.10.8"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "793db75ad2bcafc3ffa7c68b215fee268f537982cd901d132f89c6343f3a3dc8"
dependencies = [
"cfg-if",
"cpufeatures",
"digest",
]
[[package]]
name = "smallvec"
version = "2.0.0-alpha.10"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "51d44cfb396c3caf6fbfd0ab422af02631b69ddd96d2eff0b0f0724f9024051b"
[[package]]
name = "socks"
version = "0.3.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "f0c3dbbd9ae980613c6dd8e28a9407b50509d3803b57624d5dfe8315218cd58b"
dependencies = [
"byteorder",
"libc",
"winapi",
]
[[package]]
name = "syn"
version = "2.0.66"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "c42f3f41a2de00b01c0aaad383c5a45241efc8b2d1eda5661812fda5f3cdcff5"
dependencies = [
"proc-macro2",
"quote",
"unicode-ident",
]
[[package]]
name = "tar"
version = "0.4.40"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b16afcea1f22891c49a00c751c7b63b2233284064f11a200fc624137c51e2ddb"
dependencies = [
"filetime",
"libc",
"xattr",
]
[[package]]
name = "tempfile"
version = "3.12.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "04cbcdd0c794ebb0d4cf35e88edd2f7d2c4c3e9a5a6dab322839b321c6a87a64"
dependencies = [
"cfg-if",
"fastrand",
"once_cell",
"rustix",
"windows-sys 0.59.0",
]
[[package]]
name = "tracing"
version = "0.1.40"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "c3523ab5a71916ccf420eebdf5521fcef02141234bbc0b8a49f2fdc4544364ef"
dependencies = [
"pin-project-lite",
"tracing-core",
]
[[package]]
name = "tracing-core"
version = "0.1.32"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "c06d3da6113f116aaee68e4d601191614c9053067f9ab7f6edbcb161237daa54"
dependencies = [
"once_cell",
]
[[package]]
name = "typenum"
version = "1.17.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "42ff0bf0c66b8238c6f3b578df37d0b7848e55df8577b3f74f92a69acceeb825"
[[package]]
name = "unicode-ident"
version = "1.0.12"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "3354b9ac3fae1ff6755cb6db53683adb661634f67557942dea4facebec0fee4b"
[[package]]
name = "ureq"
version = "3.1.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d39cb1dbab692d82a977c0392ffac19e188bd9186a9f32806f0aaa859d75585a"
dependencies = [
"base64",
"der",
"log",
"native-tls",
"percent-encoding",
"rustls-pki-types",
"socks",
"ureq-proto",
"utf-8",
"webpki-root-certs",
]
[[package]]
name = "ureq-proto"
version = "0.5.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d81f9efa9df032be5934a46a068815a10a042b494b6a58cb0a1a97bb5467ed6f"
dependencies = [
"base64",
"http",
"httparse",
"log",
]
[[package]]
name = "utf-8"
version = "0.7.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "09cc8ee72d2a9becf2f2febe0205bbed8fc6615b7cb429ad062dc7b7ddd036a9"
[[package]]
name = "vcpkg"
version = "0.2.15"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "accd4ea62f7bb7a82fe23066fb0957d48ef677f6eeb8215f372f52e48bb32426"
[[package]]
name = "version_check"
version = "0.9.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "49874b5167b65d7193b8aba1567f5c7d93d001cafc34600cee003eda787e483f"
[[package]]
name = "webpki-root-certs"
version = "1.0.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "ee3e3b5f5e80bc89f30ce8d0343bf4e5f12341c51f3e26cbeecbc7c85443e85b"
dependencies = [
"rustls-pki-types",
]
[[package]]
name = "winapi"
version = "0.3.9"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "5c839a674fcd7a98952e593242ea400abe93992746761e38641405d28b00f419"
dependencies = [
"winapi-i686-pc-windows-gnu",
"winapi-x86_64-pc-windows-gnu",
]
[[package]]
name = "winapi-i686-pc-windows-gnu"
version = "0.4.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "ac3b87c63620426dd9b991e5ce0329eff545bccbbb34f3be09ff6fb6ab51b7b6"
[[package]]
name = "winapi-x86_64-pc-windows-gnu"
version = "0.4.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "712e227841d057c1ee1cd2fb22fa7e5a5461ae8e48fa2ca79ec42cfc1931183f"
[[package]]
name = "windows-link"
version = "0.2.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "f0805222e57f7521d6a62e36fa9163bc891acd422f971defe97d64e70d0a4fe5"
[[package]]
name = "windows-sys"
version = "0.52.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "282be5f36a8ce781fad8c8ae18fa3f9beff57ec1b52cb3de0789201425d9a33d"
dependencies = [
"windows-targets",
]
[[package]]
name = "windows-sys"
version = "0.59.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "1e38bc4d79ed67fd075bcc251a1c39b32a1776bbe92e5bef1f0bf1f8c531853b"
dependencies = [
"windows-targets",
]
[[package]]
name = "windows-sys"
version = "0.61.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "ae137229bcbd6cdf0f7b80a31df61766145077ddf49416a728b02cb3921ff3fc"
dependencies = [
"windows-link",
]
[[package]]
name = "windows-targets"
version = "0.52.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "9b724f72796e036ab90c1021d4780d4d3d648aca59e491e6b98e725b84e99973"
dependencies = [
"windows_aarch64_gnullvm",
"windows_aarch64_msvc",
"windows_i686_gnu",
"windows_i686_gnullvm",
"windows_i686_msvc",
"windows_x86_64_gnu",
"windows_x86_64_gnullvm",
"windows_x86_64_msvc",
]
[[package]]
name = "windows_aarch64_gnullvm"
version = "0.52.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "32a4622180e7a0ec044bb555404c800bc9fd9ec262ec147edd5989ccd0c02cd3"
[[package]]
name = "windows_aarch64_msvc"
version = "0.52.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "09ec2a7bb152e2252b53fa7803150007879548bc709c039df7627cabbd05d469"
[[package]]
name = "windows_i686_gnu"
version = "0.52.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8e9b5ad5ab802e97eb8e295ac6720e509ee4c243f69d781394014ebfe8bbfa0b"
[[package]]
name = "windows_i686_gnullvm"
version = "0.52.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "0eee52d38c090b3caa76c563b86c3a4bd71ef1a819287c19d586d7334ae8ed66"
[[package]]
name = "windows_i686_msvc"
version = "0.52.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "240948bc05c5e7c6dabba28bf89d89ffce3e303022809e73deaefe4f6ec56c66"
[[package]]
name = "windows_x86_64_gnu"
version = "0.52.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "147a5c80aabfbf0c7d901cb5895d1de30ef2907eb21fbbab29ca94c5b08b1a78"
[[package]]
name = "windows_x86_64_gnullvm"
version = "0.52.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "24d5b23dc417412679681396f2b49f3de8c1473deb516bd34410872eff51ed0d"
[[package]]
name = "windows_x86_64_msvc"
version = "0.52.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "589f6da84c646204747d1270a2a5661ea66ed1cced2631d546fdfb155959f9ec"
[[package]]
name = "xattr"
version = "1.3.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8da84f1a25939b27f6820d92aed108f83ff920fdf11a7b19366c27c4cda81d4f"
dependencies = [
"libc",
"linux-raw-sys",
"rustix",
]
[[package]]
name = "zeroize"
version = "1.8.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "ced3678a2879b30306d323f4542626697a464a97c0a07c9aebf7ebca65cd4dde"

View File

@@ -0,0 +1,9 @@
[package]
name = "rust-example"
version = "0.1.0"
edition = "2021"
[dependencies]
ort = { version = "=2.0.0-rc.10", features = ["ndarray"] }
ndarray = "0.16"
hound = "3"

View File

@@ -0,0 +1,19 @@
# Stream example in Rust
Made after [C++ stream example](https://github.com/snakers4/silero-vad/tree/master/examples/cpp)
## Dependencies
- To build Rust crate `ort` you need `cc` installed.
## Usage
Just
```
cargo run
```
If you run example outside of this repo adjust environment variable
```
SILERO_MODEL_PATH=/path/to/silero_vad.onnx cargo run
```
If you need to test against other wav file, not `recorder.wav`, specify it as the first argument
```
cargo run -- /path/to/audio/file.wav
```

View File

@@ -0,0 +1,36 @@
mod silero;
mod utils;
mod vad_iter;
fn main() {
let model_path = std::env::var("SILERO_MODEL_PATH")
.unwrap_or_else(|_| String::from("../../src/silero_vad/data/silero_vad.onnx"));
let audio_path = std::env::args()
.nth(1)
.unwrap_or_else(|| String::from("recorder.wav"));
let mut wav_reader = hound::WavReader::open(audio_path).unwrap();
let sample_rate = match wav_reader.spec().sample_rate {
8000 => utils::SampleRate::EightkHz,
16000 => utils::SampleRate::SixteenkHz,
_ => panic!("Unsupported sample rate. Expect 8 kHz or 16 kHz."),
};
if wav_reader.spec().sample_format != hound::SampleFormat::Int {
panic!("Unsupported sample format. Expect Int.");
}
let content = wav_reader
.samples()
.filter_map(|x| x.ok())
.collect::<Vec<i16>>();
assert!(!content.is_empty());
let silero = silero::Silero::new(sample_rate, model_path).unwrap();
let vad_params = utils::VadParams {
sample_rate: sample_rate.into(),
..Default::default()
};
let mut vad_iterator = vad_iter::VadIter::new(silero, vad_params);
vad_iterator.process(&content).unwrap();
for timestamp in vad_iterator.speeches() {
println!("{}", timestamp);
}
println!("Finished.");
}

View File

@@ -0,0 +1,84 @@
use crate::utils;
use ndarray::{Array, Array1, Array2, ArrayBase, ArrayD, Dim, IxDynImpl, OwnedRepr};
use ort::session::Session;
use ort::value::Value;
use std::mem::take;
use std::path::Path;
#[derive(Debug)]
pub struct Silero {
session: Session,
sample_rate: ArrayBase<OwnedRepr<i64>, Dim<[usize; 1]>>,
state: ArrayBase<OwnedRepr<f32>, Dim<IxDynImpl>>,
context: Array1<f32>,
context_size: usize,
}
impl Silero {
pub fn new(
sample_rate: utils::SampleRate,
model_path: impl AsRef<Path>,
) -> Result<Self, ort::Error> {
let session = Session::builder()?.commit_from_file(model_path)?;
let state = ArrayD::<f32>::zeros([2, 1, 128].as_slice());
let sample_rate_val: i64 = sample_rate.into();
let context_size = if sample_rate_val == 16000 { 64 } else { 32 };
let context = Array1::<f32>::zeros(context_size);
let sample_rate = Array::from_shape_vec([1], vec![sample_rate_val]).unwrap();
Ok(Self {
session,
sample_rate,
state,
context,
context_size,
})
}
pub fn reset(&mut self) {
self.state = ArrayD::<f32>::zeros([2, 1, 128].as_slice());
self.context = Array1::<f32>::zeros(self.context_size);
}
pub fn calc_level(&mut self, audio_frame: &[i16]) -> Result<f32, ort::Error> {
let data = audio_frame
.iter()
.map(|x| (*x as f32) / (i16::MAX as f32))
.collect::<Vec<_>>();
// Concatenate context with input
let mut input_with_context = Vec::with_capacity(self.context_size + data.len());
input_with_context.extend_from_slice(self.context.as_slice().unwrap());
input_with_context.extend_from_slice(&data);
let frame =
Array2::<f32>::from_shape_vec([1, input_with_context.len()], input_with_context)
.unwrap();
let frame_value = Value::from_array(frame)?;
let state_value = Value::from_array(take(&mut self.state))?;
let sr_value = Value::from_array(self.sample_rate.clone())?;
let res = self.session.run([
(&frame_value).into(),
(&state_value).into(),
(&sr_value).into(),
])?;
let (shape, state_data) = res["stateN"].try_extract_tensor::<f32>()?;
let shape_usize: Vec<usize> = shape.as_ref().iter().map(|&d| d as usize).collect();
self.state = ArrayD::from_shape_vec(shape_usize.as_slice(), state_data.to_vec()).unwrap();
// Update context with last context_size samples from the input
if data.len() >= self.context_size {
self.context = Array1::from_vec(data[data.len() - self.context_size..].to_vec());
}
let prob = *res["output"]
.try_extract_tensor::<f32>()
.unwrap()
.1
.first()
.unwrap();
Ok(prob)
}
}

View File

@@ -0,0 +1,60 @@
#[derive(Debug, Clone, Copy)]
pub enum SampleRate {
EightkHz,
SixteenkHz,
}
impl From<SampleRate> for i64 {
fn from(value: SampleRate) -> Self {
match value {
SampleRate::EightkHz => 8000,
SampleRate::SixteenkHz => 16000,
}
}
}
impl From<SampleRate> for usize {
fn from(value: SampleRate) -> Self {
match value {
SampleRate::EightkHz => 8000,
SampleRate::SixteenkHz => 16000,
}
}
}
#[derive(Debug)]
pub struct VadParams {
pub frame_size: usize,
pub threshold: f32,
pub min_silence_duration_ms: usize,
pub speech_pad_ms: usize,
pub min_speech_duration_ms: usize,
pub max_speech_duration_s: f32,
pub sample_rate: usize,
}
impl Default for VadParams {
fn default() -> Self {
Self {
frame_size: 32, // 32ms for 512 samples at 16kHz
threshold: 0.5,
min_silence_duration_ms: 0,
speech_pad_ms: 64,
min_speech_duration_ms: 64,
max_speech_duration_s: f32::INFINITY,
sample_rate: 16000,
}
}
}
#[derive(Debug, Default)]
pub struct TimeStamp {
pub start: i64,
pub end: i64,
}
impl std::fmt::Display for TimeStamp {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(f, "[start:{:08}, end:{:08}]", self.start, self.end)
}
}

View File

@@ -0,0 +1,223 @@
use crate::{silero, utils};
const DEBUG_SPEECH_PROB: bool = true;
#[derive(Debug)]
pub struct VadIter {
silero: silero::Silero,
params: Params,
state: State,
}
impl VadIter {
pub fn new(silero: silero::Silero, params: utils::VadParams) -> Self {
Self {
silero,
params: Params::from(params),
state: State::new(),
}
}
pub fn process(&mut self, samples: &[i16]) -> Result<(), ort::Error> {
self.reset_states();
for audio_frame in samples.chunks_exact(self.params.frame_size_samples) {
let speech_prob: f32 = self.silero.calc_level(audio_frame)?;
self.state.update(&self.params, speech_prob);
}
self.state.check_for_last_speech(samples.len());
Ok(())
}
pub fn speeches(&self) -> &[utils::TimeStamp] {
&self.state.speeches
}
}
impl VadIter {
fn reset_states(&mut self) {
self.silero.reset();
self.state = State::new()
}
}
#[allow(unused)]
#[derive(Debug)]
struct Params {
frame_size: usize,
threshold: f32,
min_silence_duration_ms: usize,
speech_pad_ms: usize,
min_speech_duration_ms: usize,
max_speech_duration_s: f32,
sample_rate: usize,
sr_per_ms: usize,
frame_size_samples: usize,
min_speech_samples: usize,
speech_pad_samples: usize,
max_speech_samples: f32,
min_silence_samples: usize,
min_silence_samples_at_max_speech: usize,
}
impl From<utils::VadParams> for Params {
fn from(value: utils::VadParams) -> Self {
let frame_size = value.frame_size;
let threshold = value.threshold;
let min_silence_duration_ms = value.min_silence_duration_ms;
let speech_pad_ms = value.speech_pad_ms;
let min_speech_duration_ms = value.min_speech_duration_ms;
let max_speech_duration_s = value.max_speech_duration_s;
let sample_rate = value.sample_rate;
let sr_per_ms = sample_rate / 1000;
let frame_size_samples = frame_size * sr_per_ms;
let min_speech_samples = sr_per_ms * min_speech_duration_ms;
let speech_pad_samples = sr_per_ms * speech_pad_ms;
let max_speech_samples = sample_rate as f32 * max_speech_duration_s
- frame_size_samples as f32
- 2.0 * speech_pad_samples as f32;
let min_silence_samples = sr_per_ms * min_silence_duration_ms;
let min_silence_samples_at_max_speech = sr_per_ms * 98;
Self {
frame_size,
threshold,
min_silence_duration_ms,
speech_pad_ms,
min_speech_duration_ms,
max_speech_duration_s,
sample_rate,
sr_per_ms,
frame_size_samples,
min_speech_samples,
speech_pad_samples,
max_speech_samples,
min_silence_samples,
min_silence_samples_at_max_speech,
}
}
}
#[derive(Debug, Default)]
struct State {
current_sample: usize,
temp_end: usize,
next_start: usize,
prev_end: usize,
triggered: bool,
current_speech: utils::TimeStamp,
speeches: Vec<utils::TimeStamp>,
}
impl State {
fn new() -> Self {
Default::default()
}
fn update(&mut self, params: &Params, speech_prob: f32) {
self.current_sample += params.frame_size_samples;
if speech_prob > params.threshold {
if self.temp_end != 0 {
self.temp_end = 0;
if self.next_start < self.prev_end {
self.next_start = self
.current_sample
.saturating_sub(params.frame_size_samples)
}
}
if !self.triggered {
self.debug(speech_prob, params, "start");
self.triggered = true;
self.current_speech.start =
self.current_sample as i64 - params.frame_size_samples as i64;
}
return;
}
if self.triggered
&& (self.current_sample as i64 - self.current_speech.start) as f32
> params.max_speech_samples
{
if self.prev_end > 0 {
self.current_speech.end = self.prev_end as _;
self.take_speech();
if self.next_start < self.prev_end {
self.triggered = false
} else {
self.current_speech.start = self.next_start as _;
}
self.prev_end = 0;
self.next_start = 0;
self.temp_end = 0;
} else {
self.current_speech.end = self.current_sample as _;
self.take_speech();
self.prev_end = 0;
self.next_start = 0;
self.temp_end = 0;
self.triggered = false;
}
return;
}
if speech_prob >= (params.threshold - 0.15) && (speech_prob < params.threshold) {
if self.triggered {
self.debug(speech_prob, params, "speaking")
} else {
self.debug(speech_prob, params, "silence")
}
}
if self.triggered && speech_prob < (params.threshold - 0.15) {
self.debug(speech_prob, params, "end");
if self.temp_end == 0 {
self.temp_end = self.current_sample;
}
if self.current_sample.saturating_sub(self.temp_end)
> params.min_silence_samples_at_max_speech
{
self.prev_end = self.temp_end;
}
if self.current_sample.saturating_sub(self.temp_end) >= params.min_silence_samples {
self.current_speech.end = self.temp_end as _;
if self.current_speech.end - self.current_speech.start
> params.min_speech_samples as _
{
self.take_speech();
self.prev_end = 0;
self.next_start = 0;
self.temp_end = 0;
self.triggered = false;
}
}
}
}
fn take_speech(&mut self) {
self.speeches.push(std::mem::take(&mut self.current_speech)); // current speech becomes TimeStamp::default() due to take()
}
fn check_for_last_speech(&mut self, last_sample: usize) {
if self.current_speech.start > 0 {
self.current_speech.end = last_sample as _;
self.take_speech();
self.prev_end = 0;
self.next_start = 0;
self.temp_end = 0;
self.triggered = false;
}
}
fn debug(&self, speech_prob: f32, params: &Params, title: &str) {
if DEBUG_SPEECH_PROB {
let speech = self.current_sample as f32
- params.frame_size_samples as f32
- if title == "end" {
params.speech_pad_samples
} else {
0
} as f32; // minus window_size_samples to get precise start time point.
println!(
"[{:10}: {:.3} s ({:.3}) {:8}]",
title,
speech / params.sample_rate as f32,
speech_prob,
self.current_sample - params.frame_size_samples,
);
}
}
}

View File

@@ -1 +0,0 @@
{"59": "mg, Malagasy", "76": "tk, Turkmen", "20": "lb, Luxembourgish, Letzeburgesch", "62": "or, Oriya", "30": "en, English", "26": "oc, Occitan", "69": "no, Norwegian", "77": "sr, Serbian", "90": "bs, Bosnian", "71": "el, Greek, Modern (1453\u2013)", "15": "az, Azerbaijani", "12": "lo, Lao", "85": "zh-HK, Chinese", "79": "cs, Czech", "43": "sv, Swedish", "37": "mn, Mongolian", "32": "fi, Finnish", "51": "tg, Tajik", "46": "am, Amharic", "17": "nn, Norwegian Nynorsk", "40": "ja, Japanese", "8": "it, Italian", "21": "ha, Hausa", "11": "as, Assamese", "29": "fa, Persian", "82": "bn, Bengali", "54": "mk, Macedonian", "31": "sw, Swahili", "45": "vi, Vietnamese", "41": "ur, Urdu", "74": "bo, Tibetan", "4": "hi, Hindi", "86": "mr, Marathi", "3": "fy-NL, Western Frisian", "65": "sk, Slovak", "2": "ln, Lingala", "92": "gl, Galician", "53": "sn, Shona", "87": "su, Sundanese", "35": "tt, Tatar", "93": "kn, Kannada", "6": "yo, Yoruba", "27": "ps, Pashto, Pushto", "34": "hy, Armenian", "25": "pa-IN, Punjabi, Panjabi", "23": "nl, Dutch, Flemish", "48": "th, Thai", "73": "mt, Maltese", "55": "ar, Arabic", "89": "ba, Bashkir", "78": "bg, Bulgarian", "42": "yi, Yiddish", "5": "ru, Russian", "84": "sv-SE, Swedish", "80": "tr, Turkish", "33": "sq, Albanian", "38": "kk, Kazakh", "50": "pl, Polish", "9": "hr, Croatian", "66": "ky, Kirghiz, Kyrgyz", "49": "hu, Hungarian", "10": "si, Sinhala, Sinhalese", "56": "la, Latin", "75": "de, German", "14": "ko, Korean", "22": "id, Indonesian", "47": "sl, Slovenian", "57": "be, Belarusian", "36": "ta, Tamil", "7": "da, Danish", "91": "sd, Sindhi", "28": "et, Estonian", "63": "pt, Portuguese", "60": "ne, Nepali", "94": "zh-TW, Chinese", "18": "zh-CN, Chinese", "88": "rw, Kinyarwanda", "19": "es, Spanish, Castilian", "39": "ht, Haitian, Haitian Creole", "64": "tl, Tagalog", "83": "ms, Malay", "70": "ro, Romanian, Moldavian, Moldovan", "68": "pa, Punjabi, Panjabi", "52": "uz, Uzbek", "58": "km, Central Khmer", "67": "my, Burmese", "0": "fr, French", "24": "af, Afrikaans", "16": "gu, Gujarati", "81": "so, Somali", "13": "uk, Ukrainian", "44": "ca, Catalan, Valencian", "72": "ml, Malayalam", "61": "te, Telugu", "1": "zh, Chinese"}

View File

@@ -1 +0,0 @@
{"0": ["Afrikaans", "Dutch, Flemish", "Western Frisian"], "1": ["Turkish", "Azerbaijani"], "2": ["Russian", "Slovak", "Ukrainian", "Czech", "Polish", "Belarusian"], "3": ["Bulgarian", "Macedonian", "Serbian", "Croatian", "Bosnian", "Slovenian"], "4": ["Norwegian Nynorsk", "Swedish", "Danish", "Norwegian"], "5": ["English"], "6": ["Finnish", "Estonian"], "7": ["Yiddish", "Luxembourgish, Letzeburgesch", "German"], "8": ["Spanish", "Occitan", "Portuguese", "Catalan, Valencian", "Galician", "Spanish, Castilian", "Italian"], "9": ["Maltese", "Arabic"], "10": ["Marathi"], "11": ["Hindi", "Urdu"], "12": ["Lao", "Thai"], "13": ["Malay", "Indonesian"], "14": ["Romanian, Moldavian, Moldovan"], "15": ["Tagalog"], "16": ["Tajik", "Persian"], "17": ["Kazakh", "Uzbek", "Kirghiz, Kyrgyz"], "18": ["Kinyarwanda"], "19": ["Tatar", "Bashkir"], "20": ["French"], "21": ["Chinese"], "22": ["Lingala"], "23": ["Yoruba"], "24": ["Sinhala, Sinhalese"], "25": ["Assamese"], "26": ["Korean"], "27": ["Gujarati"], "28": ["Hausa"], "29": ["Punjabi, Panjabi"], "30": ["Pashto, Pushto"], "31": ["Swahili"], "32": ["Albanian"], "33": ["Armenian"], "34": ["Mongolian"], "35": ["Tamil"], "36": ["Haitian, Haitian Creole"], "37": ["Japanese"], "38": ["Vietnamese"], "39": ["Amharic"], "40": ["Hungarian"], "41": ["Shona"], "42": ["Latin"], "43": ["Central Khmer"], "44": ["Malagasy"], "45": ["Nepali"], "46": ["Telugu"], "47": ["Oriya"], "48": ["Burmese"], "49": ["Greek, Modern (1453\u2013)"], "50": ["Malayalam"], "51": ["Tibetan"], "52": ["Turkmen"], "53": ["Somali"], "54": ["Bengali"], "55": ["Sundanese"], "56": ["Sindhi"], "57": ["Kannada"]}

Binary file not shown.

Binary file not shown.

View File

@@ -1,29 +1,50 @@
dependencies = ['torch', 'torchaudio']
import torch
import os
import json
from utils_vad import (init_jit_model,
get_speech_timestamps,
get_number_ts,
get_language,
get_language_and_group,
save_audio,
read_audio,
VADIterator,
collect_chunks,
drop_chunks,
Validator,
OnnxWrapper)
import sys
sys.path.insert(0, os.path.join(os.path.dirname(__file__), 'src'))
from silero_vad.utils_vad import (init_jit_model,
get_speech_timestamps,
save_audio,
read_audio,
VADIterator,
collect_chunks,
OnnxWrapper)
def silero_vad(onnx=False):
def versiontuple(v):
splitted = v.split('+')[0].split(".")
version_list = []
for i in splitted:
try:
version_list.append(int(i))
except:
version_list.append(0)
return tuple(version_list)
def silero_vad(onnx=False, force_onnx_cpu=False, opset_version=16):
"""Silero Voice Activity Detector
Returns a model with a set of utils
Please see https://github.com/snakers4/silero-vad for usage examples
"""
model_dir = os.path.join(os.path.dirname(__file__), 'files')
available_ops = [15, 16]
if onnx and opset_version not in available_ops:
raise Exception(f'Available ONNX opset_version: {available_ops}')
if not onnx:
installed_version = torch.__version__
supported_version = '1.12.0'
if versiontuple(installed_version) < versiontuple(supported_version):
raise Exception(f'Please install torch {supported_version} or greater ({installed_version} installed)')
model_dir = os.path.join(os.path.dirname(__file__), 'src', 'silero_vad', 'data')
if onnx:
model = OnnxWrapper(os.path.join(model_dir, 'silero_vad.onnx'))
if opset_version == 16:
model_name = 'silero_vad.onnx'
else:
model_name = f'silero_vad_16k_op{opset_version}.onnx'
model = OnnxWrapper(os.path.join(model_dir, model_name), force_onnx_cpu)
else:
model = init_jit_model(os.path.join(model_dir, 'silero_vad.jit'))
utils = (get_speech_timestamps,
@@ -33,62 +54,3 @@ def silero_vad(onnx=False):
collect_chunks)
return model, utils
def silero_number_detector(onnx=False):
"""Silero Number Detector
Returns a model with a set of utils
Please see https://github.com/snakers4/silero-vad for usage examples
"""
if onnx:
url = 'https://models.silero.ai/vad_models/number_detector.onnx'
else:
url = 'https://models.silero.ai/vad_models/number_detector.jit'
model = Validator(url)
utils = (get_number_ts,
save_audio,
read_audio,
collect_chunks,
drop_chunks)
return model, utils
def silero_lang_detector(onnx=False):
"""Silero Language Classifier
Returns a model with a set of utils
Please see https://github.com/snakers4/silero-vad for usage examples
"""
if onnx:
url = 'https://models.silero.ai/vad_models/number_detector.onnx'
else:
url = 'https://models.silero.ai/vad_models/number_detector.jit'
model = Validator(url)
utils = (get_language,
read_audio)
return model, utils
def silero_lang_detector_95(onnx=False):
"""Silero Language Classifier (95 languages)
Returns a model with a set of utils
Please see https://github.com/snakers4/silero-vad for usage examples
"""
if onnx:
url = 'https://models.silero.ai/vad_models/lang_classifier_95.onnx'
else:
url = 'https://models.silero.ai/vad_models/lang_classifier_95.jit'
model = Validator(url)
model_dir = os.path.join(os.path.dirname(__file__), 'files')
with open(os.path.join(model_dir, 'lang_dict_95.json'), 'r') as f:
lang_dict = json.load(f)
with open(os.path.join(model_dir, 'lang_group_dict_95.json'), 'r') as f:
lang_group_dict = json.load(f)
utils = (get_language_and_group, read_audio)
return model, lang_dict, lang_group_dict, utils

46
pyproject.toml Normal file
View File

@@ -0,0 +1,46 @@
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
[project]
name = "silero-vad"
version = "6.2.0"
authors = [
{name="Silero Team", email="hello@silero.ai"},
]
description = "Voice Activity Detector (VAD) by Silero"
readme = "README.md"
requires-python = ">=3.8"
classifiers = [
"Development Status :: 5 - Production/Stable",
"License :: OSI Approved :: MIT License",
"Operating System :: OS Independent",
"Intended Audience :: Science/Research",
"Intended Audience :: Developers",
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
"Programming Language :: Python :: 3.13",
"Programming Language :: Python :: 3.14",
"Programming Language :: Python :: 3.15",
"Topic :: Scientific/Engineering :: Artificial Intelligence",
"Topic :: Scientific/Engineering",
]
dependencies = [
"packaging",
"torch>=1.12.0",
"torchaudio>=0.12.0",
"onnxruntime>=1.16.1",
]
[project.urls]
Homepage = "https://github.com/snakers4/silero-vad"
Issues = "https://github.com/snakers4/silero-vad/issues"
[project.optional-dependencies]
test = [
"pytest",
"soundfile",
"torch<2.9",
]

View File

@@ -1,14 +1,5 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "FpMplOCA2Fwp"
},
"source": [
"#VAD"
]
},
{
"cell_type": "markdown",
"metadata": {
@@ -52,20 +43,30 @@
},
"outputs": [],
"source": [
"USE_PIP = True # download model using pip package or torch.hub\n",
"USE_ONNX = False # change this to True if you want to test onnx model\n",
"if USE_ONNX:\n",
" !pip install -q onnxruntime\n",
" \n",
"model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',\n",
" model='silero_vad',\n",
" force_reload=True,\n",
" onnx=USE_ONNX)\n",
"if USE_PIP:\n",
" !pip install -q silero-vad\n",
" from silero_vad import (load_silero_vad,\n",
" read_audio,\n",
" get_speech_timestamps,\n",
" save_audio,\n",
" VADIterator,\n",
" collect_chunks)\n",
" model = load_silero_vad(onnx=USE_ONNX)\n",
"else:\n",
" model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',\n",
" model='silero_vad',\n",
" force_reload=True,\n",
" onnx=USE_ONNX)\n",
"\n",
"(get_speech_timestamps,\n",
" save_audio,\n",
" read_audio,\n",
" VADIterator,\n",
" collect_chunks) = utils"
" (get_speech_timestamps,\n",
" save_audio,\n",
" read_audio,\n",
" VADIterator,\n",
" collect_chunks) = utils"
]
},
{
@@ -74,16 +75,7 @@
"id": "fXbbaUO3jsrw"
},
"source": [
"## Full Audio"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "RAfJPb_a-Auj"
},
"source": [
"**Speech timestapms from full audio**"
"## Speech timestapms from full audio"
]
},
{
@@ -110,10 +102,33 @@
"source": [
"# merge all speech chunks to one audio\n",
"save_audio('only_speech.wav',\n",
" collect_chunks(speech_timestamps, wav), sampling_rate=SAMPLING_RATE) \n",
" collect_chunks(speech_timestamps, wav), sampling_rate=SAMPLING_RATE)\n",
"Audio('only_speech.wav')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "zeO1xCqxUC6w"
},
"source": [
"## Entire audio inference"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "LjZBcsaTT7Mk"
},
"outputs": [],
"source": [
"wav = read_audio('en_example.wav', sampling_rate=SAMPLING_RATE)\n",
"# audio is being splitted into 31.25 ms long pieces\n",
"# so output length equals ceil(input_length * 31.25 / SAMPLING_RATE)\n",
"predicts = model.audio_forward(wav, sr=SAMPLING_RATE)"
]
},
{
"cell_type": "markdown",
"metadata": {
@@ -133,12 +148,15 @@
"source": [
"## using VADIterator class\n",
"\n",
"vad_iterator = VADIterator(model)\n",
"vad_iterator = VADIterator(model, sampling_rate=SAMPLING_RATE)\n",
"wav = read_audio(f'en_example.wav', sampling_rate=SAMPLING_RATE)\n",
"\n",
"window_size_samples = 1536 # number of samples in a single audio chunk\n",
"window_size_samples = 512 if SAMPLING_RATE == 16000 else 256\n",
"for i in range(0, len(wav), window_size_samples):\n",
" speech_dict = vad_iterator(wav[i: i+ window_size_samples], return_seconds=True)\n",
" chunk = wav[i: i+ window_size_samples]\n",
" if len(chunk) < window_size_samples:\n",
" break\n",
" speech_dict = vad_iterator(chunk, return_seconds=True)\n",
" if speech_dict:\n",
" print(speech_dict, end=' ')\n",
"vad_iterator.reset_states() # reset model states after each audio"
@@ -156,246 +174,17 @@
"\n",
"wav = read_audio('en_example.wav', sampling_rate=SAMPLING_RATE)\n",
"speech_probs = []\n",
"window_size_samples = 1536\n",
"window_size_samples = 512 if SAMPLING_RATE == 16000 else 256\n",
"for i in range(0, len(wav), window_size_samples):\n",
" speech_prob = model(wav[i: i+ window_size_samples], SAMPLING_RATE).item()\n",
" chunk = wav[i: i+ window_size_samples]\n",
" if len(chunk) < window_size_samples:\n",
" break\n",
" speech_prob = model(chunk, SAMPLING_RATE).item()\n",
" speech_probs.append(speech_prob)\n",
"vad_iterator.reset_states() # reset model states after each audio\n",
"\n",
"print(speech_probs[:10]) # first 10 chunks predicts"
]
},
{
"cell_type": "markdown",
"metadata": {
"heading_collapsed": true,
"id": "36jY0niD2Fww"
},
"source": [
"# Number detector"
]
},
{
"cell_type": "markdown",
"metadata": {
"heading_collapsed": true,
"hidden": true,
"id": "scd1DlS42Fwx"
},
"source": [
"## Install Dependencies"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true,
"id": "Kq5gQuYq2Fwx"
},
"outputs": [],
"source": [
"#@title Install and Import Dependencies\n",
"\n",
"# this assumes that you have a relevant version of PyTorch installed\n",
"!pip install -q torchaudio\n",
"\n",
"SAMPLING_RATE = 16000\n",
"\n",
"import torch\n",
"torch.set_num_threads(1)\n",
"\n",
"from IPython.display import Audio\n",
"from pprint import pprint\n",
"# download example\n",
"torch.hub.download_url_to_file('https://models.silero.ai/vad_models/en_num.wav', 'en_number_example.wav')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "dPwCFHmFycUF"
},
"outputs": [],
"source": [
"USE_ONNX = False # change this to True if you want to test onnx model\n",
"if USE_ONNX:\n",
" !pip install -q onnxruntime\n",
" \n",
"model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',\n",
" model='silero_number_detector',\n",
" force_reload=True,\n",
" onnx=USE_ONNX)\n",
"\n",
"(get_number_ts,\n",
" save_audio,\n",
" read_audio,\n",
" collect_chunks,\n",
" drop_chunks) = utils\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"heading_collapsed": true,
"hidden": true,
"id": "qhPa30ij2Fwy"
},
"source": [
"## Full audio"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true,
"id": "EXpau6xq2Fwy"
},
"outputs": [],
"source": [
"wav = read_audio('en_number_example.wav', sampling_rate=SAMPLING_RATE)\n",
"# get number timestamps from full audio file\n",
"number_timestamps = get_number_ts(wav, model)\n",
"pprint(number_timestamps)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true,
"id": "u-KfXRhZ2Fwy"
},
"outputs": [],
"source": [
"# convert ms in timestamps to samples\n",
"for timestamp in number_timestamps:\n",
" timestamp['start'] = int(timestamp['start'] * SAMPLING_RATE / 1000)\n",
" timestamp['end'] = int(timestamp['end'] * SAMPLING_RATE / 1000)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true,
"id": "iwYEC4aZ2Fwy"
},
"outputs": [],
"source": [
"# merge all number chunks to one audio\n",
"save_audio('only_numbers.wav',\n",
" collect_chunks(number_timestamps, wav), SAMPLING_RATE) \n",
"Audio('only_numbers.wav')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true,
"id": "fHaYejX12Fwy"
},
"outputs": [],
"source": [
"# drop all number chunks from audio\n",
"save_audio('no_numbers.wav',\n",
" drop_chunks(number_timestamps, wav), SAMPLING_RATE) \n",
"Audio('no_numbers.wav')"
]
},
{
"cell_type": "markdown",
"metadata": {
"heading_collapsed": true,
"id": "PnKtJKbq2Fwz"
},
"source": [
"# Language detector"
]
},
{
"cell_type": "markdown",
"metadata": {
"heading_collapsed": true,
"hidden": true,
"id": "F5cAmMbP2Fwz"
},
"source": [
"## Install Dependencies"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true,
"id": "Zu9D0t6n2Fwz"
},
"outputs": [],
"source": [
"#@title Install and Import Dependencies\n",
"\n",
"# this assumes that you have a relevant version of PyTorch installed\n",
"!pip install -q torchaudio\n",
"\n",
"SAMPLING_RATE = 16000\n",
"\n",
"import torch\n",
"torch.set_num_threads(1)\n",
"\n",
"from IPython.display import Audio\n",
"from pprint import pprint\n",
"# download example\n",
"torch.hub.download_url_to_file('https://models.silero.ai/vad_models/en.wav', 'en_example.wav')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "JfRKDZiRztFe"
},
"outputs": [],
"source": [
"USE_ONNX = False # change this to True if you want to test onnx model\n",
"if USE_ONNX:\n",
" !pip install -q onnxruntime\n",
" \n",
"model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',\n",
" model='silero_lang_detector',\n",
" force_reload=True,\n",
" onnx=USE_ONNX)\n",
"\n",
"get_language, read_audio = utils"
]
},
{
"cell_type": "markdown",
"metadata": {
"heading_collapsed": true,
"hidden": true,
"id": "iC696eMX2Fwz"
},
"source": [
"## Full audio"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hidden": true,
"id": "c8UYnYBF2Fw0"
},
"outputs": [],
"source": [
"wav = read_audio('en_example.wav', sampling_rate=SAMPLING_RATE)\n",
"lang = get_language(wav, model)\n",
"print(lang)"
]
}
],
"metadata": {

View File

@@ -0,0 +1,13 @@
from importlib.metadata import version
try:
__version__ = version(__name__)
except:
pass
from silero_vad.model import load_silero_vad
from silero_vad.utils_vad import (get_speech_timestamps,
save_audio,
read_audio,
VADIterator,
collect_chunks,
drop_chunks)

View File

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

36
src/silero_vad/model.py Normal file
View File

@@ -0,0 +1,36 @@
from .utils_vad import init_jit_model, OnnxWrapper
import torch
torch.set_num_threads(1)
def load_silero_vad(onnx=False, opset_version=16):
available_ops = [15, 16]
if onnx and opset_version not in available_ops:
raise Exception(f'Available ONNX opset_version: {available_ops}')
if onnx:
if opset_version == 16:
model_name = 'silero_vad.onnx'
else:
model_name = f'silero_vad_16k_op{opset_version}.onnx'
else:
model_name = 'silero_vad.jit'
package_path = "silero_vad.data"
try:
import importlib_resources as impresources
model_file_path = str(impresources.files(package_path).joinpath(model_name))
except:
from importlib import resources as impresources
try:
with impresources.path(package_path, model_name) as f:
model_file_path = f
except:
model_file_path = str(impresources.files(package_path).joinpath(model_name))
if onnx:
model = OnnxWrapper(str(model_file_path), force_onnx_cpu=True)
else:
model = init_jit_model(model_file_path)
return model

View File

@@ -0,0 +1,71 @@
from tinygrad import nn
class TinySileroVAD:
def __init__(self):
"""
from tinygrad.nn.state import safe_load, load_state_dict
tiny_model = TinySileroVAD()
state_dict = safe_load('data/silero_vad_16k.safetensors')
load_state_dict(tiny_model, state_dict)
"""
self.n_fft = 256
self.stride = 128
self.pad = 64
self.cutoff = int(self.n_fft // 2) + 1
self.stft_conv = nn.Conv1d(1, 258, kernel_size=256, stride=self.stride, padding=0, bias=False)
self.conv1 = nn.Conv1d(129, 128, kernel_size=3, stride=1, padding=1)
self.conv2 = nn.Conv1d(128, 64, kernel_size=3, stride=2, padding=1)
self.conv3 = nn.Conv1d(64, 64, kernel_size=3, stride=2, padding=1)
self.conv4 = nn.Conv1d(64, 128, kernel_size=3, stride=1, padding=1)
self.lstm_cell = nn.LSTMCell(128, 128)
self.final_conv = nn.Conv1d(128, 1, 1)
def __call__(self, x, state=None):
"""
# full audio example:
import torch
from tinygrad import Tensor
wav = read_audio(audio_path, sampling_rate=16000).unsqueeze(0)
num_samples = 512
context_size = 64
context = Tensor(np.zeros((1, context_size))).float()
outs = []
state = None
if wav.shape[1] % num_samples:
pad_num = num_samples - (wav.shape[1] % num_samples)
wav = torch.nn.functional.pad(wav, (0, pad_num), 'constant', value=0.0)
wav = torch.nn.functional.pad(wav, (context_size, 0))
wav = Tensor(wav.numpy()).float()
for i in tqdm(range(context_size, wav.shape[1], num_samples)):
wavs_batch = wav[:, i-context_size:i+num_samples]
out_chunk, state = tiny_model(wavs_batch, state)
#outs.append(out_chunk.numpy())
outs.append(out_chunk)
predict = outs[0].cat(*outs[1:], dim=1).numpy()
"""
if state is not None:
state = (state[0], state[1])
x = x.pad((0, self.pad), "reflect").unsqueeze(1)
x = self.stft_conv(x)
x = (x[:, :self.cutoff, :]**2 + x[:, self.cutoff:, :]**2).sqrt()
x = self.conv1(x).relu()
x = self.conv2(x).relu()
x = self.conv3(x).relu()
x = self.conv4(x).relu().squeeze(-1)
h, c = self.lstm_cell(x, state)
x = h.unsqueeze(-1)
state = h.stack(c, dim=0)
x = x.relu()
x = self.final_conv(x).sigmoid()
x = x.squeeze(1).mean(axis=1).unsqueeze(1)
return x, state

655
src/silero_vad/utils_vad.py Normal file
View File

@@ -0,0 +1,655 @@
import torch
import torchaudio
from typing import Callable, List
import warnings
from packaging import version
languages = ['ru', 'en', 'de', 'es']
class OnnxWrapper():
def __init__(self, path, force_onnx_cpu=False):
import numpy as np
global np
import onnxruntime
opts = onnxruntime.SessionOptions()
opts.inter_op_num_threads = 1
opts.intra_op_num_threads = 1
if force_onnx_cpu and 'CPUExecutionProvider' in onnxruntime.get_available_providers():
self.session = onnxruntime.InferenceSession(path, providers=['CPUExecutionProvider'], sess_options=opts)
else:
self.session = onnxruntime.InferenceSession(path, sess_options=opts)
self.reset_states()
if '16k' in path:
warnings.warn('This model support only 16000 sampling rate!')
self.sample_rates = [16000]
else:
self.sample_rates = [8000, 16000]
def _validate_input(self, x, sr: int):
if x.dim() == 1:
x = x.unsqueeze(0)
if x.dim() > 2:
raise ValueError(f"Too many dimensions for input audio chunk {x.dim()}")
if sr != 16000 and (sr % 16000 == 0):
step = sr // 16000
x = x[:,::step]
sr = 16000
if sr not in self.sample_rates:
raise ValueError(f"Supported sampling rates: {self.sample_rates} (or multiply of 16000)")
if sr / x.shape[1] > 31.25:
raise ValueError("Input audio chunk is too short")
return x, sr
def reset_states(self, batch_size=1):
self._state = torch.zeros((2, batch_size, 128)).float()
self._context = torch.zeros(0)
self._last_sr = 0
self._last_batch_size = 0
def __call__(self, x, sr: int):
x, sr = self._validate_input(x, sr)
num_samples = 512 if sr == 16000 else 256
if x.shape[-1] != num_samples:
raise ValueError(f"Provided number of samples is {x.shape[-1]} (Supported values: 256 for 8000 sample rate, 512 for 16000)")
batch_size = x.shape[0]
context_size = 64 if sr == 16000 else 32
if not self._last_batch_size:
self.reset_states(batch_size)
if (self._last_sr) and (self._last_sr != sr):
self.reset_states(batch_size)
if (self._last_batch_size) and (self._last_batch_size != batch_size):
self.reset_states(batch_size)
if not len(self._context):
self._context = torch.zeros(batch_size, context_size)
x = torch.cat([self._context, x], dim=1)
if sr in [8000, 16000]:
ort_inputs = {'input': x.numpy(), 'state': self._state.numpy(), 'sr': np.array(sr, dtype='int64')}
ort_outs = self.session.run(None, ort_inputs)
out, state = ort_outs
self._state = torch.from_numpy(state)
else:
raise ValueError()
self._context = x[..., -context_size:]
self._last_sr = sr
self._last_batch_size = batch_size
out = torch.from_numpy(out)
return out
def audio_forward(self, x, sr: int):
outs = []
x, sr = self._validate_input(x, sr)
self.reset_states()
num_samples = 512 if sr == 16000 else 256
if x.shape[1] % num_samples:
pad_num = num_samples - (x.shape[1] % num_samples)
x = torch.nn.functional.pad(x, (0, pad_num), 'constant', value=0.0)
for i in range(0, x.shape[1], num_samples):
wavs_batch = x[:, i:i+num_samples]
out_chunk = self.__call__(wavs_batch, sr)
outs.append(out_chunk)
stacked = torch.cat(outs, dim=1)
return stacked.cpu()
class Validator():
def __init__(self, url, force_onnx_cpu):
self.onnx = True if url.endswith('.onnx') else False
torch.hub.download_url_to_file(url, 'inf.model')
if self.onnx:
import onnxruntime
if force_onnx_cpu and 'CPUExecutionProvider' in onnxruntime.get_available_providers():
self.model = onnxruntime.InferenceSession('inf.model', providers=['CPUExecutionProvider'])
else:
self.model = onnxruntime.InferenceSession('inf.model')
else:
self.model = init_jit_model(model_path='inf.model')
def __call__(self, inputs: torch.Tensor):
with torch.no_grad():
if self.onnx:
ort_inputs = {'input': inputs.cpu().numpy()}
outs = self.model.run(None, ort_inputs)
outs = [torch.Tensor(x) for x in outs]
else:
outs = self.model(inputs)
return outs
def read_audio(path: str, sampling_rate: int = 16000) -> torch.Tensor:
ta_ver = version.parse(torchaudio.__version__)
if ta_ver < version.parse("2.9"):
try:
effects = [['channels', '1'],['rate', str(sampling_rate)]]
wav, sr = torchaudio.sox_effects.apply_effects_file(path, effects=effects)
except:
wav, sr = torchaudio.load(path)
else:
try:
wav, sr = torchaudio.load(path)
except:
try:
from torchcodec.decoders import AudioDecoder
samples = AudioDecoder(path).get_all_samples()
wav = samples.data
sr = samples.sample_rate
except ImportError:
raise RuntimeError(
f"torchaudio version {torchaudio.__version__} requires torchcodec for audio I/O. "
+ "Install torchcodec or pin torchaudio < 2.9"
)
if wav.ndim > 1 and wav.size(0) > 1:
wav = wav.mean(dim=0, keepdim=True)
if sr != sampling_rate:
wav = torchaudio.transforms.Resample(sr, sampling_rate)(wav)
return wav.squeeze(0)
def save_audio(path: str, tensor: torch.Tensor, sampling_rate: int = 16000):
tensor = tensor.detach().cpu()
if tensor.ndim == 1:
tensor = tensor.unsqueeze(0)
ta_ver = version.parse(torchaudio.__version__)
try:
torchaudio.save(path, tensor, sampling_rate, bits_per_sample=16)
except Exception:
if ta_ver >= version.parse("2.9"):
try:
from torchcodec.encoders import AudioEncoder
encoder = AudioEncoder(tensor, sample_rate=16000)
encoder.to_file(path)
except ImportError:
raise RuntimeError(
f"torchaudio version {torchaudio.__version__} requires torchcodec for saving. "
+ "Install torchcodec or pin torchaudio < 2.9"
)
else:
raise
def init_jit_model(model_path: str,
device=torch.device('cpu')):
model = torch.jit.load(model_path, map_location=device)
model.eval()
return model
def make_visualization(probs, step):
import pandas as pd
pd.DataFrame({'probs': probs},
index=[x * step for x in range(len(probs))]).plot(figsize=(16, 8),
kind='area', ylim=[0, 1.05], xlim=[0, len(probs) * step],
xlabel='seconds',
ylabel='speech probability',
colormap='tab20')
@torch.no_grad()
def get_speech_timestamps(audio: torch.Tensor,
model,
threshold: float = 0.5,
sampling_rate: int = 16000,
min_speech_duration_ms: int = 250,
max_speech_duration_s: float = float('inf'),
min_silence_duration_ms: int = 100,
speech_pad_ms: int = 30,
return_seconds: bool = False,
time_resolution: int = 1,
visualize_probs: bool = False,
progress_tracking_callback: Callable[[float], None] = None,
neg_threshold: float = None,
window_size_samples: int = 512,
min_silence_at_max_speech: int = 98,
use_max_poss_sil_at_max_speech: bool = True):
"""
This method is used for splitting long audios into speech chunks using silero VAD
Parameters
----------
audio: torch.Tensor, one dimensional
One dimensional float torch.Tensor, other types are casted to torch if possible
model: preloaded .jit/.onnx silero VAD model
threshold: float (default - 0.5)
Speech threshold. Silero VAD outputs speech probabilities for each audio chunk, probabilities ABOVE this value are considered as SPEECH.
It is better to tune this parameter for each dataset separately, but "lazy" 0.5 is pretty good for most datasets.
sampling_rate: int (default - 16000)
Currently silero VAD models support 8000 and 16000 (or multiply of 16000) sample rates
min_speech_duration_ms: int (default - 250 milliseconds)
Final speech chunks shorter min_speech_duration_ms are thrown out
max_speech_duration_s: int (default - inf)
Maximum duration of speech chunks in seconds
Chunks longer than max_speech_duration_s will be split at the timestamp of the last silence that lasts more than 100ms (if any), to prevent aggressive cutting.
Otherwise, they will be split aggressively just before max_speech_duration_s.
min_silence_duration_ms: int (default - 100 milliseconds)
In the end of each speech chunk wait for min_silence_duration_ms before separating it
speech_pad_ms: int (default - 30 milliseconds)
Final speech chunks are padded by speech_pad_ms each side
return_seconds: bool (default - False)
whether return timestamps in seconds (default - samples)
time_resolution: bool (default - 1)
time resolution of speech coordinates when requested as seconds
visualize_probs: bool (default - False)
whether draw prob hist or not
progress_tracking_callback: Callable[[float], None] (default - None)
callback function taking progress in percents as an argument
neg_threshold: float (default = threshold - 0.15)
Negative threshold (noise or exit threshold). If model's current state is SPEECH, values BELOW this value are considered as NON-SPEECH.
min_silence_at_max_speech: int (default - 98ms)
Minimum silence duration in ms which is used to avoid abrupt cuts when max_speech_duration_s is reached
use_max_poss_sil_at_max_speech: bool (default - True)
Whether to use the maximum possible silence at max_speech_duration_s or not. If not, the last silence is used.
window_size_samples: int (default - 512 samples)
!!! DEPRECATED, DOES NOTHING !!!
Returns
----------
speeches: list of dicts
list containing ends and beginnings of speech chunks (samples or seconds based on return_seconds)
"""
if not torch.is_tensor(audio):
try:
audio = torch.Tensor(audio)
except:
raise TypeError("Audio cannot be casted to tensor. Cast it manually")
if len(audio.shape) > 1:
for i in range(len(audio.shape)): # trying to squeeze empty dimensions
audio = audio.squeeze(0)
if len(audio.shape) > 1:
raise ValueError("More than one dimension in audio. Are you trying to process audio with 2 channels?")
if sampling_rate > 16000 and (sampling_rate % 16000 == 0):
step = sampling_rate // 16000
sampling_rate = 16000
audio = audio[::step]
warnings.warn('Sampling rate is a multiply of 16000, casting to 16000 manually!')
else:
step = 1
if sampling_rate not in [8000, 16000]:
raise ValueError("Currently silero VAD models support 8000 and 16000 (or multiply of 16000) sample rates")
window_size_samples = 512 if sampling_rate == 16000 else 256
model.reset_states()
min_speech_samples = sampling_rate * min_speech_duration_ms / 1000
speech_pad_samples = sampling_rate * speech_pad_ms / 1000
max_speech_samples = sampling_rate * max_speech_duration_s - window_size_samples - 2 * speech_pad_samples
min_silence_samples = sampling_rate * min_silence_duration_ms / 1000
min_silence_samples_at_max_speech = sampling_rate * min_silence_at_max_speech / 1000
audio_length_samples = len(audio)
speech_probs = []
for current_start_sample in range(0, audio_length_samples, window_size_samples):
chunk = audio[current_start_sample: current_start_sample + window_size_samples]
if len(chunk) < window_size_samples:
chunk = torch.nn.functional.pad(chunk, (0, int(window_size_samples - len(chunk))))
speech_prob = model(chunk, sampling_rate).item()
speech_probs.append(speech_prob)
# calculate progress and send it to callback function
progress = current_start_sample + window_size_samples
if progress > audio_length_samples:
progress = audio_length_samples
progress_percent = (progress / audio_length_samples) * 100
if progress_tracking_callback:
progress_tracking_callback(progress_percent)
triggered = False
speeches = []
current_speech = {}
if neg_threshold is None:
neg_threshold = max(threshold - 0.15, 0.01)
temp_end = 0 # to save potential segment end (and tolerate some silence)
prev_end = next_start = 0 # to save potential segment limits in case of maximum segment size reached
possible_ends = []
for i, speech_prob in enumerate(speech_probs):
cur_sample = window_size_samples * i
# If speech returns after a temp_end, record candidate silence if long enough and clear temp_end
if (speech_prob >= threshold) and temp_end:
sil_dur = cur_sample - temp_end
if sil_dur > min_silence_samples_at_max_speech:
possible_ends.append((temp_end, sil_dur))
temp_end = 0
if next_start < prev_end:
next_start = cur_sample
# Start of speech
if (speech_prob >= threshold) and not triggered:
triggered = True
current_speech['start'] = cur_sample
continue
# Max speech length reached: decide where to cut
if triggered and (cur_sample - current_speech['start'] > max_speech_samples):
if use_max_poss_sil_at_max_speech and possible_ends:
prev_end, dur = max(possible_ends, key=lambda x: x[1]) # use the longest possible silence segment in the current speech chunk
current_speech['end'] = prev_end
speeches.append(current_speech)
current_speech = {}
next_start = prev_end + dur
if next_start < prev_end + cur_sample: # previously reached silence (< neg_thres) and is still not speech (< thres)
current_speech['start'] = next_start
else:
triggered = False
prev_end = next_start = temp_end = 0
possible_ends = []
else:
# Legacy max-speech cut (use_max_poss_sil_at_max_speech=False): prefer last valid silence (prev_end) if available
if prev_end:
current_speech['end'] = prev_end
speeches.append(current_speech)
current_speech = {}
if next_start < prev_end:
triggered = False
else:
current_speech['start'] = next_start
prev_end = next_start = temp_end = 0
possible_ends = []
else:
# No prev_end -> fallback to cutting at current sample
current_speech['end'] = cur_sample
speeches.append(current_speech)
current_speech = {}
prev_end = next_start = temp_end = 0
triggered = False
possible_ends = []
continue
# Silence detection while in speech
if (speech_prob < neg_threshold) and triggered:
if not temp_end:
temp_end = cur_sample
sil_dur_now = cur_sample - temp_end
if not use_max_poss_sil_at_max_speech and sil_dur_now > min_silence_samples_at_max_speech: # condition to avoid cutting in very short silence
prev_end = temp_end
if sil_dur_now < min_silence_samples:
continue
else:
current_speech['end'] = temp_end
if (current_speech['end'] - current_speech['start']) > min_speech_samples:
speeches.append(current_speech)
current_speech = {}
prev_end = next_start = temp_end = 0
triggered = False
possible_ends = []
continue
if current_speech and (audio_length_samples - current_speech['start']) > min_speech_samples:
current_speech['end'] = audio_length_samples
speeches.append(current_speech)
for i, speech in enumerate(speeches):
if i == 0:
speech['start'] = int(max(0, speech['start'] - speech_pad_samples))
if i != len(speeches) - 1:
silence_duration = speeches[i+1]['start'] - speech['end']
if silence_duration < 2 * speech_pad_samples:
speech['end'] += int(silence_duration // 2)
speeches[i+1]['start'] = int(max(0, speeches[i+1]['start'] - silence_duration // 2))
else:
speech['end'] = int(min(audio_length_samples, speech['end'] + speech_pad_samples))
speeches[i+1]['start'] = int(max(0, speeches[i+1]['start'] - speech_pad_samples))
else:
speech['end'] = int(min(audio_length_samples, speech['end'] + speech_pad_samples))
if return_seconds:
audio_length_seconds = audio_length_samples / sampling_rate
for speech_dict in speeches:
speech_dict['start'] = max(round(speech_dict['start'] / sampling_rate, time_resolution), 0)
speech_dict['end'] = min(round(speech_dict['end'] / sampling_rate, time_resolution), audio_length_seconds)
elif step > 1:
for speech_dict in speeches:
speech_dict['start'] *= step
speech_dict['end'] *= step
if visualize_probs:
make_visualization(speech_probs, window_size_samples / sampling_rate)
return speeches
class VADIterator:
def __init__(self,
model,
threshold: float = 0.5,
sampling_rate: int = 16000,
min_silence_duration_ms: int = 100,
speech_pad_ms: int = 30
):
"""
Class for stream imitation
Parameters
----------
model: preloaded .jit/.onnx silero VAD model
threshold: float (default - 0.5)
Speech threshold. Silero VAD outputs speech probabilities for each audio chunk, probabilities ABOVE this value are considered as SPEECH.
It is better to tune this parameter for each dataset separately, but "lazy" 0.5 is pretty good for most datasets.
sampling_rate: int (default - 16000)
Currently silero VAD models support 8000 and 16000 sample rates
min_silence_duration_ms: int (default - 100 milliseconds)
In the end of each speech chunk wait for min_silence_duration_ms before separating it
speech_pad_ms: int (default - 30 milliseconds)
Final speech chunks are padded by speech_pad_ms each side
"""
self.model = model
self.threshold = threshold
self.sampling_rate = sampling_rate
if sampling_rate not in [8000, 16000]:
raise ValueError('VADIterator does not support sampling rates other than [8000, 16000]')
self.min_silence_samples = sampling_rate * min_silence_duration_ms / 1000
self.speech_pad_samples = sampling_rate * speech_pad_ms / 1000
self.reset_states()
def reset_states(self):
self.model.reset_states()
self.triggered = False
self.temp_end = 0
self.current_sample = 0
@torch.no_grad()
def __call__(self, x, return_seconds=False, time_resolution: int = 1):
"""
x: torch.Tensor
audio chunk (see examples in repo)
return_seconds: bool (default - False)
whether return timestamps in seconds (default - samples)
time_resolution: int (default - 1)
time resolution of speech coordinates when requested as seconds
"""
if not torch.is_tensor(x):
try:
x = torch.Tensor(x)
except:
raise TypeError("Audio cannot be casted to tensor. Cast it manually")
window_size_samples = len(x[0]) if x.dim() == 2 else len(x)
self.current_sample += window_size_samples
speech_prob = self.model(x, self.sampling_rate).item()
if (speech_prob >= self.threshold) and self.temp_end:
self.temp_end = 0
if (speech_prob >= self.threshold) and not self.triggered:
self.triggered = True
speech_start = max(0, self.current_sample - self.speech_pad_samples - window_size_samples)
return {'start': int(speech_start) if not return_seconds else round(speech_start / self.sampling_rate, time_resolution)}
if (speech_prob < self.threshold - 0.15) and self.triggered:
if not self.temp_end:
self.temp_end = self.current_sample
if self.current_sample - self.temp_end < self.min_silence_samples:
return None
else:
speech_end = self.temp_end + self.speech_pad_samples - window_size_samples
self.temp_end = 0
self.triggered = False
return {'end': int(speech_end) if not return_seconds else round(speech_end / self.sampling_rate, time_resolution)}
return None
def collect_chunks(tss: List[dict],
wav: torch.Tensor,
seconds: bool = False,
sampling_rate: int = None) -> torch.Tensor:
"""Collect audio chunks from a longer audio clip
This method extracts audio chunks from an audio clip, using a list of
provided coordinates, and concatenates them together. Coordinates can be
passed either as sample numbers or in seconds, in which case the audio
sampling rate is also needed.
Parameters
----------
tss: List[dict]
Coordinate list of the clips to collect from the audio.
wav: torch.Tensor, one dimensional
One dimensional float torch.Tensor, containing the audio to clip.
seconds: bool (default - False)
Whether input coordinates are passed as seconds or samples.
sampling_rate: int (default - None)
Input audio sampling rate. Required if seconds is True.
Returns
-------
torch.Tensor, one dimensional
One dimensional float torch.Tensor of the concatenated clipped audio
chunks.
Raises
------
ValueError
Raised if sampling_rate is not provided when seconds is True.
"""
if seconds and not sampling_rate:
raise ValueError('sampling_rate must be provided when seconds is True')
chunks = list()
_tss = _seconds_to_samples_tss(tss, sampling_rate) if seconds else tss
for i in _tss:
chunks.append(wav[i['start']:i['end']])
return torch.cat(chunks)
def drop_chunks(tss: List[dict],
wav: torch.Tensor,
seconds: bool = False,
sampling_rate: int = None) -> torch.Tensor:
"""Drop audio chunks from a longer audio clip
This method extracts audio chunks from an audio clip, using a list of
provided coordinates, and drops them. Coordinates can be passed either as
sample numbers or in seconds, in which case the audio sampling rate is also
needed.
Parameters
----------
tss: List[dict]
Coordinate list of the clips to drop from from the audio.
wav: torch.Tensor, one dimensional
One dimensional float torch.Tensor, containing the audio to clip.
seconds: bool (default - False)
Whether input coordinates are passed as seconds or samples.
sampling_rate: int (default - None)
Input audio sampling rate. Required if seconds is True.
Returns
-------
torch.Tensor, one dimensional
One dimensional float torch.Tensor of the input audio minus the dropped
chunks.
Raises
------
ValueError
Raised if sampling_rate is not provided when seconds is True.
"""
if seconds and not sampling_rate:
raise ValueError('sampling_rate must be provided when seconds is True')
chunks = list()
cur_start = 0
_tss = _seconds_to_samples_tss(tss, sampling_rate) if seconds else tss
for i in _tss:
chunks.append((wav[cur_start: i['start']]))
cur_start = i['end']
chunks.append(wav[cur_start:])
return torch.cat(chunks)
def _seconds_to_samples_tss(tss: List[dict], sampling_rate: int) -> List[dict]:
"""Convert coordinates expressed in seconds to sample coordinates.
"""
return [{
'start': round(crd['start']) * sampling_rate,
'end': round(crd['end']) * sampling_rate
} for crd in tss]

BIN
tests/data/test.mp3 Normal file

Binary file not shown.

BIN
tests/data/test.opus Normal file

Binary file not shown.

BIN
tests/data/test.wav Normal file

Binary file not shown.

22
tests/test_basic.py Normal file
View File

@@ -0,0 +1,22 @@
from silero_vad import load_silero_vad, read_audio, get_speech_timestamps
import torch
torch.set_num_threads(1)
def test_jit_model():
model = load_silero_vad(onnx=False)
for path in ["tests/data/test.wav", "tests/data/test.opus", "tests/data/test.mp3"]:
audio = read_audio(path, sampling_rate=16000)
speech_timestamps = get_speech_timestamps(audio, model, visualize_probs=False, return_seconds=True)
assert speech_timestamps is not None
out = model.audio_forward(audio, sr=16000)
assert out is not None
def test_onnx_model():
model = load_silero_vad(onnx=True)
for path in ["tests/data/test.wav", "tests/data/test.opus", "tests/data/test.mp3"]:
audio = read_audio(path, sampling_rate=16000)
speech_timestamps = get_speech_timestamps(audio, model, visualize_probs=False, return_seconds=True)
assert speech_timestamps is not None
out = model.audio_forward(audio, sr=16000)
assert out is not None

74
tuning/README.md Normal file
View File

@@ -0,0 +1,74 @@
# Тюнинг Silero-VAD модели
> Код тюнинга создан при поддержке Фонда содействия инновациям в рамках федерального проекта «Искусственный
интеллект» национальной программы «Цифровая экономика Российской Федерации».
Тюнинг используется для улучшения качества детекции речи Silero-VAD модели на кастомных данных.
## Зависимости
Следующие зависимости используются при тюнинге VAD модели:
- `torchaudio>=0.12.0`
- `omegaconf>=2.3.0`
- `sklearn>=1.2.0`
- `torch>=1.12.0`
- `pandas>=2.2.2`
- `tqdm`
## Подготовка данных
Датафреймы для тюнинга должны быть подготовлены и сохранены в формате `.feather`. Следующие колонки в `.feather` файлах тренировки и валидации являются обязательными:
- **audio_path** - абсолютный путь до аудиофайла в дисковой системе. Аудиофайлы должны представлять собой `PCM` данные, предпочтительно в форматах `.wav` или `.opus` (иные популярные форматы аудио тоже поддерживаются). Для ускорения темпа дообучения рекомендуется предварительно выполнить ресемплинг аудиофайлов (изменить частоту дискретизации) до 16000 Гц;
- **speech_ts** - разметка для соответствующего аудиофайла. Список, состоящий из словарей формата `{'start': START_SEC, 'end': 'END_SEC'}`, где `START_SEC` и `END_SEC` - время начало и конца речевого отрезка в секундах соответственно. Для качественного дообучения рекомендуется использовать разметку с точностью до 30 миллисекунд.
Чем больше данных используется на этапе дообучения, тем эффективнее показывает себя адаптированная модель на целевом домене. Длина аудио не ограничена, т.к. каждое аудио будет обрезано до `max_train_length_sec` секунд перед подачей в нейросеть. Длинные аудио лучше предварительно порезать на кусочки длины `max_train_length_sec`.
Пример `.feather` датафрейма можно посмотреть в файле `example_dataframe.feather`
## Файл конфигурации `config.yml`
Файл конфигурации `config.yml` содержит пути до обучающей и валидационной выборки, а также параметры дообучения:
- `train_dataset_path` - абсолютный путь до тренировочного датафрейма в формате `.feather`. Должен содержать колонки `audio_path` и `speech_ts`, описанные в пункте "Подготовка данных". Пример устройства датафрейма можно посмотреть в `example_dataframe.feather`;
- `val_dataset_path` - абсолютный путь до валидационного датафрейма в формате `.feather`. Должен содержать колонки `audio_path` и `speech_ts`, описанные в пункте "Подготовка данных". Пример устройства датафрейма можно посмотреть в `example_dataframe.feather`;
- `jit_model_path` - абсолютный путь до Silero-VAD модели в формате `.jit`. Если оставить это поле пустым, то модель будет загружена из репозитория в зависимости от значения поля `use_torchhub`
- `use_torchhub` - Если `True`, то модель для дообучения будет загружена с помощью torch.hub. Если `False`, то модель для дообучения будет загружена с помощью библиотеки silero-vad (необходимо заранее установить командой `pip install silero-vad`);
- `tune_8k` - данный параметр отвечает, какую голову Silero-VAD дообучать. Если `True`, дообучаться будет голова с 8000 Гц частотой дискретизации, иначе с 16000 Гц;
- `model_save_path` - путь сохранения добученной модели;
- `noise_loss` - коэффициент лосса, применяемый для неречевых окон аудио;
- `max_train_length_sec` - максимальная длина аудио в секундах на этапе дообучения. Более длительные аудио будут обрезаны до этого показателя;
- `aug_prob` - вероятность применения аугментаций к аудиофайлу на этапе дообучения;
- `learning_rate` - темп дообучения;
- `batch_size` - размер батча при дообучении и валидации;
- `num_workers` - количество потоков, используемых для загрузки данных;
- `num_epochs` - количество эпох дообучения. За одну эпоху прогоняются все тренировочные данные;
- `device` - `cpu` или `cuda`.
## Дообучение
Дообучение запускается командой
`python tune.py`
Длится в течение `num_epochs`, лучший чекпоинт по показателю ROC-AUC на валидационной выборке будет сохранен в `model_save_path` в формате jit.
## Поиск пороговых значений
Порог на вход и порог на выход можно подобрать, используя команду
`python search_thresholds`
Данный скрипт использует файл конфигурации, описанный выше. Указанная в конфигурации модель будет использована для поиска оптимальных порогов на валидационном датасете.
## Цитирование
```
@misc{Silero VAD,
author = {Silero Team},
title = {Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/snakers4/silero-vad}},
commit = {insert_some_commit_here},
email = {hello@silero.ai}
}
```

0
tuning/__init__.py Normal file
View File

17
tuning/config.yml Normal file
View File

@@ -0,0 +1,17 @@
jit_model_path: '' # путь до Silero-VAD модели в формате jit, эта модель будет использована для дообучения. Если оставить поле пустым, то модель будет загружена автоматически
use_torchhub: True # jit модель будет загружена через torchhub, если True, или через pip, если False
tune_8k: False # дообучает 16к голову, если False, и 8к голову, если True
train_dataset_path: 'train_dataset_path.feather' # путь до датасета в формате feather для дообучения, подробности в README
val_dataset_path: 'val_dataset_path.feather' # путь до датасета в формате feather для валидации, подробности в README
model_save_path: 'model_save_path.jit' # путь сохранения дообученной модели
noise_loss: 0.5 # коэффициент, применяемый к лоссу на неречевых окнах
max_train_length_sec: 8 # во время тюнинга аудио длиннее будут обрезаны до данного значения
aug_prob: 0.4 # вероятность применения аугментаций к аудио в процессе дообучения
learning_rate: 5e-4 # темп дообучения модели
batch_size: 128 # размер батча при дообучении и валидации
num_workers: 4 # количество потоков, используемых для даталоадеров
num_epochs: 20 # количество эпох дообучения, 1 эпоха = полный прогон тренировочных данных
device: 'cuda' # cpu или cuda, на чем будет производится дообучение

Binary file not shown.

View File

@@ -0,0 +1,36 @@
from utils import init_jit_model, predict, calculate_best_thresholds, SileroVadDataset, SileroVadPadder
from omegaconf import OmegaConf
import torch
torch.set_num_threads(1)
if __name__ == '__main__':
config = OmegaConf.load('config.yml')
loader = torch.utils.data.DataLoader(SileroVadDataset(config, mode='val'),
batch_size=config.batch_size,
collate_fn=SileroVadPadder,
num_workers=config.num_workers)
if config.jit_model_path:
print(f'Loading model from the local folder: {config.jit_model_path}')
model = init_jit_model(config.jit_model_path, device=config.device)
else:
if config.use_torchhub:
print('Loading model using torch.hub')
model, _ = torch.hub.load(repo_or_dir='snakers4/silero-vad',
model='silero_vad',
onnx=False,
force_reload=True)
else:
print('Loading model using silero-vad library')
from silero_vad import load_silero_vad
model = load_silero_vad(onnx=False)
print('Model loaded')
model.to(config.device)
print('Making predicts...')
all_predicts, all_gts = predict(model, loader, config.device, sr=8000 if config.tune_8k else 16000)
print('Calculating thresholds...')
best_ths_enter, best_ths_exit, best_acc = calculate_best_thresholds(all_predicts, all_gts)
print(f'Best threshold: {best_ths_enter}\nBest exit threshold: {best_ths_exit}\nBest accuracy: {best_acc}')

65
tuning/tune.py Normal file
View File

@@ -0,0 +1,65 @@
from utils import SileroVadDataset, SileroVadPadder, VADDecoderRNNJIT, train, validate, init_jit_model
from omegaconf import OmegaConf
import torch.nn as nn
import torch
if __name__ == '__main__':
config = OmegaConf.load('config.yml')
train_dataset = SileroVadDataset(config, mode='train')
train_loader = torch.utils.data.DataLoader(train_dataset,
batch_size=config.batch_size,
collate_fn=SileroVadPadder,
num_workers=config.num_workers)
val_dataset = SileroVadDataset(config, mode='val')
val_loader = torch.utils.data.DataLoader(val_dataset,
batch_size=config.batch_size,
collate_fn=SileroVadPadder,
num_workers=config.num_workers)
if config.jit_model_path:
print(f'Loading model from the local folder: {config.jit_model_path}')
model = init_jit_model(config.jit_model_path, device=config.device)
else:
if config.use_torchhub:
print('Loading model using torch.hub')
model, _ = torch.hub.load(repo_or_dir='snakers4/silero-vad',
model='silero_vad',
onnx=False,
force_reload=True)
else:
print('Loading model using silero-vad library')
from silero_vad import load_silero_vad
model = load_silero_vad(onnx=False)
print('Model loaded')
model.to(config.device)
decoder = VADDecoderRNNJIT().to(config.device)
decoder.load_state_dict(model._model_8k.decoder.state_dict() if config.tune_8k else model._model.decoder.state_dict())
decoder.train()
params = decoder.parameters()
optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, params),
lr=config.learning_rate)
criterion = nn.BCELoss(reduction='none')
best_val_roc = 0
for i in range(config.num_epochs):
print(f'Starting epoch {i + 1}')
train_loss = train(config, train_loader, model, decoder, criterion, optimizer, config.device)
val_loss, val_roc = validate(config, val_loader, model, decoder, criterion, config.device)
print(f'Metrics after epoch {i + 1}:\n'
f'\tTrain loss: {round(train_loss, 3)}\n',
f'\tValidation loss: {round(val_loss, 3)}\n'
f'\tValidation ROC-AUC: {round(val_roc, 3)}')
if val_roc > best_val_roc:
print('New best ROC-AUC, saving model')
best_val_roc = val_roc
if config.tune_8k:
model._model_8k.decoder.load_state_dict(decoder.state_dict())
else:
model._model.decoder.load_state_dict(decoder.state_dict())
torch.jit.save(model, config.model_save_path)
print('Done')

356
tuning/utils.py Normal file
View File

@@ -0,0 +1,356 @@
from sklearn.metrics import roc_auc_score, accuracy_score
from torch.utils.data import Dataset
import torch.nn as nn
from tqdm import tqdm
import pandas as pd
import numpy as np
import torchaudio
import warnings
import random
import torch
import gc
warnings.filterwarnings('ignore')
def read_audio(path: str,
sampling_rate: int = 16000,
normalize=False):
wav, sr = torchaudio.load(path)
if wav.size(0) > 1:
wav = wav.mean(dim=0, keepdim=True)
if sampling_rate:
if sr != sampling_rate:
transform = torchaudio.transforms.Resample(orig_freq=sr,
new_freq=sampling_rate)
wav = transform(wav)
sr = sampling_rate
if normalize and wav.abs().max() != 0:
wav = wav / wav.abs().max()
return wav.squeeze(0)
def build_audiomentations_augs(p):
from audiomentations import SomeOf, AirAbsorption, BandPassFilter, BandStopFilter, ClippingDistortion, HighPassFilter, HighShelfFilter, \
LowPassFilter, LowShelfFilter, Mp3Compression, PeakingFilter, PitchShift, RoomSimulator, SevenBandParametricEQ, \
Aliasing, AddGaussianNoise
transforms = [Aliasing(p=1),
AddGaussianNoise(p=1),
AirAbsorption(p=1),
BandPassFilter(p=1),
BandStopFilter(p=1),
ClippingDistortion(p=1),
HighPassFilter(p=1),
HighShelfFilter(p=1),
LowPassFilter(p=1),
LowShelfFilter(p=1),
Mp3Compression(p=1),
PeakingFilter(p=1),
PitchShift(p=1),
RoomSimulator(p=1, leave_length_unchanged=True),
SevenBandParametricEQ(p=1)]
tr = SomeOf((1, 3), transforms=transforms, p=p)
return tr
class SileroVadDataset(Dataset):
def __init__(self,
config,
mode='train'):
self.num_samples = 512 # constant, do not change
self.sr = 16000 # constant, do not change
self.resample_to_8k = config.tune_8k
self.noise_loss = config.noise_loss
self.max_train_length_sec = config.max_train_length_sec
self.max_train_length_samples = config.max_train_length_sec * self.sr
assert self.max_train_length_samples % self.num_samples == 0
assert mode in ['train', 'val']
dataset_path = config.train_dataset_path if mode == 'train' else config.val_dataset_path
self.dataframe = pd.read_feather(dataset_path).reset_index(drop=True)
self.index_dict = self.dataframe.to_dict('index')
self.mode = mode
print(f'DATASET SIZE : {len(self.dataframe)}')
if mode == 'train':
self.augs = build_audiomentations_augs(p=config.aug_prob)
else:
self.augs = None
def __getitem__(self, idx):
idx = None if self.mode == 'train' else idx
wav, gt, mask = self.load_speech_sample(idx)
if self.mode == 'train':
wav = self.add_augs(wav)
if len(wav) > self.max_train_length_samples:
wav = wav[:self.max_train_length_samples]
gt = gt[:int(self.max_train_length_samples / self.num_samples)]
mask = mask[:int(self.max_train_length_samples / self.num_samples)]
wav = torch.FloatTensor(wav)
if self.resample_to_8k:
transform = torchaudio.transforms.Resample(orig_freq=self.sr,
new_freq=8000)
wav = transform(wav)
return wav, torch.FloatTensor(gt), torch.from_numpy(mask)
def __len__(self):
return len(self.index_dict)
def load_speech_sample(self, idx=None):
if idx is None:
idx = random.randint(0, len(self.index_dict) - 1)
wav = read_audio(self.index_dict[idx]['audio_path'], self.sr).numpy()
if len(wav) % self.num_samples != 0:
pad_num = self.num_samples - (len(wav) % (self.num_samples))
wav = np.pad(wav, (0, pad_num), 'constant', constant_values=0)
gt, mask = self.get_ground_truth_annotated(self.index_dict[idx]['speech_ts'], len(wav))
assert len(gt) == len(wav) / self.num_samples
return wav, gt, mask
def get_ground_truth_annotated(self, annotation, audio_length_samples):
gt = np.zeros(audio_length_samples)
for i in annotation:
gt[int(i['start'] * self.sr): int(i['end'] * self.sr)] = 1
squeezed_predicts = np.average(gt.reshape(-1, self.num_samples), axis=1)
squeezed_predicts = (squeezed_predicts > 0.5).astype(int)
mask = np.ones(len(squeezed_predicts))
mask[squeezed_predicts == 0] = self.noise_loss
return squeezed_predicts, mask
def add_augs(self, wav):
while True:
try:
wav_aug = self.augs(wav, self.sr)
if np.isnan(wav_aug.max()) or np.isnan(wav_aug.min()):
return wav
return wav_aug
except Exception as e:
continue
def SileroVadPadder(batch):
wavs = [batch[i][0] for i in range(len(batch))]
labels = [batch[i][1] for i in range(len(batch))]
masks = [batch[i][2] for i in range(len(batch))]
wavs = torch.nn.utils.rnn.pad_sequence(
wavs, batch_first=True, padding_value=0)
labels = torch.nn.utils.rnn.pad_sequence(
labels, batch_first=True, padding_value=0)
masks = torch.nn.utils.rnn.pad_sequence(
masks, batch_first=True, padding_value=0)
return wavs, labels, masks
class VADDecoderRNNJIT(nn.Module):
def __init__(self):
super(VADDecoderRNNJIT, self).__init__()
self.rnn = nn.LSTMCell(128, 128)
self.decoder = nn.Sequential(nn.Dropout(0.1),
nn.ReLU(),
nn.Conv1d(128, 1, kernel_size=1),
nn.Sigmoid())
def forward(self, x, state=torch.zeros(0)):
x = x.squeeze(-1)
if len(state):
h, c = self.rnn(x, (state[0], state[1]))
else:
h, c = self.rnn(x)
x = h.unsqueeze(-1).float()
state = torch.stack([h, c])
x = self.decoder(x)
return x, state
class AverageMeter(object):
"""Computes and stores the average and current value"""
def __init__(self):
self.reset()
def reset(self):
self.val = 0
self.avg = 0
self.sum = 0
self.count = 0
def update(self, val, n=1):
self.val = val
self.sum += val * n
self.count += n
self.avg = self.sum / self.count
def train(config,
loader,
jit_model,
decoder,
criterion,
optimizer,
device):
losses = AverageMeter()
decoder.train()
context_size = 32 if config.tune_8k else 64
num_samples = 256 if config.tune_8k else 512
stft_layer = jit_model._model_8k.stft if config.tune_8k else jit_model._model.stft
encoder_layer = jit_model._model_8k.encoder if config.tune_8k else jit_model._model.encoder
with torch.enable_grad():
for _, (x, targets, masks) in tqdm(enumerate(loader), total=len(loader)):
targets = targets.to(device)
x = x.to(device)
masks = masks.to(device)
x = torch.nn.functional.pad(x, (context_size, 0))
outs = []
state = torch.zeros(0)
for i in range(context_size, x.shape[1], num_samples):
input_ = x[:, i-context_size:i+num_samples]
out = stft_layer(input_)
out = encoder_layer(out)
out, state = decoder(out, state)
outs.append(out)
stacked = torch.cat(outs, dim=2).squeeze(1)
loss = criterion(stacked, targets)
loss = (loss * masks).mean()
optimizer.zero_grad()
loss.backward()
optimizer.step()
losses.update(loss.item(), masks.numel())
torch.cuda.empty_cache()
gc.collect()
return losses.avg
def validate(config,
loader,
jit_model,
decoder,
criterion,
device):
losses = AverageMeter()
decoder.eval()
predicts = []
gts = []
context_size = 32 if config.tune_8k else 64
num_samples = 256 if config.tune_8k else 512
stft_layer = jit_model._model_8k.stft if config.tune_8k else jit_model._model.stft
encoder_layer = jit_model._model_8k.encoder if config.tune_8k else jit_model._model.encoder
with torch.no_grad():
for _, (x, targets, masks) in tqdm(enumerate(loader), total=len(loader)):
targets = targets.to(device)
x = x.to(device)
masks = masks.to(device)
x = torch.nn.functional.pad(x, (context_size, 0))
outs = []
state = torch.zeros(0)
for i in range(context_size, x.shape[1], num_samples):
input_ = x[:, i-context_size:i+num_samples]
out = stft_layer(input_)
out = encoder_layer(out)
out, state = decoder(out, state)
outs.append(out)
stacked = torch.cat(outs, dim=2).squeeze(1)
predicts.extend(stacked[masks != 0].tolist())
gts.extend(targets[masks != 0].tolist())
loss = criterion(stacked, targets)
loss = (loss * masks).mean()
losses.update(loss.item(), masks.numel())
score = roc_auc_score(gts, predicts)
torch.cuda.empty_cache()
gc.collect()
return losses.avg, round(score, 3)
def init_jit_model(model_path: str,
device=torch.device('cpu')):
torch.set_grad_enabled(False)
model = torch.jit.load(model_path, map_location=device)
model.eval()
return model
def predict(model, loader, device, sr):
with torch.no_grad():
all_predicts = []
all_gts = []
for _, (x, targets, masks) in tqdm(enumerate(loader), total=len(loader)):
x = x.to(device)
out = model.audio_forward(x, sr=sr)
for i, out_chunk in enumerate(out):
predict = out_chunk[masks[i] != 0].cpu().tolist()
gt = targets[i, masks[i] != 0].cpu().tolist()
all_predicts.append(predict)
all_gts.append(gt)
return all_predicts, all_gts
def calculate_best_thresholds(all_predicts, all_gts):
best_acc = 0
for ths_enter in tqdm(np.linspace(0, 1, 20)):
for ths_exit in np.linspace(0, 1, 20):
if ths_exit >= ths_enter:
continue
accs = []
for j, predict in enumerate(all_predicts):
predict_bool = []
is_speech = False
for i in predict:
if i >= ths_enter:
is_speech = True
predict_bool.append(1)
elif i <= ths_exit:
is_speech = False
predict_bool.append(0)
else:
val = 1 if is_speech else 0
predict_bool.append(val)
score = round(accuracy_score(all_gts[j], predict_bool), 4)
accs.append(score)
mean_acc = round(np.mean(accs), 3)
if mean_acc > best_acc:
best_acc = mean_acc
best_ths_enter = round(ths_enter, 2)
best_ths_exit = round(ths_exit, 2)
return best_ths_enter, best_ths_exit, best_acc

View File

@@ -1,424 +0,0 @@
import torch
import torchaudio
from typing import List
import torch.nn.functional as F
import warnings
languages = ['ru', 'en', 'de', 'es']
class OnnxWrapper():
def __init__(self, path):
import numpy as np
global np
import onnxruntime
self.session = onnxruntime.InferenceSession(path)
self.session.intra_op_num_threads = 1
self.session.inter_op_num_threads = 1
self.reset_states()
def reset_states(self):
self._h = np.zeros((2, 1, 64)).astype('float32')
self._c = np.zeros((2, 1, 64)).astype('float32')
def __call__(self, x, sr: int):
if x.dim() == 1:
x = x.unsqueeze(0)
if x.dim() > 2:
raise ValueError(f"Too many dimensions for input audio chunk {x.dim()}")
if x.shape[0] > 1:
raise ValueError("Onnx model does not support batching")
if sr not in [16000]:
raise ValueError(f"Supported sample rates: {[16000]}")
if sr / x.shape[1] > 31.25:
raise ValueError("Input audio chunk is too short")
ort_inputs = {'input': x.numpy(), 'h0': self._h, 'c0': self._c}
ort_outs = self.session.run(None, ort_inputs)
out, self._h, self._c = ort_outs
out = torch.tensor(out).squeeze(2)[:, 1] # make output type match JIT analog
return out
class Validator():
def __init__(self, url):
self.onnx = True if url.endswith('.onnx') else False
torch.hub.download_url_to_file(url, 'inf.model')
if self.onnx:
import onnxruntime
self.model = onnxruntime.InferenceSession('inf.model')
else:
self.model = init_jit_model(model_path='inf.model')
def __call__(self, inputs: torch.Tensor):
with torch.no_grad():
if self.onnx:
ort_inputs = {'input': inputs.cpu().numpy()}
outs = self.model.run(None, ort_inputs)
outs = [torch.Tensor(x) for x in outs]
else:
outs = self.model(inputs)
return outs
def read_audio(path: str,
sampling_rate: int = 16000):
wav, sr = torchaudio.load(path)
if wav.size(0) > 1:
wav = wav.mean(dim=0, keepdim=True)
if sr != sampling_rate:
transform = torchaudio.transforms.Resample(orig_freq=sr,
new_freq=sampling_rate)
wav = transform(wav)
sr = sampling_rate
assert sr == sampling_rate
return wav.squeeze(0)
def save_audio(path: str,
tensor: torch.Tensor,
sampling_rate: int = 16000):
torchaudio.save(path, tensor.unsqueeze(0), sampling_rate)
def init_jit_model(model_path: str,
device=torch.device('cpu')):
torch.set_grad_enabled(False)
model = torch.jit.load(model_path, map_location=device)
model.eval()
return model
def make_visualization(probs, step):
import pandas as pd
pd.DataFrame({'probs': probs},
index=[x * step for x in range(len(probs))]).plot(figsize=(16, 8),
kind='area', ylim=[0, 1.05], xlim=[0, len(probs) * step],
xlabel='seconds',
ylabel='speech probability',
colormap='tab20')
def get_speech_timestamps(audio: torch.Tensor,
model,
threshold: float = 0.5,
sampling_rate: int = 16000,
min_speech_duration_ms: int = 250,
min_silence_duration_ms: int = 100,
window_size_samples: int = 1536,
speech_pad_ms: int = 30,
return_seconds: bool = False,
visualize_probs: bool = False):
"""
This method is used for splitting long audios into speech chunks using silero VAD
Parameters
----------
audio: torch.Tensor, one dimensional
One dimensional float torch.Tensor, other types are casted to torch if possible
model: preloaded .jit silero VAD model
threshold: float (default - 0.5)
Speech threshold. Silero VAD outputs speech probabilities for each audio chunk, probabilities ABOVE this value are considered as SPEECH.
It is better to tune this parameter for each dataset separately, but "lazy" 0.5 is pretty good for most datasets.
sampling_rate: int (default - 16000)
Currently silero VAD models support 8000 and 16000 sample rates
min_speech_duration_ms: int (default - 250 milliseconds)
Final speech chunks shorter min_speech_duration_ms are thrown out
min_silence_duration_ms: int (default - 100 milliseconds)
In the end of each speech chunk wait for min_silence_duration_ms before separating it
window_size_samples: int (default - 1536 samples)
Audio chunks of window_size_samples size are fed to the silero VAD model.
WARNING! Silero VAD models were trained using 512, 1024, 1536 samples for 16000 sample rate and 256, 512, 768 samples for 8000 sample rate.
Values other than these may affect model perfomance!!
speech_pad_ms: int (default - 30 milliseconds)
Final speech chunks are padded by speech_pad_ms each side
return_seconds: bool (default - False)
whether return timestamps in seconds (default - samples)
visualize_probs: bool (default - False)
whether draw prob hist or not
Returns
----------
speeches: list of dicts
list containing ends and beginnings of speech chunks (samples or seconds based on return_seconds)
"""
if not torch.is_tensor(audio):
try:
audio = torch.Tensor(audio)
except:
raise TypeError("Audio cannot be casted to tensor. Cast it manually")
if len(audio.shape) > 1:
for i in range(len(audio.shape)): # trying to squeeze empty dimensions
audio = audio.squeeze(0)
if len(audio.shape) > 1:
raise ValueError("More than one dimension in audio. Are you trying to process audio with 2 channels?")
if sampling_rate == 8000 and window_size_samples > 768:
warnings.warn('window_size_samples is too big for 8000 sampling_rate! Better set window_size_samples to 256, 512 or 1536 for 8000 sample rate!')
if window_size_samples not in [256, 512, 768, 1024, 1536]:
warnings.warn('Unusual window_size_samples! Supported window_size_samples:\n - [512, 1024, 1536] for 16000 sampling_rate\n - [256, 512, 768] for 8000 sampling_rate')
model.reset_states()
min_speech_samples = sampling_rate * min_speech_duration_ms / 1000
min_silence_samples = sampling_rate * min_silence_duration_ms / 1000
speech_pad_samples = sampling_rate * speech_pad_ms / 1000
audio_length_samples = len(audio)
speech_probs = []
for current_start_sample in range(0, audio_length_samples, window_size_samples):
chunk = audio[current_start_sample: current_start_sample + window_size_samples]
if len(chunk) < window_size_samples:
chunk = torch.nn.functional.pad(chunk, (0, int(window_size_samples - len(chunk))))
speech_prob = model(chunk, sampling_rate).item()
speech_probs.append(speech_prob)
triggered = False
speeches = []
current_speech = {}
neg_threshold = threshold - 0.15
temp_end = 0
for i, speech_prob in enumerate(speech_probs):
if (speech_prob >= threshold) and temp_end:
temp_end = 0
if (speech_prob >= threshold) and not triggered:
triggered = True
current_speech['start'] = window_size_samples * i
continue
if (speech_prob < neg_threshold) and triggered:
if not temp_end:
temp_end = window_size_samples * i
if (window_size_samples * i) - temp_end < min_silence_samples:
continue
else:
current_speech['end'] = temp_end
if (current_speech['end'] - current_speech['start']) > min_speech_samples:
speeches.append(current_speech)
temp_end = 0
current_speech = {}
triggered = False
continue
if current_speech:
current_speech['end'] = audio_length_samples
speeches.append(current_speech)
for i, speech in enumerate(speeches):
if i == 0:
speech['start'] = int(max(0, speech['start'] - speech_pad_samples))
if i != len(speeches) - 1:
silence_duration = speeches[i+1]['start'] - speech['end']
if silence_duration < 2 * speech_pad_samples:
speech['end'] += int(silence_duration // 2)
speeches[i+1]['start'] = int(max(0, speeches[i+1]['start'] - silence_duration // 2))
else:
speech['end'] += int(speech_pad_samples)
else:
speech['end'] = int(min(audio_length_samples, speech['end'] + speech_pad_samples))
if return_seconds:
for speech_dict in speeches:
speech_dict['start'] = round(speech_dict['start'] / sampling_rate, 1)
speech_dict['end'] = round(speech_dict['end'] / sampling_rate, 1)
if visualize_probs:
make_visualization(speech_probs, window_size_samples / sampling_rate)
return speeches
def get_number_ts(wav: torch.Tensor,
model,
model_stride=8,
hop_length=160,
sample_rate=16000):
wav = torch.unsqueeze(wav, dim=0)
perframe_logits = model(wav)[0]
perframe_preds = torch.argmax(torch.softmax(perframe_logits, dim=1), dim=1).squeeze() # (1, num_frames_strided)
extended_preds = []
for i in perframe_preds:
extended_preds.extend([i.item()] * model_stride)
# len(extended_preds) is *num_frames_real*; for each frame of audio we know if it has a number in it.
triggered = False
timings = []
cur_timing = {}
for i, pred in enumerate(extended_preds):
if pred == 1:
if not triggered:
cur_timing['start'] = int((i * hop_length) / (sample_rate / 1000))
triggered = True
elif pred == 0:
if triggered:
cur_timing['end'] = int((i * hop_length) / (sample_rate / 1000))
timings.append(cur_timing)
cur_timing = {}
triggered = False
if cur_timing:
cur_timing['end'] = int(len(wav) / (sample_rate / 1000))
timings.append(cur_timing)
return timings
def get_language(wav: torch.Tensor,
model):
wav = torch.unsqueeze(wav, dim=0)
lang_logits = model(wav)[2]
lang_pred = torch.argmax(torch.softmax(lang_logits, dim=1), dim=1).item() # from 0 to len(languages) - 1
assert lang_pred < len(languages)
return languages[lang_pred]
def get_language_and_group(wav: torch.Tensor,
model,
lang_dict: dict,
lang_group_dict: dict,
top_n=1):
wav = torch.unsqueeze(wav, dim=0)
lang_logits, lang_group_logits = model(wav)
softm = torch.softmax(lang_logits, dim=1).squeeze()
softm_group = torch.softmax(lang_group_logits, dim=1).squeeze()
srtd = torch.argsort(softm, descending=True)
srtd_group = torch.argsort(softm_group, descending=True)
outs = []
outs_group = []
for i in range(top_n):
prob = round(softm[srtd[i]].item(), 2)
prob_group = round(softm_group[srtd_group[i]].item(), 2)
outs.append((lang_dict[str(srtd[i].item())], prob))
outs_group.append((lang_group_dict[str(srtd_group[i].item())], prob_group))
return outs, outs_group
class VADIterator:
def __init__(self,
model,
threshold: float = 0.5,
sampling_rate: int = 16000,
min_silence_duration_ms: int = 100,
speech_pad_ms: int = 30
):
"""
Class for stream imitation
Parameters
----------
model: preloaded .jit silero VAD model
threshold: float (default - 0.5)
Speech threshold. Silero VAD outputs speech probabilities for each audio chunk, probabilities ABOVE this value are considered as SPEECH.
It is better to tune this parameter for each dataset separately, but "lazy" 0.5 is pretty good for most datasets.
sampling_rate: int (default - 16000)
Currently silero VAD models support 8000 and 16000 sample rates
min_silence_duration_ms: int (default - 100 milliseconds)
In the end of each speech chunk wait for min_silence_duration_ms before separating it
speech_pad_ms: int (default - 30 milliseconds)
Final speech chunks are padded by speech_pad_ms each side
"""
self.model = model
self.threshold = threshold
self.sampling_rate = sampling_rate
self.min_silence_samples = sampling_rate * min_silence_duration_ms / 1000
self.speech_pad_samples = sampling_rate * speech_pad_ms / 1000
self.reset_states()
def reset_states(self):
self.model.reset_states()
self.triggered = False
self.temp_end = 0
self.current_sample = 0
def __call__(self, x, return_seconds=False):
"""
x: torch.Tensor
audio chunk (see examples in repo)
return_seconds: bool (default - False)
whether return timestamps in seconds (default - samples)
"""
if not torch.is_tensor(x):
try:
x = torch.Tensor(x)
except:
raise TypeError("Audio cannot be casted to tensor. Cast it manually")
window_size_samples = len(x[0]) if x.dim() == 2 else len(x)
self.current_sample += window_size_samples
speech_prob = self.model(x, self.sampling_rate).item()
if (speech_prob >= self.threshold) and self.temp_end:
self.temp_end = 0
if (speech_prob >= self.threshold) and not self.triggered:
self.triggered = True
speech_start = self.current_sample - self.speech_pad_samples
return {'start': int(speech_start) if not return_seconds else round(speech_start / self.sampling_rate, 1)}
if (speech_prob < self.threshold - 0.15) and self.triggered:
if not self.temp_end:
self.temp_end = self.current_sample
if self.current_sample - self.temp_end < self.min_silence_samples:
return None
else:
speech_end = self.temp_end + self.speech_pad_samples
self.temp_end = 0
self.triggered = False
return {'end': int(speech_end) if not return_seconds else round(speech_end / self.sampling_rate, 1)}
return None
def collect_chunks(tss: List[dict],
wav: torch.Tensor):
chunks = []
for i in tss:
chunks.append(wav[i['start']: i['end']])
return torch.cat(chunks)
def drop_chunks(tss: List[dict],
wav: torch.Tensor):
chunks = []
cur_start = 0
for i in tss:
chunks.append((wav[cur_start: i['start']]))
cur_start = i['end']
return torch.cat(chunks)