Merge pull request #719 from snakers4/adamnsandle

Adamnsandle
fx workflow
2026-02-05 01:49:22 +08:00 · 2025-11-06 11:25:49 +03:00 · 2025-11-06 08:18:46 +00:00 · 2025-11-06 08:04:02 +00:00 · 2025-11-06 07:49:44 +00:00 · 2025-11-06 07:36:38 +00:00
39 changed files with 1881 additions and 592 deletions
--- a/.github/workflows/test.yml
+++ b/.github/workflows/test.yml
@@ -0,0 +1,40 @@
+name: Test Package
+
+on:
+  workflow_dispatch:       # запуск вручную
+
+jobs:
+  test:
+    runs-on: ${{ matrix.os }}
+    strategy:
+      fail-fast: false
+      matrix:
+        os: [ubuntu-latest, windows-latest, macos-latest]
+        python-version: ["3.8","3.9","3.10","3.11","3.12","3.13"]
+
+    steps:
+    - uses: actions/checkout@v4
+
+    - name: Set up Python
+      uses: actions/setup-python@v4
+      with:
+        python-version: ${{ matrix.python-version }}
+
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        pip install build hatchling pytest soundfile
+        pip install .[test]
+
+    - name: Build package
+      run: python -m build --wheel --outdir dist
+
+    - name: Install package
+      run: |
+        import glob, subprocess, sys
+        whl = glob.glob("dist/*.whl")[0]
+        subprocess.check_call([sys.executable, "-m", "pip", "install", whl])
+      shell: python
+
+    - name: Run tests
+      run: pytest tests
--- a/CITATION.cff
+++ b/CITATION.cff
@@ -0,0 +1,20 @@
+cff-version: 1.2.0
+message: "If you use this software, please cite it as below."
+title: "Silero VAD"
+authors:
+  - family-names: "Silero Team"
+    email: "hello@silero.ai"
+type: software
+repository-code: "https://github.com/snakers4/silero-vad"
+license: MIT
+abstract: "Pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier"
+preferred-citation:
+  type: software
+  authors:
+    - family-names: "Silero Team"
+      email: "hello@silero.ai"
+  title: "Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier"
+  year: 2024
+  publisher: "GitHub"
+  journal: "GitHub repository"
+  howpublished: "https://github.com/snakers4/silero-vad"
--- a/README.md
+++ b/README.md
@@ -1,6 +1,6 @@
-[![Mailing list : test](http://img.shields.io/badge/Email-gray.svg?style=for-the-badge&logo=gmail)](mailto:hello@silero.ai) [![Mailing list : test](http://img.shields.io/badge/Telegram-blue.svg?style=for-the-badge&logo=telegram)](https://t.me/silero_speech) [![License: CC BY-NC 4.0](https://img.shields.io/badge/License-MIT-lightgrey.svg?style=for-the-badge)](https://github.com/snakers4/silero-vad/blob/master/LICENSE)
+[![Mailing list : test](http://img.shields.io/badge/Email-gray.svg?style=for-the-badge&logo=gmail)](mailto:hello@silero.ai) [![Mailing list : test](http://img.shields.io/badge/Telegram-blue.svg?style=for-the-badge&logo=telegram)](https://t.me/silero_speech) [![License: CC BY-NC 4.0](https://img.shields.io/badge/License-MIT-lightgrey.svg?style=for-the-badge)](https://github.com/snakers4/silero-vad/blob/master/LICENSE) [![downloads](https://img.shields.io/pypi/dm/silero-vad?style=for-the-badge)](https://pypi.org/project/silero-vad/) 

-[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb)
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb) [![Test Package](https://github.com/snakers4/silero-vad/actions/workflows/test.yml/badge.svg)](https://github.com/snakers4/silero-vad/actions/workflows/test.yml) [![Pypi version](https://img.shields.io/pypi/v/silero-vad)](https://pypi.org/project/silero-vad/) [![Python version](https://img.shields.io/pypi/pyversions/silero-vad)](https://pypi.org/project/silero-vad)

 ![header](https://user-images.githubusercontent.com/12515440/89997349-b3523080-dc94-11ea-9906-ca2e8bc50535.png)

@@ -13,7 +13,7 @@
 <br/>

 <p align="center">
-  <img src="https://github.com/snakers4/silero-vad/assets/36505480/300bd062-4da5-4f19-9736-9c144a45d7a7" />
+  <img src="https://github.com/user-attachments/assets/f2940867-0a51-4bdb-8c14-1129d3c44e64" />
 </p>


@@ -22,6 +22,8 @@

 https://user-images.githubusercontent.com/36505480/144874384-95f80f6d-a4f1-42cc-9be7-004c891dd481.mp4

+Please note, that video loads only if you are logged in your GitHub account. 
+
 </details>

 <br/>
@@ -64,7 +66,11 @@ If you are planning to run the VAD using solely the `onnx-runtime`, it will run
 from silero_vad import load_silero_vad, read_audio, get_speech_timestamps
 model = load_silero_vad()
 wav = read_audio('path_to_audio_file')
-speech_timestamps = get_speech_timestamps(wav, model)
+speech_timestamps = get_speech_timestamps(
+  wav,
+  model,
+  return_seconds=True,  # Return speech timestamps in seconds (default is samples)
+)
 ```

 **Using torch.hub**:
@@ -76,7 +82,11 @@ model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad', model='silero_v
 (get_speech_timestamps, _, read_audio, _, _) = utils

 wav = read_audio('path_to_audio_file')
-speech_timestamps = get_speech_timestamps(wav, model)
+speech_timestamps = get_speech_timestamps(
+  wav,
+  model,
+  return_seconds=True,  # Return speech timestamps in seconds (default is samples)
+)
 ```

 <br/>
@@ -165,4 +175,4 @@ Please see our [wiki](https://github.com/snakers4/silero-models/wiki) for releva

 - Voice activity detection for the [browser](https://github.com/ricky0123/vad) using ONNX Runtime Web

- [Rust](https://github.com/snakers4/silero-vad/tree/master/examples/rust-example), [Go](https://github.com/snakers4/silero-vad/tree/master/examples/go), [Java](https://github.com/snakers4/silero-vad/tree/master/examples/java-example) and [other](https://github.com/snakers4/silero-vad/tree/master/examples) examples
+- [Rust](https://github.com/snakers4/silero-vad/tree/master/examples/rust-example), [Go](https://github.com/snakers4/silero-vad/tree/master/examples/go), [Java](https://github.com/snakers4/silero-vad/tree/master/examples/java-example), [C++](https://github.com/snakers4/silero-vad/tree/master/examples/cpp), [C#](https://github.com/snakers4/silero-vad/tree/master/examples/csharp) and [other](https://github.com/snakers4/silero-vad/tree/master/examples) community examples
--- a/examples/colab_record_example.ipynb
+++ b/examples/colab_record_example.ipynb
@@ -17,6 +17,7 @@
   },
   "outputs": [],
   "source": [
+    "#!apt install ffmpeg\n",
    "!pip -q install pydub\n",
    "from google.colab import output\n",
    "from base64 import b64decode, b64encode\n",
@@ -37,13 +38,12 @@
    "                              model='silero_vad',\n",
    "                              force_reload=True)\n",
    "\n",
-    "def int2float(sound):\n",
-    "    abs_max = np.abs(sound).max()\n",
-    "    sound = sound.astype('float32')\n",
-    "    if abs_max > 0:\n",
-    "        sound *= 1/32768\n",
-    "    sound = sound.squeeze()\n",
-    "    return sound\n",
+    "def int2float(audio):\n",
+    "    samples = audio.get_array_of_samples()\n",
+    "    new_sound = audio._spawn(samples)\n",
+    "    arr = np.array(samples).astype(np.float32)\n",
+    "    arr = arr / np.abs(arr).max()\n",
+    "    return arr\n",
    "\n",
    "AUDIO_HTML = \"\"\"\n",
    "<script>\n",
@@ -68,10 +68,10 @@
    "    //bitsPerSecond: 8000, //chrome seems to ignore, always 48k\n",
    "    mimeType : 'audio/webm;codecs=opus'\n",
    "    //mimeType : 'audio/webm;codecs=pcm'\n",
-    "  };            \n",
+    "  };\n",
    "  //recorder = new MediaRecorder(stream, options);\n",
    "  recorder = new MediaRecorder(stream);\n",
-    "  recorder.ondataavailable = function(e) {            \n",
+    "  recorder.ondataavailable = function(e) {\n",
    "    var url = URL.createObjectURL(e.data);\n",
    "    // var preview = document.createElement('audio');\n",
    "    // preview.controls = true;\n",
@@ -79,7 +79,7 @@
    "    // document.body.appendChild(preview);\n",
    "\n",
    "    reader = new FileReader();\n",
-    "    reader.readAsDataURL(e.data); \n",
+    "    reader.readAsDataURL(e.data);\n",
    "    reader.onloadend = function() {\n",
    "      base64data = reader.result;\n",
    "      //console.log(\"Inside FileReader:\" + base64data);\n",
@@ -121,7 +121,7 @@
    "\n",
    "}\n",
    "});\n",
-    "      \n",
+    "\n",
    "</script>\n",
    "\"\"\"\n",
    "\n",
@@ -133,8 +133,8 @@
    "    audio.export('test.mp3', format='mp3')\n",
    "    audio = audio.set_channels(1)\n",
    "    audio = audio.set_frame_rate(16000)\n",
-    "    audio_float = int2float(np.array(audio.get_array_of_samples()))\n",
-    "    audio_tens = torch.tensor(audio_float )\n",
+    "    audio_float = int2float(audio)\n",
+    "    audio_tens = torch.tensor(audio_float)\n",
    "    return audio_tens\n",
    "\n",
    "def make_animation(probs, audio_duration, interval=40):\n",
@@ -154,19 +154,18 @@
    "    def animate(i):\n",
    "        x = i * interval / 1000 - 0.04\n",
    "        y = np.linspace(0, 1.02, 2)\n",
-    "        \n",
+    "\n",
    "        line.set_data(x, y)\n",
    "        line.set_color('#990000')\n",
    "        return line,\n",
+    "    anim = FuncAnimation(fig, animate, init_func=init, interval=interval, save_count=int(audio_duration / (interval / 1000)))\n",
    "\n",
-    "    anim = FuncAnimation(fig, animate, init_func=init, interval=interval, save_count=audio_duration / (interval / 1000))\n",
-    "\n",
-    "    f = r\"animation.mp4\" \n",
-    "    writervideo = FFMpegWriter(fps=1000/interval) \n",
+    "    f = r\"animation.mp4\"\n",
+    "    writervideo = FFMpegWriter(fps=1000/interval)\n",
    "    anim.save(f, writer=writervideo)\n",
    "    plt.close('all')\n",
    "\n",
-    "def combine_audio(vidname, audname, outname, fps=25): \n",
+    "def combine_audio(vidname, audname, outname, fps=25):\n",
    "    my_clip = mpe.VideoFileClip(vidname, verbose=False)\n",
    "    audio_background = mpe.AudioFileClip(audname)\n",
    "    final_clip = my_clip.set_audio(audio_background)\n",
@@ -174,15 +173,10 @@
    "\n",
    "def record_make_animation():\n",
    "  tensor = record()\n",
-    "\n",
    "  print('Calculating probabilities...')\n",
    "  speech_probs = []\n",
    "  window_size_samples = 512\n",
-    "  for i in range(0, len(tensor), window_size_samples):\n",
-    "      if len(tensor[i: i+ window_size_samples]) < window_size_samples:\n",
-    "        break\n",
-    "      speech_prob = model(tensor[i: i+ window_size_samples], 16000).item()\n",
-    "      speech_probs.append(speech_prob)\n",
+    "  speech_probs = model.audio_forward(tensor, sr=16000)[0].tolist()\n",
    "  model.reset_states()\n",
    "  print('Making animation...')\n",
    "  make_animation(speech_probs, len(tensor) / 16000)\n",
@@ -196,7 +190,9 @@
    "  <video width=800 controls>\n",
    "        <source src=\"%s\" type=\"video/mp4\">\n",
    "  </video>\n",
-    "  \"\"\" % data_url))"
+    "  \"\"\" % data_url))\n",
+    "\n",
+    "  return speech_probs"
   ]
  },
  {
@@ -216,7 +212,7 @@
   },
   "outputs": [],
   "source": [
-    "record_make_animation()"
+    "speech_probs = record_make_animation()"
   ]
  }
 ],
--- a/examples/cpp/silero-vad-onnx.cpp
+++ b/examples/cpp/silero-vad-onnx.cpp
@@ -1,211 +1,227 @@
+#ifndef _CRT_SECURE_NO_WARNINGS
+#define _CRT_SECURE_NO_WARNINGS
+#endif
+
 #include <iostream>
 #include <vector>
 #include <sstream>
 #include <cstring>
 #include <limits>
 #include <chrono>
+#include <iomanip>
 #include <memory>
 #include <string>
 #include <stdexcept>
-#include <iostream>
-#include <string>
-#include "onnxruntime_cxx_api.h"
-#include "wav.h"
 #include <cstdio>
 #include <cstdarg>
+#include <cmath>    // for std::rint
 #if __cplusplus < 201703L
 #include <memory>
 #endif

 //#define __DEBUG_SPEECH_PROB___

-class timestamp_t
-{
+#include "onnxruntime_cxx_api.h"
+#include "wav.h" // For reading WAV files
+
+// timestamp_t class: stores the start and end (in samples) of a speech segment.
+class timestamp_t {
 public:
    int start;
    int end;

-    // default + parameterized constructor
    timestamp_t(int start = -1, int end = -1)
-        : start(start), end(end)
-    {
-    };
+        : start(start), end(end) { }

-    // assignment operator modifies object, therefore non-const
-    timestamp_t& operator=(const timestamp_t& a)
-    {
+    timestamp_t& operator=(const timestamp_t& a) {
        start = a.start;
        end = a.end;
        return *this;
-    };
+    }

-    // equality comparison. doesn't modify object. therefore const.
-    bool operator==(const timestamp_t& a) const
-    {
+    bool operator==(const timestamp_t& a) const {
        return (start == a.start && end == a.end);
-    };
-    std::string c_str()
-    {
-        //return std::format("timestamp {:08d}, {:08d}", start, end);
-        return format("{start:%08d,end:%08d}", start, end);
-    };
+    }
+
+    // Returns a formatted string of the timestamp.
+    std::string c_str() const {
+        return format("{start:%08d, end:%08d}", start, end);
+    }
 private:
-
-    std::string format(const char* fmt, ...)
-    {
+    // Helper function for formatting.
+    std::string format(const char* fmt, ...) const {
        char buf[256];
-
        va_list args;
        va_start(args, fmt);
-        const auto r = std::vsnprintf(buf, sizeof buf, fmt, args);
+        const auto r = std::vsnprintf(buf, sizeof(buf), fmt, args);
        va_end(args);
-
        if (r < 0)
-            // conversion failed
            return {};
-
        const size_t len = r;
-        if (len < sizeof buf)
-            // we fit in the buffer
-            return { buf, len };
-
+        if (len < sizeof(buf))
+            return std::string(buf, len);
 #if __cplusplus >= 201703L
-        // C++17: Create a string and write to its underlying array
        std::string s(len, '\0');
        va_start(args, fmt);
        std::vsnprintf(s.data(), len + 1, fmt, args);
        va_end(args);
-
        return s;
 #else
-        // C++11 or C++14: We need to allocate scratch memory
        auto vbuf = std::unique_ptr<char[]>(new char[len + 1]);
        va_start(args, fmt);
        std::vsnprintf(vbuf.get(), len + 1, fmt, args);
        va_end(args);
-
-        return { vbuf.get(), len };
+        return std::string(vbuf.get(), len);
 #endif
-    };
+    }
 };

-
-class VadIterator
-{
+// VadIterator class: uses ONNX Runtime to detect speech segments.
+class VadIterator {
 private:
-    // OnnxRuntime resources
+    // ONNX Runtime resources
    Ort::Env env;
    Ort::SessionOptions session_options;
    std::shared_ptr<Ort::Session> session = nullptr;
    Ort::AllocatorWithDefaultOptions allocator;
    Ort::MemoryInfo memory_info = Ort::MemoryInfo::CreateCpu(OrtArenaAllocator, OrtMemTypeCPU);

-private:
-    void init_engine_threads(int inter_threads, int intra_threads)
-    {
-        // The method should be called in each thread/proc in multi-thread/proc work
+    // ----- Context-related additions -----
+    const int context_samples = 64;  // For 16kHz, 64 samples are added as context.
+    std::vector<float> _context;     // Holds the last 64 samples from the previous chunk (initialized to zero).
+
+    // Original window size (e.g., 32ms corresponds to 512 samples)
+    int window_size_samples;
+    // Effective window size = window_size_samples + context_samples
+    int effective_window_size;
+
+    // Additional declaration: samples per millisecond
+    int sr_per_ms;
+
+    // ONNX Runtime input/output buffers
+    std::vector<Ort::Value> ort_inputs;
+    std::vector<const char*> input_node_names = { "input", "state", "sr" };
+    std::vector<float> input;
+    unsigned int size_state = 2 * 1 * 128;
+    std::vector<float> _state;
+    std::vector<int64_t> sr;
+    int64_t input_node_dims[2] = {};
+    const int64_t state_node_dims[3] = { 2, 1, 128 };
+    const int64_t sr_node_dims[1] = { 1 };
+    std::vector<Ort::Value> ort_outputs;
+    std::vector<const char*> output_node_names = { "output", "stateN" };
+
+    // Model configuration parameters
+    int sample_rate;
+    float threshold;
+    int min_silence_samples;
+    int min_silence_samples_at_max_speech;
+    int min_speech_samples;
+    float max_speech_samples;
+    int speech_pad_samples;
+    int audio_length_samples;
+
+    // State management
+    bool triggered = false;
+    unsigned int temp_end = 0;
+    unsigned int current_sample = 0;
+    int prev_end;
+    int next_start = 0;
+    std::vector<timestamp_t> speeches;
+    timestamp_t current_speech;
+
+    // Loads the ONNX model.
+    void init_onnx_model(const std::wstring& model_path) {
+        init_engine_threads(1, 1);
+        session = std::make_shared<Ort::Session>(env, model_path.c_str(), session_options);
+    }
+
+    // Initializes threading settings.
+    void init_engine_threads(int inter_threads, int intra_threads) {
        session_options.SetIntraOpNumThreads(intra_threads);
        session_options.SetInterOpNumThreads(inter_threads);
        session_options.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_ALL);
-    };
+    }

-    void init_onnx_model(const std::wstring& model_path)
-    {
-        // Init threads = 1 for 
-        init_engine_threads(1, 1);
-        // Load model
-        session = std::make_shared<Ort::Session>(env, model_path.c_str(), session_options);
-    };
-
-    void reset_states()
-    {
-        // Call reset before each audio start
-        std::memset(_state.data(), 0.0f, _state.size() * sizeof(float));
+    // Resets internal state (_state, _context, etc.)
+    void reset_states() {
+        std::memset(_state.data(), 0, _state.size() * sizeof(float));
        triggered = false;
        temp_end = 0;
        current_sample = 0;
-
        prev_end = next_start = 0;
-
        speeches.clear();
        current_speech = timestamp_t();
-    };
+        std::fill(_context.begin(), _context.end(), 0.0f);
+    }

-    void predict(const std::vector<float> &data)
-    {
-        // Infer
-        // Create ort tensors
-        input.assign(data.begin(), data.end());
+    // Inference: runs inference on one chunk of input data.
+    // data_chunk is expected to have window_size_samples samples.
+    void predict(const std::vector<float>& data_chunk) {
+        // Build new input: first context_samples from _context, followed by the current chunk (window_size_samples).
+        std::vector<float> new_data(effective_window_size, 0.0f);
+        std::copy(_context.begin(), _context.end(), new_data.begin());
+        std::copy(data_chunk.begin(), data_chunk.end(), new_data.begin() + context_samples);
+        input = new_data;
+
+        // Create input tensor (input_node_dims[1] is already set to effective_window_size).
        Ort::Value input_ort = Ort::Value::CreateTensor<float>(
            memory_info, input.data(), input.size(), input_node_dims, 2);
        Ort::Value state_ort = Ort::Value::CreateTensor<float>(
            memory_info, _state.data(), _state.size(), state_node_dims, 3);
        Ort::Value sr_ort = Ort::Value::CreateTensor<int64_t>(
            memory_info, sr.data(), sr.size(), sr_node_dims, 1);
-
-        // Clear and add inputs
        ort_inputs.clear();
        ort_inputs.emplace_back(std::move(input_ort));
        ort_inputs.emplace_back(std::move(state_ort));
        ort_inputs.emplace_back(std::move(sr_ort));

-        // Infer
+        // Run inference.
        ort_outputs = session->Run(
-            Ort::RunOptions{nullptr},
+            Ort::RunOptions{ nullptr },
            input_node_names.data(), ort_inputs.data(), ort_inputs.size(),
            output_node_names.data(), output_node_names.size());

-        // Output probability & update h,c recursively
        float speech_prob = ort_outputs[0].GetTensorMutableData<float>()[0];
-        float *stateN = ort_outputs[1].GetTensorMutableData<float>();
+        float* stateN = ort_outputs[1].GetTensorMutableData<float>();
        std::memcpy(_state.data(), stateN, size_state * sizeof(float));
+        current_sample += static_cast<unsigned int>(window_size_samples); // Advance by the original window size.

-        // Push forward sample index
-        current_sample += window_size_samples;
-
-        // Reset temp_end when > threshold 
-        if ((speech_prob >= threshold))
-        {
+        // If speech is detected (probability >= threshold)
+        if (speech_prob >= threshold) {
 #ifdef __DEBUG_SPEECH_PROB___
-            float speech = current_sample - window_size_samples; // minus window_size_samples to get precise start time point.
-            printf("{    start: %.3f s (%.3f) %08d}\n", 1.0 * speech / sample_rate, speech_prob, current_sample- window_size_samples);
-#endif //__DEBUG_SPEECH_PROB___
-            if (temp_end != 0)
-            {
+            float speech = current_sample - window_size_samples;
+            printf("{ start: %.3f s (%.3f) %08d}\n", 1.0f * speech / sample_rate, speech_prob, current_sample - window_size_samples);
+#endif
+            if (temp_end != 0) {
                temp_end = 0;
                if (next_start < prev_end)
                    next_start = current_sample - window_size_samples;
            }
-            if (triggered == false)
-            {
+            if (!triggered) {
                triggered = true;
-
                current_speech.start = current_sample - window_size_samples;
            }
+            // Update context: copy the last context_samples from new_data.
+            std::copy(new_data.end() - context_samples, new_data.end(), _context.begin());
            return;
        }

-        if (
-            (triggered == true)
-            && ((current_sample - current_speech.start) > max_speech_samples)
-            ) {
+        // If the speech segment becomes too long.
+        if (triggered && ((current_sample - current_speech.start) > max_speech_samples)) {
            if (prev_end > 0) {
                current_speech.end = prev_end;
                speeches.push_back(current_speech);
                current_speech = timestamp_t();
-                
-                // previously reached silence(< neg_thres) and is still not speech(< thres)
                if (next_start < prev_end)
                    triggered = false;
-                else{
+                else
                    current_speech.start = next_start;
-                }
                prev_end = 0;
                next_start = 0;
                temp_end = 0;
-
            }
-            else{ 
+            else {
                current_speech.end = current_sample;
                speeches.push_back(current_speech);
                current_speech = timestamp_t();
@@ -214,53 +230,29 @@ private:
                temp_end = 0;
                triggered = false;
            }
+            std::copy(new_data.end() - context_samples, new_data.end(), _context.begin());
            return;
-
        }
-        if ((speech_prob >= (threshold - 0.15)) && (speech_prob < threshold))
-        {
+
+        if ((speech_prob >= (threshold - 0.15)) && (speech_prob < threshold)) {
+            // When the speech probability temporarily drops but is still in speech, update context without changing state.
+            std::copy(new_data.end() - context_samples, new_data.end(), _context.begin());
+            return;
+        }
+
+        if (speech_prob < (threshold - 0.15)) {
+#ifdef __DEBUG_SPEECH_PROB___
+            float speech = current_sample - window_size_samples - speech_pad_samples;
+            printf("{ end: %.3f s (%.3f) %08d}\n", 1.0f * speech / sample_rate, speech_prob, current_sample - window_size_samples);
+#endif
            if (triggered) {
-#ifdef __DEBUG_SPEECH_PROB___
-                float speech = current_sample - window_size_samples; // minus window_size_samples to get precise start time point.
-                printf("{ speeking: %.3f s (%.3f) %08d}\n", 1.0 * speech / sample_rate, speech_prob, current_sample - window_size_samples);
-#endif //__DEBUG_SPEECH_PROB___
-            }
-            else {
-#ifdef __DEBUG_SPEECH_PROB___
-                float speech = current_sample - window_size_samples; // minus window_size_samples to get precise start time point.
-                printf("{  silence: %.3f s (%.3f) %08d}\n", 1.0 * speech / sample_rate, speech_prob, current_sample - window_size_samples);
-#endif //__DEBUG_SPEECH_PROB___
-            }
-            return;
-        }
-
-
-        // 4) End 
-        if ((speech_prob < (threshold - 0.15)))
-        {
-#ifdef __DEBUG_SPEECH_PROB___
-            float speech = current_sample - window_size_samples - speech_pad_samples; // minus window_size_samples to get precise start time point.
-            printf("{      end: %.3f s (%.3f) %08d}\n", 1.0 * speech / sample_rate, speech_prob, current_sample - window_size_samples);
-#endif //__DEBUG_SPEECH_PROB___
-            if (triggered == true)
-            {
                if (temp_end == 0)
-                {
                    temp_end = current_sample;
-                }
                if (current_sample - temp_end > min_silence_samples_at_max_speech)
                    prev_end = temp_end;
-                // a. silence < min_slience_samples, continue speaking 
-                if ((current_sample - temp_end) < min_silence_samples)
-                {
-
-                }
-                // b. silence >= min_slience_samples, end speaking
-                else
-                {
+                if ((current_sample - temp_end) >= min_silence_samples) {
                    current_speech.end = temp_end;
-                    if (current_speech.end - current_speech.start > min_speech_samples)
-                    {
+                    if (current_speech.end - current_speech.start > min_speech_samples) {
                        speeches.push_back(current_speech);
                        current_speech = timestamp_t();
                        prev_end = 0;
@@ -270,27 +262,23 @@ private:
                    }
                }
            }
-            else {
-                // may first windows see end state.
-            }
+            std::copy(new_data.end() - context_samples, new_data.end(), _context.begin());
            return;
        }
-    };
+    }
+
 public:
-    void process(const std::vector<float>& input_wav)
-    {
+    // Process the entire audio input.
+    void process(const std::vector<float>& input_wav) {
        reset_states();
-
-        audio_length_samples = input_wav.size();
-
-        for (int j = 0; j < audio_length_samples; j += window_size_samples)
-        {
-            if (j + window_size_samples > audio_length_samples)
+        audio_length_samples = static_cast<int>(input_wav.size());
+        // Process audio in chunks of window_size_samples (e.g., 512 samples)
+        for (size_t j = 0; j < static_cast<size_t>(audio_length_samples); j += static_cast<size_t>(window_size_samples)) {
+            if (j + static_cast<size_t>(window_size_samples) > static_cast<size_t>(audio_length_samples))
                break;
-            std::vector<float> r{ &input_wav[0] + j, &input_wav[0] + j + window_size_samples };
-            predict(r);
+            std::vector<float> chunk(&input_wav[j], &input_wav[j] + window_size_samples);
+            predict(chunk);
        }
-
        if (current_speech.start >= 0) {
            current_speech.end = audio_length_samples;
            speeches.push_back(current_speech);
@@ -300,179 +288,80 @@ public:
            temp_end = 0;
            triggered = false;
        }
-    };
-
-    void process(const std::vector<float>& input_wav, std::vector<float>& output_wav)
-    {
-        process(input_wav);
-        collect_chunks(input_wav, output_wav);
    }

-    void collect_chunks(const std::vector<float>& input_wav, std::vector<float>& output_wav)
-    {
-        output_wav.clear();
-        for (int i = 0; i < speeches.size(); i++) {
-#ifdef __DEBUG_SPEECH_PROB___
-            std::cout << speeches[i].c_str() << std::endl;
-#endif //#ifdef __DEBUG_SPEECH_PROB___
-            std::vector<float> slice(&input_wav[speeches[i].start], &input_wav[speeches[i].end]);
-            output_wav.insert(output_wav.end(),slice.begin(),slice.end());
-        }
-    };
-
-    const std::vector<timestamp_t> get_speech_timestamps() const
-    {
+    // Returns the detected speech timestamps.
+    const std::vector<timestamp_t> get_speech_timestamps() const {
        return speeches;
    }

-    void drop_chunks(const std::vector<float>& input_wav, std::vector<float>& output_wav)
-    {
-        output_wav.clear();
-        int current_start = 0;
-        for (int i = 0; i < speeches.size(); i++) {
-
-            std::vector<float> slice(&input_wav[current_start],&input_wav[speeches[i].start]);
-            output_wav.insert(output_wav.end(), slice.begin(), slice.end());
-            current_start = speeches[i].end;
-        }
-
-        std::vector<float> slice(&input_wav[current_start], &input_wav[input_wav.size()]);
-        output_wav.insert(output_wav.end(), slice.begin(), slice.end());
-    };
-
-private:
-    // model config
-    int64_t window_size_samples;  // Assign when init, support 256 512 768 for 8k; 512 1024 1536 for 16k.
-    int sample_rate;  //Assign when init support 16000 or 8000      
-    int sr_per_ms;   // Assign when init, support 8 or 16
-    float threshold; 
-    int min_silence_samples; // sr_per_ms * #ms
-    int min_silence_samples_at_max_speech; // sr_per_ms * #98
-    int min_speech_samples; // sr_per_ms * #ms
-    float max_speech_samples;
-    int speech_pad_samples; // usually a 
-    int audio_length_samples;
-
-    // model states
-    bool triggered = false;
-    unsigned int temp_end = 0;
-    unsigned int current_sample = 0;    
-    // MAX 4294967295 samples / 8sample per ms / 1000 / 60 = 8947 minutes  
-    int prev_end;
-    int next_start = 0;
-
-    //Output timestamp
-    std::vector<timestamp_t> speeches;
-    timestamp_t current_speech;
-
-
-    // Onnx model
-    // Inputs
-    std::vector<Ort::Value> ort_inputs;
-    
-    std::vector<const char *> input_node_names = {"input", "state", "sr"};
-    std::vector<float> input;
-    unsigned int size_state = 2 * 1 * 128; // It's FIXED.
-    std::vector<float> _state;
-    std::vector<int64_t> sr;
-
-    int64_t input_node_dims[2] = {};
-    const int64_t state_node_dims[3] = {2, 1, 128}; 
-    const int64_t sr_node_dims[1] = {1};
-
-    // Outputs
-    std::vector<Ort::Value> ort_outputs;
-    std::vector<const char *> output_node_names = {"output", "stateN"};
+    // Public method to reset the internal state.
+    void reset() {
+        reset_states();
+    }

 public:
-    // Construction
+    // Constructor: sets model path, sample rate, window size (ms), and other parameters.
+    // The parameters are set to match the Python version.
    VadIterator(const std::wstring ModelPath,
        int Sample_rate = 16000, int windows_frame_size = 32,
-        float Threshold = 0.5, int min_silence_duration_ms = 0,
-        int speech_pad_ms = 32, int min_speech_duration_ms = 32,
+        float Threshold = 0.5, int min_silence_duration_ms = 100,
+        int speech_pad_ms = 30, int min_speech_duration_ms = 250,
        float max_speech_duration_s = std::numeric_limits<float>::infinity())
+        : sample_rate(Sample_rate), threshold(Threshold), speech_pad_samples(speech_pad_ms), prev_end(0)
    {
-        init_onnx_model(ModelPath);
-        threshold = Threshold;
-        sample_rate = Sample_rate;
-        sr_per_ms = sample_rate / 1000;
-
-        window_size_samples = windows_frame_size * sr_per_ms;
-
-        min_speech_samples = sr_per_ms * min_speech_duration_ms;
-        speech_pad_samples = sr_per_ms * speech_pad_ms;
-
-        max_speech_samples = (
-            sample_rate * max_speech_duration_s
-            - window_size_samples
-            - 2 * speech_pad_samples
-            );
-
-        min_silence_samples = sr_per_ms * min_silence_duration_ms;
-        min_silence_samples_at_max_speech = sr_per_ms * 98;
-
-        input.resize(window_size_samples);
+        sr_per_ms = sample_rate / 1000;  // e.g., 16000 / 1000 = 16
+        window_size_samples = windows_frame_size * sr_per_ms; // e.g., 32ms * 16 = 512 samples
+        effective_window_size = window_size_samples + context_samples; // e.g., 512 + 64 = 576 samples
        input_node_dims[0] = 1;
-        input_node_dims[1] = window_size_samples;
-
+        input_node_dims[1] = effective_window_size;
        _state.resize(size_state);
        sr.resize(1);
        sr[0] = sample_rate;
-    };
+        _context.assign(context_samples, 0.0f);
+        min_speech_samples = sr_per_ms * min_speech_duration_ms;
+        max_speech_samples = (sample_rate * max_speech_duration_s - window_size_samples - 2 * speech_pad_samples);
+        min_silence_samples = sr_per_ms * min_silence_duration_ms;
+        min_silence_samples_at_max_speech = sr_per_ms * 98;
+        init_onnx_model(ModelPath);
+    }
 };

-int main()
-{
-    std::vector<timestamp_t> stamps;
-
-    // Read wav
-    wav::WavReader wav_reader("recorder.wav"); //16000,1,32float
-    std::vector<float> input_wav(wav_reader.num_samples());
-    std::vector<float> output_wav;
-
-    for (int i = 0; i < wav_reader.num_samples(); i++)
-    {
+int main() {
+    // Read the WAV file (expects 16000 Hz, mono, PCM).
+    wav::WavReader wav_reader("audio/recorder.wav"); // File located in the "audio" folder.
+    int numSamples = wav_reader.num_samples();
+    std::vector<float> input_wav(static_cast<size_t>(numSamples));
+    for (size_t i = 0; i < static_cast<size_t>(numSamples); i++) {
        input_wav[i] = static_cast<float>(*(wav_reader.data() + i));
    }

+    // Set the ONNX model path (file located in the "model" folder).
+    std::wstring model_path = L"model/silero_vad.onnx";

+    // Initialize the VadIterator.
+    VadIterator vad(model_path);

-    // ===== Test configs =====
-    std::wstring path = L"silero_vad.onnx";
-    VadIterator vad(path);
-
-    // ==============================================
-    // ==== = Example 1 of full function  ===== 
-    // ==============================================
+    // Process the audio.
    vad.process(input_wav);

-    // 1.a get_speech_timestamps
-    stamps = vad.get_speech_timestamps();
-    for (int i = 0; i < stamps.size(); i++) {
+    // Retrieve the speech timestamps (in samples).
+    std::vector<timestamp_t> stamps = vad.get_speech_timestamps();

-        std::cout << stamps[i].c_str() << std::endl;
+    // Convert timestamps to seconds and round to one decimal place (for 16000 Hz).
+    const float sample_rate_float = 16000.0f;
+    for (size_t i = 0; i < stamps.size(); i++) {
+        float start_sec = std::rint((stamps[i].start / sample_rate_float) * 10.0f) / 10.0f;
+        float end_sec = std::rint((stamps[i].end / sample_rate_float) * 10.0f) / 10.0f;
+        std::cout << "Speech detected from "
+            << std::fixed << std::setprecision(1) << start_sec
+            << " s to "
+            << std::fixed << std::setprecision(1) << end_sec
+            << " s" << std::endl;
    }

-    // 1.b collect_chunks output wav
-    vad.collect_chunks(input_wav, output_wav);
+    // Optionally, reset the internal state.
+    vad.reset();

-    // 1.c drop_chunks output wav
-    vad.drop_chunks(input_wav, output_wav);
-
-    // ==============================================
-    // ===== Example 2 of simple full function  =====
-    // ==============================================
-    vad.process(input_wav, output_wav);
-
-    stamps = vad.get_speech_timestamps();
-    for (int i = 0; i < stamps.size(); i++) {
-
-        std::cout << stamps[i].c_str() << std::endl;
-    }
-
-    // ==============================================
-    // ===== Example 3 of full function  =====
-    // ==============================================
-    for(int i = 0; i<2; i++)
-        vad.process(input_wav, output_wav);
+    return 0;
 }
--- a/examples/cpp/wav.h
+++ b/examples/cpp/wav.h
@@ -12,10 +12,10 @@
 // See the License for the specific language governing permissions and
 // limitations under the License.

-
 #ifndef FRONTEND_WAV_H_
 #define FRONTEND_WAV_H_

+
 #include <assert.h>
 #include <stdint.h>
 #include <stdio.h>
@@ -24,6 +24,8 @@

 #include <string>

+#include <iostream>
+
 // #include "utils/log.h"

 namespace wav {
@@ -230,6 +232,6 @@ class WavWriter {
  int bits_per_sample_;
 };

-}  // namespace wenet
+}  // namespace wav

 #endif  // FRONTEND_WAV_H_
--- a/examples/cpp_libtorch/README.md
+++ b/examples/cpp_libtorch/README.md
@@ -0,0 +1,45 @@
+# Silero-VAD V5 in C++ (based on LibTorch)
+
+This is the source code for Silero-VAD V5 in C++, utilizing LibTorch. The primary implementation is CPU-based, and you should compare its results with the Python version. Only results at 16kHz have been tested.
+
+Additionally, batch and CUDA inference options are available if you want to explore further. Note that when using batch inference, the speech probabilities may slightly differ from the standard version, likely due to differences in caching. Unlike individual input processing, batch inference may not use the cache from previous chunks. Despite this, batch inference offers significantly faster processing. For optimal performance, consider adjusting the threshold when using batch inference.
+
+## Requirements
+
+- GCC 11.4.0 (GCC >= 5.1)
+- LibTorch 1.13.0 (other versions are also acceptable)
+
+## Download LibTorch
+
+```bash
+-CPU Version
+wget https://download.pytorch.org/libtorch/cpu/libtorch-shared-with-deps-1.13.0%2Bcpu.zip
+unzip libtorch-shared-with-deps-1.13.0+cpu.zip'
+
+-CUDA Version
+wget https://download.pytorch.org/libtorch/cu116/libtorch-shared-with-deps-1.13.0%2Bcu116.zip
+unzip libtorch-shared-with-deps-1.13.0+cu116.zip
+```
+
+## Compilation
+
+```bash
+-CPU Version
+g++ main.cc silero_torch.cc -I ./libtorch/include/ -I ./libtorch/include/torch/csrc/api/include -L ./libtorch/lib/ -ltorch -ltorch_cpu -lc10 -Wl,-rpath,./libtorch/lib/ -o silero -std=c++14 -D_GLIBCXX_USE_CXX11_ABI=0
+
+-CUDA Version
+g++ main.cc silero_torch.cc -I ./libtorch/include/ -I ./libtorch/include/torch/csrc/api/include -L ./libtorch/lib/ -ltorch -ltorch_cuda -ltorch_cpu -lc10 -Wl,-rpath,./libtorch/lib/ -o silero -std=c++14 -D_GLIBCXX_USE_CXX11_ABI=0 -DUSE_GPU
+```
+
+
+## Optional Compilation Flags
+-DUSE_BATCH: Enable batch inference
+-DUSE_GPU: Use GPU for inference
+
+## Run the Program
+To run the program, use the following command:
+
+`./silero aepyx.wav 16000 0.5`
+
+The sample file aepyx.wav is part of the Voxconverse dataset.
+File details: aepyx.wav is a 16kHz, 16-bit audio file.
--- a/examples/cpp_libtorch/aepyx.wav
+++ b/examples/cpp_libtorch/aepyx.wav
--- a/examples/cpp_libtorch/main.cc
+++ b/examples/cpp_libtorch/main.cc
@@ -0,0 +1,54 @@
+#include <iostream>
+#include "silero_torch.h"
+#include "wav.h"
+
+int main(int argc, char* argv[]) {
+
+	if(argc != 4){
+		std::cerr<<"Usage : "<<argv[0]<<" <wav.path> <SampleRate> <Threshold>"<<std::endl;
+		std::cerr<<"Usage : "<<argv[0]<<" sample.wav 16000 0.5"<<std::endl;
+		return 1;
+	}
+
+	std::string wav_path = argv[1];
+	float sample_rate = std::stof(argv[2]);
+	float threshold = std::stof(argv[3]);
+
+
+	//Load Model
+	std::string model_path = "../../src/silero_vad/data/silero_vad.jit";
+	silero::VadIterator vad(model_path);
+
+        vad.threshold=threshold;	//(Default:0.5)
+	vad.sample_rate=sample_rate;	//16000Hz,8000Hz. (Default:16000)
+        vad.print_as_samples=true;	//if true, it prints time-stamp with samples. otherwise, in seconds
+					//(Default:false)
+
+	vad.SetVariables();
+
+	// Read wav
+	wav::WavReader wav_reader(wav_path); 
+	std::vector<float> input_wav(wav_reader.num_samples());
+
+	for (int i = 0; i < wav_reader.num_samples(); i++)
+	{
+		input_wav[i] = static_cast<float>(*(wav_reader.data() + i));
+	}
+
+	vad.SpeechProbs(input_wav);
+
+	std::vector<silero::SpeechSegment> speeches = vad.GetSpeechTimestamps();
+	for(const auto& speech : speeches){
+		if(vad.print_as_samples){
+			std::cout<<"{'start': "<<static_cast<int>(speech.start)<<", 'end': "<<static_cast<int>(speech.end)<<"}"<<std::endl;
+		}
+		else{
+			std::cout<<"{'start': "<<speech.start<<", 'end': "<<speech.end<<"}"<<std::endl;
+		}
+	}	
+
+
+	return 0;
+	}
+
+
--- a/examples/cpp_libtorch/silero
+++ b/examples/cpp_libtorch/silero
--- a/examples/cpp_libtorch/silero_torch.cc
+++ b/examples/cpp_libtorch/silero_torch.cc
@@ -0,0 +1,285 @@
+//Author      : Nathan Lee
+//Created On  : 2024-11-18
+//Description : silero 5.1 system for torch-script(c++).
+//Version     : 1.0
+
+
+#include "silero_torch.h"
+
+namespace silero {
+
+	VadIterator::VadIterator(const std::string &model_path, float threshold, int sample_rate, int window_size_ms, int speech_pad_ms, int min_silence_duration_ms, int min_speech_duration_ms, int max_duration_merge_ms, bool print_as_samples)
+		:sample_rate(sample_rate), threshold(threshold), window_size_ms(window_size_ms), speech_pad_ms(speech_pad_ms), min_silence_duration_ms(min_silence_duration_ms), min_speech_duration_ms(min_speech_duration_ms), max_duration_merge_ms(max_duration_merge_ms), print_as_samples(print_as_samples)
+	{
+		init_torch_model(model_path);
+		//init_engine(window_size_ms);
+	}
+	VadIterator::~VadIterator(){
+	}
+
+
+	void VadIterator::SpeechProbs(std::vector<float>& input_wav){
+		// Set the sample rate (must match the model's expected sample rate)
+		// Process the waveform in chunks of 512 samples
+		int num_samples = input_wav.size();
+		int num_chunks = num_samples / window_size_samples;
+		int remainder_samples = num_samples % window_size_samples;
+
+		total_sample_size += num_samples;
+
+		torch::Tensor output;
+		std::vector<torch::Tensor> chunks;
+
+		for (int i = 0; i < num_chunks; i++) {
+
+			float* chunk_start = input_wav.data() + i *window_size_samples;
+			torch::Tensor chunk = torch::from_blob(chunk_start, {1,window_size_samples}, torch::kFloat32);
+			//std::cout<<"chunk size : "<<chunk.sizes()<<std::endl;
+			chunks.push_back(chunk);
+
+
+			if(i==num_chunks-1 && remainder_samples>0){//마지막 chunk && 나머지가 존재
+				int remaining_samples = num_samples - num_chunks * window_size_samples;
+				//std::cout<<"Remainder size : "<<remaining_samples;
+				float* chunk_start_remainder = input_wav.data() + num_chunks *window_size_samples;
+
+				torch::Tensor remainder_chunk = torch::from_blob(chunk_start_remainder, {1,remaining_samples},
+						torch::kFloat32);
+				// Pad the remainder chunk to match window_size_samples
+				torch::Tensor padded_chunk = torch::cat({remainder_chunk, torch::zeros({1, window_size_samples
+							- remaining_samples}, torch::kFloat32)}, 1);
+				//std::cout<<", padded_chunk size : "<<padded_chunk.size(1)<<std::endl;
+
+				chunks.push_back(padded_chunk);
+			}
+		}
+
+		if (!chunks.empty()) {
+
+#ifdef USE_BATCH
+			torch::Tensor batched_chunks = torch::stack(chunks);  // Stack all chunks into a single tensor
+			//batched_chunks = batched_chunks.squeeze(1);
+			batched_chunks = torch::cat({batched_chunks.squeeze(1)});
+
+#ifdef USE_GPU
+			batched_chunks = batched_chunks.to(at::kCUDA);        // Move the entire batch to GPU once
+#endif
+			// Prepare input for model
+			std::vector<torch::jit::IValue> inputs;
+			inputs.push_back(batched_chunks);  // Batch of chunks
+			inputs.push_back(sample_rate);     // Assuming sample_rate is a valid input for the model
+
+			// Run inference on the batch
+			torch::NoGradGuard no_grad;
+			torch::Tensor output = model.forward(inputs).toTensor();
+#ifdef USE_GPU
+			output = output.to(at::kCPU);      // Move the output back to CPU once
+#endif
+			// Collect output probabilities
+			for (int i = 0; i < chunks.size(); i++) {
+				float output_f = output[i].item<float>();
+				outputs_prob.push_back(output_f);
+				//std::cout << "Chunk " << i << " prob: " << output_f<< "\n";
+			}
+#else
+
+			std::vector<torch::Tensor> outputs;
+			torch::Tensor batched_chunks = torch::stack(chunks);
+#ifdef USE_GPU
+			batched_chunks = batched_chunks.to(at::kCUDA);
+#endif
+			for (int i = 0; i < chunks.size(); i++) {
+				torch::NoGradGuard no_grad;
+				std::vector<torch::jit::IValue> inputs;
+				inputs.push_back(batched_chunks[i]);
+				inputs.push_back(sample_rate);
+
+				torch::Tensor output = model.forward(inputs).toTensor();
+				outputs.push_back(output);
+			}
+			torch::Tensor all_outputs = torch::stack(outputs);
+#ifdef USE_GPU
+			all_outputs = all_outputs.to(at::kCPU);
+#endif
+			for (int i = 0; i < chunks.size(); i++) {
+				float output_f = all_outputs[i].item<float>();
+				outputs_prob.push_back(output_f);
+			}
+
+
+
+#endif
+
+		}
+
+
+	}
+
+
+	std::vector<SpeechSegment> VadIterator::GetSpeechTimestamps() {
+		std::vector<SpeechSegment> speeches = DoVad();
+
+#ifdef USE_BATCH
+		//When you use BATCH inference. You would better use 'mergeSpeeches' function to arrage time stamp.
+		//It could be better get reasonable output because of distorted probs.
+		duration_merge_samples = sample_rate * max_duration_merge_ms / 1000;
+		std::vector<SpeechSegment> speeches_merge = mergeSpeeches(speeches, duration_merge_samples);
+		if(!print_as_samples){
+			for (auto& speech : speeches_merge) { //samples to second
+				speech.start /= sample_rate;
+				speech.end /= sample_rate;
+			}
+		}
+
+		return speeches_merge;
+#else
+
+		if(!print_as_samples){
+			for (auto& speech : speeches) { //samples to second
+				speech.start /= sample_rate;
+				speech.end /= sample_rate;
+			}
+		}
+
+		return speeches;
+
+#endif
+
+	}
+	void VadIterator::SetVariables(){
+		init_engine(window_size_ms);
+	}
+
+	void VadIterator::init_engine(int window_size_ms) {
+		min_silence_samples = sample_rate * min_silence_duration_ms / 1000;
+		speech_pad_samples = sample_rate * speech_pad_ms / 1000;
+		window_size_samples = sample_rate / 1000 * window_size_ms;
+		min_speech_samples = sample_rate * min_speech_duration_ms / 1000;
+	}
+
+	void VadIterator::init_torch_model(const std::string& model_path) {
+		at::set_num_threads(1);
+		model = torch::jit::load(model_path);
+
+#ifdef USE_GPU
+		if (!torch::cuda::is_available()) {
+			std::cout<<"CUDA is not available! Please check your GPU settings"<<std::endl;
+			throw std::runtime_error("CUDA is not available!");
+			model.to(at::Device(at::kCPU));    
+
+		} else {
+			std::cout<<"CUDA available! Running on '0'th GPU"<<std::endl;
+			model.to(at::Device(at::kCUDA, 0));        //select 0'th machine 
+		}
+#endif
+
+
+		model.eval();
+		torch::NoGradGuard no_grad;
+		std::cout << "Model loaded successfully"<<std::endl;
+	}
+
+	void VadIterator::reset_states() {
+		triggered = false;
+		current_sample = 0;
+		temp_end = 0;
+		outputs_prob.clear();
+		model.run_method("reset_states");
+		total_sample_size = 0;
+	}
+
+	std::vector<SpeechSegment> VadIterator::DoVad() {
+		std::vector<SpeechSegment> speeches;
+
+		for (size_t i = 0; i < outputs_prob.size(); ++i) {
+			float speech_prob = outputs_prob[i];
+			//std::cout << speech_prob << std::endl;
+			//std::cout << "Chunk " << i << " Prob: " << speech_prob << "\n";
+			//std::cout << speech_prob << " ";
+			current_sample += window_size_samples;
+
+			if (speech_prob >= threshold && temp_end != 0) {
+				temp_end = 0;
+			}
+
+			if (speech_prob >= threshold && !triggered) {
+				triggered = true;
+				SpeechSegment segment;
+				segment.start = std::max(static_cast<int>(0), current_sample - speech_pad_samples - window_size_samples);
+				speeches.push_back(segment);
+				continue;
+			}
+
+			if (speech_prob < threshold - 0.15f && triggered) {
+				if (temp_end == 0) {
+					temp_end = current_sample;
+				}
+
+				if (current_sample - temp_end < min_silence_samples) {
+					continue;
+				} else {
+					SpeechSegment& segment = speeches.back();
+					segment.end = temp_end + speech_pad_samples - window_size_samples;
+					temp_end = 0;
+					triggered = false;
+				}
+			}
+		}
+
+		if (triggered) { //만약 낮은 확률을 보이다가  마지막프레임 prbos만 딱 확률이 높게 나오면 위에서 triggerd = true 메핑과 동시에  segment start가 돼서 문제가 될것 같은데? start = end 같은값? 후처리가 있으니 문제가 없으려나?
+			std::cout<<"when last triggered is keep working until last Probs"<<std::endl;
+			SpeechSegment& segment = speeches.back();
+			segment.end = total_sample_size;  // 현재 샘플을 마지막 구간의 종료 시간으로 설정
+			triggered = false;  // VAD 상태 초기화
+		}
+
+		speeches.erase(
+                		std::remove_if(
+                        speeches.begin(),
+                        speeches.end(),
+                        [this](const SpeechSegment& speech) {
+                        return ((speech.end - this->speech_pad_samples) - (speech.start + this->speech_pad_samples) < min_speech_samples);
+			//min_speech_samples is 4000samples(0.25sec)
+			//여기서 포인트!! 계산 할때는 start,end sample에'speech_pad_samples' 사이즈를 추가한후 길이를 측정함. 
+                        }
+                ),
+                speeches.end()
+              );
+
+
+		//std::cout<<std::endl;
+		//std::cout<<"outputs_prob.size : "<<outputs_prob.size()<<std::endl;
+
+		reset_states();
+		return speeches;
+	}
+
+	std::vector<SpeechSegment> VadIterator::mergeSpeeches(const std::vector<SpeechSegment>& speeches, int duration_merge_samples) {
+		std::vector<SpeechSegment> mergedSpeeches;
+
+		if (speeches.empty()) {
+			return mergedSpeeches; // 빈 벡터 반환
+		}
+
+		// 첫 번째 구간으로 초기화
+		SpeechSegment currentSegment = speeches[0];
+
+		for (size_t i = 1; i < speeches.size(); ++i) {	//첫번째 start,end 정보 건너뛰기. 그래서 i=1부터
+			// 두 구간의 차이가 threshold(duration_merge_samples)보다 작은 경우, 합침
+			if (speeches[i].start - currentSegment.end < duration_merge_samples) {
+				// 현재 구간의 끝점을 업데이트
+				currentSegment.end = speeches[i].end;
+			} else {
+				// 차이가 threshold(duration_merge_samples) 이상이면 현재 구간을 저장하고 새로운 구간 시작
+				mergedSpeeches.push_back(currentSegment);
+				currentSegment = speeches[i];
+			}
+		}
+
+		// 마지막 구간 추가
+		mergedSpeeches.push_back(currentSegment);
+
+		return mergedSpeeches;
+	}
+
+	}
--- a/examples/cpp_libtorch/silero_torch.h
+++ b/examples/cpp_libtorch/silero_torch.h
@@ -0,0 +1,75 @@
+//Author      : Nathan Lee
+//Created On  : 2024-11-18
+//Description : silero 5.1 system for torch-script(c++).
+//Version     : 1.0
+
+#ifndef SILERO_TORCH_H
+#define SILERO_TORCH_H
+
+#include <string>
+#include <memory>
+#include <stdexcept>
+#include <iostream>
+#include <memory>
+#include <vector>
+#include <fstream>
+#include <chrono>
+
+#include <torch/torch.h>
+#include <torch/script.h>
+
+
+namespace silero{
+
+	struct SpeechSegment{
+		int start;
+		int end;
+	};
+
+	class VadIterator{
+		public:
+
+			VadIterator(const std::string &model_path, float threshold = 0.5, int sample_rate = 16000, 
+				int window_size_ms = 32, int speech_pad_ms = 30, int min_silence_duration_ms = 100, 
+				int min_speech_duration_ms = 250, int max_duration_merge_ms = 300, bool print_as_samples = false);
+			~VadIterator(); 
+
+
+			void SpeechProbs(std::vector<float>& input_wav);
+			std::vector<silero::SpeechSegment> GetSpeechTimestamps();
+			void SetVariables();
+
+			float threshold;
+			int sample_rate;
+			int window_size_ms;
+			int min_speech_duration_ms;
+			int max_duration_merge_ms;
+			bool print_as_samples;
+
+		private:
+			torch::jit::script::Module model;
+			std::vector<float> outputs_prob;
+			int min_silence_samples;
+			int min_speech_samples;
+			int speech_pad_samples;
+			int window_size_samples;
+			int duration_merge_samples;
+			int current_sample = 0;
+
+			int total_sample_size=0;
+
+			int min_silence_duration_ms;
+			int speech_pad_ms;
+			bool triggered = false;
+			int temp_end = 0;
+
+			void init_engine(int window_size_ms);
+			void init_torch_model(const std::string& model_path);
+			void reset_states();
+			std::vector<SpeechSegment> DoVad();
+			std::vector<SpeechSegment> mergeSpeeches(const std::vector<SpeechSegment>& speeches, int duration_merge_samples);
+
+	};
+
+}
+#endif // SILERO_TORCH_H
--- a/examples/cpp_libtorch/wav.h
+++ b/examples/cpp_libtorch/wav.h
@@ -0,0 +1,235 @@
+// Copyright (c) 2016 Personal (Binbin Zhang)
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+
+#ifndef FRONTEND_WAV_H_
+#define FRONTEND_WAV_H_
+
+#include <assert.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include <string>
+
+// #include "utils/log.h"
+
+namespace wav {
+
+struct WavHeader {
+  char riff[4];  // "riff"
+  unsigned int size;
+  char wav[4];  // "WAVE"
+  char fmt[4];  // "fmt "
+  unsigned int fmt_size;
+  uint16_t format;
+  uint16_t channels;
+  unsigned int sample_rate;
+  unsigned int bytes_per_second;
+  uint16_t block_size;
+  uint16_t bit;
+  char data[4];  // "data"
+  unsigned int data_size;
+};
+
+class WavReader {
+ public:
+  WavReader() : data_(nullptr) {}
+  explicit WavReader(const std::string& filename) { Open(filename); }
+
+  bool Open(const std::string& filename) {
+    FILE* fp = fopen(filename.c_str(), "rb"); //文件读取
+    if (NULL == fp) {
+      std::cout << "Error in read " << filename;
+      return false;
+    }
+
+    WavHeader header;
+    fread(&header, 1, sizeof(header), fp);
+    if (header.fmt_size < 16) {
+      printf("WaveData: expect PCM format data "
+              "to have fmt chunk of at least size 16.\n");
+      return false;
+    } else if (header.fmt_size > 16) {
+      int offset = 44 - 8 + header.fmt_size - 16;
+      fseek(fp, offset, SEEK_SET);
+      fread(header.data, 8, sizeof(char), fp);
+    }
+    // check "riff" "WAVE" "fmt " "data"
+
+    // Skip any sub-chunks between "fmt" and "data".  Usually there will
+    // be a single "fact" sub chunk, but on Windows there can also be a
+    // "list" sub chunk.
+    while (0 != strncmp(header.data, "data", 4)) {
+      // We will just ignore the data in these chunks.
+      fseek(fp, header.data_size, SEEK_CUR);
+      // read next sub chunk
+      fread(header.data, 8, sizeof(char), fp);
+    }
+
+    if (header.data_size == 0) {
+        int offset = ftell(fp);
+        fseek(fp, 0, SEEK_END);
+        header.data_size = ftell(fp) - offset;
+        fseek(fp, offset, SEEK_SET);
+    }
+
+    num_channel_ = header.channels;
+    sample_rate_ = header.sample_rate;
+    bits_per_sample_ = header.bit;
+    int num_data = header.data_size / (bits_per_sample_ / 8);
+    data_ = new float[num_data]; // Create 1-dim array
+    num_samples_ = num_data / num_channel_;
+
+    std::cout << "num_channel_    :" << num_channel_ << std::endl;
+    std::cout << "sample_rate_    :" << sample_rate_ << std::endl;
+    std::cout << "bits_per_sample_:" << bits_per_sample_ << std::endl;
+    std::cout << "num_samples     :" << num_data << std::endl;
+    std::cout << "num_data_size   :" << header.data_size << std::endl;
+
+    switch (bits_per_sample_) {
+        case 8: {
+            char sample;
+            for (int i = 0; i < num_data; ++i) {
+                fread(&sample, 1, sizeof(char), fp);
+                data_[i] = static_cast<float>(sample) / 32768;
+            }
+            break;
+        }
+        case 16: {
+            int16_t sample;
+            for (int i = 0; i < num_data; ++i) {
+                fread(&sample, 1, sizeof(int16_t), fp);
+                data_[i] = static_cast<float>(sample) / 32768;
+            }
+            break;
+        }
+        case 32:
+        {
+            if (header.format == 1) //S32
+            {
+                int sample;
+                for (int i = 0; i < num_data; ++i) {
+                    fread(&sample, 1, sizeof(int), fp);
+                    data_[i] = static_cast<float>(sample) / 32768;
+                }
+            }
+            else if (header.format == 3) // IEEE-float
+            {
+                float sample;
+                for (int i = 0; i < num_data; ++i) {
+                    fread(&sample, 1, sizeof(float), fp);
+                    data_[i] = static_cast<float>(sample);
+                }
+            }
+            else {
+                printf("unsupported quantization bits\n");
+            }
+            break;
+        }
+        default:
+            printf("unsupported quantization bits\n");
+            break;
+    }
+
+    fclose(fp);
+    return true;
+  }
+
+  int num_channel() const { return num_channel_; }
+  int sample_rate() const { return sample_rate_; }
+  int bits_per_sample() const { return bits_per_sample_; }
+  int num_samples() const { return num_samples_; }
+
+  ~WavReader() {
+    delete[] data_;
+  }
+
+  const float* data() const { return data_; }
+
+ private:
+  int num_channel_;
+  int sample_rate_;
+  int bits_per_sample_;
+  int num_samples_;  // sample points per channel
+  float* data_;
+};
+
+class WavWriter {
+ public:
+  WavWriter(const float* data, int num_samples, int num_channel,
+            int sample_rate, int bits_per_sample)
+      : data_(data),
+        num_samples_(num_samples),
+        num_channel_(num_channel),
+        sample_rate_(sample_rate),
+        bits_per_sample_(bits_per_sample) {}
+
+  void Write(const std::string& filename) {
+    FILE* fp = fopen(filename.c_str(), "w");
+    // init char 'riff' 'WAVE' 'fmt ' 'data'
+    WavHeader header;
+    char wav_header[44] = {0x52, 0x49, 0x46, 0x46, 0x00, 0x00, 0x00, 0x00, 0x57,
+                           0x41, 0x56, 0x45, 0x66, 0x6d, 0x74, 0x20, 0x10, 0x00,
+                           0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+                           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+                           0x64, 0x61, 0x74, 0x61, 0x00, 0x00, 0x00, 0x00};
+    memcpy(&header, wav_header, sizeof(header));
+    header.channels = num_channel_;
+    header.bit = bits_per_sample_;
+    header.sample_rate = sample_rate_;
+    header.data_size = num_samples_ * num_channel_ * (bits_per_sample_ / 8);
+    header.size = sizeof(header) - 8 + header.data_size;
+    header.bytes_per_second =
+        sample_rate_ * num_channel_ * (bits_per_sample_ / 8);
+    header.block_size = num_channel_ * (bits_per_sample_ / 8);
+
+    fwrite(&header, 1, sizeof(header), fp);
+
+    for (int i = 0; i < num_samples_; ++i) {
+      for (int j = 0; j < num_channel_; ++j) {
+        switch (bits_per_sample_) {
+          case 8: {
+            char sample = static_cast<char>(data_[i * num_channel_ + j]);
+            fwrite(&sample, 1, sizeof(sample), fp);
+            break;
+          }
+          case 16: {
+            int16_t sample = static_cast<int16_t>(data_[i * num_channel_ + j]);
+            fwrite(&sample, 1, sizeof(sample), fp);
+            break;
+          }
+          case 32: {
+            int sample = static_cast<int>(data_[i * num_channel_ + j]);
+            fwrite(&sample, 1, sizeof(sample), fp);
+            break;
+          }
+        }
+      }
+    }
+    fclose(fp);
+  }
+
+ private:
+  const float* data_;
+  int num_samples_;  // total float points in data_
+  int num_channel_;
+  int sample_rate_;
+  int bits_per_sample_;
+};
+
+}  // namespace wenet
+
+#endif  // FRONTEND_WAV_H_
--- a/examples/haskell/README.md
+++ b/examples/haskell/README.md
@@ -0,0 +1,13 @@
+# Haskell example
+
+To run the example, make sure you put an ``example.wav`` in this directory, and then run the following:
+```bash
+stack run
+```
+
+The ``example.wav`` file must have the following requirements:
+- Must be 16khz sample rate.
+- Must be mono channel.
+- Must be 16-bit audio.
+
+This uses the [silero-vad](https://hackage.haskell.org/package/silero-vad) package, a haskell implementation based on the C# example.
--- a/examples/haskell/app/Main.hs
+++ b/examples/haskell/app/Main.hs
@@ -0,0 +1,22 @@
+module Main (main) where
+
+import qualified Data.Vector.Storable as Vector
+import Data.WAVE
+import Data.Function
+import Silero
+
+main :: IO ()
+main =
+  withModel $ \model -> do
+    wav <- getWAVEFile "example.wav"
+    let samples =
+          concat (waveSamples wav)
+            & Vector.fromList
+            & Vector.map (realToFrac . sampleToDouble)
+    let vad =
+          (defaultVad model)
+            { startThreshold = 0.5
+            , endThreshold = 0.35
+            }
+    segments <- detectSegments vad samples
+    print segments
--- a/examples/haskell/example.cabal
+++ b/examples/haskell/example.cabal
@@ -0,0 +1,23 @@
+cabal-version: 1.12
+
+-- This file has been generated from package.yaml by hpack version 0.37.0.
+--
+-- see: https://github.com/sol/hpack
+
+name:           example
+version:        0.1.0.0
+build-type:     Simple
+
+executable example-exe
+  main-is: Main.hs
+  other-modules:
+      Paths_example
+  hs-source-dirs:
+      app
+  ghc-options: -Wall -Wcompat -Widentities -Wincomplete-record-updates -Wincomplete-uni-patterns -Wmissing-export-lists -Wmissing-home-modules -Wpartial-fields -Wredundant-constraints -threaded -rtsopts -with-rtsopts=-N
+  build-depends:
+      WAVE
+    , base >=4.7 && <5
+    , silero-vad
+    , vector
+  default-language: Haskell2010
--- a/examples/haskell/package.yaml
+++ b/examples/haskell/package.yaml
@@ -0,0 +1,28 @@
+name: example
+version: 0.1.0.0
+
+dependencies:
+- base >= 4.7 && < 5
+- silero-vad
+- WAVE
+- vector
+
+ghc-options:
+- -Wall
+- -Wcompat
+- -Widentities
+- -Wincomplete-record-updates
+- -Wincomplete-uni-patterns
+- -Wmissing-export-lists
+- -Wmissing-home-modules
+- -Wpartial-fields
+- -Wredundant-constraints
+
+executables:
+  example-exe:
+    main: Main.hs
+    source-dirs: app
+    ghc-options:
+    - -threaded
+    - -rtsopts
+    - -with-rtsopts=-N
--- a/examples/haskell/stack.yaml
+++ b/examples/haskell/stack.yaml
@@ -0,0 +1,11 @@
+snapshot:
+  url: https://raw.githubusercontent.com/commercialhaskell/stackage-snapshots/master/lts/20/26.yaml
+
+packages:
+- .
+
+extra-deps:
+  - silero-vad-0.1.0.4@sha256:2bff95be978a2782915b250edc795760d4cf76838e37bb7d4a965dc32566eb0f,5476
+  - WAVE-0.1.6@sha256:f744ff68f5e3a0d1f84fab373ea35970659085d213aef20860357512d0458c5c,1016
+  - derive-storable-0.3.1.0@sha256:bd1c51c155a00e2be18325d553d6764dd678904a85647d6ba952af998e70aa59,2313
+  - vector-0.13.2.0@sha256:98f5cb3080a3487527476e3c272dcadaba1376539f2aa0646f2f19b3af6b2f67,8481
--- a/examples/haskell/stack.yaml.lock
+++ b/examples/haskell/stack.yaml.lock
@@ -0,0 +1,41 @@
+# This file was autogenerated by Stack.
+# You should not edit this file by hand.
+# For more information, please see the documentation at:
+#   https://docs.haskellstack.org/en/stable/lock_files
+
+packages:
+- completed:
+    hackage: silero-vad-0.1.0.4@sha256:2bff95be978a2782915b250edc795760d4cf76838e37bb7d4a965dc32566eb0f,5476
+    pantry-tree:
+      sha256: a62e813f978d32c87769796fded981d25fcf2875bb2afdf60ed6279f931ccd7f
+      size: 1391
+  original:
+    hackage: silero-vad-0.1.0.4@sha256:2bff95be978a2782915b250edc795760d4cf76838e37bb7d4a965dc32566eb0f,5476
+- completed:
+    hackage: WAVE-0.1.6@sha256:f744ff68f5e3a0d1f84fab373ea35970659085d213aef20860357512d0458c5c,1016
+    pantry-tree:
+      sha256: ee5ccd70fa7fe6ffc360ebd762b2e3f44ae10406aa27f3842d55b8cbd1a19498
+      size: 405
+  original:
+    hackage: WAVE-0.1.6@sha256:f744ff68f5e3a0d1f84fab373ea35970659085d213aef20860357512d0458c5c,1016
+- completed:
+    hackage: derive-storable-0.3.1.0@sha256:bd1c51c155a00e2be18325d553d6764dd678904a85647d6ba952af998e70aa59,2313
+    pantry-tree:
+      sha256: 48e35a72d1bb593173890616c8d7efd636a650a306a50bb3e1513e679939d27e
+      size: 902
+  original:
+    hackage: derive-storable-0.3.1.0@sha256:bd1c51c155a00e2be18325d553d6764dd678904a85647d6ba952af998e70aa59,2313
+- completed:
+    hackage: vector-0.13.2.0@sha256:98f5cb3080a3487527476e3c272dcadaba1376539f2aa0646f2f19b3af6b2f67,8481
+    pantry-tree:
+      sha256: 2176fd677a02a4c47337f7dca5aeca2745dbb821a6ea5c7099b3a991ecd7f4f0
+      size: 4478
+  original:
+    hackage: vector-0.13.2.0@sha256:98f5cb3080a3487527476e3c272dcadaba1376539f2aa0646f2f19b3af6b2f67,8481
+snapshots:
+- completed:
+    sha256: 5a59b2a405b3aba3c00188453be172b85893cab8ebc352b1ef58b0eae5d248a2
+    size: 650475
+    url: https://raw.githubusercontent.com/commercialhaskell/stackage-snapshots/master/lts/20/26.yaml
+  original:
+    url: https://raw.githubusercontent.com/commercialhaskell/stackage-snapshots/master/lts/20/26.yaml
--- a/examples/java-example/pom.xml
+++ b/examples/java-example/pom.xml
@@ -1,30 +1,31 @@
 <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
-  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
-  <modelVersion>4.0.0</modelVersion>
+         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
+    <modelVersion>4.0.0</modelVersion>

-  <groupId>org.example</groupId>
-  <artifactId>java-example</artifactId>
-  <version>1.0-SNAPSHOT</version>
-  <packaging>jar</packaging>
+    <groupId>org.example</groupId>
+    <artifactId>java-example</artifactId>
+    <version>1.0-SNAPSHOT</version>
+    <packaging>jar</packaging>

-  <name>sliero-vad-example</name>
-  <url>http://maven.apache.org</url>
+    <name>sliero-vad-example</name>
+    <url>http://maven.apache.org</url>

-  <properties>
-    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
-  </properties>
+    <properties>
+        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
+    </properties>

-  <dependencies>
-    <dependency>
-      <groupId>junit</groupId>
-      <artifactId>junit</artifactId>
-      <version>3.8.1</version>
-      <scope>test</scope>
-    </dependency>
-    <dependency>
-      <groupId>com.microsoft.onnxruntime</groupId>
-      <artifactId>onnxruntime</artifactId>
-      <version>1.16.0-rc1</version>
-    </dependency>
-  </dependencies>
+    <dependencies>
+        <dependency>
+            <groupId>junit</groupId>
+            <artifactId>junit</artifactId>
+            <version>3.8.1</version>
+            <scope>test</scope>
+        </dependency>
+        <!-- https://mvnrepository.com/artifact/com.microsoft.onnxruntime/onnxruntime -->
+        <dependency>
+            <groupId>com.microsoft.onnxruntime</groupId>
+            <artifactId>onnxruntime</artifactId>
+            <version>1.23.1</version>
+        </dependency>
+    </dependencies>
 </project>
--- a/examples/java-example/src/main/java/org/example/App.java
+++ b/examples/java-example/src/main/java/org/example/App.java
@@ -2,68 +2,263 @@ package org.example;

 import ai.onnxruntime.OrtException;
 import javax.sound.sampled.*;
+import java.io.File;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
 import java.util.Map;

+/**
+ * Silero VAD Java Example
+ * Voice Activity Detection using ONNX model
+ * 
+ * @author VvvvvGH
+ */
 public class App {

-    private static final String MODEL_PATH = "src/main/resources/silero_vad.onnx";
+    // ONNX model path - using the model file from the project
+    private static final String MODEL_PATH = "../../src/silero_vad/data/silero_vad.onnx";
+    // Test audio file path
+    private static final String AUDIO_FILE_PATH = "../../en_example.wav";
+    // Sampling rate
    private static final int SAMPLE_RATE = 16000;
-    private static final float START_THRESHOLD = 0.6f;
-    private static final float END_THRESHOLD = 0.45f;
-    private static final int MIN_SILENCE_DURATION_MS = 600;
-    private static final int SPEECH_PAD_MS = 500;
-    private static final int WINDOW_SIZE_SAMPLES = 2048;
+    // Speech threshold (consistent with Python default)
+    private static final float THRESHOLD = 0.5f;
+    // Negative threshold (used to determine speech end)
+    private static final float NEG_THRESHOLD = 0.35f; // threshold - 0.15
+    // Minimum speech duration (milliseconds)
+    private static final int MIN_SPEECH_DURATION_MS = 250;
+    // Minimum silence duration (milliseconds)
+    private static final int MIN_SILENCE_DURATION_MS = 100;
+    // Speech padding (milliseconds)
+    private static final int SPEECH_PAD_MS = 30;
+    // Window size (samples) - 512 samples for 16kHz
+    private static final int WINDOW_SIZE_SAMPLES = 512;

    public static void main(String[] args) {
-        // Initialize the Voice Activity Detector
-        SlieroVadDetector vadDetector;
+        System.out.println("=".repeat(60));
+        System.out.println("Silero VAD Java ONNX Example");
+        System.out.println("=".repeat(60));
+        
+        // Load ONNX model
+        SlieroVadOnnxModel model;
        try {
-            vadDetector = new SlieroVadDetector(MODEL_PATH, START_THRESHOLD, END_THRESHOLD, SAMPLE_RATE, MIN_SILENCE_DURATION_MS, SPEECH_PAD_MS);
+            System.out.println("Loading ONNX model: " + MODEL_PATH);
+            model = new SlieroVadOnnxModel(MODEL_PATH);
+            System.out.println("Model loaded successfully!");
        } catch (OrtException e) {
-            System.err.println("Error initializing the VAD detector: " + e.getMessage());
+            System.err.println("Failed to load model: " + e.getMessage());
+            e.printStackTrace();
            return;
        }

-        // Set audio format
-        AudioFormat format = new AudioFormat(SAMPLE_RATE, 16, 1, true, false);
-        DataLine.Info info = new DataLine.Info(TargetDataLine.class, format);
-
-        // Get the target data line and open it with the specified format
-        TargetDataLine targetDataLine;
+        // Read WAV file
+        float[] audioData;
        try {
-            targetDataLine = (TargetDataLine) AudioSystem.getLine(info);
-            targetDataLine.open(format);
-            targetDataLine.start();
-        } catch (LineUnavailableException e) {
-            System.err.println("Error opening target data line: " + e.getMessage());
+            System.out.println("\nReading audio file: " + AUDIO_FILE_PATH);
+            audioData = readWavFileAsFloatArray(AUDIO_FILE_PATH);
+            System.out.println("Audio file read successfully, samples: " + audioData.length);
+            System.out.println("Audio duration: " + String.format("%.2f", (audioData.length / (float) SAMPLE_RATE)) + " seconds");
+        } catch (Exception e) {
+            System.err.println("Failed to read audio file: " + e.getMessage());
+            e.printStackTrace();
            return;
        }

-        // Main loop to continuously read data and apply Voice Activity Detection
-        while (targetDataLine.isOpen()) {
-            byte[] data = new byte[WINDOW_SIZE_SAMPLES];
+        // Get speech timestamps (batch mode, consistent with Python's get_speech_timestamps)
+        System.out.println("\nDetecting speech segments...");
+        List<Map<String, Integer>> speechTimestamps;
+        try {
+            speechTimestamps = getSpeechTimestamps(
+                audioData,
+                model,
+                THRESHOLD,
+                SAMPLE_RATE,
+                MIN_SPEECH_DURATION_MS,
+                MIN_SILENCE_DURATION_MS,
+                SPEECH_PAD_MS,
+                NEG_THRESHOLD
+            );
+        } catch (OrtException e) {
+            System.err.println("Failed to detect speech timestamps: " + e.getMessage());
+            e.printStackTrace();
+            return;
+        }

-            int numBytesRead = targetDataLine.read(data, 0, data.length);
-            if (numBytesRead <= 0) {
-                System.err.println("Error reading data from target data line.");
+        // Output detection results
+        System.out.println("\nDetected speech timestamps (in samples):");
+        for (Map<String, Integer> timestamp : speechTimestamps) {
+            System.out.println(timestamp);
+        }
+
+        // Output summary
+        System.out.println("\n" + "=".repeat(60));
+        System.out.println("Detection completed!");
+        System.out.println("Total detected " + speechTimestamps.size() + " speech segments");
+        System.out.println("=".repeat(60));
+        
+        // Close model
+        try {
+            model.close();
+        } catch (OrtException e) {
+            System.err.println("Error closing model: " + e.getMessage());
+        }
+    }
+
+    /**
+     * Get speech timestamps
+     * Implements the same logic as Python's get_speech_timestamps
+     * 
+     * @param audio Audio data (float array)
+     * @param model ONNX model
+     * @param threshold Speech threshold
+     * @param samplingRate Sampling rate
+     * @param minSpeechDurationMs Minimum speech duration (milliseconds)
+     * @param minSilenceDurationMs Minimum silence duration (milliseconds)
+     * @param speechPadMs Speech padding (milliseconds)
+     * @param negThreshold Negative threshold (used to determine speech end)
+     * @return List of speech timestamps
+     */
+    private static List<Map<String, Integer>> getSpeechTimestamps(
+            float[] audio,
+            SlieroVadOnnxModel model,
+            float threshold,
+            int samplingRate,
+            int minSpeechDurationMs,
+            int minSilenceDurationMs,
+            int speechPadMs,
+            float negThreshold) throws OrtException {
+        
+        // Reset model states
+        model.resetStates();
+        
+        // Calculate parameters
+        int minSpeechSamples = samplingRate * minSpeechDurationMs / 1000;
+        int speechPadSamples = samplingRate * speechPadMs / 1000;
+        int minSilenceSamples = samplingRate * minSilenceDurationMs / 1000;
+        int windowSizeSamples = samplingRate == 16000 ? 512 : 256;
+        int audioLengthSamples = audio.length;
+        
+        // Calculate speech probabilities for all audio chunks
+        List<Float> speechProbs = new ArrayList<>();
+        for (int currentStart = 0; currentStart < audioLengthSamples; currentStart += windowSizeSamples) {
+            float[] chunk = new float[windowSizeSamples];
+            int chunkLength = Math.min(windowSizeSamples, audioLengthSamples - currentStart);
+            System.arraycopy(audio, currentStart, chunk, 0, chunkLength);
+            
+            // Pad with zeros if chunk is shorter than window size
+            if (chunkLength < windowSizeSamples) {
+                for (int i = chunkLength; i < windowSizeSamples; i++) {
+                    chunk[i] = 0.0f;
+                }
+            }
+            
+            float speechProb = model.call(new float[][]{chunk}, samplingRate)[0];
+            speechProbs.add(speechProb);
+        }
+        
+        // Detect speech segments using the same algorithm as Python
+        boolean triggered = false;
+        List<Map<String, Integer>> speeches = new ArrayList<>();
+        Map<String, Integer> currentSpeech = null;
+        int tempEnd = 0;
+        
+        for (int i = 0; i < speechProbs.size(); i++) {
+            float speechProb = speechProbs.get(i);
+            
+            // Reset temporary end if speech probability exceeds threshold
+            if (speechProb >= threshold && tempEnd != 0) {
+                tempEnd = 0;
+            }
+            
+            // Detect speech start
+            if (speechProb >= threshold && !triggered) {
+                triggered = true;
+                currentSpeech = new HashMap<>();
+                currentSpeech.put("start", windowSizeSamples * i);
                continue;
            }
            
-            // Apply the Voice Activity Detector to the data and get the result
-            Map<String, Double> detectResult;
-            try {
-                detectResult = vadDetector.apply(data, true);
-            } catch (Exception e) {
-                System.err.println("Error applying VAD detector: " + e.getMessage());
-                continue;
-            }
-
-            if (!detectResult.isEmpty()) {
-                System.out.println(detectResult);
+            // Detect speech end
+            if (speechProb < negThreshold && triggered) {
+                if (tempEnd == 0) {
+                    tempEnd = windowSizeSamples * i;
+                }
+                if (windowSizeSamples * i - tempEnd < minSilenceSamples) {
+                    continue;
+                } else {
+                    currentSpeech.put("end", tempEnd);
+                    if (currentSpeech.get("end") - currentSpeech.get("start") > minSpeechSamples) {
+                        speeches.add(currentSpeech);
+                    }
+                    currentSpeech = null;
+                    tempEnd = 0;
+                    triggered = false;
+                }
            }
        }
        
-        // Close the target data line to release audio resources
-        targetDataLine.close();
+        // Handle the last speech segment
+        if (currentSpeech != null && 
+            (audioLengthSamples - currentSpeech.get("start")) > minSpeechSamples) {
+            currentSpeech.put("end", audioLengthSamples);
+            speeches.add(currentSpeech);
+        }
+        
+        // Add speech padding - same logic as Python
+        for (int i = 0; i < speeches.size(); i++) {
+            Map<String, Integer> speech = speeches.get(i);
+            if (i == 0) {
+                speech.put("start", Math.max(0, speech.get("start") - speechPadSamples));
+            }
+            if (i != speeches.size() - 1) {
+                int silenceDuration = speeches.get(i + 1).get("start") - speech.get("end");
+                if (silenceDuration < 2 * speechPadSamples) {
+                    speech.put("end", speech.get("end") + silenceDuration / 2);
+                    speeches.get(i + 1).put("start", 
+                        Math.max(0, speeches.get(i + 1).get("start") - silenceDuration / 2));
+                } else {
+                    speech.put("end", Math.min(audioLengthSamples, speech.get("end") + speechPadSamples));
+                    speeches.get(i + 1).put("start", 
+                        Math.max(0, speeches.get(i + 1).get("start") - speechPadSamples));
+                }
+            } else {
+                speech.put("end", Math.min(audioLengthSamples, speech.get("end") + speechPadSamples));
+            }
+        }
+        
+        return speeches;
    }
+
+    /**
+     * Read WAV file and return as float array
+     * 
+     * @param filePath WAV file path
+     * @return Audio data as float array (normalized to -1.0 to 1.0)
+     */
+    private static float[] readWavFileAsFloatArray(String filePath) 
+            throws UnsupportedAudioFileException, IOException {
+        File audioFile = new File(filePath);
+        AudioInputStream audioStream = AudioSystem.getAudioInputStream(audioFile);
+        
+        // Get audio format information
+        AudioFormat format = audioStream.getFormat();
+        System.out.println("Audio format: " + format);
+        
+        // Read all audio data
+        byte[] audioBytes = audioStream.readAllBytes();
+        audioStream.close();
+        
+        // Convert to float array
+        float[] audioData = new float[audioBytes.length / 2];
+        for (int i = 0; i < audioData.length; i++) {
+            // 16-bit PCM: two bytes per sample (little-endian)
+            short sample = (short) ((audioBytes[i * 2] & 0xff) | (audioBytes[i * 2 + 1] << 8));
+            audioData[i] = sample / 32768.0f; // Normalize to -1.0 to 1.0
+        }
+        
+        return audioData;
+    }
+
 }
--- a/examples/java-example/src/main/java/org/example/SlieroVadDetector.java
+++ b/examples/java-example/src/main/java/org/example/SlieroVadDetector.java
@@ -8,25 +8,30 @@ import java.util.Collections;
 import java.util.HashMap;
 import java.util.Map;

-
+/**
+ * Silero VAD Detector
+ * Real-time voice activity detection
+ * 
+ * @author VvvvvGH
+ */
 public class SlieroVadDetector {
-    // OnnxModel model used for speech processing
+    // ONNX model for speech processing
    private final SlieroVadOnnxModel model;
-    // Threshold for speech start
+    // Speech start threshold
    private final float startThreshold;
-    // Threshold for speech end
+    // Speech end threshold
    private final float endThreshold;
    // Sampling rate
    private final int samplingRate;
-    // Minimum number of silence samples to determine the end threshold of speech
+    // Minimum silence samples to determine speech end
    private final float minSilenceSamples;
-    // Additional number of samples for speech start or end to calculate speech start or end time
+    // Speech padding samples for calculating speech boundaries
    private final float speechPadSamples;
-    // Whether in the triggered state (i.e. whether speech is being detected)
+    // Triggered state (whether speech is being detected)
    private boolean triggered;
-    // Temporarily stored number of speech end samples
+    // Temporary speech end sample position
    private int tempEnd;
-    // Number of samples currently being processed
+    // Current sample position
    private int currentSample;


@@ -36,23 +41,25 @@ public class SlieroVadDetector {
                             int samplingRate,
                             int minSilenceDurationMs,
                             int speechPadMs) throws OrtException {
-        // Check if the sampling rate is 8000 or 16000, if not, throw an exception
+        // Validate sampling rate
        if (samplingRate != 8000 && samplingRate != 16000) {
-            throw new IllegalArgumentException("does not support sampling rates other than [8000, 16000]");
+            throw new IllegalArgumentException("Does not support sampling rates other than [8000, 16000]");
        }

-        // Initialize the parameters
+        // Initialize parameters
        this.model = new SlieroVadOnnxModel(modelPath);
        this.startThreshold = startThreshold;
        this.endThreshold = endThreshold;
        this.samplingRate = samplingRate;
        this.minSilenceSamples = samplingRate * minSilenceDurationMs / 1000f;
        this.speechPadSamples = samplingRate * speechPadMs / 1000f;
-        // Reset the state
+        // Reset state
        reset();
    }

-    // Method to reset the state, including the model state, trigger state, temporary end time, and current sample count
+    /**
+     * Reset detector state
+     */
    public void reset() {
        model.resetStates();
        triggered = false;
@@ -60,21 +67,27 @@ public class SlieroVadDetector {
        currentSample = 0;
    }

-    // apply method for processing the audio array, returning possible speech start or end times
+    /**
+     * Process audio data and detect speech events
+     * 
+     * @param data Audio data as byte array
+     * @param returnSeconds Whether to return timestamps in seconds
+     * @return Speech event (start or end) or empty map if no event
+     */
    public Map<String, Double> apply(byte[] data, boolean returnSeconds) {

-        // Convert the byte array to a float array
+        // Convert byte array to float array
        float[] audioData = new float[data.length / 2];
        for (int i = 0; i < audioData.length; i++) {
            audioData[i] = ((data[i * 2] & 0xff) | (data[i * 2 + 1] << 8)) / 32767.0f;
        }

-        // Get the length of the audio array as the window size
+        // Get window size from audio data length
        int windowSizeSamples = audioData.length;
-        // Update the current sample count
+        // Update current sample position
        currentSample += windowSizeSamples;

-        // Call the model to get the prediction probability of speech
+        // Get speech probability from model
        float speechProb = 0;
        try {
            speechProb = model.call(new float[][]{audioData}, samplingRate)[0];
@@ -82,19 +95,18 @@ public class SlieroVadDetector {
            throw new RuntimeException(e);
        }

-        // If the speech probability is greater than the threshold and the temporary end time is not 0, reset the temporary end time
-        // This indicates that the speech duration has exceeded expectations and needs to recalculate the end time
+        // Reset temporary end if speech probability exceeds threshold
        if (speechProb >= startThreshold && tempEnd != 0) {
            tempEnd = 0;
        }

-        // If the speech probability is greater than the threshold and not in the triggered state, set to triggered state and calculate the speech start time
+        // Detect speech start
        if (speechProb >= startThreshold && !triggered) {
            triggered = true;
            int speechStart = (int) (currentSample - speechPadSamples);
            speechStart = Math.max(speechStart, 0);
            Map<String, Double> result = new HashMap<>();
-            // Decide whether to return the result in seconds or sample count based on the returnSeconds parameter
+            // Return in seconds or samples based on returnSeconds parameter
            if (returnSeconds) {
                double speechStartSeconds = speechStart / (double) samplingRate;
                double roundedSpeechStart = BigDecimal.valueOf(speechStartSeconds).setScale(1, RoundingMode.HALF_UP).doubleValue();
@@ -106,18 +118,17 @@ public class SlieroVadDetector {
            return result;
        }

-        // If the speech probability is less than a certain threshold and in the triggered state, calculate the speech end time
+        // Detect speech end
        if (speechProb < endThreshold && triggered) {
-            // Initialize or update the temporary end time
+            // Initialize or update temporary end position
            if (tempEnd == 0) {
                tempEnd = currentSample;
            }
-            // If the number of silence samples between the current sample and the temporary end time is less than the minimum silence samples, return null
-            // This indicates that it is not yet possible to determine whether the speech has ended
+            // Wait for minimum silence duration before confirming speech end
            if (currentSample - tempEnd < minSilenceSamples) {
                return Collections.emptyMap();
            } else {
-                // Calculate the speech end time, reset the trigger state and temporary end time
+                // Calculate speech end time and reset state
                int speechEnd = (int) (tempEnd + speechPadSamples);
                tempEnd = 0;
                triggered = false;
@@ -134,7 +145,7 @@ public class SlieroVadDetector {
            }
        }

-        // If the above conditions are not met, return null by default
+        // No speech event detected
        return Collections.emptyMap();
    }

--- a/examples/java-example/src/main/java/org/example/SlieroVadOnnxModel.java
+++ b/examples/java-example/src/main/java/org/example/SlieroVadOnnxModel.java
@@ -9,42 +9,58 @@ import java.util.HashMap;
 import java.util.List;
 import java.util.Map;

+/**
+ * Silero VAD ONNX Model Wrapper
+ * 
+ * @author VvvvvGH
+ */
 public class SlieroVadOnnxModel {
-    // Define private variable OrtSession
+    // ONNX runtime session
    private final OrtSession session;
-    private float[][][] h;
-    private float[][][] c;
-    // Define the last sample rate
+    // Model state - dimensions: [2, batch_size, 128]
+    private float[][][] state;
+    // Context - stores the tail of the previous audio chunk
+    private float[][] context;
+    // Last sample rate
    private int lastSr = 0;
-    // Define the last batch size
+    // Last batch size
    private int lastBatchSize = 0;
-    // Define a list of supported sample rates
+    // Supported sample rates
    private static final List<Integer> SAMPLE_RATES = Arrays.asList(8000, 16000);

    // Constructor
    public SlieroVadOnnxModel(String modelPath) throws OrtException {
        // Get the ONNX runtime environment
        OrtEnvironment env = OrtEnvironment.getEnvironment();
-        // Create an ONNX session options object
+        // Create ONNX session options
        OrtSession.SessionOptions opts = new OrtSession.SessionOptions();
-        // Set the InterOp thread count to 1, InterOp threads are used for parallel processing of different computation graph operations
+        // Set InterOp thread count to 1 (for parallel processing of different graph operations)
        opts.setInterOpNumThreads(1);
-        // Set the IntraOp thread count to 1, IntraOp threads are used for parallel processing within a single operation
+        // Set IntraOp thread count to 1 (for parallel processing within a single operation)
        opts.setIntraOpNumThreads(1);
-        // Add a CPU device, setting to false disables CPU execution optimization
+        // Enable CPU execution optimization
        opts.addCPU(true);
-        // Create an ONNX session using the environment, model path, and options
+        // Create ONNX session with the environment, model path, and options
        session = env.createSession(modelPath, opts);
        // Reset states
        resetStates();
    }

    /**
-     * Reset states
+     * Reset states with default batch size
     */
    void resetStates() {
-        h = new float[2][1][64];
-        c = new float[2][1][64];
+        resetStates(1);
+    }
+    
+    /**
+     * Reset states with specific batch size
+     * 
+     * @param batchSize Batch size for state initialization
+     */
+    void resetStates(int batchSize) {
+        state = new float[2][batchSize][128];
+        context = new float[0][]; // Empty context
        lastSr = 0;
        lastBatchSize = 0;
    }
@@ -54,13 +70,12 @@ public class SlieroVadOnnxModel {
    }

    /**
-     * Define inner class ValidationResult
+     * Inner class for validation result
     */
    public static class ValidationResult {
        public final float[][] x;
        public final int sr;

-        // Constructor
        public ValidationResult(float[][] x, int sr) {
            this.x = x;
            this.sr = sr;
@@ -68,19 +83,23 @@ public class SlieroVadOnnxModel {
    }

    /**
-     * Function to validate input data
+     * Validate input data
+     * 
+     * @param x Audio data array
+     * @param sr Sample rate
+     * @return Validated input data and sample rate
     */
    private ValidationResult validateInput(float[][] x, int sr) {
-        // Process the input data with dimension 1
+        // Ensure input is at least 2D
        if (x.length == 1) {
            x = new float[][]{x[0]};
        }
-        // Throw an exception when the input data dimension is greater than 2
+        // Check if input dimension is valid
        if (x.length > 2) {
            throw new IllegalArgumentException("Incorrect audio data dimension: " + x[0].length);
        }

-        // Process the input data when the sample rate is not equal to 16000 and is a multiple of 16000
+        // Downsample if sample rate is a multiple of 16000
        if (sr != 16000 && (sr % 16000 == 0)) {
            int step = sr / 16000;
            float[][] reducedX = new float[x.length][];
@@ -100,22 +119,26 @@ public class SlieroVadOnnxModel {
            sr = 16000;
        }

-        // If the sample rate is not in the list of supported sample rates, throw an exception
+        // Validate sample rate
        if (!SAMPLE_RATES.contains(sr)) {
            throw new IllegalArgumentException("Only supports sample rates " + SAMPLE_RATES + " (or multiples of 16000)");
        }

-        // If the input audio block is too short, throw an exception
+        // Check if audio chunk is too short
        if (((float) sr) / x[0].length > 31.25) {
            throw new IllegalArgumentException("Input audio is too short");
        }

-        // Return the validated result
        return new ValidationResult(x, sr);
    }

    /**
-     * Method to call the ONNX model
+     * Call the ONNX model for inference
+     * 
+     * @param x Audio data array
+     * @param sr Sample rate
+     * @return Speech probability output
+     * @throws OrtException If ONNX runtime error occurs
     */
    public float[] call(float[][] x, int sr) throws OrtException {
        ValidationResult result = validateInput(x, sr);
@@ -123,38 +146,62 @@ public class SlieroVadOnnxModel {
        sr = result.sr;

        int batchSize = x.length;
+        int numSamples = sr == 16000 ? 512 : 256;
+        int contextSize = sr == 16000 ? 64 : 32;

-        if (lastBatchSize == 0 || lastSr != sr || lastBatchSize != batchSize) {
-            resetStates();
+        // Reset states only when sample rate or batch size changes
+        if (lastSr != 0 && lastSr != sr) {
+            resetStates(batchSize);
+        } else if (lastBatchSize != 0 && lastBatchSize != batchSize) {
+            resetStates(batchSize);
+        } else if (lastBatchSize == 0) {
+            // First call - state is already initialized, just set batch size
+            lastBatchSize = batchSize;
+        }
+
+        // Initialize context if needed
+        if (context.length == 0) {
+            context = new float[batchSize][contextSize];
+        }
+
+        // Concatenate context and input
+        float[][] xWithContext = new float[batchSize][contextSize + numSamples];
+        for (int i = 0; i < batchSize; i++) {
+            // Copy context
+            System.arraycopy(context[i], 0, xWithContext[i], 0, contextSize);
+            // Copy input
+            System.arraycopy(x[i], 0, xWithContext[i], contextSize, numSamples);
        }

        OrtEnvironment env = OrtEnvironment.getEnvironment();

        OnnxTensor inputTensor = null;
-        OnnxTensor hTensor = null;
-        OnnxTensor cTensor = null;
+        OnnxTensor stateTensor = null;
        OnnxTensor srTensor = null;
        OrtSession.Result ortOutputs = null;

        try {
            // Create input tensors
-            inputTensor = OnnxTensor.createTensor(env, x);
-            hTensor = OnnxTensor.createTensor(env, h);
-            cTensor = OnnxTensor.createTensor(env, c);
+            inputTensor = OnnxTensor.createTensor(env, xWithContext);
+            stateTensor = OnnxTensor.createTensor(env, state);
            srTensor = OnnxTensor.createTensor(env, new long[]{sr});

            Map<String, OnnxTensor> inputs = new HashMap<>();
            inputs.put("input", inputTensor);
            inputs.put("sr", srTensor);
-            inputs.put("h", hTensor);
-            inputs.put("c", cTensor);
+            inputs.put("state", stateTensor);

-            // Call the ONNX model for calculation
+            // Run ONNX model inference
            ortOutputs = session.run(inputs);
-            // Get the output results
+            // Get output results
            float[][] output = (float[][]) ortOutputs.get(0).getValue();
-            h = (float[][][]) ortOutputs.get(1).getValue();
-            c = (float[][][]) ortOutputs.get(2).getValue();
+            state = (float[][][]) ortOutputs.get(1).getValue();
+
+            // Update context - save the last contextSize samples from input
+            for (int i = 0; i < batchSize; i++) {
+                System.arraycopy(xWithContext[i], xWithContext[i].length - contextSize, 
+                               context[i], 0, contextSize);
+            }

            lastSr = sr;
            lastBatchSize = batchSize;
@@ -163,11 +210,8 @@ public class SlieroVadOnnxModel {
            if (inputTensor != null) {
                inputTensor.close();
            }
-            if (hTensor != null) {
-                hTensor.close();
-            }
-            if (cTensor != null) {
-                cTensor.close();
+            if (stateTensor != null) {
+                stateTensor.close();
            }
            if (srTensor != null) {
                srTensor.close();
--- a/examples/parallel_example.ipynb
+++ b/examples/parallel_example.ipynb
@@ -1,7 +1,6 @@
 {
 "cells": [
  {
-   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
@@ -18,17 +17,19 @@
    "SAMPLING_RATE = 16000\n",
    "import torch\n",
    "from pprint import pprint\n",
+    "import time\n",
+    "import shutil\n",
    "\n",
    "torch.set_num_threads(1)\n",
    "NUM_PROCESS=4 # set to the number of CPU cores in the machine\n",
    "NUM_COPIES=8\n",
    "# download wav files, make multiple copies\n",
-    "for idx in range(NUM_COPIES):\n",
-    "    torch.hub.download_url_to_file('https://models.silero.ai/vad_models/en.wav', f\"en_example{idx}.wav\")\n"
+    "torch.hub.download_url_to_file('https://models.silero.ai/vad_models/en.wav', f\"en_example0.wav\")\n",
+    "for idx in range(NUM_COPIES-1):\n",
+    "    shutil.copy(f\"en_example0.wav\", f\"en_example{idx+1}.wav\")"
   ]
  },
  {
-   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
@@ -54,7 +55,6 @@
   ]
  },
  {
-   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
@@ -99,7 +99,6 @@
   ]
  },
  {
-   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
@@ -127,7 +126,7 @@
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": "diarization",
+   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
@@ -141,7 +140,20 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.9.15"
+   "version": "3.10.14"
+  },
+  "toc": {
+   "base_numbering": 1,
+   "nav_menu": {},
+   "number_sections": true,
+   "sideBar": true,
+   "skip_h1_title": false,
+   "title_cell": "Table of Contents",
+   "title_sidebar": "Contents",
+   "toc_cell": false,
+   "toc_position": {},
+   "toc_section_display": true,
+   "toc_window_display": false
  }
 },
 "nbformat": 4,
--- a/examples/pyaudio-streaming/README.md
+++ b/examples/pyaudio-streaming/README.md
@@ -8,6 +8,8 @@ Currently, the notebook consits of two examples:
 - One that records audio of a predefined length from the microphone, process it with Silero-VAD, and plots it afterwards.
 - The other one plots the speech probabilities in real-time (using jupyterplot) and records the audio until you press enter.
 
+ This example does not work in google colab! For local usage only.
+
 ## Example Video for the Real-Time Visualization


--- a/examples/pyaudio-streaming/pyaudio-streaming-examples.ipynb
+++ b/examples/pyaudio-streaming/pyaudio-streaming-examples.ipynb
@@ -2,7 +2,7 @@
 "cells": [
  {
   "cell_type": "markdown",
-   "id": "62a0cccb",
+   "id": "76aa55ba",
   "metadata": {},
   "source": [
    "# Pyaudio Microphone Streaming Examples\n",
@@ -12,12 +12,14 @@
    "I created it as an example on how binary data from a stream could be feed into Silero VAD.\n",
    "\n",
    "\n",
-    "Has been tested on Ubuntu 21.04 (x86). After you installed the dependencies below, no additional setup is required."
+    "Has been tested on Ubuntu 21.04 (x86). After you installed the dependencies below, no additional setup is required.\n",
+    "\n",
+    "This notebook does not work in google colab! For local usage only."
   ]
  },
  {
   "cell_type": "markdown",
-   "id": "64cbe1eb",
+   "id": "4a4e15c2",
   "metadata": {},
   "source": [
    "## Dependencies\n",
@@ -26,22 +28,27 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
-   "id": "57bc2aac",
-   "metadata": {},
+   "execution_count": 1,
+   "id": "24205cce",
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2024-10-09T08:47:34.056898Z",
+     "start_time": "2024-10-09T08:47:34.053418Z"
+    }
+   },
   "outputs": [],
   "source": [
-    "#!pip install numpy==2.0.2\n",
-    "#!pip install torch==2.4.1\n",
-    "#!pip install matplotlib==3.9.2\n",
-    "#!pip install torchaudio==2.4.1\n",
+    "#!pip install numpy>=1.24.0\n",
+    "#!pip install torch>=1.12.0\n",
+    "#!pip install matplotlib>=3.6.0\n",
+    "#!pip install torchaudio>=0.12.0\n",
    "#!pip install soundfile==0.12.1\n",
-    "#!pip install pyaudio==0.2.11"
+    "#!apt install python3-pyaudio (linux) or pip install pyaudio (windows)"
   ]
  },
  {
   "cell_type": "markdown",
-   "id": "110de761",
+   "id": "cd22818f",
   "metadata": {},
   "source": [
    "## Imports"
@@ -49,10 +56,27 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
-   "id": "5a647d8d",
-   "metadata": {},
-   "outputs": [],
+   "execution_count": 2,
+   "id": "994d7f3a",
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2024-10-09T08:47:39.005032Z",
+     "start_time": "2024-10-09T08:47:36.489952Z"
+    }
+   },
+   "outputs": [
+    {
+     "ename": "ModuleNotFoundError",
+     "evalue": "No module named 'pyaudio'",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[0;31mModuleNotFoundError\u001b[0m                       Traceback (most recent call last)",
+      "Cell \u001b[0;32mIn[2], line 8\u001b[0m\n\u001b[1;32m      6\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m \u001b[38;5;21;01mmatplotlib\u001b[39;00m\n\u001b[1;32m      7\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m \u001b[38;5;21;01mmatplotlib\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mpylab\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m \u001b[38;5;21;01mplt\u001b[39;00m\n\u001b[0;32m----> 8\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m \u001b[38;5;21;01mpyaudio\u001b[39;00m\n",
+      "\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'pyaudio'"
+     ]
+    }
+   ],
   "source": [
    "import io\n",
    "import numpy as np\n",
@@ -67,7 +91,7 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "725d7066",
+   "id": "ac5c52f7",
   "metadata": {},
   "outputs": [],
   "source": [
@@ -79,7 +103,7 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "1c0b2ea7",
+   "id": "ad5919dc",
   "metadata": {},
   "outputs": [],
   "source": [
@@ -92,7 +116,7 @@
  },
  {
   "cell_type": "markdown",
-   "id": "f9112603",
+   "id": "784d1ab6",
   "metadata": {},
   "source": [
    "### Helper Methods"
@@ -101,7 +125,7 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "5abc6330",
+   "id": "af4bca64",
   "metadata": {},
   "outputs": [],
   "source": [
@@ -124,7 +148,7 @@
  },
  {
   "cell_type": "markdown",
-   "id": "5124095e",
+   "id": "ca13e514",
   "metadata": {},
   "source": [
    "## Pyaudio Set-up"
@@ -133,7 +157,7 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "a845356e",
+   "id": "75f99022",
   "metadata": {},
   "outputs": [],
   "source": [
@@ -147,7 +171,7 @@
  },
  {
   "cell_type": "markdown",
-   "id": "0b910c99",
+   "id": "4da7d2ef",
   "metadata": {},
   "source": [
    "## Simple Example\n",
@@ -157,7 +181,7 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "9d3d2c10",
+   "id": "6fe77661",
   "metadata": {},
   "outputs": [],
   "source": [
@@ -167,7 +191,7 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "3cb44a4a",
+   "id": "23f4da3e",
   "metadata": {},
   "outputs": [],
   "source": [
@@ -207,7 +231,7 @@
  },
  {
   "cell_type": "markdown",
-   "id": "a3dda982",
+   "id": "fd243e8f",
   "metadata": {},
   "source": [
    "## Real Time Visualization\n",
@@ -220,7 +244,7 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "05ef4100",
+   "id": "d36980c2",
   "metadata": {},
   "outputs": [],
   "source": [
@@ -230,7 +254,7 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "d1d4cdd6",
+   "id": "5607b616",
   "metadata": {},
   "outputs": [],
   "source": [
@@ -287,7 +311,7 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "1e398009",
+   "id": "dc4f0108",
   "metadata": {},
   "outputs": [],
   "source": [
@@ -311,7 +335,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.9.10"
+   "version": "3.10.14"
  },
  "toc": {
   "base_numbering": 1,
--- a/hubconf.py
+++ b/hubconf.py
@@ -23,11 +23,14 @@ def versiontuple(v):
    return tuple(version_list)


-def silero_vad(onnx=False, force_onnx_cpu=False):
+def silero_vad(onnx=False, force_onnx_cpu=False, opset_version=16):
    """Silero Voice Activity Detector
    Returns a model with a set of utils
    Please see https://github.com/snakers4/silero-vad for usage examples
    """
+    available_ops = [15, 16]
+    if onnx and opset_version not in available_ops:
+        raise Exception(f'Available ONNX opset_version: {available_ops}')

    if not onnx:
        installed_version = torch.__version__
@@ -37,7 +40,11 @@ def silero_vad(onnx=False, force_onnx_cpu=False):

    model_dir = os.path.join(os.path.dirname(__file__), 'src', 'silero_vad', 'data')
    if onnx:
-        model = OnnxWrapper(os.path.join(model_dir, 'silero_vad.onnx'), force_onnx_cpu)
+        if opset_version == 16:
+            model_name = 'silero_vad.onnx'
+        else:
+            model_name = f'silero_vad_16k_op{opset_version}.onnx'
+        model = OnnxWrapper(os.path.join(model_dir, model_name), force_onnx_cpu)
    else:
        model = init_jit_model(os.path.join(model_dir, 'silero_vad.jit'))
    utils = (get_speech_timestamps,
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -3,7 +3,7 @@ requires = ["hatchling"]
 build-backend = "hatchling.build"
 [project]
 name = "silero-vad"
-version = "5.1"
+version = "6.2.0"
 authors = [
  {name="Silero Team", email="hello@silero.ai"},
 ]
@@ -21,10 +21,14 @@ classifiers = [
    "Programming Language :: Python :: 3.10",
    "Programming Language :: Python :: 3.11",
    "Programming Language :: Python :: 3.12",
+    "Programming Language :: Python :: 3.13",
+    "Programming Language :: Python :: 3.14",
+    "Programming Language :: Python :: 3.15",
    "Topic :: Scientific/Engineering :: Artificial Intelligence",
    "Topic :: Scientific/Engineering",
 ]
 dependencies = [
+  "packaging",
  "torch>=1.12.0",
  "torchaudio>=0.12.0",
  "onnxruntime>=1.16.1",
@@ -33,3 +37,10 @@ dependencies = [
 [project.urls]
 Homepage = "https://github.com/snakers4/silero-vad"
 Issues = "https://github.com/snakers4/silero-vad/issues"
+
+[project.optional-dependencies]
+test = [
+    "pytest",
+    "soundfile",
+    "torch<2.9",
+]
--- a/src/silero_vad/init.py
+++ b/src/silero_vad/init.py
@@ -9,4 +9,5 @@ from silero_vad.utils_vad import (get_speech_timestamps,
                                  save_audio,
                                  read_audio,
                                  VADIterator,
-                                  collect_chunks)
+                                  collect_chunks,
+                                  drop_chunks)
--- a/src/silero_vad/data/silero_vad.jit
+++ b/src/silero_vad/data/silero_vad.jit
--- a/src/silero_vad/data/silero_vad.onnx
+++ b/src/silero_vad/data/silero_vad.onnx
--- a/src/silero_vad/data/silero_vad_16k_op15.onnx
+++ b/src/silero_vad/data/silero_vad_16k_op15.onnx
--- a/src/silero_vad/model.py
+++ b/src/silero_vad/model.py
@@ -2,8 +2,19 @@ from .utils_vad import init_jit_model, OnnxWrapper
 import torch
 torch.set_num_threads(1)

-def load_silero_vad(onnx=False):
-    model_name = 'silero_vad.onnx' if onnx else 'silero_vad.jit'
+
+def load_silero_vad(onnx=False, opset_version=16):
+    available_ops = [15, 16]
+    if onnx and opset_version not in available_ops:
+        raise Exception(f'Available ONNX opset_version: {available_ops}')
+
+    if onnx:
+        if opset_version == 16:
+            model_name = 'silero_vad.onnx'
+        else:
+            model_name = f'silero_vad_16k_op{opset_version}.onnx'
+    else:
+        model_name = 'silero_vad.jit'
    package_path = "silero_vad.data"

    try:
@@ -18,7 +29,7 @@ def load_silero_vad(onnx=False):
            model_file_path = str(impresources.files(package_path).joinpath(model_name))

    if onnx:
-        model = OnnxWrapper(model_file_path, force_onnx_cpu=True)
+        model = OnnxWrapper(str(model_file_path), force_onnx_cpu=True)
    else:
        model = init_jit_model(model_file_path)

--- a/src/silero_vad/utils_vad.py
+++ b/src/silero_vad/utils_vad.py
@@ -2,6 +2,7 @@ import torch
 import torchaudio
 from typing import Callable, List
 import warnings
+from packaging import version

 languages = ['ru', 'en', 'de', 'es']

@@ -23,7 +24,11 @@ class OnnxWrapper():
            self.session = onnxruntime.InferenceSession(path, sess_options=opts)

        self.reset_states()
-        self.sample_rates = [8000, 16000]
+        if '16k' in path:
+            warnings.warn('This model support only 16000 sampling rate!')
+            self.sample_rates = [16000]
+        else:
+            self.sample_rates = [8000, 16000]

    def _validate_input(self, x, sr: int):
        if x.dim() == 1:
@@ -130,40 +135,60 @@ class Validator():
        return outs


-def read_audio(path: str,
-               sampling_rate: int = 16000):
-    list_backends = torchaudio.list_audio_backends()
+def read_audio(path: str, sampling_rate: int = 16000) -> torch.Tensor:
+    ta_ver = version.parse(torchaudio.__version__)
+    if ta_ver < version.parse("2.9"):
+        try:
+            effects = [['channels', '1'],['rate', str(sampling_rate)]]
+            wav, sr = torchaudio.sox_effects.apply_effects_file(path, effects=effects)
+        except:
+            wav, sr = torchaudio.load(path)
+    else:
+        try:
+            wav, sr = torchaudio.load(path)
+        except:
+            try:
+                from torchcodec.decoders import AudioDecoder
+                samples = AudioDecoder(path).get_all_samples()
+                wav = samples.data
+                sr = samples.sample_rate
+            except ImportError:
+                raise RuntimeError(
+                    f"torchaudio version {torchaudio.__version__} requires torchcodec for audio I/O. "
+                    + "Install torchcodec or pin torchaudio < 2.9"
+                )

-    assert len(list_backends) > 0, 'The list of available backends is empty, please install backend manually. \
-                                    \n Recommendations: \n \tSox (UNIX OS) \n \tSoundfile (Windows OS, UNIX OS) \n \tffmpeg (Windows OS, UNIX OS)'
+    if wav.ndim > 1 and wav.size(0) > 1:
+        wav = wav.mean(dim=0, keepdim=True)

-    try:
-        effects = [
-            ['channels', '1'],
-            ['rate', str(sampling_rate)]
-        ]
+    if sr != sampling_rate:
+        wav = torchaudio.transforms.Resample(sr, sampling_rate)(wav)

-        wav, sr = torchaudio.sox_effects.apply_effects_file(path, effects=effects)
-    except:
-        wav, sr = torchaudio.load(path)
-
-        if wav.size(0) > 1:
-            wav = wav.mean(dim=0, keepdim=True)
-
-        if sr != sampling_rate:
-            transform = torchaudio.transforms.Resample(orig_freq=sr,
-                                                       new_freq=sampling_rate)
-            wav = transform(wav)
-            sr = sampling_rate
-
-    assert sr == sampling_rate
    return wav.squeeze(0)


-def save_audio(path: str,
-               tensor: torch.Tensor,
-               sampling_rate: int = 16000):
-    torchaudio.save(path, tensor.unsqueeze(0), sampling_rate, bits_per_sample=16)
+def save_audio(path: str, tensor: torch.Tensor, sampling_rate: int = 16000):
+    tensor = tensor.detach().cpu()
+    if tensor.ndim == 1:
+        tensor = tensor.unsqueeze(0)
+
+    ta_ver = version.parse(torchaudio.__version__)
+
+    try:
+        torchaudio.save(path, tensor, sampling_rate, bits_per_sample=16)
+    except Exception:
+        if ta_ver >= version.parse("2.9"):
+            try:
+                from torchcodec.encoders import AudioEncoder
+                encoder = AudioEncoder(tensor, sample_rate=16000)
+                encoder.to_file(path)
+            except ImportError:
+                raise RuntimeError(
+                    f"torchaudio version {torchaudio.__version__} requires torchcodec for saving. "
+                    + "Install torchcodec or pin torchaudio < 2.9"
+                )
+        else:
+            raise


 def init_jit_model(model_path: str,
@@ -193,10 +218,13 @@ def get_speech_timestamps(audio: torch.Tensor,
                          min_silence_duration_ms: int = 100,
                          speech_pad_ms: int = 30,
                          return_seconds: bool = False,
+                          time_resolution: int = 1,
                          visualize_probs: bool = False,
                          progress_tracking_callback: Callable[[float], None] = None,
                          neg_threshold: float = None,
-                          window_size_samples: int = 512,):
+                          window_size_samples: int = 512,
+                          min_silence_at_max_speech: int = 98,
+                          use_max_poss_sil_at_max_speech: bool = True):

    """
    This method is used for splitting long audios into speech chunks using silero VAD
@@ -220,7 +248,7 @@ def get_speech_timestamps(audio: torch.Tensor,

    max_speech_duration_s: int (default -  inf)
        Maximum duration of speech chunks in seconds
-        Chunks longer than max_speech_duration_s will be split at the timestamp of the last silence that lasts more than 100ms (if any), to prevent agressive cutting.
+        Chunks longer than max_speech_duration_s will be split at the timestamp of the last silence that lasts more than 100ms (if any), to prevent aggressive cutting.
        Otherwise, they will be split aggressively just before max_speech_duration_s.

    min_silence_duration_ms: int (default - 100 milliseconds)
@@ -232,6 +260,9 @@ def get_speech_timestamps(audio: torch.Tensor,
    return_seconds: bool (default - False)
        whether return timestamps in seconds (default - samples)

+    time_resolution: bool (default - 1)
+        time resolution of speech coordinates when requested as seconds
+
    visualize_probs: bool (default - False)
        whether draw prob hist or not

@@ -241,6 +272,12 @@ def get_speech_timestamps(audio: torch.Tensor,
    neg_threshold: float (default = threshold - 0.15)
        Negative threshold (noise or exit threshold). If model's current state is SPEECH, values BELOW this value are considered as NON-SPEECH.

+    min_silence_at_max_speech: int (default - 98ms)
+        Minimum silence duration in ms which is used to avoid abrupt cuts when max_speech_duration_s is reached
+
+    use_max_poss_sil_at_max_speech: bool (default - True)
+        Whether to use the maximum possible silence at max_speech_duration_s or not. If not, the last silence is used.
+
    window_size_samples: int (default - 512 samples)
        !!! DEPRECATED, DOES NOTHING !!!

@@ -249,7 +286,6 @@ def get_speech_timestamps(audio: torch.Tensor,
    speeches: list of dicts
        list containing ends and beginnings of speech chunks (samples or seconds based on return_seconds)
    """
-
    if not torch.is_tensor(audio):
        try:
            audio = torch.Tensor(audio)
@@ -280,7 +316,7 @@ def get_speech_timestamps(audio: torch.Tensor,
    speech_pad_samples = sampling_rate * speech_pad_ms / 1000
    max_speech_samples = sampling_rate * max_speech_duration_s - window_size_samples - 2 * speech_pad_samples
    min_silence_samples = sampling_rate * min_silence_duration_ms / 1000
-    min_silence_samples_at_max_speech = sampling_rate * 98 / 1000
+    min_silence_samples_at_max_speech = sampling_rate * min_silence_at_max_speech / 1000

    audio_length_samples = len(audio)

@@ -291,7 +327,7 @@ def get_speech_timestamps(audio: torch.Tensor,
            chunk = torch.nn.functional.pad(chunk, (0, int(window_size_samples - len(chunk))))
        speech_prob = model(chunk, sampling_rate).item()
        speech_probs.append(speech_prob)
-        # caculate progress and seng it to callback function
+        # calculate progress and send it to callback function
        progress = current_start_sample + window_size_samples
        if progress > audio_length_samples:
            progress = audio_length_samples
@@ -304,45 +340,76 @@ def get_speech_timestamps(audio: torch.Tensor,
    current_speech = {}

    if neg_threshold is None:
-        neg_threshold = threshold - 0.15
+        neg_threshold = max(threshold - 0.15, 0.01)
    temp_end = 0  # to save potential segment end (and tolerate some silence)
    prev_end = next_start = 0  # to save potential segment limits in case of maximum segment size reached
+    possible_ends = []

    for i, speech_prob in enumerate(speech_probs):
+        cur_sample = window_size_samples * i
+
+        # If speech returns after a temp_end, record candidate silence if long enough and clear temp_end
        if (speech_prob >= threshold) and temp_end:
+            sil_dur = cur_sample - temp_end
+            if sil_dur > min_silence_samples_at_max_speech:
+                possible_ends.append((temp_end, sil_dur))
            temp_end = 0
            if next_start < prev_end:
-                next_start = window_size_samples * i
+                next_start = cur_sample

+        # Start of speech
        if (speech_prob >= threshold) and not triggered:
            triggered = True
-            current_speech['start'] = window_size_samples * i
+            current_speech['start'] = cur_sample
            continue

-        if triggered and (window_size_samples * i) - current_speech['start'] > max_speech_samples:
-            if prev_end:
+        # Max speech length reached: decide where to cut
+        if triggered and (cur_sample - current_speech['start'] > max_speech_samples):
+            if use_max_poss_sil_at_max_speech and possible_ends:
+                prev_end, dur = max(possible_ends, key=lambda x: x[1])  # use the longest possible silence segment in the current speech chunk
                current_speech['end'] = prev_end
                speeches.append(current_speech)
                current_speech = {}
-                if next_start < prev_end:  # previously reached silence (< neg_thres) and is still not speech (< thres)
-                    triggered = False
-                else:
-                    current_speech['start'] = next_start
-                prev_end = next_start = temp_end = 0
-            else:
-                current_speech['end'] = window_size_samples * i
-                speeches.append(current_speech)
-                current_speech = {}
-                prev_end = next_start = temp_end = 0
-                triggered = False
-                continue
+                next_start = prev_end + dur

+                if next_start < prev_end + cur_sample:  # previously reached silence (< neg_thres) and is still not speech (< thres)
+                    current_speech['start'] = next_start
+                else:
+                    triggered = False
+                prev_end = next_start = temp_end = 0
+                possible_ends = []
+            else:
+                # Legacy max-speech cut (use_max_poss_sil_at_max_speech=False): prefer last valid silence (prev_end) if available
+                if prev_end:
+                    current_speech['end'] = prev_end
+                    speeches.append(current_speech)
+                    current_speech = {}
+                    if next_start < prev_end:
+                        triggered = False
+                    else:
+                        current_speech['start'] = next_start
+                    prev_end = next_start = temp_end = 0
+                    possible_ends = []
+                else:
+                    # No prev_end -> fallback to cutting at current sample
+                    current_speech['end'] = cur_sample
+                    speeches.append(current_speech)
+                    current_speech = {}
+                    prev_end = next_start = temp_end = 0
+                    triggered = False
+                    possible_ends = []
+                    continue
+
+        # Silence detection while in speech
        if (speech_prob < neg_threshold) and triggered:
            if not temp_end:
-                temp_end = window_size_samples * i
-            if ((window_size_samples * i) - temp_end) > min_silence_samples_at_max_speech:  # condition to avoid cutting in very short silence
+                temp_end = cur_sample
+            sil_dur_now = cur_sample - temp_end
+
+            if not use_max_poss_sil_at_max_speech and sil_dur_now > min_silence_samples_at_max_speech:  # condition to avoid cutting in very short silence
                prev_end = temp_end
-            if (window_size_samples * i) - temp_end < min_silence_samples:
+
+            if sil_dur_now < min_silence_samples:
                continue
            else:
                current_speech['end'] = temp_end
@@ -351,6 +418,7 @@ def get_speech_timestamps(audio: torch.Tensor,
                current_speech = {}
                prev_end = next_start = temp_end = 0
                triggered = False
+                possible_ends = []
                continue

    if current_speech and (audio_length_samples - current_speech['start']) > min_speech_samples:
@@ -372,9 +440,10 @@ def get_speech_timestamps(audio: torch.Tensor,
            speech['end'] = int(min(audio_length_samples, speech['end'] + speech_pad_samples))

    if return_seconds:
+        audio_length_seconds = audio_length_samples / sampling_rate
        for speech_dict in speeches:
-            speech_dict['start'] = round(speech_dict['start'] / sampling_rate, 1)
-            speech_dict['end'] = round(speech_dict['end'] / sampling_rate, 1)
+            speech_dict['start'] = max(round(speech_dict['start'] / sampling_rate, time_resolution), 0)
+            speech_dict['end'] = min(round(speech_dict['end'] / sampling_rate, time_resolution), audio_length_seconds)
    elif step > 1:
        for speech_dict in speeches:
            speech_dict['start'] *= step
@@ -435,13 +504,16 @@ class VADIterator:
        self.current_sample = 0

    @torch.no_grad()
-    def __call__(self, x, return_seconds=False):
+    def __call__(self, x, return_seconds=False, time_resolution: int = 1):
        """
        x: torch.Tensor
            audio chunk (see examples in repo)

        return_seconds: bool (default - False)
            whether return timestamps in seconds (default - samples)
+
+        time_resolution: int (default - 1)
+            time resolution of speech coordinates when requested as seconds
        """

        if not torch.is_tensor(x):
@@ -461,7 +533,7 @@ class VADIterator:
        if (speech_prob >= self.threshold) and not self.triggered:
            self.triggered = True
            speech_start = max(0, self.current_sample - self.speech_pad_samples - window_size_samples)
-            return {'start': int(speech_start) if not return_seconds else round(speech_start / self.sampling_rate, 1)}
+            return {'start': int(speech_start) if not return_seconds else round(speech_start / self.sampling_rate, time_resolution)}

        if (speech_prob < self.threshold - 0.15) and self.triggered:
            if not self.temp_end:
@@ -472,24 +544,112 @@ class VADIterator:
                speech_end = self.temp_end + self.speech_pad_samples - window_size_samples
                self.temp_end = 0
                self.triggered = False
-                return {'end': int(speech_end) if not return_seconds else round(speech_end / self.sampling_rate, 1)}
+                return {'end': int(speech_end) if not return_seconds else round(speech_end / self.sampling_rate, time_resolution)}

        return None


 def collect_chunks(tss: List[dict],
-                   wav: torch.Tensor):
-    chunks = []
-    for i in tss:
-        chunks.append(wav[i['start']: i['end']])
+                   wav: torch.Tensor,
+                   seconds: bool = False,
+                   sampling_rate: int = None) -> torch.Tensor:
+    """Collect audio chunks from a longer audio clip
+
+    This method extracts audio chunks from an audio clip, using a list of
+    provided coordinates, and concatenates them together. Coordinates can be
+    passed either as sample numbers or in seconds, in which case the audio
+    sampling rate is also needed.
+
+    Parameters
+    ----------
+    tss: List[dict]
+        Coordinate list of the clips to collect from the audio.
+    wav: torch.Tensor, one dimensional
+        One dimensional float torch.Tensor, containing the audio to clip.
+    seconds: bool (default - False)
+        Whether input coordinates are passed as seconds or samples.
+    sampling_rate: int (default - None)
+        Input audio sampling rate. Required if seconds is True.
+
+    Returns
+    -------
+    torch.Tensor, one dimensional
+        One dimensional float torch.Tensor of the concatenated clipped audio
+        chunks.
+
+    Raises
+    ------
+    ValueError
+        Raised if sampling_rate is not provided when seconds is True.
+
+    """
+    if seconds and not sampling_rate:
+        raise ValueError('sampling_rate must be provided when seconds is True')
+
+    chunks = list()
+    _tss = _seconds_to_samples_tss(tss, sampling_rate) if seconds else tss
+
+    for i in _tss:
+        chunks.append(wav[i['start']:i['end']])
+
    return torch.cat(chunks)


 def drop_chunks(tss: List[dict],
-                wav: torch.Tensor):
-    chunks = []
+                wav: torch.Tensor,
+                seconds: bool = False,
+                sampling_rate: int = None) -> torch.Tensor:
+    """Drop audio chunks from a longer audio clip
+
+    This method extracts audio chunks from an audio clip, using a list of
+    provided coordinates, and drops them. Coordinates can be passed either as
+    sample numbers or in seconds, in which case the audio sampling rate is also
+    needed.
+
+    Parameters
+    ----------
+    tss: List[dict]
+        Coordinate list of the clips to drop from from the audio.
+    wav: torch.Tensor, one dimensional
+        One dimensional float torch.Tensor, containing the audio to clip.
+    seconds: bool (default - False)
+        Whether input coordinates are passed as seconds or samples.
+    sampling_rate: int (default - None)
+        Input audio sampling rate. Required if seconds is True.
+
+    Returns
+    -------
+    torch.Tensor, one dimensional
+        One dimensional float torch.Tensor of the input audio minus the dropped
+        chunks.
+
+    Raises
+    ------
+    ValueError
+        Raised if sampling_rate is not provided when seconds is True.
+
+    """
+    if seconds and not sampling_rate:
+        raise ValueError('sampling_rate must be provided when seconds is True')
+
+    chunks = list()
    cur_start = 0
-    for i in tss:
+
+    _tss = _seconds_to_samples_tss(tss, sampling_rate) if seconds else tss
+
+    for i in _tss:
        chunks.append((wav[cur_start: i['start']]))
        cur_start = i['end']
+
+    chunks.append(wav[cur_start:])
+
    return torch.cat(chunks)
+
+
+def _seconds_to_samples_tss(tss: List[dict], sampling_rate: int) -> List[dict]:
+    """Convert coordinates expressed in seconds to sample coordinates.
+    """
+    return [{
+        'start': round(crd['start']) * sampling_rate,
+        'end': round(crd['end']) * sampling_rate
+    } for crd in tss]
--- a/tests/data/test.mp3
+++ b/tests/data/test.mp3
--- a/tests/data/test.opus
+++ b/tests/data/test.opus
--- a/tests/data/test.wav
+++ b/tests/data/test.wav
--- a/tests/test_basic.py
+++ b/tests/test_basic.py
@@ -0,0 +1,22 @@
+from silero_vad import load_silero_vad, read_audio, get_speech_timestamps
+import torch
+torch.set_num_threads(1)
+
+def test_jit_model():
+    model = load_silero_vad(onnx=False)
+    for path in ["tests/data/test.wav", "tests/data/test.opus", "tests/data/test.mp3"]:
+        audio = read_audio(path, sampling_rate=16000)
+        speech_timestamps = get_speech_timestamps(audio, model, visualize_probs=False, return_seconds=True)
+        assert speech_timestamps is not None
+        out = model.audio_forward(audio, sr=16000)
+        assert out is not None
+
+def test_onnx_model():
+    model = load_silero_vad(onnx=True)
+    for path in ["tests/data/test.wav", "tests/data/test.opus", "tests/data/test.mp3"]:
+        audio = read_audio(path, sampling_rate=16000)
+        speech_timestamps = get_speech_timestamps(audio, model, visualize_probs=False, return_seconds=True)
+        assert speech_timestamps is not None
+
+        out = model.audio_forward(audio, sr=16000)
+        assert out is not None
--- a/tuning/utils.py
+++ b/tuning/utils.py
@@ -118,8 +118,6 @@ class SileroVadDataset(Dataset):

        assert len(gt) == len(wav) / self.num_samples

-        mask[gt == 0]
-
        return wav, gt, mask

    def get_ground_truth_annotated(self, annotation, audio_length_samples):
@@ -240,6 +238,7 @@ def train(config,

            loss = criterion(stacked, targets)
            loss = (loss * masks).mean()
+            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            losses.update(loss.item(), masks.numel())
Author	SHA1	Message	Date
Dimitrii Voronin	be95df9152	Merge pull request #719 from snakers4/adamnsandle Adamnsandle	2025-11-06 11:25:49 +03:00
adamnsandle	ec56fe50a5	fx workflow	2025-11-06 08:18:46 +00:00
adamnsandle	dea5980320	fx workflow	2025-11-06 08:04:02 +00:00
adamnsandle	90d9ce7695	fx workflow	2025-11-06 07:49:44 +00:00
adamnsandle	c56dbb11ac	Merge branch 'master' of github.com:snakers4/silero-vad into adamnsandle	2025-11-06 07:36:38 +00:00
adamnsandle	9b686893ad	fx test workflow	2025-11-06 07:36:23 +00:00
Dimitrii Voronin	6979fbd535	Merge pull request #717 from snakers4/adamnsandle v6.2.0 release	2025-11-06 10:28:00 +03:00
adamnsandle	1cff663de5	fix version to 6.2.0	2025-11-06 07:27:07 +00:00
adamnsandle	bfdc019302	add v6.2 model	2025-11-06 07:23:43 +00:00
Alexander Veysov	c0c0ffa0c5	Merge pull request #714 from Purfview/patch-4 Fix type hint for min_silence_at_max_speech (float -> int)	2025-11-05 08:44:00 +03:00
Alexander Veysov	3f0c9ead54	Update pyproject.toml	2025-11-05 08:38:07 +03:00
Purfview	556a442942	Fix type hint for min_silence_at_max_speech (float -> int)	2025-11-04 08:30:01 +00:00
Dimitrii Voronin	9623ce72da	Merge pull request #710 from Purfview/patch-3 Fixes and refines - use_max_poss_sil_at_max_speech arg	2025-10-29 12:36:58 +03:00
Dimitrii Voronin	b6dd0599fc	Merge pull request #712 from snakers4/adamnsandle drop_chunks fix	2025-10-29 12:16:10 +03:00
adamnsandle	d8f88c9157	drop_chunks fix	2025-10-29 09:14:45 +00:00
Purfview	b15a216b47	Reword a comment	2025-10-24 10:30:34 +01:00
Purfview	2389039408	Fixes and refines - use_max_poss_sil_at_max_speech arg Removed redundant "if temp_end != 0:" check. Multiple "window_size_samples * i" - assigned to a variable. Restored the previous functionality (which was broken) when use_max_poss_sil_at_max_speech=False. @shashank14k was your https://github.com/snakers4/silero-vad/pull/664 PR still WIP when it was merged? Anyway, please test if use_max_poss_sil_at_max_speech=True behaviour is same, and "False" is same as before your PR.	2025-10-24 07:46:41 +01:00
Alexander Veysov	df22fcaec8	Merge pull request #708 from Purfview/patch-2 Removes redundant hop_size_samples variable	2025-10-23 15:58:00 +03:00
Purfview	81e8a48e25	Removes redundant hop_size_samples variable Remove redundant hop_size_samples variable	2025-10-23 05:23:18 +01:00
Alexander Veysov	a14a23faa7	Merge pull request #707 from Purfview/patch-1 Fixes few typos	2025-10-23 06:35:58 +03:00
Purfview	a30b5843c1	Fixes various typos	2025-10-23 04:02:13 +01:00
Dimitrii Voronin	a66c890188	Merge pull request #704 from snakers4/adamnsandle resolve torchaudio 2.9 utils	2025-10-17 15:50:20 +03:00
adamnsandle	77c91a91fa	resolve torchaudio 2.9 utils	2025-10-17 12:35:40 +00:00
Alexander Veysov	33093c6f1b	Update utils.py	2025-10-14 14:51:23 +03:00
Alexander Veysov	dc0b62e1e4	Merge pull request #699 from JiJiJiang/master fix bug in tuning/utils.py: add optimizer.zero_grad() before loss.bac…	2025-10-14 14:50:58 +03:00
Hongji Wang	64fb49e1c8	fix bug in tuning/utils.py: add optimizer.zero_grad() before loss.backward()	2025-10-13 20:50:29 +08:00
Alexander Veysov	55ba6e2825	Merge pull request #697 from VvvvvGH/java-example-v6 Update java example for v6	2025-10-11 11:41:15 +03:00
GH	b90f8c012f	Update SlieroVadOnnxModel.java	2025-10-11 16:21:57 +08:00
GH	25a778c798	Update SlieroVadDetector.java	2025-10-11 16:21:45 +08:00
GH	3d860e6ace	Update App.java	2025-10-11 16:21:32 +08:00
GH	f5ea01bfda	Update pom.xml	2025-10-11 16:21:03 +08:00
Alexander Veysov	dd651a54a5	Merge pull request #695 from mpariente/master Remove ipdb and raise error directly in get_speech_timestamps	2025-10-11 08:07:18 +03:00
Manuel Pariente	f1175c902f	Remove ipdb and raise error directly	2025-10-10 10:46:44 +02:00
Alexander Veysov	7819fd911b	Update README.md	2025-10-09 17:34:33 +03:00
Dimitrii Voronin	fba061dc55	Merge pull request #677 from snakers4/adamnsandle get rid of hop_size_ratio	2025-08-26 09:54:35 +03:00
adamnsandle	11631356a2	get rid of hop_size_ratio	2025-08-26 06:53:53 +00:00
Dimitrii Voronin	34dea51680	Merge pull request #664 from shashank14k/master Adding additional params to get_speech_timestamps	2025-08-26 09:50:44 +03:00
Dimitrii Voronin	51fd43130a	Update README.md	2025-08-25 19:30:20 +03:00
Dimitrii Voronin	3080062489	Update README.md	2025-08-25 18:07:06 +03:00
Dimitrii Voronin	f974f2d6bc	Merge pull request #676 from snakers4/adamnsandle Adamnsandle	2025-08-25 17:59:19 +03:00
adamnsandle	f1886d9088	Merge branch 'master' of github.com:snakers4/silero-vad into adamnsandle	2025-08-25 14:57:11 +00:00
adamnsandle	4c00cd14be	add v6 models	2025-08-25 14:56:50 +00:00
Dimitrii Voronin	5d70880844	Merge pull request #675 from snakers4/adamnsandle Adamnsandle	2025-08-25 17:28:38 +03:00
adamnsandle	a16f3ed079	Merge branch 'master' of github.com:snakers4/silero-vad into adamnsandle	2025-08-25 14:27:26 +00:00
adamnsandle	b0fbf4bec6	fx	2025-08-25 14:27:15 +00:00
Dimitrii Voronin	ab02267584	Merge pull request #674 from snakers4/adamnsandle Adamnsandle	2025-08-25 17:09:07 +03:00
adamnsandle	485a7d91b0	git push Merge branch 'master' of github.com:snakers4/silero-vad into adamnsandle	2025-08-25 14:08:15 +00:00
adamnsandle	1da76acfc3	fx	2025-08-25 14:07:32 +00:00
Dimitrii Voronin	3c70b587e8	Merge pull request #673 from snakers4/adamnsandle Adamnsandle	2025-08-25 16:56:19 +03:00
adamnsandle	7aff370d68	Merge branch 'master' of github.com:snakers4/silero-vad into adamnsandle	2025-08-25 13:55:30 +00:00
adamnsandle	931eddfdab	fx	2025-08-25 13:55:24 +00:00
Dimitrii Voronin	6143b9a5d9	Merge pull request #672 from snakers4/adamnsandle fx	2025-08-25 16:46:24 +03:00
adamnsandle	8ca8cf7d9b	fx	2025-08-25 13:45:36 +00:00
Dimitrii Voronin	ad0fdbe4ac	Merge pull request #671 from snakers4/adamnsandle Adamnsandle	2025-08-25 16:40:10 +03:00
adamnsandle	06806eb70b	Merge branch 'master' of github.com:snakers4/silero-vad into adamnsandle	2025-08-25 13:39:32 +00:00
adamnsandle	c90e1603c5	fx	2025-08-25 13:39:15 +00:00
Dimitrii Voronin	023d3a36f0	Merge pull request #670 from snakers4/adamnsandle fx	2025-08-25 16:25:39 +03:00
adamnsandle	aa2a66cf46	fx	2025-08-25 13:24:43 +00:00
Dimitrii Voronin	b1cd34aae2	Merge pull request #669 from snakers4/adamnsandle Adamnsandle	2025-08-25 16:17:17 +03:00
adamnsandle	50be3744fe	fix	2025-08-25 13:08:02 +00:00
adamnsandle	fce776f872	fix workflow	2025-08-25 12:59:58 +00:00
adamnsandle	fbddc91a5d	initial autotest commit	2025-08-25 12:54:47 +00:00
shashank14k	bbf22a0064	Added params for hop_size, and min_silence_at_max speech to cut at a possible silence when max_dur reached to avoid abrupt cuts	2025-07-25 20:51:40 +05:30
Alexander Veysov	94811cbe12	Merge pull request #656 from davidrs/patch-1 Surface drop_chunks in init	2025-06-11 07:45:36 +03:00
David Rust-Smith	22a2362b4c	Surface drop_chunks in init	2025-06-10 11:36:10 -07:00
Dimitrii Voronin	0dd45f0bcd	Merge pull request #626 from b3by/feature/process_chunks_in_seconds Use second coordinates for audio concatenation in collect_chunks and drop_chunks	2025-03-24 19:02:56 +03:00
Dimitrii Voronin	feba8cd5c4	Merge pull request #627 from b3by/feature/time_coordinates_resolution Specify time resolution when returning speech coordinates in seconds	2025-03-24 18:59:25 +03:00
Antonio Bevilacqua	6622e562e4	time resolution can be specified when coordinates are returned in seconds	2025-03-24 08:53:28 +01:00
Antonio Bevilacqua	d5625d5c38	added audio concatenation for collect_chunks and drop_chunks based on second coordinates	2025-03-21 13:06:59 +01:00
Alexander Veysov	cd92290a15	Merge pull request #605 from OJRYK/fix/cpp-vad-context Fix/cpp vad context	2025-02-17 11:01:04 +03:00
Ojuro Yokoyama	33a9d190fe	Update wav.h	2025-02-17 16:03:42 +09:00
Ojuro Yokoyama	7440bc4689	Update silero-vad-onnx.cpp I fixed bug of silero-vad-onnx.cpp	2025-02-17 16:02:24 +09:00
Alexander Veysov	10e7e8a8bc	Merge pull request #601 from kiwamizamurai/master Add CITATION.cff file for proper citation	2025-02-11 08:42:10 +03:00
きわみざむらい	5a5b662496	Create CITATION.cff	2025-02-11 08:54:16 +09:00
Alexander Veysov	9060f664f2	Merge pull request #591 from qwbarch/master Add haskell example	2024-12-26 19:05:13 +03:00
qwbarch	94271e9096	Add haskell example	2024-12-26 11:18:10 -05:00
Dimitrii Voronin	3f9fffc261	Merge pull request #581 from snakers4/adamnsandle fx negative ths bug	2024-11-25 16:55:38 +03:00
adamnsandle	eaf633ec9d	fx negative ths bug	2024-11-25 13:54:46 +00:00
Alexander Veysov	cff5eb2980	Merge pull request #578 from NathanJHLee/add-torch-cpp Add cpp source based on libtorch	2024-11-22 11:26:49 +03:00
Dimitrii Voronin	f356a8081a	Merge pull request #579 from snakers4/adamnsandle fx https://github.com/snakers4/silero-vad/issues/576	2024-11-22 11:18:26 +03:00
adamnsandle	782e30d28f	fx https://github.com/snakers4/silero-vad/issues/576	2024-11-22 08:17:25 +00:00
Nathan Lee	caee535cf6	ReadMe v4	2024-11-22 06:48:27 +00:00
Nathan Lee	8ab5be005f	ReadMe v3	2024-11-22 06:46:28 +00:00
Nathan Lee	9f67a54e87	ReadMe v2	2024-11-22 06:42:20 +00:00
Nathan Lee	c8df1dee3f	modified Readme	2024-11-22 06:35:16 +00:00
Nathan Lee	0189ebd8af	Changed some source.	2024-11-22 06:21:49 +00:00
Nathan Lee	05e380c1de	add c++ inference based on libtorch	2024-11-22 00:10:13 +00:00
Alexander Veysov	93b9782f28	Merge pull request #573 from snakers4/adamnsandle Adamnsandle	2024-11-13 12:32:55 +03:00
adamnsandle	d2ab7c254e	add just 16k model	2024-11-13 08:53:27 +00:00
adamnsandle	6217b08bbb	add other opsets	2024-11-12 08:25:06 +00:00
adamnsandle	d53ba1ea11	Merge branch 'master' of github.com:snakers4/silero-vad into adamnsandle	2024-11-12 08:19:54 +00:00
Alexander Veysov	102e6d0962	Add downloads shield	2024-11-07 14:40:33 +03:00
Alexander Veysov	e531cd3462	Update README.md	2024-10-21 10:22:02 +03:00
Alexander Veysov	fd41da0b15	Merge pull request #553 from EarningsCall/master Improve documentation.	2024-10-12 18:25:46 +03:00
EarningsCall	9db72c35bd	Update README.md update again	2024-10-12 09:23:29 -05:00
EarningsCall	867a067bee	Update README.md I assume most people want seconds, so it's useful to show example to return seconds in README file.	2024-10-12 09:22:39 -05:00
Alexander Veysov	2c43391b17	Update README.md	2024-10-09 12:56:22 +03:00
Alexander Veysov	6478567951	Update pyproject.toml	2024-10-09 12:49:27 +03:00
adamnsandle	add6e3028e	Merge branch 'master' of github.com:snakers4/silero-vad into adamnsandle	2024-10-09 09:48:51 +00:00
adamnsandle	e7025ed8c5	5.1.1 tag	2024-10-09 09:48:37 +00:00
Alexander Veysov	35d601adc6	Update pyproject.toml	2024-10-09 12:47:08 +03:00
Dimitrii Voronin	032ca21a70	Merge pull request #549 from snakers4/adamnsandle Adamnsandle	2024-10-09 12:32:09 +03:00
adamnsandle	001d57d6ff	fx dependencies	2024-10-09 09:26:39 +00:00
adamnsandle	6e6da04e7a	fix pyaudio streaming example	2024-10-09 08:49:39 +00:00
Alexander Veysov	9c1eff9169	Delete files/real_time_example.mp4	2024-10-09 10:10:03 +03:00
Alexander Veysov	36b759d053	Add files via upload	2024-10-09 10:02:04 +03:00
Dimitrii Voronin	1a7499607a	Merge pull request #543 from snakers4/adamnsandle Adamnsandle	2024-09-24 15:19:30 +03:00
Alexander Veysov	87451b059f	Update README.md	2024-09-24 15:16:18 +03:00
adamnsandle	d23867da10	fx parallel example	2024-09-24 12:03:07 +00:00
adamnsandle	2043282182	Merge branch 'master' of github.com:snakers4/silero-vad into adamnsandle	2024-09-24 12:02:00 +00:00
adamnsandle	fa8036ae1c	fx old examples	2024-09-24 12:01:47 +00:00