mirror of https://github.com/HumanAIGC-Engineering/gradio-webrtc.git synced 2026-02-04 17:39:23 +08:00

Files

Václav Volhejn 58bccddd93 Fix audio type conversion (#259 )

* Fix conversion between audio dtypes

* Run Pytest in CI

* Add pytest tests path in pyproject.toml

* Fix usages

* Use other PR's test format (more or less)

* Support legacy arguments

* Fix pyproject.toml and test location

* Omit `test` arg in CI, given by pyproject.toml

---------

Co-authored-by: Freddy Boulton <alfonsoboulton@gmail.com>

2025-04-09 10:00:23 -04:00

backend

Fix audio type conversion (#259 )

2025-04-09 10:00:23 -04:00

frontend/fastrtc-demo

Adding nextjs + 11labs + openai streaming demo (#139 )

2025-03-07 14:24:23 -05:00

README.md

Adding nextjs + 11labs + openai streaming demo (#139 )

2025-03-07 14:24:23 -05:00

requirements.txt

Adding nextjs + 11labs + openai streaming demo (#139 )

2025-03-07 14:24:23 -05:00

run.sh

Adding nextjs + 11labs + openai streaming demo (#139 )

2025-03-07 14:24:23 -05:00

README.md

FastRTC POC

A simple POC for a fast real-time voice chat application using FastAPI and FastRTC by rohanprichard. I wanted to make one as an example with more production-ready languages, rather than just Gradio.

Setup

Set your API keys in an .env file based on the .env.example file

Create a virtual environment and install the dependencies

python3 -m venv env
source env/bin/activate
pip install -r requirements.txt

Run the server
```
./run.sh
```
Navigate into the frontend directory in another terminal
```
cd frontend/fastrtc-demo
```
Run the frontend
```
npm install
npm run dev
```
Go to the URL and click the microphone icon to start chatting!
Reset chats by clicking the trash button on the bottom right

Notes

You can choose to not install the requirements for TTS and STT by removing the [tts, stt] from the specifier in the requirements.txt file.

The STT is currently using the ElevenLabs API.
The LLM is currently using the OpenAI API.
The TTS is currently using the ElevenLabs API.
The VAD is currently using the Silero VAD model.
You may need to install ffmpeg if you get errors in STT

The prompt can be changed in the backend/server.py file and modified as you like.

Audio Parameters

AlgoOptions

audio_chunk_duration: Length of audio chunks in seconds. Smaller values allow for faster processing but may be less accurate.
started_talking_threshold: If a chunk has more than this many seconds of speech, the system considers that the user has started talking.
speech_threshold: After the user has started speaking, if a chunk has less than this many seconds of speech, the system considers that the user has paused.

SileroVadOptions

threshold: Speech probability threshold (0.0-1.0). Values above this are considered speech. Higher values are more strict.
min_speech_duration_ms: Speech segments shorter than this (in milliseconds) are filtered out.
min_silence_duration_ms: The system waits for this duration of silence (in milliseconds) before considering speech to be finished.
speech_pad_ms: Padding added to both ends of detected speech segments to prevent cutting off words.
max_speech_duration_s: Maximum allowed duration for a speech segment in seconds. Prevents indefinite listening.

Tuning Recommendations

If the AI interrupts you too early:
- Increase min_silence_duration_ms
- Increase speech_threshold
- Increase speech_pad_ms
If the AI is slow to respond after you finish speaking:
- Decrease min_silence_duration_ms
- Decrease speech_threshold
If the system fails to detect some speech:
- Lower the threshold value
- Decrease started_talking_threshold

Credits:

Credit for the UI components goes to Shadcn, Aceternity UI and Kokonut UI.