mirror of
https://github.com/HumanAIGC-Engineering/gradio-webrtc.git
synced 2026-02-05 18:09:23 +08:00
75 lines
2.9 KiB
Markdown
75 lines
2.9 KiB
Markdown
# FastRTC POC
|
|
A simple POC for a fast real-time voice chat application using FastAPI and FastRTC by [rohanprichard](https://github.com/rohanprichard). I wanted to make one as an example with more production-ready languages, rather than just Gradio.
|
|
|
|
## Setup
|
|
1. Set your API keys in an `.env` file based on the `.env.example` file
|
|
2. Create a virtual environment and install the dependencies
|
|
```bash
|
|
python3 -m venv env
|
|
source env/bin/activate
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
3. Run the server
|
|
```bash
|
|
./run.sh
|
|
```
|
|
4. Navigate into the frontend directory in another terminal
|
|
```bash
|
|
cd frontend/fastrtc-demo
|
|
```
|
|
5. Run the frontend
|
|
```bash
|
|
npm install
|
|
npm run dev
|
|
```
|
|
6. Go to the URL and click the microphone icon to start chatting!
|
|
|
|
7. Reset chats by clicking the trash button on the bottom right
|
|
|
|
## Notes
|
|
You can choose to not install the requirements for TTS and STT by removing the `[tts, stt]` from the specifier in the `requirements.txt` file.
|
|
|
|
- The STT is currently using the ElevenLabs API.
|
|
- The LLM is currently using the OpenAI API.
|
|
- The TTS is currently using the ElevenLabs API.
|
|
- The VAD is currently using the Silero VAD model.
|
|
- You may need to install ffmpeg if you get errors in STT
|
|
|
|
The prompt can be changed in the `backend/server.py` file and modified as you like.
|
|
|
|
### Audio Parameters
|
|
|
|
#### AlgoOptions
|
|
|
|
- **audio_chunk_duration**: Length of audio chunks in seconds. Smaller values allow for faster processing but may be less accurate.
|
|
- **started_talking_threshold**: If a chunk has more than this many seconds of speech, the system considers that the user has started talking.
|
|
- **speech_threshold**: After the user has started speaking, if a chunk has less than this many seconds of speech, the system considers that the user has paused.
|
|
|
|
#### SileroVadOptions
|
|
|
|
- **threshold**: Speech probability threshold (0.0-1.0). Values above this are considered speech. Higher values are more strict.
|
|
- **min_speech_duration_ms**: Speech segments shorter than this (in milliseconds) are filtered out.
|
|
- **min_silence_duration_ms**: The system waits for this duration of silence (in milliseconds) before considering speech to be finished.
|
|
- **speech_pad_ms**: Padding added to both ends of detected speech segments to prevent cutting off words.
|
|
- **max_speech_duration_s**: Maximum allowed duration for a speech segment in seconds. Prevents indefinite listening.
|
|
|
|
### Tuning Recommendations
|
|
|
|
- If the AI interrupts you too early:
|
|
- Increase `min_silence_duration_ms`
|
|
- Increase `speech_threshold`
|
|
- Increase `speech_pad_ms`
|
|
|
|
- If the AI is slow to respond after you finish speaking:
|
|
- Decrease `min_silence_duration_ms`
|
|
- Decrease `speech_threshold`
|
|
|
|
- If the system fails to detect some speech:
|
|
- Lower the `threshold` value
|
|
- Decrease `started_talking_threshold`
|
|
|
|
|
|
## Credits:
|
|
Credit for the UI components goes to Shadcn, Aceternity UI and Kokonut UI.
|