* Fix conversion between audio dtypes * Run Pytest in CI * Add pytest tests path in pyproject.toml * Fix usages * Use other PR's test format (more or less) * Support legacy arguments * Fix pyproject.toml and test location * Omit `test` arg in CI, given by pyproject.toml --------- Co-authored-by: Freddy Boulton <alfonsoboulton@gmail.com>
FastRTC POC
A simple POC for a fast real-time voice chat application using FastAPI and FastRTC by rohanprichard. I wanted to make one as an example with more production-ready languages, rather than just Gradio.
Setup
-
Set your API keys in an
.envfile based on the.env.examplefile -
Create a virtual environment and install the dependencies
python3 -m venv env source env/bin/activate pip install -r requirements.txt -
Run the server
./run.sh -
Navigate into the frontend directory in another terminal
cd frontend/fastrtc-demo -
Run the frontend
npm install npm run dev -
Go to the URL and click the microphone icon to start chatting!
-
Reset chats by clicking the trash button on the bottom right
Notes
You can choose to not install the requirements for TTS and STT by removing the [tts, stt] from the specifier in the requirements.txt file.
- The STT is currently using the ElevenLabs API.
- The LLM is currently using the OpenAI API.
- The TTS is currently using the ElevenLabs API.
- The VAD is currently using the Silero VAD model.
- You may need to install ffmpeg if you get errors in STT
The prompt can be changed in the backend/server.py file and modified as you like.
Audio Parameters
AlgoOptions
- audio_chunk_duration: Length of audio chunks in seconds. Smaller values allow for faster processing but may be less accurate.
- started_talking_threshold: If a chunk has more than this many seconds of speech, the system considers that the user has started talking.
- speech_threshold: After the user has started speaking, if a chunk has less than this many seconds of speech, the system considers that the user has paused.
SileroVadOptions
- threshold: Speech probability threshold (0.0-1.0). Values above this are considered speech. Higher values are more strict.
- min_speech_duration_ms: Speech segments shorter than this (in milliseconds) are filtered out.
- min_silence_duration_ms: The system waits for this duration of silence (in milliseconds) before considering speech to be finished.
- speech_pad_ms: Padding added to both ends of detected speech segments to prevent cutting off words.
- max_speech_duration_s: Maximum allowed duration for a speech segment in seconds. Prevents indefinite listening.
Tuning Recommendations
-
If the AI interrupts you too early:
- Increase
min_silence_duration_ms - Increase
speech_threshold - Increase
speech_pad_ms
- Increase
-
If the AI is slow to respond after you finish speaking:
- Decrease
min_silence_duration_ms - Decrease
speech_threshold
- Decrease
-
If the system fails to detect some speech:
- Lower the
thresholdvalue - Decrease
started_talking_threshold
- Lower the
Credits:
Credit for the UI components goes to Shadcn, Aceternity UI and Kokonut UI.