mirror of https://github.com/HumanAIGC-Engineering/gradio-webrtc.git synced 2026-02-05 01:49:23 +08:00

Files

Marcus Valtonen Örnhag f70b27bd41 Enforce modern typing (#258 )

* Allow UP

* Upgrade typing

* test smolagents

* Change to contextlib

---------

Co-authored-by: Marcus Valtonen Örnhag <marcus.valtonen.ornhag@ericsson.com>

2025-04-08 16:46:12 -04:00

app.py

Enforce modern typing (#258 )

2025-04-08 16:46:12 -04:00

README.md

Clean up cancelled generators (#124 )

2025-03-04 18:08:10 -05:00

requirements.txt

[DEMO] basic demo using smolagents and FastRTC (#93 )

2025-02-27 14:35:29 -05:00

README.md

title, emoji, colorFrom, colorTo, sdk, sdk_version, app_file, pinned, license, short_description, tags

title

emoji

colorFrom

colorTo

sdk

sdk_version

app_file

pinned

license

short_description

Voice LLM Agent with Image Generation

A voice-enabled AI assistant powered by FastRTC that can:

Stream audio in real-time using WebRTC
Listen and respond with natural pauses in conversation
Generate images based on your requests
Maintain conversation context across exchanges

This app combines the real-time communication capabilities of FastRTC with the powerful agent framework of smolagents.

Key Features

Real-time Streaming: Uses FastRTC's WebRTC-based audio streaming
Voice Activation: Automatic detection of speech pauses to trigger responses
Multi-modal Interaction: Combines voice and image generation in a single interface

Setup

Install Python 3.9+ and create a virtual environment:

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```

Create a .env file with the following:

HF_TOKEN=your_huggingface_api_key
MODE=UI  # Use 'UI' for Gradio interface, leave blank for HTML interface

Running the App

With Gradio UI (Recommended)

MODE=UI python app.py

This launches a Gradio UI at http://localhost:7860 with:

FastRTC's built-in streaming audio components
A chat interface showing the conversation
An image display panel for generated images

How to Use

Click the microphone button to start streaming your voice.
Speak naturally - the app will automatically detect when you pause.
Ask the agent to generate an image, for example:
- "Create an image of a magical forest with glowing mushrooms."
- "Generate a picture of a futuristic city with flying cars."
View the generated image and hear the agent's response.

Technical Architecture

FastRTC Components

Stream: Core component that handles WebRTC connections and audio streaming
ReplyOnPause: Detects when the user stops speaking to trigger a response
get_stt_model/get_tts_model: Provides optimized speech-to-text and text-to-speech models

smolagents Components

CodeAgent: Intelligent agent that can use tools based on natural language inputs
Tool.from_space: Integration with Hugging Face Spaces for image generation
HfApiModel: Connection to powerful language models for understanding requests

Integration Flow

FastRTC streams and processes audio input in real-time
Speech is converted to text and passed to the smolagents CodeAgent
The agent processes the request and calls tools when needed
Responses and generated images are streamed back through FastRTC
The UI updates to show both text responses and generated images

Advanced Features

Conversation history is maintained across exchanges
Error handling ensures the app continues working even if agent processing fails
The application leverages FastRTC's streaming capabilities for efficient audio transmission