mirror of
https://github.com/HumanAIGC-Engineering/gradio-webrtc.git
synced 2026-02-05 18:09:23 +08:00
[feat] update some feature
sync code of fastrtc, add text support through datachannel, fix safari connect problem support chat without camera or mic
This commit is contained in:
98
demo/talk_to_smolagents/README.md
Normal file
98
demo/talk_to_smolagents/README.md
Normal file
@@ -0,0 +1,98 @@
|
||||
---
|
||||
title: Talk to Smolagents
|
||||
emoji: 💻
|
||||
colorFrom: purple
|
||||
colorTo: red
|
||||
sdk: gradio
|
||||
sdk_version: 5.16.0
|
||||
app_file: app.py
|
||||
pinned: false
|
||||
license: mit
|
||||
short_description: FastRTC Voice Agent with smolagents
|
||||
tags: [webrtc, websocket, gradio, secret|HF_TOKEN, secret|TWILIO_ACCOUNT_SID, secret|TWILIO_AUTH_TOKEN]
|
||||
---
|
||||
|
||||
# Voice LLM Agent with Image Generation
|
||||
|
||||
A voice-enabled AI assistant powered by FastRTC that can:
|
||||
1. Stream audio in real-time using WebRTC
|
||||
2. Listen and respond with natural pauses in conversation
|
||||
3. Generate images based on your requests
|
||||
4. Maintain conversation context across exchanges
|
||||
|
||||
This app combines the real-time communication capabilities of FastRTC with the powerful agent framework of smolagents.
|
||||
|
||||
## Key Features
|
||||
|
||||
- **Real-time Streaming**: Uses FastRTC's WebRTC-based audio streaming
|
||||
- **Voice Activation**: Automatic detection of speech pauses to trigger responses
|
||||
- **Multi-modal Interaction**: Combines voice and image generation in a single interface
|
||||
|
||||
## Setup
|
||||
|
||||
1. Install Python 3.9+ and create a virtual environment:
|
||||
```bash
|
||||
python -m venv .venv
|
||||
source .venv/bin/activate # On Windows: .venv\Scripts\activate
|
||||
```
|
||||
|
||||
2. Install dependencies:
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
3. Create a `.env` file with the following:
|
||||
```
|
||||
HF_TOKEN=your_huggingface_api_key
|
||||
MODE=UI # Use 'UI' for Gradio interface, leave blank for HTML interface
|
||||
```
|
||||
|
||||
## Running the App
|
||||
|
||||
### With Gradio UI (Recommended)
|
||||
|
||||
```bash
|
||||
MODE=UI python app.py
|
||||
```
|
||||
|
||||
This launches a Gradio UI at http://localhost:7860 with:
|
||||
- FastRTC's built-in streaming audio components
|
||||
- A chat interface showing the conversation
|
||||
- An image display panel for generated images
|
||||
|
||||
## How to Use
|
||||
|
||||
1. Click the microphone button to start streaming your voice.
|
||||
2. Speak naturally - the app will automatically detect when you pause.
|
||||
3. Ask the agent to generate an image, for example:
|
||||
- "Create an image of a magical forest with glowing mushrooms."
|
||||
- "Generate a picture of a futuristic city with flying cars."
|
||||
4. View the generated image and hear the agent's response.
|
||||
|
||||
## Technical Architecture
|
||||
|
||||
### FastRTC Components
|
||||
|
||||
- **Stream**: Core component that handles WebRTC connections and audio streaming
|
||||
- **ReplyOnPause**: Detects when the user stops speaking to trigger a response
|
||||
- **get_stt_model/get_tts_model**: Provides optimized speech-to-text and text-to-speech models
|
||||
|
||||
### smolagents Components
|
||||
|
||||
- **CodeAgent**: Intelligent agent that can use tools based on natural language inputs
|
||||
- **Tool.from_space**: Integration with Hugging Face Spaces for image generation
|
||||
- **HfApiModel**: Connection to powerful language models for understanding requests
|
||||
|
||||
### Integration Flow
|
||||
|
||||
1. FastRTC streams and processes audio input in real-time
|
||||
2. Speech is converted to text and passed to the smolagents CodeAgent
|
||||
3. The agent processes the request and calls tools when needed
|
||||
4. Responses and generated images are streamed back through FastRTC
|
||||
5. The UI updates to show both text responses and generated images
|
||||
|
||||
## Advanced Features
|
||||
|
||||
- Conversation history is maintained across exchanges
|
||||
- Error handling ensures the app continues working even if agent processing fails
|
||||
- The application leverages FastRTC's streaming capabilities for efficient audio transmission
|
||||
Reference in New Issue
Block a user