[feat] update some feature

sync code of fastrtc, add text support through datachannel, fix safari connect problem support chat without camera or mic
2026-02-05 18:09:23 +08:00 · 2025-03-25 18:05:10 +08:00
parent e1fb40a8a8
commit aefb08150f
222 changed files with 28698 additions and 5889 deletions
--- a/demo/talk_to_smolagents/README.md
+++ b/demo/talk_to_smolagents/README.md
@@ -0,0 +1,98 @@
+---
+title: Talk to Smolagents
+emoji: 💻
+colorFrom: purple
+colorTo: red
+sdk: gradio
+sdk_version: 5.16.0
+app_file: app.py
+pinned: false
+license: mit
+short_description: FastRTC Voice Agent with smolagents
+tags: [webrtc, websocket, gradio, secret|HF_TOKEN, secret|TWILIO_ACCOUNT_SID, secret|TWILIO_AUTH_TOKEN]
+---
+
+# Voice LLM Agent with Image Generation
+
+A voice-enabled AI assistant powered by FastRTC that can:
+1. Stream audio in real-time using WebRTC
+2. Listen and respond with natural pauses in conversation
+3. Generate images based on your requests
+4. Maintain conversation context across exchanges
+
+This app combines the real-time communication capabilities of FastRTC with the powerful agent framework of smolagents.
+
+## Key Features
+
+- **Real-time Streaming**: Uses FastRTC's WebRTC-based audio streaming
+- **Voice Activation**: Automatic detection of speech pauses to trigger responses
+- **Multi-modal Interaction**: Combines voice and image generation in a single interface
+
+## Setup
+
+1. Install Python 3.9+ and create a virtual environment:
+   ```bash
+   python -m venv .venv
+   source .venv/bin/activate  # On Windows: .venv\Scripts\activate
+   ```
+
+2. Install dependencies:
+   ```bash
+   pip install -r requirements.txt
+   ```
+
+3. Create a `.env` file with the following:
+   ```
+   HF_TOKEN=your_huggingface_api_key
+   MODE=UI  # Use 'UI' for Gradio interface, leave blank for HTML interface
+   ```
+
+## Running the App
+
+### With Gradio UI (Recommended)
+
+```bash
+MODE=UI python app.py
+```
+
+This launches a Gradio UI at http://localhost:7860 with:
+- FastRTC's built-in streaming audio components
+- A chat interface showing the conversation
+- An image display panel for generated images
+
+## How to Use
+
+1. Click the microphone button to start streaming your voice.
+2. Speak naturally - the app will automatically detect when you pause.
+3. Ask the agent to generate an image, for example:
+   - "Create an image of a magical forest with glowing mushrooms."
+   - "Generate a picture of a futuristic city with flying cars."
+4. View the generated image and hear the agent's response.
+
+## Technical Architecture
+
+### FastRTC Components
+
+- **Stream**: Core component that handles WebRTC connections and audio streaming
+- **ReplyOnPause**: Detects when the user stops speaking to trigger a response
+- **get_stt_model/get_tts_model**: Provides optimized speech-to-text and text-to-speech models
+
+### smolagents Components
+
+- **CodeAgent**: Intelligent agent that can use tools based on natural language inputs
+- **Tool.from_space**: Integration with Hugging Face Spaces for image generation
+- **HfApiModel**: Connection to powerful language models for understanding requests
+
+### Integration Flow
+
+1. FastRTC streams and processes audio input in real-time
+2. Speech is converted to text and passed to the smolagents CodeAgent
+3. The agent processes the request and calls tools when needed
+4. Responses and generated images are streamed back through FastRTC
+5. The UI updates to show both text responses and generated images
+
+## Advanced Features
+
+- Conversation history is maintained across exchanges
+- Error handling ensures the app continues working even if agent processing fails
+- The application leverages FastRTC's streaming capabilities for efficient audio transmission