feat: windows infer & gradio (#312)

* fix: windows infer * docs: update readme * docs: update readme * feat: v1.5 gradio for windows&linux * fix: dependencies * feat: windows infer & gradio --------- Co-authored-by: NeRF-Factory <zzhizhou66@gmail.com>
2026-02-04 17:39:20 +08:00 · 2025-04-12 23:22:22 +08:00
parent 36163fccbd
commit 67e7ee3c73
14 changed files with 613 additions and 245 deletions
--- a/README.md
+++ b/README.md
@@ -146,50 +146,87 @@ We also hope you note that we have not verified, maintained, or updated third-pa

 ## Installation
 To prepare the Python environment and install additional packages such as opencv, diffusers, mmcv, etc., please follow the steps below:
-### Build environment

-We recommend a python version >=3.10 and cuda version =11.7. Then build environment as follows:
+### Build environment
+We recommend Python 3.10 and CUDA 11.7. Set up your environment as follows:
+
+```shell
+conda create -n MuseTalk python==3.10
+conda activate MuseTalk
+```
+
+### Install PyTorch 2.0.1
+Choose one of the following installation methods:
+
+```shell
+# Option 1: Using pip
+pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118
+
+# Option 2: Using conda
+conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.8 -c pytorch -c nvidia
+```
+
+### Install Dependencies
+Install the remaining required packages:

 ```shell
 pip install -r requirements.txt
 ```

-### mmlab packages
+### Install MMLab Packages
+Install the MMLab ecosystem packages:
+
 ```bash
-pip install --no-cache-dir -U openmim 
-mim install mmengine 
-mim install "mmcv>=2.0.1" 
-mim install "mmdet>=3.1.0" 
-mim install "mmpose>=1.1.0" 
+pip install --no-cache-dir -U openmim
+mim install mmengine
+mim install "mmcv==2.0.1"
+mim install "mmdet==3.1.0"
+mim install "mmpose==1.1.0"
 ```

-### Download ffmpeg-static
-Download the ffmpeg-static and
-```
+### Setup FFmpeg
+1. [Download](https://github.com/BtbN/FFmpeg-Builds/releases) the ffmpeg-static package
+
+2. Configure FFmpeg based on your operating system:
+
+For Linux:
+```bash
 export FFMPEG_PATH=/path/to/ffmpeg
-```
-for example:
-```
+# Example:
 export FFMPEG_PATH=/musetalk/ffmpeg-4.4-amd64-static
 ```
-### Download weights
-You can download weights manually as follows:

-1. Download our trained [weights](https://huggingface.co/TMElyralab/MuseTalk).
+For Windows:
+Add the `ffmpeg-xxx\bin` directory to your system's PATH environment variable. Verify the installation by running `ffmpeg -version` in the command prompt - it should display the ffmpeg version information.
+
+### Download weights
+You can download weights in two ways:
+
+#### Option 1: Using Download Scripts
+We provide two scripts for automatic downloading:
+
+For Linux:
 ```bash
-# !pip install -U "huggingface_hub[cli]" 
-export HF_ENDPOINT=https://hf-mirror.com 
-huggingface-cli download TMElyralab/MuseTalk --local-dir models/
+sh ./download_weights.sh
 ```

+For Windows:
+```batch
+# Run the script
+download_weights.bat
+```
+
+#### Option 2: Manual Download
+You can also download the weights manually from the following links:
+
+1. Download our trained [weights](https://huggingface.co/TMElyralab/MuseTalk/tree/main)
 2. Download the weights of other components:
-   - [sd-vae-ft-mse](https://huggingface.co/stabilityai/sd-vae-ft-mse)
+   - [sd-vae-ft-mse](https://huggingface.co/stabilityai/sd-vae-ft-mse/tree/main)
   - [whisper](https://huggingface.co/openai/whisper-tiny/tree/main)
   - [dwpose](https://huggingface.co/yzd-v/DWPose/tree/main)
-   - [face-parse-bisent](https://github.com/zllrunning/face-parsing.PyTorch)
-   - [resnet18](https://download.pytorch.org/models/resnet18-5c106cde.pth)
   - [syncnet](https://huggingface.co/ByteDance/LatentSync/tree/main)
-
+   - [face-parse-bisent](https://drive.google.com/file/d/154JgKpzCPW82qINcVieuPH3fZ2e0P812/view?pli=1)
+   - [resnet18](https://download.pytorch.org/models/resnet18-5c106cde.pth)

 Finally, these weights should be organized in `models` as follows:
 ```
@@ -207,7 +244,7 @@ Finally, these weights should be organized in `models` as follows:
 ├── face-parse-bisent
 │   ├── 79999_iter.pth
 │   └── resnet18-5c106cde.pth
-├── sd-vae-ft-mse
+├── sd-vae
 │   ├── config.json
 │   └── diffusion_pytorch_model.bin
 └── whisper
@@ -221,21 +258,60 @@ Finally, these weights should be organized in `models` as follows:
 ### Inference
 We provide inference scripts for both versions of MuseTalk:

-#### MuseTalk 1.5 (Recommended)
+#### Prerequisites
+Before running inference, please ensure ffmpeg is installed and accessible:
 ```bash
-# Run MuseTalk 1.5 inference
-sh inference.sh v1.5 normal
+# Check ffmpeg installation
+ffmpeg -version
 ```
+If ffmpeg is not found, please install it first:
+- Windows: Download from [ffmpeg-static](https://github.com/BtbN/FFmpeg-Builds/releases) and add to PATH
+- Linux: `sudo apt-get install ffmpeg`

-#### MuseTalk 1.0
+#### Normal Inference
+##### Linux Environment
 ```bash
-# Run MuseTalk 1.0 inference
+# MuseTalk 1.5 (Recommended)
+sh inference.sh v1.5 normal
+
+# MuseTalk 1.0
 sh inference.sh v1.0 normal
 ```

-The inference script supports both MuseTalk 1.5 and 1.0 models:
- For MuseTalk 1.5: Use the command above with the V1.5 model path
- For MuseTalk 1.0: Use the same script but point to the V1.0 model path
+##### Windows Environment
+
+Please ensure that you set the `ffmpeg_path` to match the actual location of your FFmpeg installation.
+
+```bash
+# MuseTalk 1.5 (Recommended)
+python -m scripts.inference --inference_config configs\inference\test.yaml --result_dir results\test --unet_model_path models\musetalkV15\unet.pth --unet_config models\musetalkV15\musetalk.json --version v15 --ffmpeg_path ffmpeg-master-latest-win64-gpl-shared\bin
+
+# For MuseTalk 1.0, change:
+# - models\musetalkV15 -> models\musetalk
+# - unet.pth -> pytorch_model.bin
+# - --version v15 -> --version v1
+```
+
+#### Real-time Inference
+##### Linux Environment
+```bash
+# MuseTalk 1.5 (Recommended)
+sh inference.sh v1.5 realtime
+
+# MuseTalk 1.0
+sh inference.sh v1.0 realtime
+```
+
+##### Windows Environment
+```bash
+# MuseTalk 1.5 (Recommended)
+python -m scripts.realtime_inference --inference_config configs\inference\realtime.yaml --result_dir results\realtime --unet_model_path models\musetalkV15\unet.pth --unet_config models\musetalkV15\musetalk.json --version v15 --fps 25 --ffmpeg_path ffmpeg-master-latest-win64-gpl-shared\bin
+
+# For MuseTalk 1.0, change:
+# - models\musetalkV15 -> models\musetalk
+# - unet.pth -> pytorch_model.bin
+# - --version v15 -> --version v1
+```

 The configuration file `configs/inference/test.yaml` contains the inference settings, including:
 - `video_path`: Path to the input video, image file, or directory of images
@@ -243,21 +319,6 @@ The configuration file `configs/inference/test.yaml` contains the inference sett

 Note: For optimal results, we recommend using input videos with 25fps, which is the same fps used during model training. If your video has a lower frame rate, you can use frame interpolation or convert it to 25fps using ffmpeg.

-#### Real-time Inference
-For real-time inference, use the following command:
-```bash
-# Run real-time inference
-sh inference.sh v1.5 realtime  # For MuseTalk 1.5
-# or
-sh inference.sh v1.0 realtime  # For MuseTalk 1.0
-```
-
-The real-time inference configuration is in `configs/inference/realtime.yaml`, which includes:
- `preparation`: Set to `True` for new avatar preparation
- `video_path`: Path to the input video
- `bbox_shift`: Adjustable parameter for mouth region control
- `audio_clips`: List of audio clips for generation
-
 Important notes for real-time inference:
 1. Set `preparation` to `True` when processing a new avatar
 2. After preparation, the avatar will generate videos using audio clips from `audio_clips`
@@ -269,6 +330,18 @@ For faster generation without saving images, you can use:
 python -m scripts.realtime_inference --inference_config configs/inference/realtime.yaml --skip_save_images
 ```

+## Gradio Demo
+We provide an intuitive web interface through Gradio for users to easily adjust input parameters. To optimize inference time, users can generate only the **first frame** to fine-tune the best lip-sync parameters, which helps reduce facial artifacts in the final output.
+![para](assets/figs/gradio_2.png)
+For minimum hardware requirements, we tested the system on a Windows environment using an NVIDIA GeForce RTX 3050 Ti Laptop GPU with 4GB VRAM. In fp16 mode, generating an 8-second video takes approximately 5 minutes. ![speed](assets/figs/gradio.png)
+
+Both Linux and Windows users can launch the demo using the following command. Please ensure that the `ffmpeg_path` parameter matches your actual FFmpeg installation path:
+
+```bash
+# You can remove --use_float16 for better quality, but it will increase VRAM usage and inference time
+python app.py --use_float16 --ffmpeg_path ffmpeg-master-latest-win64-gpl-shared\bin
+```
+
 ## Training

 ### Data Preparation