Add codes for real time inference

2026-02-04 17:39:20 +08:00 · 2024-04-18 12:05:22 +08:00
parent 955ca416ea
commit 0387c39a93
4 changed files with 373 additions and 5 deletions
--- a/README.md
+++ b/README.md
@@ -11,7 +11,7 @@ Chao Zhan,
 Wenjiang Zhou
 (<sup>*</sup>Equal Contribution, <sup>†</sup>Corresponding Author, benbinwu@tencent.com)

-**[github](https://github.com/TMElyralab/MuseTalk)**    **[huggingface](https://huggingface.co/TMElyralab/MuseTalk)**    **[gradio](https://huggingface.co/spaces/TMElyralab/MuseTalk)**    **Project (comming soon)**    **Technical report (comming soon)**
+**[github](https://github.com/TMElyralab/MuseTalk)**    **[huggingface](https://huggingface.co/TMElyralab/MuseTalk)**    **[space](https://huggingface.co/spaces/TMElyralab/MuseTalk)**    **Project (comming soon)**    **Technical report (comming soon)**

 We introduce `MuseTalk`, a **real-time high quality** lip-syncing model (30fps+ on an NVIDIA Tesla V100). MuseTalk can be applied with input videos, e.g., generated by [MuseV](https://github.com/TMElyralab/MuseV), as a complete virtual human solution.

@@ -28,12 +28,13 @@ We introduce `MuseTalk`, a **real-time high quality** lip-syncing model (30fps+
 # News
 - [04/02/2024] Release MuseTalk project and pretrained models.
 - [04/16/2024] Release Gradio [demo](https://huggingface.co/spaces/TMElyralab/MuseTalk) on HuggingFace Spaces (thanks to HF team for their community grant)
+- [04/17/2024] :mega: We release a pipeline that utilizes MuseTalk for real-time inference.

 ## Model
 ![Model Structure](assets/figs/musetalk_arc.jpg)
 MuseTalk was trained in latent spaces, where the images were encoded by a freezed VAE. The audio was encoded by a freezed `whisper-tiny` model. The architecture of the generation network was borrowed from the UNet of the `stable-diffusion-v1-4`, where the audio embeddings were fused to the image embeddings by cross-attention. 

-Note that although we use a very similar architecture as Stable Diffusion, MuseTalk is distinct in that it is `Not` a diffusion model. Instead, MuseTalk operates by inpainting in the latent space with `a single step`.
+Note that although we use a very similar architecture as Stable Diffusion, MuseTalk is distinct in that it is **NOT** a diffusion model. Instead, MuseTalk operates by inpainting in the latent space with a single step.

 ## Cases
 ### MuseV + MuseTalk make human photos alive！
@@ -162,7 +163,7 @@ Note that although we use a very similar architecture as Stable Diffusion, MuseT
 # TODO:
 - [x] trained models and inference codes.
 - [x] Huggingface Gradio [demo](https://huggingface.co/spaces/TMElyralab/MuseTalk).
- [ ] codes for real-time inference.
+- [x] codes for real-time inference.
 - [ ] technical report.
 - [ ] training codes.
 - [ ] a better model (may take longer).
@@ -262,9 +263,30 @@ python -m scripts.inference --inference_config configs/inference/test.yaml --bbo

 As a complete solution to virtual human generation, you are suggested to first apply [MuseV](https://github.com/TMElyralab/MuseV) to generate a video (text-to-video, image-to-video or pose-to-video) by referring [this](https://github.com/TMElyralab/MuseV?tab=readme-ov-file#text2video). Frame interpolation is suggested to increase frame rate. Then, you can use `MuseTalk` to generate a lip-sync video by referring [this](https://github.com/TMElyralab/MuseTalk?tab=readme-ov-file#inference).

-# Note
+#### :new: Real-time inference

-If you want to launch online video chats, you are suggested to generate videos using MuseV and apply necessary pre-processing such as face detection and face parsing in advance. During online chatting, only UNet and the VAE decoder are involved, which makes MuseTalk real-time.
+Here, we provide the inference script. This script first applies necessary pre-processing such as face detection, face parsing and VAE encode in advance. During inference, only UNet and the VAE decoder are involved, which makes MuseTalk real-time.
+```
+python -m scripts.realtime_inference --inference_config configs/inference/realtime.yaml
+```
+configs/inference/realtime.yaml is the path to the real-time inference configuration file, including `preparation`, `video_path` , `bbox_shift` and `audio_clips`. 
+
+1. Set `preparation` to `True` in `realtime.yaml` to prepare the materials for a new `avatar`. (If the `bbox_shift` has changed, you also need to re-prepare the materials.)
+1. After that, the `avatar` will use an audio clip selected from `audio_clips` to generate video.
+    ```
+    Inferring using: data/audio/yongen.wav
+    ```
+1. While MuseTalk is inferring, sub-threads can simultaneously stream the results to the users. The generation process can achieve up to 50fps on an NVIDIA Tesla V100.
+    ```
+    2%|██▍                                                         | 3/141 [00:00<00:32,  4.30it/s]   # inference process
+    Generating the 6-th frame with FPS: 48.58                                                  # playing process
+    Generating the 7-th frame with FPS: 48.74
+    Generating the 8-th frame with FPS: 49.17
+    3%|███▎                                                        | 4/141 [00:00<00:32,  4.21it/s]
+    ```
+1. Set `preparation` to `False` and run this script if you want to genrate more videos using the same avatar.
+
+If you want to generate multiple videos using the same avatar/video, you can also use this script to **SIGNIFICANTLY** expedite the generation process.


 # Acknowledgement