mirror of
https://github.com/TMElyralab/MuseTalk.git
synced 2026-02-04 17:39:20 +08:00
Add codes for real time inference
This commit is contained in:
32
README.md
32
README.md
@@ -11,7 +11,7 @@ Chao Zhan,
|
||||
Wenjiang Zhou
|
||||
(<sup>*</sup>Equal Contribution, <sup>†</sup>Corresponding Author, benbinwu@tencent.com)
|
||||
|
||||
**[github](https://github.com/TMElyralab/MuseTalk)** **[huggingface](https://huggingface.co/TMElyralab/MuseTalk)** **[gradio](https://huggingface.co/spaces/TMElyralab/MuseTalk)** **Project (comming soon)** **Technical report (comming soon)**
|
||||
**[github](https://github.com/TMElyralab/MuseTalk)** **[huggingface](https://huggingface.co/TMElyralab/MuseTalk)** **[space](https://huggingface.co/spaces/TMElyralab/MuseTalk)** **Project (comming soon)** **Technical report (comming soon)**
|
||||
|
||||
We introduce `MuseTalk`, a **real-time high quality** lip-syncing model (30fps+ on an NVIDIA Tesla V100). MuseTalk can be applied with input videos, e.g., generated by [MuseV](https://github.com/TMElyralab/MuseV), as a complete virtual human solution.
|
||||
|
||||
@@ -28,12 +28,13 @@ We introduce `MuseTalk`, a **real-time high quality** lip-syncing model (30fps+
|
||||
# News
|
||||
- [04/02/2024] Release MuseTalk project and pretrained models.
|
||||
- [04/16/2024] Release Gradio [demo](https://huggingface.co/spaces/TMElyralab/MuseTalk) on HuggingFace Spaces (thanks to HF team for their community grant)
|
||||
- [04/17/2024] :mega: We release a pipeline that utilizes MuseTalk for real-time inference.
|
||||
|
||||
## Model
|
||||

|
||||
MuseTalk was trained in latent spaces, where the images were encoded by a freezed VAE. The audio was encoded by a freezed `whisper-tiny` model. The architecture of the generation network was borrowed from the UNet of the `stable-diffusion-v1-4`, where the audio embeddings were fused to the image embeddings by cross-attention.
|
||||
|
||||
Note that although we use a very similar architecture as Stable Diffusion, MuseTalk is distinct in that it is `Not` a diffusion model. Instead, MuseTalk operates by inpainting in the latent space with `a single step`.
|
||||
Note that although we use a very similar architecture as Stable Diffusion, MuseTalk is distinct in that it is **NOT** a diffusion model. Instead, MuseTalk operates by inpainting in the latent space with a single step.
|
||||
|
||||
## Cases
|
||||
### MuseV + MuseTalk make human photos alive!
|
||||
@@ -162,7 +163,7 @@ Note that although we use a very similar architecture as Stable Diffusion, MuseT
|
||||
# TODO:
|
||||
- [x] trained models and inference codes.
|
||||
- [x] Huggingface Gradio [demo](https://huggingface.co/spaces/TMElyralab/MuseTalk).
|
||||
- [ ] codes for real-time inference.
|
||||
- [x] codes for real-time inference.
|
||||
- [ ] technical report.
|
||||
- [ ] training codes.
|
||||
- [ ] a better model (may take longer).
|
||||
@@ -262,9 +263,30 @@ python -m scripts.inference --inference_config configs/inference/test.yaml --bbo
|
||||
|
||||
As a complete solution to virtual human generation, you are suggested to first apply [MuseV](https://github.com/TMElyralab/MuseV) to generate a video (text-to-video, image-to-video or pose-to-video) by referring [this](https://github.com/TMElyralab/MuseV?tab=readme-ov-file#text2video). Frame interpolation is suggested to increase frame rate. Then, you can use `MuseTalk` to generate a lip-sync video by referring [this](https://github.com/TMElyralab/MuseTalk?tab=readme-ov-file#inference).
|
||||
|
||||
# Note
|
||||
#### :new: Real-time inference
|
||||
|
||||
If you want to launch online video chats, you are suggested to generate videos using MuseV and apply necessary pre-processing such as face detection and face parsing in advance. During online chatting, only UNet and the VAE decoder are involved, which makes MuseTalk real-time.
|
||||
Here, we provide the inference script. This script first applies necessary pre-processing such as face detection, face parsing and VAE encode in advance. During inference, only UNet and the VAE decoder are involved, which makes MuseTalk real-time.
|
||||
```
|
||||
python -m scripts.realtime_inference --inference_config configs/inference/realtime.yaml
|
||||
```
|
||||
configs/inference/realtime.yaml is the path to the real-time inference configuration file, including `preparation`, `video_path` , `bbox_shift` and `audio_clips`.
|
||||
|
||||
1. Set `preparation` to `True` in `realtime.yaml` to prepare the materials for a new `avatar`. (If the `bbox_shift` has changed, you also need to re-prepare the materials.)
|
||||
1. After that, the `avatar` will use an audio clip selected from `audio_clips` to generate video.
|
||||
```
|
||||
Inferring using: data/audio/yongen.wav
|
||||
```
|
||||
1. While MuseTalk is inferring, sub-threads can simultaneously stream the results to the users. The generation process can achieve up to 50fps on an NVIDIA Tesla V100.
|
||||
```
|
||||
2%|██▍ | 3/141 [00:00<00:32, 4.30it/s] # inference process
|
||||
Generating the 6-th frame with FPS: 48.58 # playing process
|
||||
Generating the 7-th frame with FPS: 48.74
|
||||
Generating the 8-th frame with FPS: 49.17
|
||||
3%|███▎ | 4/141 [00:00<00:32, 4.21it/s]
|
||||
```
|
||||
1. Set `preparation` to `False` and run this script if you want to genrate more videos using the same avatar.
|
||||
|
||||
If you want to generate multiple videos using the same avatar/video, you can also use this script to **SIGNIFICANTLY** expedite the generation process.
|
||||
|
||||
|
||||
# Acknowledgement
|
||||
|
||||
Reference in New Issue
Block a user