v1.5

2026-02-04 17:39:20 +08:00 · 2025-03-28 16:03:02 +08:00
parent 058f7ddc7f
commit db204311a5
46 changed files with 729 additions and 204 deletions
--- a/README.md
+++ b/README.md
@@ -1,15 +1,16 @@
 # MuseTalk

-MuseTalk: Real-Time High Quality Lip Synchronization with Latent Space Inpainting
-</br>
-Yue Zhang <sup>\*</sup>,
+<strong>MuseTalk: Real-Time High-Fidelity Video Dubbing via Spatio-Temporal Sampling</strong>
+
+Yue Zhang<sup>\*</sup>,
+Zhizhou Zhong<sup>\*</sup>,
 Minhao Liu<sup>\*</sup>,
 Zhaokang Chen,
 Bin Wu<sup>†</sup>,
 Yubin Zeng, 
 Chao Zhan,
-Yingjie He,
 Junxin Huang,
+Yingjie He,
 Wenjiang Zhou
 (<sup>*</sup>Equal Contribution, <sup>†</sup>Corresponding Author, benbinwu@tencent.com)

@@ -19,7 +20,10 @@ Lyra Lab, Tencent Music Entertainment

 We introduce `MuseTalk`, a **real-time high quality** lip-syncing model (30fps+ on an NVIDIA Tesla V100). MuseTalk can be applied with input videos, e.g., generated by [MuseV](https://github.com/TMElyralab/MuseV), as a complete virtual human solution.

-:new: Update: We are thrilled to announce that [MusePose](https://github.com/TMElyralab/MusePose/) has been released. MusePose is an image-to-video generation framework for virtual human under control signal like pose. Together with MuseV and MuseTalk, we hope the community can join us and march towards the vision where a virtual human can be generated end2end with native ability of full body movement and interaction.
+## 🔥 Updates
+We're excited to unveil MuseTalk 1.5. 
+This version **(1)** integrates training with perceptual loss, GAN loss, and sync loss, significantly boosting its overall performance. **(2)** We've implemented a two-stage training strategy and a spatio-temporal data sampling approach to strike a balance between visual quality and lip-sync accuracy. 
+Learn more details [here](https://arxiv.org/abs/2410.10122)

 # Overview
 `MuseTalk` is a real-time high quality audio-driven lip-syncing model trained in the latent space of `ft-mse-vae`, which
@@ -28,23 +32,199 @@ We introduce `MuseTalk`, a **real-time high quality** lip-syncing model (30fps+
 1. supports audio in various languages, such as Chinese, English, and Japanese.
 1. supports real-time inference with 30fps+ on an NVIDIA Tesla V100.
 1. supports modification of the center point of the face region proposes, which **SIGNIFICANTLY** affects generation results. 
-1. checkpoint available trained on the HDTF dataset.
-1. training codes (comming soon).
+1. checkpoint available trained on the HDTF and private dataset.

 # News
- [04/02/2024] Release MuseTalk project and pretrained models.
+- [03/28/2025] :mega: We are thrilled to announce the release of our 1.5 version. This version is a significant improvement over the 1.0 version, with enhanced clarity, identity consistency, and precise lip-speech synchronization. We update the [technical report](https://arxiv.org/abs/2410.10122) with more details.
+- [10/18/2024] We release the [technical report](https://arxiv.org/abs/2410.10122v2). Our report details a superior model to the open-source L1 loss version. It includes GAN and perceptual losses for improved clarity, and sync loss for enhanced performance.
+- [04/17/2024] We release a pipeline that utilizes MuseTalk for real-time inference.
 - [04/16/2024] Release Gradio [demo](https://huggingface.co/spaces/TMElyralab/MuseTalk) on HuggingFace Spaces (thanks to HF team for their community grant)
- [04/17/2024] : We release a pipeline that utilizes MuseTalk for real-time inference.
- [10/18/2024] :mega: We release the [technical report](https://arxiv.org/abs/2410.10122). Our report details a superior model to the open-source L1 loss version. It includes GAN and perceptual losses for improved clarity, and sync loss for enhanced performance.
+- [04/02/2024] Release MuseTalk project and pretrained models.
+

 ## Model
-![Model Structure](assets/figs/musetalk_arc.jpg)
+![Model Structure](https://github.com/user-attachments/assets/02f4a214-1bdd-4326-983c-e70b478accba)
 MuseTalk was trained in latent spaces, where the images were encoded by a freezed VAE. The audio was encoded by a freezed `whisper-tiny` model. The architecture of the generation network was borrowed from the UNet of the `stable-diffusion-v1-4`, where the audio embeddings were fused to the image embeddings by cross-attention. 

 Note that although we use a very similar architecture as Stable Diffusion, MuseTalk is distinct in that it is **NOT** a diffusion model. Instead, MuseTalk operates by inpainting in the latent space with a single step.

 ## Cases
-### MuseV + MuseTalk make human photos alive！
+
+<table>
+<tr>
+<td width="33%">
+
+### Input Video
+---
+https://github.com/TMElyralab/MuseTalk/assets/163980830/37a3a666-7b90-4244-8d3a-058cb0e44107
+
+---
+https://github.com/user-attachments/assets/1ce3e850-90ac-4a31-a45f-8dfa4f2960ac
+
+---
+https://github.com/user-attachments/assets/fa3b13a1-ae26-4d1d-899e-87435f8d22b3
+
+---
+https://github.com/user-attachments/assets/15800692-39d1-4f4c-99f2-aef044dc3251
+
+---
+https://github.com/user-attachments/assets/a843f9c9-136d-4ed4-9303-4a7269787a60
+
+---
+https://github.com/user-attachments/assets/6eb4e70e-9e19-48e9-85a9-bbfa589c5fcb
+
+</td>
+<td width="33%">
+
+### MuseTalk 1.0
+---
+https://github.com/user-attachments/assets/c04f3cd5-9f77-40e9-aafd-61978380d0ef
+
+---
+https://github.com/user-attachments/assets/2051a388-1cef-4c1d-b2a2-3c1ceee5dc99
+
+---
+https://github.com/user-attachments/assets/b5f56f71-5cdc-4e2e-a519-454242000d32
+
+---
+https://github.com/user-attachments/assets/a5843835-04ab-4c31-989f-0995cfc22f34
+
+---
+https://github.com/user-attachments/assets/3dc7f1d7-8747-4733-bbdd-97874af0c028
+
+---
+https://github.com/user-attachments/assets/3c78064e-faad-4637-83ae-28452a22b09a
+
+</td>
+<td width="33%">
+
+### MuseTalk 1.5
+---
+https://github.com/user-attachments/assets/999a6f5b-61dd-48e1-b902-bb3f9cbc7247
+
+---
+https://github.com/user-attachments/assets/d26a5c9a-003c-489d-a043-c9a331456e75
+
+---
+https://github.com/user-attachments/assets/471290d7-b157-4cf6-8a6d-7e899afa302c
+
+---
+https://github.com/user-attachments/assets/1ee77c4c-8c70-4add-b6db-583a12faa7dc
+
+---
+https://github.com/user-attachments/assets/370510ea-624c-43b7-bbb0-ab5333e0fcc4
+
+---
+https://github.com/user-attachments/assets/b011ece9-a332-4bc1-b8b7-ef6e383d7bde
+
+</td>
+</tr>
+</table>
+
+
+# TODO:
+- [x] trained models and inference codes.
+- [x] Huggingface Gradio [demo](https://huggingface.co/spaces/TMElyralab/MuseTalk).
+- [x] codes for real-time inference.
+- [x] [technical report](https://arxiv.org/abs/2410.10122v2).
+- [x] a better model with updated [technical report](https://arxiv.org/abs/2410.10122).
+- [ ] training and dataloader code (Expected completion on 04/04/2025).
+
+
+
+# Getting Started
+We provide a detailed tutorial about the installation and the basic usage of MuseTalk for new users:
+
+## Third party integration
+Thanks for the third-party integration, which makes installation and use more convenient for everyone.
+We also hope you note that we have not verified, maintained, or updated third-party. Please refer to this project for specific results.
+
+### [ComfyUI](https://github.com/chaojie/ComfyUI-MuseTalk)
+
+## Installation
+To prepare the Python environment and install additional packages such as opencv, diffusers, mmcv, etc., please follow the steps below:
+### Build environment
+
+We recommend a python version >=3.10 and cuda version =11.7. Then build environment as follows:
+
+```shell
+pip install -r requirements.txt
+```
+
+### mmlab packages
+```bash
+pip install --no-cache-dir -U openmim 
+mim install mmengine 
+mim install "mmcv>=2.0.1" 
+mim install "mmdet>=3.1.0" 
+mim install "mmpose>=1.1.0" 
+```
+
+### Download ffmpeg-static
+Download the ffmpeg-static and
+```
+export FFMPEG_PATH=/path/to/ffmpeg
+```
+for example:
+```
+export FFMPEG_PATH=/musetalk/ffmpeg-4.4-amd64-static
+```
+### Download weights
+You can download weights manually as follows:
+
+1. Download our trained [weights](https://huggingface.co/TMElyralab/MuseTalk).
+
+2. Download the weights of other components:
+   - [sd-vae-ft-mse](https://huggingface.co/stabilityai/sd-vae-ft-mse)
+   - [whisper](https://openaipublic.azureedge.net/main/whisper/models/65147644a518d12f04e32d6f3b26facc3f8dd46e5390956a9424a650c0ce22b9/tiny.pt)
+   - [dwpose](https://huggingface.co/yzd-v/DWPose/tree/main)
+   - [face-parse-bisent](https://github.com/zllrunning/face-parsing.PyTorch)
+   - [resnet18](https://download.pytorch.org/models/resnet18-5c106cde.pth)
+
+
+Finally, these weights should be organized in `models` as follows:
+```
+./models/
+├── musetalk
+│   └── musetalk.json
+│   └── pytorch_model.bin
+├── musetalkV15
+│   └── musetalk.json
+│   └── unet.pth
+├── dwpose
+│   └── dw-ll_ucoco_384.pth
+├── face-parse-bisent
+│   ├── 79999_iter.pth
+│   └── resnet18-5c106cde.pth
+├── sd-vae-ft-mse
+│   ├── config.json
+│   └── diffusion_pytorch_model.bin
+└── whisper
+    └── tiny.pt
+```
+## Quickstart
+
+### Inference
+We provide inference scripts for both versions of MuseTalk:
+
+#### MuseTalk 1.5 (Recommended)
+```bash
+python3 -m scripts.inference_alpha --inference_config configs/inference/test.yaml --unet_model_path ./models/musetalkV15/unet.pth
+```
+This inference script supports both MuseTalk 1.5 and 1.0 models:
+- For MuseTalk 1.5: Use the command above with the V1.5 model path
+- For MuseTalk 1.0: Use the same script but point to the V1.0 model path
+
+configs/inference/test.yaml is the path to the inference configuration file, including video_path and audio_path.
+The video_path should be either a video file, an image file or a directory of images. 
+
+#### MuseTalk 1.0
+```bash
+python3 -m scripts.inference --inference_config configs/inference/test.yaml
+```
+You are recommended to input video with `25fps`, the same fps used when training the model. If your video is far less than 25fps, you are recommended to apply frame interpolation or directly convert the video to 25fps using ffmpeg.
+<details close>
+## TestCases For 1.0
 <table class="center">
  <tr style="font-weight: bolder;text-align:center;">
        <td width="33%">Image</td>
@@ -130,132 +310,7 @@ Note that although we use a very similar architecture as Stable Diffusion, MuseT
  </tr>
 </table >

-* The character of the last two rows, `Xinying Sun`, is a supermodel KOL. You can follow her on [douyin](https://www.douyin.com/user/MS4wLjABAAAAWDThbMPN_6Xmm_JgXexbOii1K-httbu2APdG8DvDyM8).
-
-## Video dubbing
-<table class="center">
-  <tr style="font-weight: bolder;text-align:center;">
-        <td width="70%">MuseTalk</td>
-        <td width="30%">Original videos</td>
-  </tr>
-  <tr>
-    <td>
-      <video src=https://github.com/TMElyralab/MuseTalk/assets/163980830/4d7c5fa1-3550-4d52-8ed2-52f158150f24 controls preload></video>
-    </td>
-    <td>
-      <a href="//www.bilibili.com/video/BV1wT411b7HU">Link</a>
-      <href src=""></href>
-    </td>
-  </tr>
-</table>
-
-* For video dubbing, we applied a self-developed tool which can identify the talking person. 
-
-## Some interesting videos!
-<table class="center">
-  <tr style="font-weight: bolder;text-align:center;">
-        <td width="50%">Image</td>
-        <td width="50%">MuseV + MuseTalk</td>
-  </tr>
-  <tr>
-    <td>
-      <img src=assets/demo/video1/video1.png width="95%">
-    </td>
-    <td>
-      <video src=https://github.com/TMElyralab/MuseTalk/assets/163980830/1f02f9c6-8b98-475e-86b8-82ebee82fe0d controls preload></video>
-    </td>
-  </tr>
-</table>
-
-# TODO:
- [x] trained models and inference codes.
- [x] Huggingface Gradio [demo](https://huggingface.co/spaces/TMElyralab/MuseTalk).
- [x] codes for real-time inference.
- [ ] technical report.
- [ ] training codes.
- [ ] a better model (may take longer).
-
-
-# Getting Started
-We provide a detailed tutorial about the installation and the basic usage of MuseTalk for new users:
-
-## Third party integration
-Thanks for the third-party integration, which makes installation and use more convenient for everyone.
-We also hope you note that we have not verified, maintained, or updated third-party. Please refer to this project for specific results.
-
-### [ComfyUI](https://github.com/chaojie/ComfyUI-MuseTalk)
-
-## Installation
-To prepare the Python environment and install additional packages such as opencv, diffusers, mmcv, etc., please follow the steps below:
-### Build environment
-
-We recommend a python version >=3.10 and cuda version =11.7. Then build environment as follows:
-
-```shell
-pip install -r requirements.txt
-```
-
-### mmlab packages
-```bash
-pip install --no-cache-dir -U openmim 
-mim install mmengine 
-mim install "mmcv>=2.0.1" 
-mim install "mmdet>=3.1.0" 
-mim install "mmpose>=1.1.0" 
-```
-
-### Download ffmpeg-static
-Download the ffmpeg-static and
-```
-export FFMPEG_PATH=/path/to/ffmpeg
-```
-for example:
-```
-export FFMPEG_PATH=/musetalk/ffmpeg-4.4-amd64-static
-```
-### Download weights
-You can download weights manually as follows:
-
-1. Download our trained [weights](https://huggingface.co/TMElyralab/MuseTalk).
-
-2. Download the weights of other components:
-   - [sd-vae-ft-mse](https://huggingface.co/stabilityai/sd-vae-ft-mse)
-   - [whisper](https://openaipublic.azureedge.net/main/whisper/models/65147644a518d12f04e32d6f3b26facc3f8dd46e5390956a9424a650c0ce22b9/tiny.pt)
-   - [dwpose](https://huggingface.co/yzd-v/DWPose/tree/main)
-   - [face-parse-bisent](https://github.com/zllrunning/face-parsing.PyTorch)
-   - [resnet18](https://download.pytorch.org/models/resnet18-5c106cde.pth)
-
-
-Finally, these weights should be organized in `models` as follows:
-```
-./models/
-├── musetalk
-│   └── musetalk.json
-│   └── pytorch_model.bin
-├── dwpose
-│   └── dw-ll_ucoco_384.pth
-├── face-parse-bisent
-│   ├── 79999_iter.pth
-│   └── resnet18-5c106cde.pth
-├── sd-vae-ft-mse
-│   ├── config.json
-│   └── diffusion_pytorch_model.bin
-└── whisper
-    └── tiny.pt
-```
-## Quickstart
-
-### Inference
-Here, we provide the inference script. 
-```
-python -m scripts.inference --inference_config configs/inference/test.yaml 
-```
-configs/inference/test.yaml is the path to the inference configuration file, including video_path and audio_path.
-The video_path should be either a video file, an image file or a directory of images. 
-
-You are recommended to input video with `25fps`, the same fps used when training the model. If your video is far less than 25fps, you are recommended to apply frame interpolation or directly convert the video to 25fps using ffmpeg.
-
-#### Use of bbox_shift to have adjustable results
+#### Use of bbox_shift to have adjustable results(For 1.0)
 :mag_right: We have found that upper-bound of the mask has an important impact on mouth openness. Thus, to control the mask region, we suggest using the `bbox_shift` parameter. Positive values (moving towards the lower half) increase mouth openness, while negative values (moving towards the upper half) decrease mouth openness.

 You can start by running with the default configuration to obtain the adjustable value range, and then re-run the script within this range. 
@@ -266,6 +321,9 @@ python -m scripts.inference --inference_config configs/inference/test.yaml --bbo
 ```
 :pushpin: More technical details can be found in [bbox_shift](assets/BBOX_SHIFT.md).

+</details>
+
+
 #### Combining MuseV and MuseTalk

 As a complete solution to virtual human generation, you are suggested to first apply [MuseV](https://github.com/TMElyralab/MuseV) to generate a video (text-to-video, image-to-video or pose-to-video) by referring [this](https://github.com/TMElyralab/MuseV?tab=readme-ov-file#text2video). Frame interpolation is suggested to increase frame rate. Then, you can use `MuseTalk` to generate a lip-sync video by referring [this](https://github.com/TMElyralab/MuseTalk?tab=readme-ov-file#inference).
@@ -312,10 +370,10 @@ If you need higher resolution, you could apply super resolution models such as [
 # Citation
 ```bib
@article{musetalk,
-  title={MuseTalk: Real-Time High Quality Lip Synchorization with Latent Space Inpainting},
-  author={Zhang, Yue and Liu, Minhao and Chen, Zhaokang and Wu, Bin and Zeng, Yubin and Zhan, Chao and He, Yingjie and Huang, Junxin and Zhou, Wenjiang},
+  title={MuseTalk: Real-Time High-Fidelity Video Dubbing via Spatio-Temporal Sampling},
+  author={Zhang, Yue and Zhong, Zhizhou and Liu, Minhao and Chen, Zhaokang and Wu, Bin and Zeng, Yubin and Zhan, Chao and He, Yingjie and Huang, Junxin and Zhou, Wenjiang},
  journal={arxiv},
-  year={2024}
+  year={2025}
 }
 ```
 # Disclaimer/License