mirror of
https://github.com/TMElyralab/MuseTalk.git
synced 2026-02-04 17:39:20 +08:00
Update README.md
This commit is contained in:
@@ -31,7 +31,9 @@ We introduce `MuseTalk`, a **real-time high quality** lip-syncing model (30fps+
|
|||||||
|
|
||||||
## Model
|
## Model
|
||||||

|

|
||||||
MuseTalk was trained in latent spaces, where the images were encoded by a freezed VAE. The audio was encoded by a freezed `whisper-tiny` model. The architecture of the generation network was borrowed from the UNet of the `stable-diffusion-v1-4`, where the audio embeddings were fused to the image embeddings by cross-attention.
|
MuseTalk was trained in latent spaces, where the images were encoded by a freezed VAE. The audio was encoded by a freezed `whisper-tiny` model. The architecture of the generation network was borrowed from the UNet of the `stable-diffusion-v1-4`, where the audio embeddings were fused to the image embeddings by cross-attention.
|
||||||
|
|
||||||
|
Note that although we use a very similar architecture as Stable Diffusion, MuseTalk is distinct in that it is `Not` a diffusion model. Instead, MuseTalk operates by inpainting in the latent space with `a single step`.
|
||||||
|
|
||||||
## Cases
|
## Cases
|
||||||
### MuseV + MuseTalk make human photos alive!
|
### MuseV + MuseTalk make human photos alive!
|
||||||
|
|||||||
Reference in New Issue
Block a user