From 696ec5aa03a72e530472bb87c8445f2f867f215a Mon Sep 17 00:00:00 2001 From: itechmusic <163980830+itechmusic@users.noreply.github.com> Date: Tue, 16 Apr 2024 15:45:01 +0800 Subject: [PATCH] Update README.md --- README.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 47a67b4..258f069 100644 --- a/README.md +++ b/README.md @@ -31,7 +31,9 @@ We introduce `MuseTalk`, a **real-time high quality** lip-syncing model (30fps+ ## Model ![Model Structure](assets/figs/musetalk_arc.jpg) -MuseTalk was trained in latent spaces, where the images were encoded by a freezed VAE. The audio was encoded by a freezed `whisper-tiny` model. The architecture of the generation network was borrowed from the UNet of the `stable-diffusion-v1-4`, where the audio embeddings were fused to the image embeddings by cross-attention. +MuseTalk was trained in latent spaces, where the images were encoded by a freezed VAE. The audio was encoded by a freezed `whisper-tiny` model. The architecture of the generation network was borrowed from the UNet of the `stable-diffusion-v1-4`, where the audio embeddings were fused to the image embeddings by cross-attention. + +Note that although we use a very similar architecture as Stable Diffusion, MuseTalk is distinct in that it is `Not` a diffusion model. Instead, MuseTalk operates by inpainting in the latent space with `a single step`. ## Cases ### MuseV + MuseTalk make human photos alive!