StableTalk: Advancing Audio-to-Talking Face Generation with Stable Diffusion and Vision Transformer

Fatemeh Nazarieh; Josef Kittler; Muhammad Awais Rana; Diptesh Kanojia; Zhenhua Feng

doi:10.1007/978-3-031-78172-8_18

Back

Book chapter

StableTalk: Advancing Audio-to-Talking Face Generation with Stable Diffusion and Vision Transformer

Fatemeh Nazarieh, Josef Kittler, Muhammad Awais Rana, Diptesh Kanojia and Zhenhua Feng

Pattern Recognition, pp.271-286

Lecture Notes in Computer Science, Springer Nature Switzerland

03/12/2024

DOI: https://doi.org/10.1007/978-3-031-78172-8_18

Abstract

Audio-to-Talking Face Generation

Denoising Diffusion Implicit Model

Latent Diffusion

Re-attention

Vision Transformer

Audio-to-talking face generation stands at the forefront of advancements in generative AI. It bridges the gap between audio and visual representations by generating synchronized and realistic talking faces. Despite recent progress, the lack of realism in animated faces, asynchronous audio-lip movements, and computational burden remain key barriers to practical applications. To address these challenges, we introduce a novel approach, StableTalk, leveraging the emerging capabilities of Stable diffusion models and vision Transformers for Talking face generation. We also integrate the Re-attention mechanism and adversarial loss to improve the consistency of facial animations and synchronization with a given audio input. More importantly, the computational efficiency of our method has been notably enhanced by optimizing operations within the latent space and dynamically adjusting the focus on different parts of the visual content based on the provided conditions. Our experimental results demonstrate the superiority of StableTalk over the existing approaches in image quality, audio-lip synchronization, and computational efficiency.

Metrics

13 Record Views

Details

Title: StableTalk: Advancing Audio-to-Talking Face Generation with Stable Diffusion and Vision Transformer
Creators: Fatemeh Nazarieh - University of Surrey, School of Computer Science and Electronic Engineering
Josef Kittler - University of Surrey, School of Computer Science and Electronic Engineering
Muhammad Awais Rana - University of Surrey, School of Computer Science and Electronic Engineering
Diptesh Kanojia - University of Surrey, School of Computer Science and Electronic Engineering
Zhenhua Feng - University of Surrey, School of Computer Science and Electronic Engineering
Contributors: Apostolos Antonacopoulos (Editor)
Subhasis Chaudhuri (Editor)
Rama Chellappa (Editor)
Cheng-Lin Liu (Editor)
Saumik Bhattacharya (Editor)
Umapada Pal (Editor)
Publication Details: Pattern Recognition, pp.271-286
Series: Lecture Notes in Computer Science
Publisher: Springer Nature Switzerland; Cham
Number of pages: 16
Publication Date: 03/12/2024
Identifiers: 99961148902346
Academic Unit: School of Computer Science and Electronic Engineering; Centre for Vision, Speech & Signal Processing (CVSSP)
Language: English
Resource Type: Book chapter

StableTalk: Advancing Audio-to-Talking Face Generation with Stable Diffusion and Vision Transformer

Abstract

Metrics

Details

Usage Policy