DEEPTalk: Dynamic Emotion Embedding for Probabilistic Speech-Driven 3D Face Animation

1Yonsei University, 2GIANTSTEP Inc.

 

*Equal Contribution. Corresponding author.
Description of the image

Abstract

Speech-driven 3D facial animation has garnered lots of attention thanks to its broad range of applications. Despite recent advancements in achieving realistic lip motion, current methods fail to capture the nuanced emotional undertones conveyed through speech and produce monotonous facial motion. These limitations result in blunt and repetitive facial animations, reducing user engagement and hindering their applicability. To address these challenges, we introduce DEEPTalk, a novel approach that generates diverse and emotionally rich 3D facial expressions directly from speech inputs. To achieve this, we first train DEE (Dynamic Emotion Embedding), which employs probabilistic contrastive learning to forge a joint emotion embedding space for both speech and facial motion. This probabilistic framework captures the uncertainty in interpreting emotions from speech and facial motion, enabling the derivation of emotion vectors from its multifaceted space. Moreover, to generate dynamic facial motion, we design TH-VQVAE (Temporally Hierarchical VQ-VAE) as an expressive and robust motion prior overcoming limitations of VAEs and VQ-VAEs. Utilizing these strong priors, we develop DEEPTalk, A talking head generator that non-autoregressively predicts codebook indices to create dynamic facial motion, incorporating a novel emotion consistency loss. Extensive experiments on various datasets demonstrate the effectiveness of our approach in creating diverse, emotionally expressive talking faces that maintain accurate lip-sync.

Architecture

(a) \( E_{\text{audio}} \) and \( E_{\text{exp}} \) are trained to predict mean and variance for a joint audio-facial emotion embedding space, DEE. (b) We train TH-VQVAE with separate codebooks, \( \mathcal{Z}^b \) and \( \mathcal{Z}^t \), for low and high-frequency motions, respectively. (c) DEEPTalk first extracts face features, predicts top and bottom codebook indices, and uses frozen TH-VQVAE decoders to decode the quantized motion features. To ensure emotion alignment between input audio and the predicted facial expressions, we introduce an emotional consistency loss \( L_{\text{emo}} \) by utilizing DEE.

Demo Video

Related Links

We gratefully acknowledge the open-source projects that served as the foundation for our work:

Emotional Speech-Driven Animation with Content-Emotion Disentanglement introduced EMOCAV2 finetuned on MEAD, allowing us to reconstruct realistic talking faces.

Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion first quantized listener's facial motions along the temporal dimension, motivating our TH-VQVAE.

Improved Probabilistic Image-Text Representations designed a closed-form sampled distance for learning probability representations.

BibTeX

@misc{kim2024deeptalkdynamicemotionembedding,
      title={DEEPTalk: Dynamic Emotion Embedding for Probabilistic Speech-Driven 3D Face Animation}, 
      author={Jisoo Kim and Jungbin Cho and Joonho Park and Soonmin Hwang and Da Eun Kim and Geon Kim and Youngjae Yu},
      year={2024},
      eprint={2408.06010},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2408.06010}, 
}