Emotion-controllable 3D Talking Face Generation with Hierarchical Disentanglement-guided VQ-VAE
收藏DataCite Commons2026-01-13 更新2026-05-05 收录
下载链接:
https://www.scidb.cn/detail?dataSetId=1e783b35d41d4e8a8690a640ed89d19e
下载链接
链接失效反馈官方服务:
资源简介:
Abstract: Objective Currently, speech-driven 3D talking face generation technology has matured considerably, particularly with recent models based on Vector Quantized Variational Autoencoders (VQ-VAE), which achieve more vivid facial expression generation by quantizing facial motion features. However, existing methods still suffer from limitations in facial reconstruction accuracy and emotional control stability. On the one hand, the facial detail reconstruction capability of VQ-VAE models requires further improvement; on the other hand, emotion guidance strategies in existing models remain relatively simplistic, failing to fully utilize multimodal information beyond speech, resulting in generated facial expressions that lack realism. Method Inspired by the structures of Conditional Variational Autoencoders (CVAE) and VQ-VAE-2, this paper proposes a hierarchical disentanglement method within the VQ-VAE framework. Facial features are decoupled into high-level and low-level features. Identity vectors and descriptive text are introduced as external conditions to disentangle the high-level features. The disentangled high-level features are then used as internal conditions, working together with the two external conditions (identity vector and text) to further disentangle the low-level features. This hierarchical disentanglement mechanism aims to improve facial reconstruction quality and enhance the stability of emotional expression. During generation, speech and text information jointly guide facial generation in different ways: the text is encoded by the CLIP text encoder to leverage its rich semantic information for holistic expression control, while the speech is encoded by Wav2Vec 2.0 to precisely control lip movements and other subtle expressions. By integrating multimodal information at different levels, the model generates more accurate facial motions. Results Experiments were conducted on the 3DMEAD and TA-MEAD datasets and included quantitative evaluation, visual analysis, and ablation studies. The proposed model was compared with FaceFormer, CodeTalker, FaceDiffuser, and ProbeTalk3D. In terms of overall facial reconstruction accuracy, the proposed model achieved a 5.9% lower Mean Vertex Error (MVE) than ProbeTalk3D—which also incorporates emotional input—demonstrating that explicit emotion modeling effectively enhances geometric consistency and preserves facial structural accuracy, particularly for complex emotions. For lip motion accuracy (LVE), the proposed model ranked second, with a 16.7% gap compared to the top-performing FaceDiffuser, while still outperforming FaceFormer and ProbeTalk3D. The limited improvement may stem from the hierarchical disentanglement approach, which, while decoupling emotional conditions, offers restricted decoupling for the mouth region, affecting detailed performance. In upper-face dynamics deviation (FDD), although inferior to ProbeTalk3D, the proposed model significantly surpassed traditional methods, reflecting its ability to express emotional intensity. More importantly, the model showed notable advantages in distribution-based quality metrics: it achieved optimal Mean Estimation Error (MEE) and Coverage Error (CE), with MEE 4.1% lower than ProbeTalk3D and 58.4% lower than CodeTalker, and CE 8.2% lower than ProbeTalk3D and 61.6% lower than CodeTalker, indicating that hierarchical disentanglement enhances distribution concentration and coverage of real samples, making generated facial motions closer to real faces. However, the model had the lowest Diversity score, highlighting a trade-off where hierarchical disentanglement ensures semantic consistency but constrains variation—beneficial for emotion accuracy yet potentially limiting naturalness. While traditional methods showed orders-of-magnitude differences in MVE and FDD due to lacking emotional input, and ProbeTalk3D excelled in FDD and Diversity but lagged in accuracy, the proposed model balances high-precision reconstruction with superior distribution estimation through hierarchical disentanglement and multimodal modeling. Conclusion Experimental results demonstrate that the proposed hierarchical disentanglement-based method produces more accurate and stable facial reconstructions, excelling in maintaining facial geometric consistency and expressing complex emotions. The model's key innovation lies in its layered feature disentanglement within the VQ-VAE framework, using identity vectors and text as external conditions and high-level features as an internal condition, which significantly enhances reconstruction quality and emotional controllability. This offers practical value for HCI, digital humans, and animation. Quantitative evaluation shows superior performance in MVE, LVE, and FDD, alongside stable MEE and CE scores, confirming the model's ability to effectively integrate multimodal information for generating realistic expressions. Ablation studies validate the critical roles of both the hierarchical feature disentanglement and the synergy between internal and external conditions. However, limitations remain: generated lip movements are less pronounced compared to real speech, the current two-level feature disentanglement is relatively simplistic, and the diversity and subtlety of emotional expressions require further improvement. Future work will focus on improving lip-sync quality, exploring finer-grained disentanglement dimensions for differentiated guidance, and enhancing expression diversity while preserving accuracy.
提供机构:
Science Data Bank
创建时间:
2026-01-13



