TY - GEN
T1 - FlowTalk
T2 - 7th ACM International Conference on Multimedia in Asia, MMAsia 2025 Workshops
AU - Deng, Kaijun
AU - Guo, Yuhang
AU - Shen, Linlin
N1 - Publisher Copyright:
© 2025 Copyright held by the owner/author(s).
PY - 2025/12/8
Y1 - 2025/12/8
N2 - Audio-driven talking head synthesis has achieved significant progress, yet existing methods face critical trade-offs among generation quality, inference efficiency, and cross-ethnic generalization. Diffusion-based approaches produce high-fidelity results but suffer from slow inference due to iterative denoising, while GAN-based methods achieve faster speed at the cost of reduced motion naturalness and limited generalization. To address these challenges, we propose FlowTalk, a novel framework that enables real-time high-fidelity talking head video synthesis. Our approach leverages Flow Matching technology to perform efficient motion modeling in a decoupled motion space rather than pixel space, achieving significant speedup while maintaining generation quality. Specifically, we adopt an off-the-shelf motion extractor to disentangle facial appearance from motion, and employ an OT-based flow matching model with a transformer architecture to predict identity-agnostic motion sequences conditioned on audio features. To improve cross-ethnic generalization, we train on a balanced combination of DH-FaceVid-1K and HDTF datasets with HuBert-CN as the audio encoder. Experimental results demonstrate that FlowTalk achieves over 100 FPS in motion-space inference with 32 ODE solver steps, approximately 5 times faster than diffusion-based baselines with 500 steps, while preserving comparable visual quality in lip synchronization, facial expressions, and head movements. This efficiency, further enhanced through TensorRT deployment, enables truly real-time generation. Our framework provides an effective and practical solution for real-time talking head generation applications.
AB - Audio-driven talking head synthesis has achieved significant progress, yet existing methods face critical trade-offs among generation quality, inference efficiency, and cross-ethnic generalization. Diffusion-based approaches produce high-fidelity results but suffer from slow inference due to iterative denoising, while GAN-based methods achieve faster speed at the cost of reduced motion naturalness and limited generalization. To address these challenges, we propose FlowTalk, a novel framework that enables real-time high-fidelity talking head video synthesis. Our approach leverages Flow Matching technology to perform efficient motion modeling in a decoupled motion space rather than pixel space, achieving significant speedup while maintaining generation quality. Specifically, we adopt an off-the-shelf motion extractor to disentangle facial appearance from motion, and employ an OT-based flow matching model with a transformer architecture to predict identity-agnostic motion sequences conditioned on audio features. To improve cross-ethnic generalization, we train on a balanced combination of DH-FaceVid-1K and HDTF datasets with HuBert-CN as the audio encoder. Experimental results demonstrate that FlowTalk achieves over 100 FPS in motion-space inference with 32 ODE solver steps, approximately 5 times faster than diffusion-based baselines with 500 steps, while preserving comparable visual quality in lip synchronization, facial expressions, and head movements. This efficiency, further enhanced through TensorRT deployment, enables truly real-time generation. Our framework provides an effective and practical solution for real-time talking head generation applications.
KW - Flow Matching
KW - Real Time
KW - Talking Head Synthesis
UR - https://www.scopus.com/pages/publications/105024961632
U2 - 10.1145/3769748.3773363
DO - 10.1145/3769748.3773363
M3 - Conference contribution
AN - SCOPUS:105024961632
T3 - Workshop Proceedings of the 7th ACM International Conference on Multimedia in Asia, MMAsia 2025 Workshops
BT - Workshop Proceedings of the 7th ACM International Conference on Multimedia in Asia, MMAsia 2025 Workshops
A2 - Chua, Tat-Seng
A2 - Wong, Lai-Kuan
A2 - Chan, Chee Seng
A2 - Tang, Jinhui
A2 - Ngo, Chong-Wah
A2 - Schoeffmann, Klaus
A2 - Liu, Jiaying
A2 - Ho, Yo-Sung
PB - Association for Computing Machinery, Inc
Y2 - 9 December 2025 through 12 December 2025
ER -