Skip to main navigation Skip to search Skip to main content

FlowTalk: Real-Time Audio-Driven Talking Head Synthesis via Motion-Space Flow Matching

  • Kaijun Deng
  • , Yuhang Guo
  • , Linlin Shen

Research output: Chapter in Book/Conference proceedingConference contributionpeer-review

Abstract

Audio-driven talking head synthesis has achieved significant progress, yet existing methods face critical trade-offs among generation quality, inference efficiency, and cross-ethnic generalization. Diffusion-based approaches produce high-fidelity results but suffer from slow inference due to iterative denoising, while GAN-based methods achieve faster speed at the cost of reduced motion naturalness and limited generalization. To address these challenges, we propose FlowTalk, a novel framework that enables real-time high-fidelity talking head video synthesis. Our approach leverages Flow Matching technology to perform efficient motion modeling in a decoupled motion space rather than pixel space, achieving significant speedup while maintaining generation quality. Specifically, we adopt an off-the-shelf motion extractor to disentangle facial appearance from motion, and employ an OT-based flow matching model with a transformer architecture to predict identity-agnostic motion sequences conditioned on audio features. To improve cross-ethnic generalization, we train on a balanced combination of DH-FaceVid-1K and HDTF datasets with HuBert-CN as the audio encoder. Experimental results demonstrate that FlowTalk achieves over 100 FPS in motion-space inference with 32 ODE solver steps, approximately 5 times faster than diffusion-based baselines with 500 steps, while preserving comparable visual quality in lip synchronization, facial expressions, and head movements. This efficiency, further enhanced through TensorRT deployment, enables truly real-time generation. Our framework provides an effective and practical solution for real-time talking head generation applications.

Original languageEnglish
Title of host publicationWorkshop Proceedings of the 7th ACM International Conference on Multimedia in Asia, MMAsia 2025 Workshops
EditorsTat-Seng Chua, Lai-Kuan Wong, Chee Seng Chan, Jinhui Tang, Chong-Wah Ngo, Klaus Schoeffmann, Jiaying Liu, Yo-Sung Ho
PublisherAssociation for Computing Machinery, Inc
ISBN (Electronic)9798400722479
DOIs
Publication statusPublished - 8 Dec 2025
Externally publishedYes
Event7th ACM International Conference on Multimedia in Asia, MMAsia 2025 Workshops - Kuala Lumpur, Malaysia
Duration: 9 Dec 202512 Dec 2025

Publication series

NameWorkshop Proceedings of the 7th ACM International Conference on Multimedia in Asia, MMAsia 2025 Workshops

Conference

Conference7th ACM International Conference on Multimedia in Asia, MMAsia 2025 Workshops
Country/TerritoryMalaysia
CityKuala Lumpur
Period9/12/2512/12/25

Free Keywords

  • Flow Matching
  • Real Time
  • Talking Head Synthesis

ASJC Scopus subject areas

  • Computer Graphics and Computer-Aided Design
  • Human-Computer Interaction

Fingerprint

Dive into the research topics of 'FlowTalk: Real-Time Audio-Driven Talking Head Synthesis via Motion-Space Flow Matching'. Together they form a unique fingerprint.

Cite this