MergeTalk: Audio-Driven Talking Head Generation From Single Image With Feature Merge

Jian Gao; Chang Shu; Ximin Zheng; Zheng Lu; Nengsheng Bao

doi:10.1109/LSP.2024.3422816

MergeTalk: Audio-Driven Talking Head Generation From Single Image With Feature Merge

Jian Gao, Chang Shu, Ximin Zheng, Zheng Lu, Nengsheng Bao

School of Computer Science

Research output: Journal Publication › Article › peer-review

1 Citation (Scopus)

Abstract

Audio-driven talking head generation has wide real world applications but remains challenging due to the problems such as audio-lip synchronization, head poses, identity preservation, video quality, etc. We propose a novel two-stage framework that uses explicit 3D face images rendered from a 3D model based on the audio input, as intermediate features. We devise two independent 3D motion parameter generation networks to generate expression and pose parameters for the popular 3DMM model to solve the audio-lip synchronization problem and natural head poses without losing identity information. To improve the final talking head quality such as avoiding facial distortion and artifacts, we propose a novel face feature merge network to accurately extract and fuse the background, identity information, facial texture from the source image, and the lip movements and head poses from the 3D face images, and generate the final videos based on generative adversarial networks. Extensive experiments show that our framework outperforms the SOTA methods in several aspects and has good generalization ability.

Original language	English
Pages (from-to)	1850-1854
Number of pages	5
Journal	IEEE Signal Processing Letters
Volume	31
DOIs	https://doi.org/10.1109/LSP.2024.3422816
Publication status	Published - 2024

Keywords

3DMM
GAN
Talking head generation
feature merge

ASJC Scopus subject areas

Signal Processing
Electrical and Electronic Engineering
Applied Mathematics

Access to Document

10.1109/LSP.2024.3422816

Cite this

@article{7ece7194b83942aba4db27b80f4aa707,

title = "MergeTalk: Audio-Driven Talking Head Generation From Single Image With Feature Merge",

abstract = "Audio-driven talking head generation has wide real world applications but remains challenging due to the problems such as audio-lip synchronization, head poses, identity preservation, video quality, etc. We propose a novel two-stage framework that uses explicit 3D face images rendered from a 3D model based on the audio input, as intermediate features. We devise two independent 3D motion parameter generation networks to generate expression and pose parameters for the popular 3DMM model to solve the audio-lip synchronization problem and natural head poses without losing identity information. To improve the final talking head quality such as avoiding facial distortion and artifacts, we propose a novel face feature merge network to accurately extract and fuse the background, identity information, facial texture from the source image, and the lip movements and head poses from the 3D face images, and generate the final videos based on generative adversarial networks. Extensive experiments show that our framework outperforms the SOTA methods in several aspects and has good generalization ability.",

keywords = "3DMM, GAN, Talking head generation, feature merge",

author = "Jian Gao and Chang Shu and Ximin Zheng and Zheng Lu and Nengsheng Bao",

note = "Publisher Copyright: {\textcopyright} 1994-2012 IEEE.",

year = "2024",

doi = "10.1109/LSP.2024.3422816",

language = "English",

volume = "31",

pages = "1850--1854",

journal = "IEEE Signal Processing Letters",

issn = "1070-9908",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - MergeTalk

T2 - Audio-Driven Talking Head Generation From Single Image With Feature Merge

AU - Gao, Jian

AU - Shu, Chang

AU - Zheng, Ximin

AU - Lu, Zheng

AU - Bao, Nengsheng

PY - 2024

Y1 - 2024

N2 - Audio-driven talking head generation has wide real world applications but remains challenging due to the problems such as audio-lip synchronization, head poses, identity preservation, video quality, etc. We propose a novel two-stage framework that uses explicit 3D face images rendered from a 3D model based on the audio input, as intermediate features. We devise two independent 3D motion parameter generation networks to generate expression and pose parameters for the popular 3DMM model to solve the audio-lip synchronization problem and natural head poses without losing identity information. To improve the final talking head quality such as avoiding facial distortion and artifacts, we propose a novel face feature merge network to accurately extract and fuse the background, identity information, facial texture from the source image, and the lip movements and head poses from the 3D face images, and generate the final videos based on generative adversarial networks. Extensive experiments show that our framework outperforms the SOTA methods in several aspects and has good generalization ability.

AB - Audio-driven talking head generation has wide real world applications but remains challenging due to the problems such as audio-lip synchronization, head poses, identity preservation, video quality, etc. We propose a novel two-stage framework that uses explicit 3D face images rendered from a 3D model based on the audio input, as intermediate features. We devise two independent 3D motion parameter generation networks to generate expression and pose parameters for the popular 3DMM model to solve the audio-lip synchronization problem and natural head poses without losing identity information. To improve the final talking head quality such as avoiding facial distortion and artifacts, we propose a novel face feature merge network to accurately extract and fuse the background, identity information, facial texture from the source image, and the lip movements and head poses from the 3D face images, and generate the final videos based on generative adversarial networks. Extensive experiments show that our framework outperforms the SOTA methods in several aspects and has good generalization ability.

KW - 3DMM

KW - GAN

KW - Talking head generation

KW - feature merge

UR - http://www.scopus.com/inward/record.url?scp=85197523027&partnerID=8YFLogxK

U2 - 10.1109/LSP.2024.3422816

DO - 10.1109/LSP.2024.3422816

M3 - Article

AN - SCOPUS:85197523027

SN - 1070-9908

VL - 31

SP - 1850

EP - 1854

JO - IEEE Signal Processing Letters

JF - IEEE Signal Processing Letters

ER -

MergeTalk: Audio-Driven Talking Head Generation From Single Image With Feature Merge

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this