Talk2Face: A Unified Sequence-based Framework for Diverse Face Generation and Analysis Tasks

Yudong Li; Xianxu Hou; Zhe Zhao; Linlin Shen; Xuefeng Yang; Kimmo Yan

doi:10.1145/3503161.3548205

Talk2Face: A Unified Sequence-based Framework for Diverse Face Generation and Analysis Tasks

Yudong Li, Xianxu Hou, Zhe Zhao, Linlin Shen, Xuefeng Yang, Kimmo Yan

Research output: Chapter in Book/Conference proceeding › Conference contribution › peer-review

6 Citations (Scopus)

Abstract

Facial analysis is an important domain in computer vision and has received extensive research attention. For numerous downstream tasks with different input/output formats and modalities, existing methods usually design task-specific architectures and train them using face datasets collected in the particular task domain. In this work, we proposed a single model, Talk2Face, to simultaneously tackle a large number of face generation and analysis tasks, e.g. text guided face synthesis, face captioning and age estimation. Specifically, we cast different tasks into a sequence-to-sequence format with the same architecture, parameters and objectives. While text and facial images are tokenized to sequences, the annotation labels of faces for different tasks are also converted to natural languages for unified representation. We collect a set of 2.3M face-text pairs from available datasets across different tasks, to train the proposed model. Uniform templates are then designed to enable the model to perform different downstream tasks, according to the task context and target. Experiments on different tasks show that our model achieves better face generation and caption performances than SOTA approaches. On age estimation and multi-attribute classification, our model reaches competitive performance with those models specially designed and trained for these particular tasks. In practice, our model is much easier to be deployed to different facial analysis related tasks. Code and dataset will be available at https://github.com/ydli-ai/Talk2Face.

Original language	English
Title of host publication	MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia
Publisher	Association for Computing Machinery, Inc
Pages	4594-4604
Number of pages	11
ISBN (Electronic)	9781450392037
DOIs	https://doi.org/10.1145/3503161.3548205
Publication status	Published - 10 Oct 2022
Externally published	Yes
Event	30th ACM International Conference on Multimedia, MM 2022 - Lisboa, Portugal Duration: 10 Oct 2022 → 14 Oct 2022

Publication series

Name	MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia

Conference

Conference	30th ACM International Conference on Multimedia, MM 2022
Country/Territory	Portugal
City	Lisboa
Period	10/10/22 → 14/10/22

Keywords

cross-modal generation
face captioning
text-to-face synthesis

ASJC Scopus subject areas

Artificial Intelligence
Computer Graphics and Computer-Aided Design
Human-Computer Interaction
Software

Access to Document

10.1145/3503161.3548205

Cite this

Li, Y., Hou, X., Zhao, Z., Shen, L., Yang, X., & Yan, K. (2022). Talk2Face: A Unified Sequence-based Framework for Diverse Face Generation and Analysis Tasks. In MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia (pp. 4594-4604). (MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia). Association for Computing Machinery, Inc. https://doi.org/10.1145/3503161.3548205

@inproceedings{3158eb9dc3984a089228c3b16a678734,

title = "Talk2Face: A Unified Sequence-based Framework for Diverse Face Generation and Analysis Tasks",

abstract = "Facial analysis is an important domain in computer vision and has received extensive research attention. For numerous downstream tasks with different input/output formats and modalities, existing methods usually design task-specific architectures and train them using face datasets collected in the particular task domain. In this work, we proposed a single model, Talk2Face, to simultaneously tackle a large number of face generation and analysis tasks, e.g. text guided face synthesis, face captioning and age estimation. Specifically, we cast different tasks into a sequence-to-sequence format with the same architecture, parameters and objectives. While text and facial images are tokenized to sequences, the annotation labels of faces for different tasks are also converted to natural languages for unified representation. We collect a set of 2.3M face-text pairs from available datasets across different tasks, to train the proposed model. Uniform templates are then designed to enable the model to perform different downstream tasks, according to the task context and target. Experiments on different tasks show that our model achieves better face generation and caption performances than SOTA approaches. On age estimation and multi-attribute classification, our model reaches competitive performance with those models specially designed and trained for these particular tasks. In practice, our model is much easier to be deployed to different facial analysis related tasks. Code and dataset will be available at https://github.com/ydli-ai/Talk2Face.",

keywords = "cross-modal generation, face captioning, text-to-face synthesis",

author = "Yudong Li and Xianxu Hou and Zhe Zhao and Linlin Shen and Xuefeng Yang and Kimmo Yan",

note = "Publisher Copyright: {\textcopyright} 2022 ACM.; 30th ACM International Conference on Multimedia, MM 2022 ; Conference date: 10-10-2022 Through 14-10-2022",

year = "2022",

month = oct,

day = "10",

doi = "10.1145/3503161.3548205",

language = "English",

series = "MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia",

publisher = "Association for Computing Machinery, Inc",

pages = "4594--4604",

booktitle = "MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia",

}

Li, Y, Hou, X, Zhao, Z, Shen, L, Yang, X & Yan, K 2022, Talk2Face: A Unified Sequence-based Framework for Diverse Face Generation and Analysis Tasks. in MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia. MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia, Association for Computing Machinery, Inc, pp. 4594-4604, 30th ACM International Conference on Multimedia, MM 2022, Lisboa, Portugal, 10/10/22. https://doi.org/10.1145/3503161.3548205

Talk2Face: A Unified Sequence-based Framework for Diverse Face Generation and Analysis Tasks. / Li, Yudong; Hou, Xianxu; Zhao, Zhe et al.
MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia. Association for Computing Machinery, Inc, 2022. p. 4594-4604 (MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia).

Research output: Chapter in Book/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Talk2Face

T2 - 30th ACM International Conference on Multimedia, MM 2022

AU - Li, Yudong

AU - Hou, Xianxu

AU - Zhao, Zhe

AU - Shen, Linlin

AU - Yang, Xuefeng

AU - Yan, Kimmo

PY - 2022/10/10

Y1 - 2022/10/10

N2 - Facial analysis is an important domain in computer vision and has received extensive research attention. For numerous downstream tasks with different input/output formats and modalities, existing methods usually design task-specific architectures and train them using face datasets collected in the particular task domain. In this work, we proposed a single model, Talk2Face, to simultaneously tackle a large number of face generation and analysis tasks, e.g. text guided face synthesis, face captioning and age estimation. Specifically, we cast different tasks into a sequence-to-sequence format with the same architecture, parameters and objectives. While text and facial images are tokenized to sequences, the annotation labels of faces for different tasks are also converted to natural languages for unified representation. We collect a set of 2.3M face-text pairs from available datasets across different tasks, to train the proposed model. Uniform templates are then designed to enable the model to perform different downstream tasks, according to the task context and target. Experiments on different tasks show that our model achieves better face generation and caption performances than SOTA approaches. On age estimation and multi-attribute classification, our model reaches competitive performance with those models specially designed and trained for these particular tasks. In practice, our model is much easier to be deployed to different facial analysis related tasks. Code and dataset will be available at https://github.com/ydli-ai/Talk2Face.

AB - Facial analysis is an important domain in computer vision and has received extensive research attention. For numerous downstream tasks with different input/output formats and modalities, existing methods usually design task-specific architectures and train them using face datasets collected in the particular task domain. In this work, we proposed a single model, Talk2Face, to simultaneously tackle a large number of face generation and analysis tasks, e.g. text guided face synthesis, face captioning and age estimation. Specifically, we cast different tasks into a sequence-to-sequence format with the same architecture, parameters and objectives. While text and facial images are tokenized to sequences, the annotation labels of faces for different tasks are also converted to natural languages for unified representation. We collect a set of 2.3M face-text pairs from available datasets across different tasks, to train the proposed model. Uniform templates are then designed to enable the model to perform different downstream tasks, according to the task context and target. Experiments on different tasks show that our model achieves better face generation and caption performances than SOTA approaches. On age estimation and multi-attribute classification, our model reaches competitive performance with those models specially designed and trained for these particular tasks. In practice, our model is much easier to be deployed to different facial analysis related tasks. Code and dataset will be available at https://github.com/ydli-ai/Talk2Face.

KW - cross-modal generation

KW - face captioning

KW - text-to-face synthesis

UR - http://www.scopus.com/inward/record.url?scp=85144803045&partnerID=8YFLogxK

U2 - 10.1145/3503161.3548205

DO - 10.1145/3503161.3548205

M3 - Conference contribution

AN - SCOPUS:85144803045

T3 - MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia

SP - 4594

EP - 4604

BT - MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia

PB - Association for Computing Machinery, Inc

Y2 - 10 October 2022 through 14 October 2022

ER -

Li Y, Hou X, Zhao Z, Shen L, Yang X, Yan K. Talk2Face: A Unified Sequence-based Framework for Diverse Face Generation and Analysis Tasks. In MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia. Association for Computing Machinery, Inc. 2022. p. 4594-4604. (MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia). doi: 10.1145/3503161.3548205

Talk2Face: A Unified Sequence-based Framework for Diverse Face Generation and Analysis Tasks

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this