FLIP-80M: 80 Million Visual-Linguistic Pairs for Facial Language-Image Pre-Training

Yudong Li, Xianxu Hou, Zheng Dezhi, Linlin Shen, Zhe Zhao

Research output: Chapter in Book/Conference proceedingConference contributionpeer-review

Abstract

While significant progress has been made in multi-modal learning driven by large-scale image-text datasets, there is still a noticeable gap in the availability of such datasets within the facial domain. To facilitate and advance the field of facial representation learning, we present FLIP-80M, a large-scale visual-linguistic dataset comprising over 80 million face images paired with text descriptions. FLIP-80M is constructed by leveraging the large openly available image-text-pair dataset LAION-5B and a mixed-method approach to filter face-related pairs from both visual and linguistic perspectives. Our curation process involves face detection, face caption classification, text de-noising, and synthesis-based image augmentation. As a result, FLIP-80M stands as the largest face-text dataset to date. To evaluate the potential of our dataset, we fine-tune the CLIP model using the proposed FLIP-80M, to create FLIP (Facial Language-Image Pretraining) and assess its representation capabilities across various downstream tasks. Our experiments demonstrate that our FLIP model achieves state-of-the-art results in a range of face analysis tasks, including face parsing, face alignment, and face attribute classification. The dataset and models are available at https://github.com/ydli-ai/FLIP.

Original languageEnglish
Title of host publicationMM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia
PublisherAssociation for Computing Machinery, Inc
Pages58-67
Number of pages10
ISBN (Electronic)9798400706868
DOIs
Publication statusPublished - 28 Oct 2024
Externally publishedYes
Event32nd ACM International Conference on Multimedia, MM 2024 - Melbourne, Australia
Duration: 28 Oct 20241 Nov 2024

Publication series

NameMM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia

Conference

Conference32nd ACM International Conference on Multimedia, MM 2024
Country/TerritoryAustralia
CityMelbourne
Period28/10/241/11/24

Keywords

  • clip model
  • dataset
  • facial representation
  • facial-linguistic

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Graphics and Computer-Aided Design
  • Human-Computer Interaction
  • Software

Fingerprint

Dive into the research topics of 'FLIP-80M: 80 Million Visual-Linguistic Pairs for Facial Language-Image Pre-Training'. Together they form a unique fingerprint.

Cite this