FLIP-80M: 80 Million Visual-Linguistic Pairs for Facial Language-Image Pre-Training

Yudong Li; Xianxu Hou; Zheng Dezhi; Linlin Shen; Zhe Zhao

doi:10.1145/3664647.3681287

FLIP-80M: 80 Million Visual-Linguistic Pairs for Facial Language-Image Pre-Training

Yudong Li, Xianxu Hou, Zheng Dezhi, Linlin Shen, Zhe Zhao

Research output: Chapter in Book/Conference proceeding › Conference contribution › peer-review

Abstract

While significant progress has been made in multi-modal learning driven by large-scale image-text datasets, there is still a noticeable gap in the availability of such datasets within the facial domain. To facilitate and advance the field of facial representation learning, we present FLIP-80M, a large-scale visual-linguistic dataset comprising over 80 million face images paired with text descriptions. FLIP-80M is constructed by leveraging the large openly available image-text-pair dataset LAION-5B and a mixed-method approach to filter face-related pairs from both visual and linguistic perspectives. Our curation process involves face detection, face caption classification, text de-noising, and synthesis-based image augmentation. As a result, FLIP-80M stands as the largest face-text dataset to date. To evaluate the potential of our dataset, we fine-tune the CLIP model using the proposed FLIP-80M, to create FLIP (Facial Language-Image Pretraining) and assess its representation capabilities across various downstream tasks. Our experiments demonstrate that our FLIP model achieves state-of-the-art results in a range of face analysis tasks, including face parsing, face alignment, and face attribute classification. The dataset and models are available at https://github.com/ydli-ai/FLIP.

Original language	English
Title of host publication	MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia
Publisher	Association for Computing Machinery, Inc
Pages	58-67
Number of pages	10
ISBN (Electronic)	9798400706868
DOIs	https://doi.org/10.1145/3664647.3681287
Publication status	Published - 28 Oct 2024
Externally published	Yes
Event	32nd ACM International Conference on Multimedia, MM 2024 - Melbourne, Australia Duration: 28 Oct 2024 → 1 Nov 2024

Publication series

Name	MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia

Conference

Conference	32nd ACM International Conference on Multimedia, MM 2024
Country/Territory	Australia
City	Melbourne
Period	28/10/24 → 1/11/24

Keywords

clip model
dataset
facial representation
facial-linguistic

ASJC Scopus subject areas

Artificial Intelligence
Computer Graphics and Computer-Aided Design
Human-Computer Interaction
Software

Access to Document

10.1145/3664647.3681287

Cite this

Li, Y., Hou, X., Dezhi, Z., Shen, L., & Zhao, Z. (2024). FLIP-80M: 80 Million Visual-Linguistic Pairs for Facial Language-Image Pre-Training. In MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia (pp. 58-67). (MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia). Association for Computing Machinery, Inc. https://doi.org/10.1145/3664647.3681287

@inproceedings{bb46eba546f8457388ded697e3fe9235,

title = "FLIP-80M: 80 Million Visual-Linguistic Pairs for Facial Language-Image Pre-Training",

abstract = "While significant progress has been made in multi-modal learning driven by large-scale image-text datasets, there is still a noticeable gap in the availability of such datasets within the facial domain. To facilitate and advance the field of facial representation learning, we present FLIP-80M, a large-scale visual-linguistic dataset comprising over 80 million face images paired with text descriptions. FLIP-80M is constructed by leveraging the large openly available image-text-pair dataset LAION-5B and a mixed-method approach to filter face-related pairs from both visual and linguistic perspectives. Our curation process involves face detection, face caption classification, text de-noising, and synthesis-based image augmentation. As a result, FLIP-80M stands as the largest face-text dataset to date. To evaluate the potential of our dataset, we fine-tune the CLIP model using the proposed FLIP-80M, to create FLIP (Facial Language-Image Pretraining) and assess its representation capabilities across various downstream tasks. Our experiments demonstrate that our FLIP model achieves state-of-the-art results in a range of face analysis tasks, including face parsing, face alignment, and face attribute classification. The dataset and models are available at https://github.com/ydli-ai/FLIP.",

keywords = "clip model, dataset, facial representation, facial-linguistic",

author = "Yudong Li and Xianxu Hou and Zheng Dezhi and Linlin Shen and Zhe Zhao",

note = "Publisher Copyright: {\textcopyright} 2024 ACM.; 32nd ACM International Conference on Multimedia, MM 2024 ; Conference date: 28-10-2024 Through 01-11-2024",

year = "2024",

month = oct,

day = "28",

doi = "10.1145/3664647.3681287",

language = "English",

series = "MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia",

publisher = "Association for Computing Machinery, Inc",

pages = "58--67",

booktitle = "MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia",

}

Li, Y, Hou, X, Dezhi, Z, Shen, L & Zhao, Z 2024, FLIP-80M: 80 Million Visual-Linguistic Pairs for Facial Language-Image Pre-Training. in MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia. MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia, Association for Computing Machinery, Inc, pp. 58-67, 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, Australia, 28/10/24. https://doi.org/10.1145/3664647.3681287

FLIP-80M: 80 Million Visual-Linguistic Pairs for Facial Language-Image Pre-Training. / Li, Yudong; Hou, Xianxu; Dezhi, Zheng et al.
MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia. Association for Computing Machinery, Inc, 2024. p. 58-67 (MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia).

Research output: Chapter in Book/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - FLIP-80M

T2 - 32nd ACM International Conference on Multimedia, MM 2024

AU - Li, Yudong

AU - Hou, Xianxu

AU - Dezhi, Zheng

AU - Shen, Linlin

AU - Zhao, Zhe

PY - 2024/10/28

Y1 - 2024/10/28

N2 - While significant progress has been made in multi-modal learning driven by large-scale image-text datasets, there is still a noticeable gap in the availability of such datasets within the facial domain. To facilitate and advance the field of facial representation learning, we present FLIP-80M, a large-scale visual-linguistic dataset comprising over 80 million face images paired with text descriptions. FLIP-80M is constructed by leveraging the large openly available image-text-pair dataset LAION-5B and a mixed-method approach to filter face-related pairs from both visual and linguistic perspectives. Our curation process involves face detection, face caption classification, text de-noising, and synthesis-based image augmentation. As a result, FLIP-80M stands as the largest face-text dataset to date. To evaluate the potential of our dataset, we fine-tune the CLIP model using the proposed FLIP-80M, to create FLIP (Facial Language-Image Pretraining) and assess its representation capabilities across various downstream tasks. Our experiments demonstrate that our FLIP model achieves state-of-the-art results in a range of face analysis tasks, including face parsing, face alignment, and face attribute classification. The dataset and models are available at https://github.com/ydli-ai/FLIP.

AB - While significant progress has been made in multi-modal learning driven by large-scale image-text datasets, there is still a noticeable gap in the availability of such datasets within the facial domain. To facilitate and advance the field of facial representation learning, we present FLIP-80M, a large-scale visual-linguistic dataset comprising over 80 million face images paired with text descriptions. FLIP-80M is constructed by leveraging the large openly available image-text-pair dataset LAION-5B and a mixed-method approach to filter face-related pairs from both visual and linguistic perspectives. Our curation process involves face detection, face caption classification, text de-noising, and synthesis-based image augmentation. As a result, FLIP-80M stands as the largest face-text dataset to date. To evaluate the potential of our dataset, we fine-tune the CLIP model using the proposed FLIP-80M, to create FLIP (Facial Language-Image Pretraining) and assess its representation capabilities across various downstream tasks. Our experiments demonstrate that our FLIP model achieves state-of-the-art results in a range of face analysis tasks, including face parsing, face alignment, and face attribute classification. The dataset and models are available at https://github.com/ydli-ai/FLIP.

KW - clip model

KW - dataset

KW - facial representation

KW - facial-linguistic

UR - http://www.scopus.com/inward/record.url?scp=85209814716&partnerID=8YFLogxK

U2 - 10.1145/3664647.3681287

DO - 10.1145/3664647.3681287

M3 - Conference contribution

AN - SCOPUS:85209814716

T3 - MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia

SP - 58

EP - 67

BT - MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia

PB - Association for Computing Machinery, Inc

Y2 - 28 October 2024 through 1 November 2024

ER -

Li Y, Hou X, Dezhi Z, Shen L, Zhao Z. FLIP-80M: 80 Million Visual-Linguistic Pairs for Facial Language-Image Pre-Training. In MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia. Association for Computing Machinery, Inc. 2024. p. 58-67. (MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia). doi: 10.1145/3664647.3681287

FLIP-80M: 80 Million Visual-Linguistic Pairs for Facial Language-Image Pre-Training

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this