TY - GEN
T1 - FLIP-80M
T2 - 32nd ACM International Conference on Multimedia, MM 2024
AU - Li, Yudong
AU - Hou, Xianxu
AU - Dezhi, Zheng
AU - Shen, Linlin
AU - Zhao, Zhe
N1 - Publisher Copyright:
© 2024 ACM.
PY - 2024/10/28
Y1 - 2024/10/28
N2 - While significant progress has been made in multi-modal learning driven by large-scale image-text datasets, there is still a noticeable gap in the availability of such datasets within the facial domain. To facilitate and advance the field of facial representation learning, we present FLIP-80M, a large-scale visual-linguistic dataset comprising over 80 million face images paired with text descriptions. FLIP-80M is constructed by leveraging the large openly available image-text-pair dataset LAION-5B and a mixed-method approach to filter face-related pairs from both visual and linguistic perspectives. Our curation process involves face detection, face caption classification, text de-noising, and synthesis-based image augmentation. As a result, FLIP-80M stands as the largest face-text dataset to date. To evaluate the potential of our dataset, we fine-tune the CLIP model using the proposed FLIP-80M, to create FLIP (Facial Language-Image Pretraining) and assess its representation capabilities across various downstream tasks. Our experiments demonstrate that our FLIP model achieves state-of-the-art results in a range of face analysis tasks, including face parsing, face alignment, and face attribute classification. The dataset and models are available at https://github.com/ydli-ai/FLIP.
AB - While significant progress has been made in multi-modal learning driven by large-scale image-text datasets, there is still a noticeable gap in the availability of such datasets within the facial domain. To facilitate and advance the field of facial representation learning, we present FLIP-80M, a large-scale visual-linguistic dataset comprising over 80 million face images paired with text descriptions. FLIP-80M is constructed by leveraging the large openly available image-text-pair dataset LAION-5B and a mixed-method approach to filter face-related pairs from both visual and linguistic perspectives. Our curation process involves face detection, face caption classification, text de-noising, and synthesis-based image augmentation. As a result, FLIP-80M stands as the largest face-text dataset to date. To evaluate the potential of our dataset, we fine-tune the CLIP model using the proposed FLIP-80M, to create FLIP (Facial Language-Image Pretraining) and assess its representation capabilities across various downstream tasks. Our experiments demonstrate that our FLIP model achieves state-of-the-art results in a range of face analysis tasks, including face parsing, face alignment, and face attribute classification. The dataset and models are available at https://github.com/ydli-ai/FLIP.
KW - clip model
KW - dataset
KW - facial representation
KW - facial-linguistic
UR - http://www.scopus.com/inward/record.url?scp=85209814716&partnerID=8YFLogxK
U2 - 10.1145/3664647.3681287
DO - 10.1145/3664647.3681287
M3 - Conference contribution
AN - SCOPUS:85209814716
T3 - MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia
SP - 58
EP - 67
BT - MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia
PB - Association for Computing Machinery, Inc
Y2 - 28 October 2024 through 1 November 2024
ER -