TY - GEN
T1 - CLIP-Guided Bidirectional Prompt and Semantic Supervision for Dynamic Facial Expression Recognition
AU - Zhang, Junliang
AU - Liu, Xu
AU - Liang, Yu
AU - Xian, Xiaole
AU - Xie, Weicheng
AU - Shen, Linlin
AU - Song, Siyang
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Due to the insufficient semantic information supervision in existing works for dynamic facial expression recognition (DFER), videos with similar facial changes but different expressions may be easily confused. Thanks to the potential textual information for semantic supervision, contrastive language-image pretraining (CLIP) model provides a new direction for DFER. However, pre-trained CLIP based on image-text pairs has difficulty in capturing temporal features in the video domain. Therefore, we propose a novel visual language model that captures and aggregates dynamic features of expressions in semantic supervision via Inter-Frame Interaction Transformer (Inter-FIT) and Multi-Scale Temporal Aggregation (MSTA). Furthermore, though prompt learning is often used in CLIP to enhance semantic supervision, previous studies have only focused on the role of textual prompts, ignoring the importance of visual prompts in facilitating the relationality between the two. Therefore, we designed a Bidirectional Enhanced Prompt (BiEhPro) to facilitate the learning of this relationality between text and visual cues in enhancing semantic supervision. Extensive experiments and ablation studies on three benchmark datasets, i.e., DFEW, FERV39K, and MAFW, validate the effectiveness of our modules and algorithm. Code is publicly available at https://github.com/JunLiangZ/CLIP-Guided-DFER.
AB - Due to the insufficient semantic information supervision in existing works for dynamic facial expression recognition (DFER), videos with similar facial changes but different expressions may be easily confused. Thanks to the potential textual information for semantic supervision, contrastive language-image pretraining (CLIP) model provides a new direction for DFER. However, pre-trained CLIP based on image-text pairs has difficulty in capturing temporal features in the video domain. Therefore, we propose a novel visual language model that captures and aggregates dynamic features of expressions in semantic supervision via Inter-Frame Interaction Transformer (Inter-FIT) and Multi-Scale Temporal Aggregation (MSTA). Furthermore, though prompt learning is often used in CLIP to enhance semantic supervision, previous studies have only focused on the role of textual prompts, ignoring the importance of visual prompts in facilitating the relationality between the two. Therefore, we designed a Bidirectional Enhanced Prompt (BiEhPro) to facilitate the learning of this relationality between text and visual cues in enhancing semantic supervision. Extensive experiments and ablation studies on three benchmark datasets, i.e., DFEW, FERV39K, and MAFW, validate the effectiveness of our modules and algorithm. Code is publicly available at https://github.com/JunLiangZ/CLIP-Guided-DFER.
UR - http://www.scopus.com/inward/record.url?scp=85211327482&partnerID=8YFLogxK
U2 - 10.1109/IJCB62174.2024.10744485
DO - 10.1109/IJCB62174.2024.10744485
M3 - Conference contribution
AN - SCOPUS:85211327482
T3 - Proceedings - 2024 IEEE International Joint Conference on Biometrics, IJCB 2024
BT - Proceedings - 2024 IEEE International Joint Conference on Biometrics, IJCB 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 18th IEEE International Joint Conference on Biometrics, IJCB 2024
Y2 - 15 September 2024 through 18 September 2024
ER -