CLIP-Guided Bidirectional Prompt and Semantic Supervision for Dynamic Facial Expression Recognition

Junliang Zhang; Xu Liu; Yu Liang; Xiaole Xian; Weicheng Xie; Linlin Shen; Siyang Song

doi:10.1109/IJCB62174.2024.10744485

CLIP-Guided Bidirectional Prompt and Semantic Supervision for Dynamic Facial Expression Recognition

Junliang Zhang, Xu Liu, Yu Liang, Xiaole Xian, Weicheng Xie, Linlin Shen, Siyang Song

Research output: Chapter in Book/Conference proceeding › Conference contribution › peer-review

Abstract

Due to the insufficient semantic information supervision in existing works for dynamic facial expression recognition (DFER), videos with similar facial changes but different expressions may be easily confused. Thanks to the potential textual information for semantic supervision, contrastive language-image pretraining (CLIP) model provides a new direction for DFER. However, pre-trained CLIP based on image-text pairs has difficulty in capturing temporal features in the video domain. Therefore, we propose a novel visual language model that captures and aggregates dynamic features of expressions in semantic supervision via Inter-Frame Interaction Transformer (Inter-FIT) and Multi-Scale Temporal Aggregation (MSTA). Furthermore, though prompt learning is often used in CLIP to enhance semantic supervision, previous studies have only focused on the role of textual prompts, ignoring the importance of visual prompts in facilitating the relationality between the two. Therefore, we designed a Bidirectional Enhanced Prompt (BiEhPro) to facilitate the learning of this relationality between text and visual cues in enhancing semantic supervision. Extensive experiments and ablation studies on three benchmark datasets, i.e., DFEW, FERV39K, and MAFW, validate the effectiveness of our modules and algorithm. Code is publicly available at https://github.com/JunLiangZ/CLIP-Guided-DFER.

Original language	English
Title of host publication	Proceedings - 2024 IEEE International Joint Conference on Biometrics, IJCB 2024
Publisher	Institute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)	9798350364132
DOIs	https://doi.org/10.1109/IJCB62174.2024.10744485
Publication status	Published - 2024
Externally published	Yes
Event	18th IEEE International Joint Conference on Biometrics, IJCB 2024 - Buffalo, United States Duration: 15 Sept 2024 → 18 Sept 2024

Publication series

Name	Proceedings - 2024 IEEE International Joint Conference on Biometrics, IJCB 2024

Conference

Conference	18th IEEE International Joint Conference on Biometrics, IJCB 2024
Country/Territory	United States
City	Buffalo
Period	15/09/24 → 18/09/24

ASJC Scopus subject areas

Artificial Intelligence
Computer Vision and Pattern Recognition
Biomedical Engineering
Instrumentation

Access to Document

10.1109/IJCB62174.2024.10744485

Cite this

Zhang, J., Liu, X., Liang, Y., Xian, X., Xie, W., Shen, L., & Song, S. (2024). CLIP-Guided Bidirectional Prompt and Semantic Supervision for Dynamic Facial Expression Recognition. In Proceedings - 2024 IEEE International Joint Conference on Biometrics, IJCB 2024 (Proceedings - 2024 IEEE International Joint Conference on Biometrics, IJCB 2024). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/IJCB62174.2024.10744485

Zhang, Junliang ; Liu, Xu ; Liang, Yu et al. / CLIP-Guided Bidirectional Prompt and Semantic Supervision for Dynamic Facial Expression Recognition. Proceedings - 2024 IEEE International Joint Conference on Biometrics, IJCB 2024. Institute of Electrical and Electronics Engineers Inc., 2024. (Proceedings - 2024 IEEE International Joint Conference on Biometrics, IJCB 2024).

@inproceedings{1e947a7b28bb48129744b181ec0c483b,

title = "CLIP-Guided Bidirectional Prompt and Semantic Supervision for Dynamic Facial Expression Recognition",

abstract = "Due to the insufficient semantic information supervision in existing works for dynamic facial expression recognition (DFER), videos with similar facial changes but different expressions may be easily confused. Thanks to the potential textual information for semantic supervision, contrastive language-image pretraining (CLIP) model provides a new direction for DFER. However, pre-trained CLIP based on image-text pairs has difficulty in capturing temporal features in the video domain. Therefore, we propose a novel visual language model that captures and aggregates dynamic features of expressions in semantic supervision via Inter-Frame Interaction Transformer (Inter-FIT) and Multi-Scale Temporal Aggregation (MSTA). Furthermore, though prompt learning is often used in CLIP to enhance semantic supervision, previous studies have only focused on the role of textual prompts, ignoring the importance of visual prompts in facilitating the relationality between the two. Therefore, we designed a Bidirectional Enhanced Prompt (BiEhPro) to facilitate the learning of this relationality between text and visual cues in enhancing semantic supervision. Extensive experiments and ablation studies on three benchmark datasets, i.e., DFEW, FERV39K, and MAFW, validate the effectiveness of our modules and algorithm. Code is publicly available at https://github.com/JunLiangZ/CLIP-Guided-DFER.",

author = "Junliang Zhang and Xu Liu and Yu Liang and Xiaole Xian and Weicheng Xie and Linlin Shen and Siyang Song",

note = "Publisher Copyright: {\textcopyright} 2024 IEEE.; 18th IEEE International Joint Conference on Biometrics, IJCB 2024 ; Conference date: 15-09-2024 Through 18-09-2024",

year = "2024",

doi = "10.1109/IJCB62174.2024.10744485",

language = "English",

series = "Proceedings - 2024 IEEE International Joint Conference on Biometrics, IJCB 2024",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

booktitle = "Proceedings - 2024 IEEE International Joint Conference on Biometrics, IJCB 2024",

address = "United States",

}

Zhang, J, Liu, X, Liang, Y, Xian, X, Xie, W, Shen, L & Song, S 2024, CLIP-Guided Bidirectional Prompt and Semantic Supervision for Dynamic Facial Expression Recognition. in Proceedings - 2024 IEEE International Joint Conference on Biometrics, IJCB 2024. Proceedings - 2024 IEEE International Joint Conference on Biometrics, IJCB 2024, Institute of Electrical and Electronics Engineers Inc., 18th IEEE International Joint Conference on Biometrics, IJCB 2024, Buffalo, United States, 15/09/24. https://doi.org/10.1109/IJCB62174.2024.10744485

CLIP-Guided Bidirectional Prompt and Semantic Supervision for Dynamic Facial Expression Recognition. / Zhang, Junliang; Liu, Xu; Liang, Yu et al.
Proceedings - 2024 IEEE International Joint Conference on Biometrics, IJCB 2024. Institute of Electrical and Electronics Engineers Inc., 2024. (Proceedings - 2024 IEEE International Joint Conference on Biometrics, IJCB 2024).

Research output: Chapter in Book/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - CLIP-Guided Bidirectional Prompt and Semantic Supervision for Dynamic Facial Expression Recognition

AU - Zhang, Junliang

AU - Liu, Xu

AU - Liang, Yu

AU - Xian, Xiaole

AU - Xie, Weicheng

AU - Shen, Linlin

AU - Song, Siyang

PY - 2024

Y1 - 2024

N2 - Due to the insufficient semantic information supervision in existing works for dynamic facial expression recognition (DFER), videos with similar facial changes but different expressions may be easily confused. Thanks to the potential textual information for semantic supervision, contrastive language-image pretraining (CLIP) model provides a new direction for DFER. However, pre-trained CLIP based on image-text pairs has difficulty in capturing temporal features in the video domain. Therefore, we propose a novel visual language model that captures and aggregates dynamic features of expressions in semantic supervision via Inter-Frame Interaction Transformer (Inter-FIT) and Multi-Scale Temporal Aggregation (MSTA). Furthermore, though prompt learning is often used in CLIP to enhance semantic supervision, previous studies have only focused on the role of textual prompts, ignoring the importance of visual prompts in facilitating the relationality between the two. Therefore, we designed a Bidirectional Enhanced Prompt (BiEhPro) to facilitate the learning of this relationality between text and visual cues in enhancing semantic supervision. Extensive experiments and ablation studies on three benchmark datasets, i.e., DFEW, FERV39K, and MAFW, validate the effectiveness of our modules and algorithm. Code is publicly available at https://github.com/JunLiangZ/CLIP-Guided-DFER.

AB - Due to the insufficient semantic information supervision in existing works for dynamic facial expression recognition (DFER), videos with similar facial changes but different expressions may be easily confused. Thanks to the potential textual information for semantic supervision, contrastive language-image pretraining (CLIP) model provides a new direction for DFER. However, pre-trained CLIP based on image-text pairs has difficulty in capturing temporal features in the video domain. Therefore, we propose a novel visual language model that captures and aggregates dynamic features of expressions in semantic supervision via Inter-Frame Interaction Transformer (Inter-FIT) and Multi-Scale Temporal Aggregation (MSTA). Furthermore, though prompt learning is often used in CLIP to enhance semantic supervision, previous studies have only focused on the role of textual prompts, ignoring the importance of visual prompts in facilitating the relationality between the two. Therefore, we designed a Bidirectional Enhanced Prompt (BiEhPro) to facilitate the learning of this relationality between text and visual cues in enhancing semantic supervision. Extensive experiments and ablation studies on three benchmark datasets, i.e., DFEW, FERV39K, and MAFW, validate the effectiveness of our modules and algorithm. Code is publicly available at https://github.com/JunLiangZ/CLIP-Guided-DFER.

UR - http://www.scopus.com/inward/record.url?scp=85211327482&partnerID=8YFLogxK

U2 - 10.1109/IJCB62174.2024.10744485

DO - 10.1109/IJCB62174.2024.10744485

M3 - Conference contribution

AN - SCOPUS:85211327482

T3 - Proceedings - 2024 IEEE International Joint Conference on Biometrics, IJCB 2024

BT - Proceedings - 2024 IEEE International Joint Conference on Biometrics, IJCB 2024

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 18th IEEE International Joint Conference on Biometrics, IJCB 2024

Y2 - 15 September 2024 through 18 September 2024

ER -

Zhang J, Liu X, Liang Y, Xian X, Xie W, Shen L et al. CLIP-Guided Bidirectional Prompt and Semantic Supervision for Dynamic Facial Expression Recognition. In Proceedings - 2024 IEEE International Joint Conference on Biometrics, IJCB 2024. Institute of Electrical and Electronics Engineers Inc. 2024. (Proceedings - 2024 IEEE International Joint Conference on Biometrics, IJCB 2024). doi: 10.1109/IJCB62174.2024.10744485

CLIP-Guided Bidirectional Prompt and Semantic Supervision for Dynamic Facial Expression Recognition

Abstract

Publication series

Conference

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this