TY - GEN
T1 - Video Question Answering Using Clip-Guided Visual-Text Attention
AU - Ye, Shuhong
AU - Kong, Weikai
AU - Yao, Chenglin
AU - Ren, Jianfeng
AU - Jiang, Xudong
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Cross-modal learning of video and text plays a key role in Video Question Answering (VideoQA). In this paper, we propose a visual-text attention mechanism to utilize the Contrastive Language-Image Pre-training (CLIP) trained on lots of general domain language-image pairs to guide the cross-modal learning for VideoQA. Specifically, we first extract video features using a TimeSformer and text features using a BERT from the target application domain, and utilize CLIP to extract a pair of visual-text features from the general-knowledge domain through the domain-specific learning. We then propose a Cross-domain Learning to extract the attention information between visual and linguistic features across the target domain and general domain. The set of CLIP-guided visual-text features are integrated to predict the answer. The proposed method is evaluated on MSVD-QA and MSRVTTQA datasets and outperforms state-of-the-art methods.
AB - Cross-modal learning of video and text plays a key role in Video Question Answering (VideoQA). In this paper, we propose a visual-text attention mechanism to utilize the Contrastive Language-Image Pre-training (CLIP) trained on lots of general domain language-image pairs to guide the cross-modal learning for VideoQA. Specifically, we first extract video features using a TimeSformer and text features using a BERT from the target application domain, and utilize CLIP to extract a pair of visual-text features from the general-knowledge domain through the domain-specific learning. We then propose a Cross-domain Learning to extract the attention information between visual and linguistic features across the target domain and general domain. The set of CLIP-guided visual-text features are integrated to predict the answer. The proposed method is evaluated on MSVD-QA and MSRVTTQA datasets and outperforms state-of-the-art methods.
KW - CLIP
KW - Cross-domain Learning
KW - Cross-modal Learning
KW - Video Question Answering
UR - http://www.scopus.com/inward/record.url?scp=85180733981&partnerID=8YFLogxK
U2 - 10.1109/ICIP49359.2023.10222286
DO - 10.1109/ICIP49359.2023.10222286
M3 - Conference contribution
AN - SCOPUS:85180733981
T3 - Proceedings - International Conference on Image Processing, ICIP
SP - 81
EP - 85
BT - 2023 IEEE International Conference on Image Processing, ICIP 2023 - Proceedings
PB - IEEE Computer Society
T2 - 30th IEEE International Conference on Image Processing, ICIP 2023
Y2 - 8 October 2023 through 11 October 2023
ER -