Video Question Answering Using Clip-Guided Visual-Text Attention

Shuhong Ye, Weikai Kong, Chenglin Yao, Jianfeng Ren, Xudong Jiang

Research output: Chapter in Book/Conference proceedingConference contributionpeer-review

Abstract

Cross-modal learning of video and text plays a key role in Video Question Answering (VideoQA). In this paper, we propose a visual-text attention mechanism to utilize the Contrastive Language-Image Pre-training (CLIP) trained on lots of general domain language-image pairs to guide the cross-modal learning for VideoQA. Specifically, we first extract video features using a TimeSformer and text features using a BERT from the target application domain, and utilize CLIP to extract a pair of visual-text features from the general-knowledge domain through the domain-specific learning. We then propose a Cross-domain Learning to extract the attention information between visual and linguistic features across the target domain and general domain. The set of CLIP-guided visual-text features are integrated to predict the answer. The proposed method is evaluated on MSVD-QA and MSRVTTQA datasets and outperforms state-of-the-art methods.

Original languageEnglish
Title of host publication2023 IEEE International Conference on Image Processing, ICIP 2023 - Proceedings
PublisherIEEE Computer Society
Pages81-85
Number of pages5
ISBN (Electronic)9781728198354
DOIs
Publication statusPublished - 2023
Event30th IEEE International Conference on Image Processing, ICIP 2023 - Kuala Lumpur, Malaysia
Duration: 8 Oct 202311 Oct 2023

Publication series

NameProceedings - International Conference on Image Processing, ICIP
ISSN (Print)1522-4880

Conference

Conference30th IEEE International Conference on Image Processing, ICIP 2023
Country/TerritoryMalaysia
CityKuala Lumpur
Period8/10/2311/10/23

Keywords

  • CLIP
  • Cross-domain Learning
  • Cross-modal Learning
  • Video Question Answering

ASJC Scopus subject areas

  • Software
  • Computer Vision and Pattern Recognition
  • Signal Processing

Fingerprint

Dive into the research topics of 'Video Question Answering Using Clip-Guided Visual-Text Attention'. Together they form a unique fingerprint.

Cite this