Video Question Answering Using Clip-Guided Visual-Text Attention

Shuhong Ye; Weikai Kong; Chenglin Yao; Jianfeng Ren; Xudong Jiang

doi:10.1109/ICIP49359.2023.10222286

Video Question Answering Using Clip-Guided Visual-Text Attention

Shuhong Ye, Weikai Kong, Chenglin Yao, Jianfeng Ren, Xudong Jiang

School of Computer Science

Research output: Chapter in Book/Conference proceeding › Conference contribution › peer-review

8 Citations (Scopus)

Abstract

Cross-modal learning of video and text plays a key role in Video Question Answering (VideoQA). In this paper, we propose a visual-text attention mechanism to utilize the Contrastive Language-Image Pre-training (CLIP) trained on lots of general domain language-image pairs to guide the cross-modal learning for VideoQA. Specifically, we first extract video features using a TimeSformer and text features using a BERT from the target application domain, and utilize CLIP to extract a pair of visual-text features from the general-knowledge domain through the domain-specific learning. We then propose a Cross-domain Learning to extract the attention information between visual and linguistic features across the target domain and general domain. The set of CLIP-guided visual-text features are integrated to predict the answer. The proposed method is evaluated on MSVD-QA and MSRVTTQA datasets and outperforms state-of-the-art methods.

Original language	English
Title of host publication	2023 IEEE International Conference on Image Processing, ICIP 2023 - Proceedings
Publisher	IEEE Computer Society
Pages	81-85
Number of pages	5
ISBN (Electronic)	9781728198354
DOIs	https://doi.org/10.1109/ICIP49359.2023.10222286
Publication status	Published - 2023
Event	30th IEEE International Conference on Image Processing, ICIP 2023 - Kuala Lumpur, Malaysia Duration: 8 Oct 2023 → 11 Oct 2023

Publication series

Name	Proceedings - International Conference on Image Processing, ICIP
ISSN (Print)	1522-4880

Conference

Conference	30th IEEE International Conference on Image Processing, ICIP 2023
Country/Territory	Malaysia
City	Kuala Lumpur
Period	8/10/23 → 11/10/23

Keywords

CLIP
Cross-domain Learning
Cross-modal Learning
Video Question Answering

ASJC Scopus subject areas

Software
Computer Vision and Pattern Recognition
Signal Processing

Access to Document

10.1109/ICIP49359.2023.10222286

Cite this

@inproceedings{87a2226a6b154d94b51a20205824b18d,

title = "Video Question Answering Using Clip-Guided Visual-Text Attention",

abstract = "Cross-modal learning of video and text plays a key role in Video Question Answering (VideoQA). In this paper, we propose a visual-text attention mechanism to utilize the Contrastive Language-Image Pre-training (CLIP) trained on lots of general domain language-image pairs to guide the cross-modal learning for VideoQA. Specifically, we first extract video features using a TimeSformer and text features using a BERT from the target application domain, and utilize CLIP to extract a pair of visual-text features from the general-knowledge domain through the domain-specific learning. We then propose a Cross-domain Learning to extract the attention information between visual and linguistic features across the target domain and general domain. The set of CLIP-guided visual-text features are integrated to predict the answer. The proposed method is evaluated on MSVD-QA and MSRVTTQA datasets and outperforms state-of-the-art methods.",

keywords = "CLIP, Cross-domain Learning, Cross-modal Learning, Video Question Answering",

author = "Shuhong Ye and Weikai Kong and Chenglin Yao and Jianfeng Ren and Xudong Jiang",

note = "Publisher Copyright: {\textcopyright} 2023 IEEE.; 30th IEEE International Conference on Image Processing, ICIP 2023 ; Conference date: 08-10-2023 Through 11-10-2023",

year = "2023",

doi = "10.1109/ICIP49359.2023.10222286",

language = "English",

series = "Proceedings - International Conference on Image Processing, ICIP",

publisher = "IEEE Computer Society",

pages = "81--85",

booktitle = "2023 IEEE International Conference on Image Processing, ICIP 2023 - Proceedings",

address = "United States",

}

Ye, S, Kong, W, Yao, C, Ren, J & Jiang, X 2023, Video Question Answering Using Clip-Guided Visual-Text Attention. in 2023 IEEE International Conference on Image Processing, ICIP 2023 - Proceedings. Proceedings - International Conference on Image Processing, ICIP, IEEE Computer Society, pp. 81-85, 30th IEEE International Conference on Image Processing, ICIP 2023, Kuala Lumpur, Malaysia, 8/10/23. https://doi.org/10.1109/ICIP49359.2023.10222286

Video Question Answering Using Clip-Guided Visual-Text Attention. / Ye, Shuhong; Kong, Weikai; Yao, Chenglin et al.
2023 IEEE International Conference on Image Processing, ICIP 2023 - Proceedings. IEEE Computer Society, 2023. p. 81-85 (Proceedings - International Conference on Image Processing, ICIP).

Research output: Chapter in Book/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Video Question Answering Using Clip-Guided Visual-Text Attention

AU - Ye, Shuhong

AU - Kong, Weikai

AU - Yao, Chenglin

AU - Ren, Jianfeng

AU - Jiang, Xudong

PY - 2023

Y1 - 2023

N2 - Cross-modal learning of video and text plays a key role in Video Question Answering (VideoQA). In this paper, we propose a visual-text attention mechanism to utilize the Contrastive Language-Image Pre-training (CLIP) trained on lots of general domain language-image pairs to guide the cross-modal learning for VideoQA. Specifically, we first extract video features using a TimeSformer and text features using a BERT from the target application domain, and utilize CLIP to extract a pair of visual-text features from the general-knowledge domain through the domain-specific learning. We then propose a Cross-domain Learning to extract the attention information between visual and linguistic features across the target domain and general domain. The set of CLIP-guided visual-text features are integrated to predict the answer. The proposed method is evaluated on MSVD-QA and MSRVTTQA datasets and outperforms state-of-the-art methods.

AB - Cross-modal learning of video and text plays a key role in Video Question Answering (VideoQA). In this paper, we propose a visual-text attention mechanism to utilize the Contrastive Language-Image Pre-training (CLIP) trained on lots of general domain language-image pairs to guide the cross-modal learning for VideoQA. Specifically, we first extract video features using a TimeSformer and text features using a BERT from the target application domain, and utilize CLIP to extract a pair of visual-text features from the general-knowledge domain through the domain-specific learning. We then propose a Cross-domain Learning to extract the attention information between visual and linguistic features across the target domain and general domain. The set of CLIP-guided visual-text features are integrated to predict the answer. The proposed method is evaluated on MSVD-QA and MSRVTTQA datasets and outperforms state-of-the-art methods.

KW - CLIP

KW - Cross-domain Learning

KW - Cross-modal Learning

KW - Video Question Answering

UR - http://www.scopus.com/inward/record.url?scp=85180733981&partnerID=8YFLogxK

U2 - 10.1109/ICIP49359.2023.10222286

DO - 10.1109/ICIP49359.2023.10222286

M3 - Conference contribution

AN - SCOPUS:85180733981

T3 - Proceedings - International Conference on Image Processing, ICIP

SP - 81

EP - 85

BT - 2023 IEEE International Conference on Image Processing, ICIP 2023 - Proceedings

PB - IEEE Computer Society

T2 - 30th IEEE International Conference on Image Processing, ICIP 2023

Y2 - 8 October 2023 through 11 October 2023

ER -

Video Question Answering Using Clip-Guided Visual-Text Attention

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this