Visual-linguistic Cross-domain Feature Learning with Group Attention and Gamma-correct Gated Fusion for Extracting Commonsense Knowledge

Jialu Zhang, Xinyi Wang, Chenglin Yao, Jianfeng Ren, Xudong Jiang

Research output: Chapter in Book/Conference proceedingConference contributionpeer-review

3 Citations (Scopus)

Abstract

Acquiring commonsense knowledge about entity-pairs from images is crucial across diverse applications. Distantly supervised learning has made significant advancements by automatically retrieving images containing entity pairs and summarizing commonsense knowledge from the bag of images. However, the retrieved images may not always cover all possible relations, and the informative features across the bag of images are often overlooked. To address these challenges, a Multi-modal Cross-domain Feature Learning framework is proposed to incorporate the general domain knowledge from a large vision-text foundation model, ViT-GPT2, to handle unseen relations and exploit complementary information from multiple sources. Then, a Group Attention module is designed to exploit the attentive information from other instances of the same bag to boost the informative features of individual instances. Finally, a Gamma-corrected Gated Fusion is designed to select a subset of informative instances for a comprehensive summarization of commonsense entity relations. Extensive experimental results demonstrate the superiority of the proposed method over state-of-the-art models for extracting commonsense knowledge.

Original languageEnglish
Title of host publicationMM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia
PublisherAssociation for Computing Machinery, Inc
Pages4650-4659
Number of pages10
ISBN (Electronic)9798400706868
DOIs
Publication statusPublished - 28 Oct 2024
Event32nd ACM International Conference on Multimedia, MM 2024 - Melbourne, Australia
Duration: 28 Oct 20241 Nov 2024

Publication series

NameMM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia

Conference

Conference32nd ACM International Conference on Multimedia, MM 2024
Country/TerritoryAustralia
CityMelbourne
Period28/10/241/11/24

Keywords

  • commonsense knowledge extraction
  • cross-instance attention
  • cross-modal learning
  • gamma-corrected gated fusion
  • large vision-language model

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Graphics and Computer-Aided Design
  • Human-Computer Interaction
  • Software

Fingerprint

Dive into the research topics of 'Visual-linguistic Cross-domain Feature Learning with Group Attention and Gamma-correct Gated Fusion for Extracting Commonsense Knowledge'. Together they form a unique fingerprint.

Cite this