Dynamic interactive learning network for audio-visual event localization

Jincai Chen, Han Liang, Ruili Wang, Jiangfeng Zeng, Ping Lu

Research output: Journal PublicationArticlepeer-review

1 Citation (Scopus)

Abstract

Audio-visual event (AVE) localization aims to detect whether an event exists in each video segment and predict its category. Only when the event is audible and visible can it be recognized as an AVE. However, sometimes the information from auditory and visual modalities is asymmetrical in a video sequence, leading to incorrect predictions. To address this challenge, we introduce a dynamic interactive learning network designed to dynamically explore the intra- and inter-modal relationships depending on the other modality for better AVE localization. Specifically, our approach involves a dynamic fusion attention of intra- and inter-modalities module, enabling the auditory and visual modalities to focus more on regions deemed informative by the other modality while focusing less on regions that the other modality considers noise. In addition, we introduce an audio-visual difference loss to reduce the distance between auditory and visual representations. Our proposed method has been demonstrated to have superior performance by extensive experimental results on the AVE dataset. The source code will be available at https://github.com/hanliang/DILN .

Original languageEnglish
Pages (from-to)30431-30442
Number of pages12
JournalApplied Intelligence
Volume53
Issue number24
DOIs
Publication statusPublished - Dec 2023
Externally publishedYes

Keywords

  • Attention mechanism
  • Audio-visual event localization
  • Difference loss
  • Dynamic fusion

ASJC Scopus subject areas

  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Dynamic interactive learning network for audio-visual event localization'. Together they form a unique fingerprint.

Cite this