Audio-Visual Event Localization using Multi-task Hybrid Attention Networks for Smart Healthcare Systems

Han Liang, Jincai Chen, Fazl Ullah Khan, Gautam Srivastava, Jiangfeng Zeng

Research output: Journal PublicationArticlepeer-review


Human perception heavily relies on two primary senses: vision and hearing, which are closely inter-connected and capable of complementing each other. Consequently, various multimodal learning tasks have emerged, with audio-visual event localization (AVEL) being a prominent example. AVEL is a popular task within the realm of multimodal learning, with the primary objective of identifying the presence of events within each video segment and predicting their respective categories. This task holds significant utility in domains such as healthcare monitoring and surveillance, among others. Generally speaking, audio-visual co-learning offers a more comprehensive information landscape compared to single-modal learning, as it allows for a more holistic perception of ambient information, aligning with real-world applications. Nevertheless, the inherent heterogeneity of audio and visual data can introduce challenges related to event semantics inconsistency, potentially leading to incorrect predictions. To track these challenges, we propose a multi-task hybrid attention network (MHAN) to acquire high-quality representation for multimodal data. Specifically, our network incorporates hybrid attention of uni- and parallel cross-modal (HAUC) modules, which consists of a uni-modal attention block and a parallel cross-modal attention block, leveraging multimodal complementary and hidden information for better representation. Furthermore, we advocate for the use of a uni-modal visual task as auxiliary supervision to enhance the performance of multimodal tasks employing a multi-task learning strategy. Our proposed model has been proven to outperform the state-of-the-art results based on extensive experiments conducted on the AVE dataset.
Original languageEnglish
JournalACM Transactions on Internet Technology
Publication statusPublished Online - 16 Mar 2024


  • parallel attention
  • hybrid attention
  • multi-task learning
  • healthcare monitoring


Dive into the research topics of 'Audio-Visual Event Localization using Multi-task Hybrid Attention Networks for Smart Healthcare Systems'. Together they form a unique fingerprint.

Cite this