Audio-Visual Event Localization using Multi-task Hybrid Attention Networks for Smart Healthcare Systems

Han  Liang; Jincai  Chen; Fazl Ullah Khan; Gautam  Srivastava; Jiangfeng  Zeng

doi:10.1145/3653018

Audio-Visual Event Localization using Multi-task Hybrid Attention Networks for Smart Healthcare Systems

Han Liang, Jincai Chen, Fazl Ullah Khan, Gautam Srivastava, Jiangfeng Zeng

School of Computer Science

Research output: Journal Publication › Article › peer-review

Abstract

Human perception heavily relies on two primary senses: vision and hearing, which are closely inter-connected and capable of complementing each other. Consequently, various multimodal learning tasks have emerged, with audio-visual event localization (AVEL) being a prominent example. AVEL is a popular task within the realm of multimodal learning, with the primary objective of identifying the presence of events within each video segment and predicting their respective categories. This task holds significant utility in domains such as healthcare monitoring and surveillance, among others. Generally speaking, audio-visual co-learning offers a more comprehensive information landscape compared to single-modal learning, as it allows for a more holistic perception of ambient information, aligning with real-world applications. Nevertheless, the inherent heterogeneity of audio and visual data can introduce challenges related to event semantics inconsistency, potentially leading to incorrect predictions. To track these challenges, we propose a multi-task hybrid attention network (MHAN) to acquire high-quality representation for multimodal data. Specifically, our network incorporates hybrid attention of uni- and parallel cross-modal (HAUC) modules, which consists of a uni-modal attention block and a parallel cross-modal attention block, leveraging multimodal complementary and hidden information for better representation. Furthermore, we advocate for the use of a uni-modal visual task as auxiliary supervision to enhance the performance of multimodal tasks employing a multi-task learning strategy. Our proposed model has been proven to outperform the state-of-the-art results based on extensive experiments conducted on the AVE dataset.

Original language	English
Journal	ACM Transactions on Internet Technology
DOIs	https://doi.org/10.1145/3653018
Publication status	Published Online - 16 Mar 2024

Keywords

parallel attention
hybrid attention
multi-task learning
healthcare monitoring

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access to Document

10.1145/3653018

Cite this

@article{112c2bf1149a44129503396b75019786,

title = "Audio-Visual Event Localization using Multi-task Hybrid Attention Networks for Smart Healthcare Systems",

abstract = "Human perception heavily relies on two primary senses: vision and hearing, which are closely inter-connected and capable of complementing each other. Consequently, various multimodal learning tasks have emerged, with audio-visual event localization (AVEL) being a prominent example. AVEL is a popular task within the realm of multimodal learning, with the primary objective of identifying the presence of events within each video segment and predicting their respective categories. This task holds significant utility in domains such as healthcare monitoring and surveillance, among others. Generally speaking, audio-visual co-learning offers a more comprehensive information landscape compared to single-modal learning, as it allows for a more holistic perception of ambient information, aligning with real-world applications. Nevertheless, the inherent heterogeneity of audio and visual data can introduce challenges related to event semantics inconsistency, potentially leading to incorrect predictions. To track these challenges, we propose a multi-task hybrid attention network (MHAN) to acquire high-quality representation for multimodal data. Specifically, our network incorporates hybrid attention of uni- and parallel cross-modal (HAUC) modules, which consists of a uni-modal attention block and a parallel cross-modal attention block, leveraging multimodal complementary and hidden information for better representation. Furthermore, we advocate for the use of a uni-modal visual task as auxiliary supervision to enhance the performance of multimodal tasks employing a multi-task learning strategy. Our proposed model has been proven to outperform the state-of-the-art results based on extensive experiments conducted on the AVE dataset.",

keywords = "parallel attention, hybrid attention, multi-task learning, healthcare monitoring",

author = "Han Liang and Jincai Chen and Khan, {Fazl Ullah} and Gautam Srivastava and Jiangfeng Zeng",

year = "2024",

month = mar,

day = "16",

doi = "10.1145/3653018",

language = "English",

journal = "ACM Transactions on Internet Technology",

issn = "1533-5399",

publisher = "Association for Computing Machinery (ACM)",

}

TY - JOUR

T1 - Audio-Visual Event Localization using Multi-task Hybrid Attention Networks for Smart Healthcare Systems

AU - Liang, Han

AU - Chen, Jincai

AU - Khan, Fazl Ullah

AU - Srivastava, Gautam

AU - Zeng, Jiangfeng

PY - 2024/3/16

Y1 - 2024/3/16

N2 - Human perception heavily relies on two primary senses: vision and hearing, which are closely inter-connected and capable of complementing each other. Consequently, various multimodal learning tasks have emerged, with audio-visual event localization (AVEL) being a prominent example. AVEL is a popular task within the realm of multimodal learning, with the primary objective of identifying the presence of events within each video segment and predicting their respective categories. This task holds significant utility in domains such as healthcare monitoring and surveillance, among others. Generally speaking, audio-visual co-learning offers a more comprehensive information landscape compared to single-modal learning, as it allows for a more holistic perception of ambient information, aligning with real-world applications. Nevertheless, the inherent heterogeneity of audio and visual data can introduce challenges related to event semantics inconsistency, potentially leading to incorrect predictions. To track these challenges, we propose a multi-task hybrid attention network (MHAN) to acquire high-quality representation for multimodal data. Specifically, our network incorporates hybrid attention of uni- and parallel cross-modal (HAUC) modules, which consists of a uni-modal attention block and a parallel cross-modal attention block, leveraging multimodal complementary and hidden information for better representation. Furthermore, we advocate for the use of a uni-modal visual task as auxiliary supervision to enhance the performance of multimodal tasks employing a multi-task learning strategy. Our proposed model has been proven to outperform the state-of-the-art results based on extensive experiments conducted on the AVE dataset.

AB - Human perception heavily relies on two primary senses: vision and hearing, which are closely inter-connected and capable of complementing each other. Consequently, various multimodal learning tasks have emerged, with audio-visual event localization (AVEL) being a prominent example. AVEL is a popular task within the realm of multimodal learning, with the primary objective of identifying the presence of events within each video segment and predicting their respective categories. This task holds significant utility in domains such as healthcare monitoring and surveillance, among others. Generally speaking, audio-visual co-learning offers a more comprehensive information landscape compared to single-modal learning, as it allows for a more holistic perception of ambient information, aligning with real-world applications. Nevertheless, the inherent heterogeneity of audio and visual data can introduce challenges related to event semantics inconsistency, potentially leading to incorrect predictions. To track these challenges, we propose a multi-task hybrid attention network (MHAN) to acquire high-quality representation for multimodal data. Specifically, our network incorporates hybrid attention of uni- and parallel cross-modal (HAUC) modules, which consists of a uni-modal attention block and a parallel cross-modal attention block, leveraging multimodal complementary and hidden information for better representation. Furthermore, we advocate for the use of a uni-modal visual task as auxiliary supervision to enhance the performance of multimodal tasks employing a multi-task learning strategy. Our proposed model has been proven to outperform the state-of-the-art results based on extensive experiments conducted on the AVE dataset.

KW - parallel attention

KW - hybrid attention

KW - multi-task learning

KW - healthcare monitoring

U2 - 10.1145/3653018

DO - 10.1145/3653018

M3 - Article

SN - 1533-5399

JO - ACM Transactions on Internet Technology

JF - ACM Transactions on Internet Technology

ER -

Audio-Visual Event Localization using Multi-task Hybrid Attention Networks for Smart Healthcare Systems

Abstract

Keywords

UN SDGs

Access to Document

Fingerprint

Cite this