Efficient Audiovisual Fusion for Active Speaker Detection

Fiseha B. Tesema; Jason Gu; Wei Song; Hong Wu; Shiqiang Zhu; Zheyuan Lin

doi:10.1109/ACCESS.2023.3267668

Efficient Audiovisual Fusion for Active Speaker Detection

Fiseha B. Tesema, Jason Gu, Wei Song, Hong Wu, Shiqiang Zhu, Zheyuan Lin

Research output: Journal Publication › Article › peer-review

Abstract

Active speaker detection (ASD) refers to detecting the speaking person among visible human instances in a video. Existing methods widely employed a similar audiovisual fusion approach, the concatenation. Although such a fusion approach is often argued to help enhance performance, it must be noted that neither feature modalities play an equal role. It forces the backend network to focus on learning intramodal rather than intermodal features. Another concern is that since the concatenation doubles the fused feature dimension that feeds from the audio and video module, it creates a higher computational overhead for the backend network. To address these problems, this work hypothesizes that instead of leveraging deterministic fusion operation, employing an efficient fusion technique may assist the network in learning efficiently and improve detection accuracy. This work proposes an efficient audiovisual fusion (AVF) with fewer feature dimensions that captures the correlations between facial regions and sound signals, focusing more on the discriminative facial features and associating them with the corresponding audio features. Furthermore, previous ASD works focus only on improving ASD performance by creating a large computational overhead using complex techniques such as adding sophisticated postprocessing, applying smoothing techniques on the classifier to refine the network outputs at multiple stages, or assembling the multiple network outputs. This work proposed a simple yet effective end-to-end ASD using the newly proposed feature fusion approach, the AVF. The proposed framework attained a mAP of 84.384% on the validation set of the most challenging audiovisual speaker detection benchmark, the AVA-ActiveSpeaker. With this, this work outperformed previous works that did not apply the postprocessing tasks and attained competitive detection accuracy compared to other works that employed different postprocessing tasks. The proposed model also learns better on the unsynchronized raw AVA-ActiveSpeaker dataset. The ablation experiments under different image scale settings and noisy signals show the AFV's effectiveness and robustness than the concatenation operation.

Original language	English
Pages (from-to)	45140-45153
Number of pages	14
Journal	IEEE Access
Volume	11
DOIs	https://doi.org/10.1109/ACCESS.2023.3267668
Publication status	Published - 2023
Externally published	Yes

Keywords

Active speaker detection
audiovisual fusion
deep learning
human-computer interaction

ASJC Scopus subject areas

General Computer Science
General Materials Science
General Engineering

Access to Document

10.1109/ACCESS.2023.3267668

Cite this

@article{98fad7262e2444e499884b791876c920,

title = "Efficient Audiovisual Fusion for Active Speaker Detection",

abstract = "Active speaker detection (ASD) refers to detecting the speaking person among visible human instances in a video. Existing methods widely employed a similar audiovisual fusion approach, the concatenation. Although such a fusion approach is often argued to help enhance performance, it must be noted that neither feature modalities play an equal role. It forces the backend network to focus on learning intramodal rather than intermodal features. Another concern is that since the concatenation doubles the fused feature dimension that feeds from the audio and video module, it creates a higher computational overhead for the backend network. To address these problems, this work hypothesizes that instead of leveraging deterministic fusion operation, employing an efficient fusion technique may assist the network in learning efficiently and improve detection accuracy. This work proposes an efficient audiovisual fusion (AVF) with fewer feature dimensions that captures the correlations between facial regions and sound signals, focusing more on the discriminative facial features and associating them with the corresponding audio features. Furthermore, previous ASD works focus only on improving ASD performance by creating a large computational overhead using complex techniques such as adding sophisticated postprocessing, applying smoothing techniques on the classifier to refine the network outputs at multiple stages, or assembling the multiple network outputs. This work proposed a simple yet effective end-to-end ASD using the newly proposed feature fusion approach, the AVF. The proposed framework attained a mAP of 84.384% on the validation set of the most challenging audiovisual speaker detection benchmark, the AVA-ActiveSpeaker. With this, this work outperformed previous works that did not apply the postprocessing tasks and attained competitive detection accuracy compared to other works that employed different postprocessing tasks. The proposed model also learns better on the unsynchronized raw AVA-ActiveSpeaker dataset. The ablation experiments under different image scale settings and noisy signals show the AFV's effectiveness and robustness than the concatenation operation.",

keywords = "Active speaker detection, audiovisual fusion, deep learning, human-computer interaction",

author = "Tesema, {Fiseha B.} and Jason Gu and Wei Song and Hong Wu and Shiqiang Zhu and Zheyuan Lin",

note = "Publisher Copyright: {\textcopyright} 2013 IEEE.",

year = "2023",

doi = "10.1109/ACCESS.2023.3267668",

language = "English",

volume = "11",

pages = "45140--45153",

journal = "IEEE Access",

issn = "2169-3536",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Efficient Audiovisual Fusion for Active Speaker Detection

AU - Tesema, Fiseha B.

AU - Gu, Jason

AU - Song, Wei

AU - Wu, Hong

AU - Zhu, Shiqiang

AU - Lin, Zheyuan

PY - 2023

Y1 - 2023

N2 - Active speaker detection (ASD) refers to detecting the speaking person among visible human instances in a video. Existing methods widely employed a similar audiovisual fusion approach, the concatenation. Although such a fusion approach is often argued to help enhance performance, it must be noted that neither feature modalities play an equal role. It forces the backend network to focus on learning intramodal rather than intermodal features. Another concern is that since the concatenation doubles the fused feature dimension that feeds from the audio and video module, it creates a higher computational overhead for the backend network. To address these problems, this work hypothesizes that instead of leveraging deterministic fusion operation, employing an efficient fusion technique may assist the network in learning efficiently and improve detection accuracy. This work proposes an efficient audiovisual fusion (AVF) with fewer feature dimensions that captures the correlations between facial regions and sound signals, focusing more on the discriminative facial features and associating them with the corresponding audio features. Furthermore, previous ASD works focus only on improving ASD performance by creating a large computational overhead using complex techniques such as adding sophisticated postprocessing, applying smoothing techniques on the classifier to refine the network outputs at multiple stages, or assembling the multiple network outputs. This work proposed a simple yet effective end-to-end ASD using the newly proposed feature fusion approach, the AVF. The proposed framework attained a mAP of 84.384% on the validation set of the most challenging audiovisual speaker detection benchmark, the AVA-ActiveSpeaker. With this, this work outperformed previous works that did not apply the postprocessing tasks and attained competitive detection accuracy compared to other works that employed different postprocessing tasks. The proposed model also learns better on the unsynchronized raw AVA-ActiveSpeaker dataset. The ablation experiments under different image scale settings and noisy signals show the AFV's effectiveness and robustness than the concatenation operation.

AB - Active speaker detection (ASD) refers to detecting the speaking person among visible human instances in a video. Existing methods widely employed a similar audiovisual fusion approach, the concatenation. Although such a fusion approach is often argued to help enhance performance, it must be noted that neither feature modalities play an equal role. It forces the backend network to focus on learning intramodal rather than intermodal features. Another concern is that since the concatenation doubles the fused feature dimension that feeds from the audio and video module, it creates a higher computational overhead for the backend network. To address these problems, this work hypothesizes that instead of leveraging deterministic fusion operation, employing an efficient fusion technique may assist the network in learning efficiently and improve detection accuracy. This work proposes an efficient audiovisual fusion (AVF) with fewer feature dimensions that captures the correlations between facial regions and sound signals, focusing more on the discriminative facial features and associating them with the corresponding audio features. Furthermore, previous ASD works focus only on improving ASD performance by creating a large computational overhead using complex techniques such as adding sophisticated postprocessing, applying smoothing techniques on the classifier to refine the network outputs at multiple stages, or assembling the multiple network outputs. This work proposed a simple yet effective end-to-end ASD using the newly proposed feature fusion approach, the AVF. The proposed framework attained a mAP of 84.384% on the validation set of the most challenging audiovisual speaker detection benchmark, the AVA-ActiveSpeaker. With this, this work outperformed previous works that did not apply the postprocessing tasks and attained competitive detection accuracy compared to other works that employed different postprocessing tasks. The proposed model also learns better on the unsynchronized raw AVA-ActiveSpeaker dataset. The ablation experiments under different image scale settings and noisy signals show the AFV's effectiveness and robustness than the concatenation operation.

KW - Active speaker detection

KW - audiovisual fusion

KW - deep learning

KW - human-computer interaction

UR - http://www.scopus.com/inward/record.url?scp=85153523104&partnerID=8YFLogxK

U2 - 10.1109/ACCESS.2023.3267668

DO - 10.1109/ACCESS.2023.3267668

M3 - Article

AN - SCOPUS:85153523104

SN - 2169-3536

VL - 11

SP - 45140

EP - 45153

JO - IEEE Access

JF - IEEE Access

ER -

Efficient Audiovisual Fusion for Active Speaker Detection

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this