Real-time Architecture for Audio-Visual Active Speaker Detection

  • Min Huang
  • , Wen Wang
  • , Zheyuan Lin
  • , Fiseha B. Tesema
  • , Shanshan Ji
  • , Jason Gu
  • , Minhong Wan
  • , Wei Song
  • , Te Li
  • , Shiqiang Zhu

Research output: Chapter in Book/Conference proceedingConference contributionpeer-review

Abstract

Continuously measuring the speaking state of users with robot in a human-robot Interaction(HRI) system improves metrics of interaction quality. Meanwhile, mainstream active speaker detection (ASD) algorithms emphasize achieving high AUCs at frame level in the AVA-Active Speaker dataset and pay less attention to get real-time performance in robotic systems. In this paper, we propose a model named FSDNet to keep a high AUC score in the AVA-Active Speaker dataset while reducing time cost, our model increase AUC score by 0.1% compared with the State-Of-The-Art and need only 75% running time. Furthermore, we put forward an architecture with a time-related prediction function to make our algorithm more effective and generative in interactive robotic systems. The code is released at https://github.com/huangmin9966/FSDNet-RealTimeArch.

Original languageEnglish
Title of host publication2022 IEEE International Conference on Robotics and Biomimetics, ROBIO 2022
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1377-1382
Number of pages6
ISBN (Electronic)9781665481090
DOIs
Publication statusPublished - 2022
Externally publishedYes
Event2022 IEEE International Conference on Robotics and Biomimetics, ROBIO 2022 - Jinghong, China
Duration: 5 Dec 20229 Dec 2022

Publication series

Name2022 IEEE International Conference on Robotics and Biomimetics, ROBIO 2022

Conference

Conference2022 IEEE International Conference on Robotics and Biomimetics, ROBIO 2022
Country/TerritoryChina
CityJinghong
Period5/12/229/12/22

ASJC Scopus subject areas

  • Artificial Intelligence
  • Aerospace Engineering
  • Automotive Engineering
  • Control and Optimization
  • Modelling and Simulation

Fingerprint

Dive into the research topics of 'Real-time Architecture for Audio-Visual Active Speaker Detection'. Together they form a unique fingerprint.

Cite this