Multi-cue based four-stream 3D ResNets for video-based action recognition

Lei Wang; Xiaoguang Yuan; Ming Zong; Yujun Ma; Wanting Ji; Mingzhe Liu; Ruili Wang

doi:10.1016/j.ins.2021.07.079

Multi-cue based four-stream 3D ResNets for video-based action recognition

Lei Wang, Xiaoguang Yuan, Ming Zong, Yujun Ma, Wanting Ji, Mingzhe Liu, Ruili Wang

Research output: Journal Publication › Article › peer-review

25 Citations (Scopus)

Abstract

Action recognition is one of the important computer vision tasks, which has many applications. This paper proposes a Multi-cue based Four-stream 3D ResNets (MF3D) model for action recognition. The proposed MF3D model contains four streams: a video saliency stream, an appearance stream, a motion stream and an audio stream. Four cues (i.e. the appearance cue, the motion cue, the video saliency cue and audio cue) are captured by the four streams of our proposed MF3D model. In addition, three different connections between different streams are injected, which can transfer different cues between different streams to obtain more effective spatiotemporal features. Experiments are conducted on the Kinetics and Kinetics-Sounds datasets, and the results verify that our MF3D model is effective and outperforms current existing models.

Original language	English
Pages (from-to)	654-665
Number of pages	12
Journal	Information Sciences
Volume	575
DOIs	https://doi.org/10.1016/j.ins.2021.07.079
Publication status	Published - Oct 2021
Externally published	Yes

Keywords

3D ResNets
Action recognition
Audio cue
Multi-cue
Video saliency cue

ASJC Scopus subject areas

Software
Control and Systems Engineering
Theoretical Computer Science
Computer Science Applications
Information Systems and Management
Artificial Intelligence

Access to Document

10.1016/j.ins.2021.07.079

Cite this

@article{08fdef38cde74e0b98091a740c0d237d,

title = "Multi-cue based four-stream 3D ResNets for video-based action recognition",

abstract = "Action recognition is one of the important computer vision tasks, which has many applications. This paper proposes a Multi-cue based Four-stream 3D ResNets (MF3D) model for action recognition. The proposed MF3D model contains four streams: a video saliency stream, an appearance stream, a motion stream and an audio stream. Four cues (i.e. the appearance cue, the motion cue, the video saliency cue and audio cue) are captured by the four streams of our proposed MF3D model. In addition, three different connections between different streams are injected, which can transfer different cues between different streams to obtain more effective spatiotemporal features. Experiments are conducted on the Kinetics and Kinetics-Sounds datasets, and the results verify that our MF3D model is effective and outperforms current existing models.",

keywords = "3D ResNets, Action recognition, Audio cue, Multi-cue, Video saliency cue",

author = "Lei Wang and Xiaoguang Yuan and Ming Zong and Yujun Ma and Wanting Ji and Mingzhe Liu and Ruili Wang",

note = "Publisher Copyright: {\textcopyright} 2021 Elsevier Inc.",

year = "2021",

month = oct,

doi = "10.1016/j.ins.2021.07.079",

language = "English",

volume = "575",

pages = "654--665",

journal = "Information Sciences",

issn = "0020-0255",

publisher = "Elsevier Inc.",

}

TY - JOUR

T1 - Multi-cue based four-stream 3D ResNets for video-based action recognition

AU - Wang, Lei

AU - Yuan, Xiaoguang

AU - Zong, Ming

AU - Ma, Yujun

AU - Ji, Wanting

AU - Liu, Mingzhe

AU - Wang, Ruili

PY - 2021/10

Y1 - 2021/10

N2 - Action recognition is one of the important computer vision tasks, which has many applications. This paper proposes a Multi-cue based Four-stream 3D ResNets (MF3D) model for action recognition. The proposed MF3D model contains four streams: a video saliency stream, an appearance stream, a motion stream and an audio stream. Four cues (i.e. the appearance cue, the motion cue, the video saliency cue and audio cue) are captured by the four streams of our proposed MF3D model. In addition, three different connections between different streams are injected, which can transfer different cues between different streams to obtain more effective spatiotemporal features. Experiments are conducted on the Kinetics and Kinetics-Sounds datasets, and the results verify that our MF3D model is effective and outperforms current existing models.

AB - Action recognition is one of the important computer vision tasks, which has many applications. This paper proposes a Multi-cue based Four-stream 3D ResNets (MF3D) model for action recognition. The proposed MF3D model contains four streams: a video saliency stream, an appearance stream, a motion stream and an audio stream. Four cues (i.e. the appearance cue, the motion cue, the video saliency cue and audio cue) are captured by the four streams of our proposed MF3D model. In addition, three different connections between different streams are injected, which can transfer different cues between different streams to obtain more effective spatiotemporal features. Experiments are conducted on the Kinetics and Kinetics-Sounds datasets, and the results verify that our MF3D model is effective and outperforms current existing models.

KW - 3D ResNets

KW - Action recognition

KW - Audio cue

KW - Multi-cue

KW - Video saliency cue

UR - http://www.scopus.com/inward/record.url?scp=85111662912&partnerID=8YFLogxK

U2 - 10.1016/j.ins.2021.07.079

DO - 10.1016/j.ins.2021.07.079

M3 - Article

AN - SCOPUS:85111662912

SN - 0020-0255

VL - 575

SP - 654

EP - 665

JO - Information Sciences

JF - Information Sciences

ER -

Multi-cue based four-stream 3D ResNets for video-based action recognition

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this