Multi-cue based 3D residual network for action recognition

Ming Zong; Ruili Wang; Zhe Chen; Maoli Wang; Xun Wang; Johan Potgieter

doi:10.1007/s00521-020-05313-8

Multi-cue based 3D residual network for action recognition

Ming Zong, Ruili Wang, Zhe Chen, Maoli Wang, Xun Wang, Johan Potgieter

Research output: Journal Publication › Article › peer-review

12 Citations (Scopus)

Abstract

Convolutional neural network (CNN) is a natural structure for video modelling that has been successfully applied in the field of action recognition. The existing 3D CNN-based action recognition methods mainly perform 3D convolutions on individual cues (e.g. appearance and motion cues) and rely on the design of subsequent networks to fuse these cues together. In this paper, we propose a novel multi-cue 3D convolutional neural network (M3D), which integrates three individual cues (i.e. an appearance cue, a direct motion cue, and a salient motion cue) directly. Different from the existing methods, the proposed M3D model directly performs 3D convolutions on multiple cues instead of a single cue. Compared with the previous methods, this model can obtain more discriminative and robust features by integrating three different cues as a whole. Further, we propose a novel residual multi-cue 3D convolution model (R-M3D) to improve the representation ability to obtain representative video features. Experimental results verify the effectiveness of proposed M3D model, and the proposed R-M3D model (pre-trained on the Kinetics dataset) achieves competitive performance compared with the state-of-the-art models on UCF101 and HMDB51 datasets.

Original language	English
Pages (from-to)	5167-5181
Number of pages	15
Journal	Neural Computing and Applications
Volume	33
Issue number	10
DOIs	https://doi.org/10.1007/s00521-020-05313-8
Publication status	Published - May 2021
Externally published	Yes

Keywords

3D convolution
Action recognition
Multi-cue
Residual
Salient motion cue

ASJC Scopus subject areas

Software
Artificial Intelligence

Access to Document

10.1007/s00521-020-05313-8

Cite this

@article{8bda10528eef409ca5d1d7d3f211b2e0,

title = "Multi-cue based 3D residual network for action recognition",

abstract = "Convolutional neural network (CNN) is a natural structure for video modelling that has been successfully applied in the field of action recognition. The existing 3D CNN-based action recognition methods mainly perform 3D convolutions on individual cues (e.g. appearance and motion cues) and rely on the design of subsequent networks to fuse these cues together. In this paper, we propose a novel multi-cue 3D convolutional neural network (M3D), which integrates three individual cues (i.e. an appearance cue, a direct motion cue, and a salient motion cue) directly. Different from the existing methods, the proposed M3D model directly performs 3D convolutions on multiple cues instead of a single cue. Compared with the previous methods, this model can obtain more discriminative and robust features by integrating three different cues as a whole. Further, we propose a novel residual multi-cue 3D convolution model (R-M3D) to improve the representation ability to obtain representative video features. Experimental results verify the effectiveness of proposed M3D model, and the proposed R-M3D model (pre-trained on the Kinetics dataset) achieves competitive performance compared with the state-of-the-art models on UCF101 and HMDB51 datasets.",

keywords = "3D convolution, Action recognition, Multi-cue, Residual, Salient motion cue",

author = "Ming Zong and Ruili Wang and Zhe Chen and Maoli Wang and Xun Wang and Johan Potgieter",

note = "Publisher Copyright: {\textcopyright} 2020, Springer-Verlag London Ltd., part of Springer Nature.",

year = "2021",

month = may,

doi = "10.1007/s00521-020-05313-8",

language = "English",

volume = "33",

pages = "5167--5181",

journal = "Neural Computing and Applications",

issn = "0941-0643",

publisher = "Springer London",

number = "10",

}

TY - JOUR

T1 - Multi-cue based 3D residual network for action recognition

AU - Zong, Ming

AU - Wang, Ruili

AU - Chen, Zhe

AU - Wang, Maoli

AU - Wang, Xun

AU - Potgieter, Johan

PY - 2021/5

Y1 - 2021/5

N2 - Convolutional neural network (CNN) is a natural structure for video modelling that has been successfully applied in the field of action recognition. The existing 3D CNN-based action recognition methods mainly perform 3D convolutions on individual cues (e.g. appearance and motion cues) and rely on the design of subsequent networks to fuse these cues together. In this paper, we propose a novel multi-cue 3D convolutional neural network (M3D), which integrates three individual cues (i.e. an appearance cue, a direct motion cue, and a salient motion cue) directly. Different from the existing methods, the proposed M3D model directly performs 3D convolutions on multiple cues instead of a single cue. Compared with the previous methods, this model can obtain more discriminative and robust features by integrating three different cues as a whole. Further, we propose a novel residual multi-cue 3D convolution model (R-M3D) to improve the representation ability to obtain representative video features. Experimental results verify the effectiveness of proposed M3D model, and the proposed R-M3D model (pre-trained on the Kinetics dataset) achieves competitive performance compared with the state-of-the-art models on UCF101 and HMDB51 datasets.

AB - Convolutional neural network (CNN) is a natural structure for video modelling that has been successfully applied in the field of action recognition. The existing 3D CNN-based action recognition methods mainly perform 3D convolutions on individual cues (e.g. appearance and motion cues) and rely on the design of subsequent networks to fuse these cues together. In this paper, we propose a novel multi-cue 3D convolutional neural network (M3D), which integrates three individual cues (i.e. an appearance cue, a direct motion cue, and a salient motion cue) directly. Different from the existing methods, the proposed M3D model directly performs 3D convolutions on multiple cues instead of a single cue. Compared with the previous methods, this model can obtain more discriminative and robust features by integrating three different cues as a whole. Further, we propose a novel residual multi-cue 3D convolution model (R-M3D) to improve the representation ability to obtain representative video features. Experimental results verify the effectiveness of proposed M3D model, and the proposed R-M3D model (pre-trained on the Kinetics dataset) achieves competitive performance compared with the state-of-the-art models on UCF101 and HMDB51 datasets.

KW - 3D convolution

KW - Action recognition

KW - Multi-cue

KW - Residual

KW - Salient motion cue

UR - http://www.scopus.com/inward/record.url?scp=85090112168&partnerID=8YFLogxK

U2 - 10.1007/s00521-020-05313-8

DO - 10.1007/s00521-020-05313-8

M3 - Article

AN - SCOPUS:85090112168

SN - 0941-0643

VL - 33

SP - 5167

EP - 5181

JO - Neural Computing and Applications

JF - Neural Computing and Applications

IS - 10

ER -

Multi-cue based 3D residual network for action recognition

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this