Abstract
Convolutional neural network (CNN) is a natural structure for video modelling that has been successfully applied in the field of action recognition. The existing 3D CNN-based action recognition methods mainly perform 3D convolutions on individual cues (e.g. appearance and motion cues) and rely on the design of subsequent networks to fuse these cues together. In this paper, we propose a novel multi-cue 3D convolutional neural network (M3D), which integrates three individual cues (i.e. an appearance cue, a direct motion cue, and a salient motion cue) directly. Different from the existing methods, the proposed M3D model directly performs 3D convolutions on multiple cues instead of a single cue. Compared with the previous methods, this model can obtain more discriminative and robust features by integrating three different cues as a whole. Further, we propose a novel residual multi-cue 3D convolution model (R-M3D) to improve the representation ability to obtain representative video features. Experimental results verify the effectiveness of proposed M3D model, and the proposed R-M3D model (pre-trained on the Kinetics dataset) achieves competitive performance compared with the state-of-the-art models on UCF101 and HMDB51 datasets.
Original language | English |
---|---|
Pages (from-to) | 5167-5181 |
Number of pages | 15 |
Journal | Neural Computing and Applications |
Volume | 33 |
Issue number | 10 |
DOIs | |
Publication status | Published - May 2021 |
Externally published | Yes |
Keywords
- 3D convolution
- Action recognition
- Multi-cue
- Residual
- Salient motion cue
ASJC Scopus subject areas
- Software
- Artificial Intelligence