Convolutional transformer network for fine-grained action recognition

Yujun Ma; Ruili Wang; Ming Zong; Wanting Ji; Yi Wang; Baoliu Ye

doi:10.1016/j.neucom.2023.127027

Convolutional transformer network for fine-grained action recognition

Yujun Ma, Ruili Wang, Ming Zong, Wanting Ji, Yi Wang, Baoliu Ye

Research output: Journal Publication › Article › peer-review

14 Citations (Scopus)

Abstract

Fine-grained action recognition is one of the critical problems in video processing, which aims to recognize similar actions of subtle interactions between humans and objects. Inspired by the remarkable performance of the Transformer in natural language processing, Transformer has been applied to the fine-grained action recognition task. However, Transformer needs abundant training data and extra supervision to achieve comparable results with convolutional neural networks (CNNs). To address these issues, we propose a Convolutional Transformer Network (CTN), which integrates the merits of CNN (e.g., sharing weights, capturing low-level features in videos and locality) and the benefits of Transformer (e.g., dynamic attention and learning long-range dependencies). In this paper, we propose two modifications to the original Transformer: (i) We propose a video-to-tokens module that can extract tokens from extracted spatial-temporal features in videos by 3D convolutions instead of the direct token embedding from raw input video clips; (ii) We completely replace the linear mapping in multi-head self-attention layer with depth-wise convolutional mapping, which applies a depth-wise separable convolution operation on embedded token maps. With these two modifications, our approach can extract effective spatial-temporal features from videos and process the long sequences of tokens encountered in videos. Experimental results demonstrate that our proposed CTN can achieve state-of-the-art accuracy on two fine-grained action recognition datasets (i.e., Epic-Kitchens and Diving 48) with a small computational increase.

Original language	English
Article number	127027
Journal	Neurocomputing
Volume	569
DOIs	https://doi.org/10.1016/j.neucom.2023.127027
Publication status	Published - 7 Feb 2024
Externally published	Yes

Keywords

3D convolutions
Fine-grained action recognition
Spatial-temporal features
Transformer

ASJC Scopus subject areas

Computer Science Applications
Cognitive Neuroscience
Artificial Intelligence

Access to Document

10.1016/j.neucom.2023.127027

Cite this

@article{e66a041d49484d169b43621102fb8cd2,

title = "Convolutional transformer network for fine-grained action recognition",

abstract = "Fine-grained action recognition is one of the critical problems in video processing, which aims to recognize similar actions of subtle interactions between humans and objects. Inspired by the remarkable performance of the Transformer in natural language processing, Transformer has been applied to the fine-grained action recognition task. However, Transformer needs abundant training data and extra supervision to achieve comparable results with convolutional neural networks (CNNs). To address these issues, we propose a Convolutional Transformer Network (CTN), which integrates the merits of CNN (e.g., sharing weights, capturing low-level features in videos and locality) and the benefits of Transformer (e.g., dynamic attention and learning long-range dependencies). In this paper, we propose two modifications to the original Transformer: (i) We propose a video-to-tokens module that can extract tokens from extracted spatial-temporal features in videos by 3D convolutions instead of the direct token embedding from raw input video clips; (ii) We completely replace the linear mapping in multi-head self-attention layer with depth-wise convolutional mapping, which applies a depth-wise separable convolution operation on embedded token maps. With these two modifications, our approach can extract effective spatial-temporal features from videos and process the long sequences of tokens encountered in videos. Experimental results demonstrate that our proposed CTN can achieve state-of-the-art accuracy on two fine-grained action recognition datasets (i.e., Epic-Kitchens and Diving 48) with a small computational increase.",

keywords = "3D convolutions, Fine-grained action recognition, Spatial-temporal features, Transformer",

author = "Yujun Ma and Ruili Wang and Ming Zong and Wanting Ji and Yi Wang and Baoliu Ye",

note = "Publisher Copyright: {\textcopyright} 2023 Elsevier B.V.",

year = "2024",

month = feb,

day = "7",

doi = "10.1016/j.neucom.2023.127027",

language = "English",

volume = "569",

journal = "Neurocomputing",

issn = "0925-2312",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - Convolutional transformer network for fine-grained action recognition

AU - Ma, Yujun

AU - Wang, Ruili

AU - Zong, Ming

AU - Ji, Wanting

AU - Wang, Yi

AU - Ye, Baoliu

PY - 2024/2/7

Y1 - 2024/2/7

N2 - Fine-grained action recognition is one of the critical problems in video processing, which aims to recognize similar actions of subtle interactions between humans and objects. Inspired by the remarkable performance of the Transformer in natural language processing, Transformer has been applied to the fine-grained action recognition task. However, Transformer needs abundant training data and extra supervision to achieve comparable results with convolutional neural networks (CNNs). To address these issues, we propose a Convolutional Transformer Network (CTN), which integrates the merits of CNN (e.g., sharing weights, capturing low-level features in videos and locality) and the benefits of Transformer (e.g., dynamic attention and learning long-range dependencies). In this paper, we propose two modifications to the original Transformer: (i) We propose a video-to-tokens module that can extract tokens from extracted spatial-temporal features in videos by 3D convolutions instead of the direct token embedding from raw input video clips; (ii) We completely replace the linear mapping in multi-head self-attention layer with depth-wise convolutional mapping, which applies a depth-wise separable convolution operation on embedded token maps. With these two modifications, our approach can extract effective spatial-temporal features from videos and process the long sequences of tokens encountered in videos. Experimental results demonstrate that our proposed CTN can achieve state-of-the-art accuracy on two fine-grained action recognition datasets (i.e., Epic-Kitchens and Diving 48) with a small computational increase.

AB - Fine-grained action recognition is one of the critical problems in video processing, which aims to recognize similar actions of subtle interactions between humans and objects. Inspired by the remarkable performance of the Transformer in natural language processing, Transformer has been applied to the fine-grained action recognition task. However, Transformer needs abundant training data and extra supervision to achieve comparable results with convolutional neural networks (CNNs). To address these issues, we propose a Convolutional Transformer Network (CTN), which integrates the merits of CNN (e.g., sharing weights, capturing low-level features in videos and locality) and the benefits of Transformer (e.g., dynamic attention and learning long-range dependencies). In this paper, we propose two modifications to the original Transformer: (i) We propose a video-to-tokens module that can extract tokens from extracted spatial-temporal features in videos by 3D convolutions instead of the direct token embedding from raw input video clips; (ii) We completely replace the linear mapping in multi-head self-attention layer with depth-wise convolutional mapping, which applies a depth-wise separable convolution operation on embedded token maps. With these two modifications, our approach can extract effective spatial-temporal features from videos and process the long sequences of tokens encountered in videos. Experimental results demonstrate that our proposed CTN can achieve state-of-the-art accuracy on two fine-grained action recognition datasets (i.e., Epic-Kitchens and Diving 48) with a small computational increase.

KW - 3D convolutions

KW - Fine-grained action recognition

KW - Spatial-temporal features

KW - Transformer

UR - http://www.scopus.com/inward/record.url?scp=85178495982&partnerID=8YFLogxK

U2 - 10.1016/j.neucom.2023.127027

DO - 10.1016/j.neucom.2023.127027

M3 - Article

AN - SCOPUS:85178495982

SN - 0925-2312

VL - 569

JO - Neurocomputing

JF - Neurocomputing

M1 - 127027

ER -

Convolutional transformer network for fine-grained action recognition

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this