Relative-position embedding based spatially and temporally decoupled Transformer for action recognition

Yujun Ma, Ruili Wang

Research output: Journal PublicationArticlepeer-review

16 Citations (Scopus)

Abstract

Recognition of human actions is to classify actions in a video. Recently, Vision Transformer (ViT) has been applied to action recognition. However, the Vision Transformer is unsuitable for high-resolution input videos due to the constraint of computing power since ViT splits frames into fixed-size patches embedded (i.e., tokens) with absolute-position information and adopts a pure Transformer encoder to model the relationships among these tokens. To address this issue, we propose a relative-position embedding based spatially and temporally decoupled Transformer (RPE-STDT) for action recognition, which can capture spatial–temporal information by stacked self-attention layers. The proposed RPE-STDT model consists of two separate series of Transformer encoders. The first series of encoders is the spatial Transformer encoders, which model interactions between tokens extracted from the same temporal index. The second series of encoders is the temporal Transformer encoders, which model interactions across time dimensions with a subsampling strategy. Furthermore, we replace the absolute-position embeddings in the Vision Transformer encoders with the proposed relative-position embeddings to capture the order of the embedded tokens to reduce computational costs. Finally, we conduct thorough ablation studies. Our RPE-STDT achieves state-of-the-art results on multiple action recognition datasets, exceeding prior convolution and Transformer-based networks.

Original languageEnglish
Article number109905
JournalPattern Recognition
Volume145
DOIs
Publication statusPublished - Jan 2024
Externally publishedYes

Keywords

  • Relative-position embedding
  • Spatial–temporal features
  • Subsampling
  • Transformer

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Computer Vision and Pattern Recognition
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Relative-position embedding based spatially and temporally decoupled Transformer for action recognition'. Together they form a unique fingerprint.

Cite this