k-NN attention-based video vision transformer for action recognition

Weirong Sun; Yujun Ma; Ruili Wang

doi:10.1016/j.neucom.2024.127256

k-NN attention-based video vision transformer for action recognition

Weirong Sun, Yujun Ma, Ruili Wang

Research output: Journal Publication › Article › peer-review

27 Citations (Scopus)

Abstract

Action Recognition aims to understand human behavior and predict a label for each action. Recently, Vision Transformer (ViT) has achieved remarkable performance on action recognition, which models the long sequences token over spatial and temporal index in a video. The fully-connected self-attention layer is the fundamental key in the vanilla Transformer. However, the redundant architecture of the vision Transformer model ignores the locality of video frame patches, which involves non-informative tokens and potentially leads to increased computational complexity. To solve this problem, we propose a k-NN attention-based Video Vision Transformer (k-ViViT) network for action recognition. We adopt k-NN attention to Video Vision Transformer (ViViT) instead of original self-attention, which can optimize the training process and neglect the irrelevant or noisy tokens in the input sequence. We conduct experiments on the UCF101 and HMDB51 datasets to verify the effectiveness of our model. The experimental results illustrate that the proposed k-ViViT achieves superior accuracy compared to several state-of-the-art models on these action recognition datasets.

Original language	English
Article number	127256
Journal	Neurocomputing
Volume	574
DOIs	https://doi.org/10.1016/j.neucom.2024.127256
Publication status	Published - 14 Mar 2024
Externally published	Yes

Keywords

Action recognition
Attention mechanism
Transformer
Vision transformer

ASJC Scopus subject areas

Computer Science Applications
Cognitive Neuroscience
Artificial Intelligence

Access to Document

10.1016/j.neucom.2024.127256

Cite this

@article{fa6efaf7e0544707a9e7b74162be215e,

title = "k-NN attention-based video vision transformer for action recognition",

abstract = "Action Recognition aims to understand human behavior and predict a label for each action. Recently, Vision Transformer (ViT) has achieved remarkable performance on action recognition, which models the long sequences token over spatial and temporal index in a video. The fully-connected self-attention layer is the fundamental key in the vanilla Transformer. However, the redundant architecture of the vision Transformer model ignores the locality of video frame patches, which involves non-informative tokens and potentially leads to increased computational complexity. To solve this problem, we propose a k-NN attention-based Video Vision Transformer (k-ViViT) network for action recognition. We adopt k-NN attention to Video Vision Transformer (ViViT) instead of original self-attention, which can optimize the training process and neglect the irrelevant or noisy tokens in the input sequence. We conduct experiments on the UCF101 and HMDB51 datasets to verify the effectiveness of our model. The experimental results illustrate that the proposed k-ViViT achieves superior accuracy compared to several state-of-the-art models on these action recognition datasets.",

keywords = "Action recognition, Attention mechanism, Transformer, Vision transformer",

author = "Weirong Sun and Yujun Ma and Ruili Wang",

note = "Publisher Copyright: {\textcopyright} 2024 The Author(s)",

year = "2024",

month = mar,

day = "14",

doi = "10.1016/j.neucom.2024.127256",

language = "English",

volume = "574",

journal = "Neurocomputing",

issn = "0925-2312",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - k-NN attention-based video vision transformer for action recognition

AU - Sun, Weirong

AU - Ma, Yujun

AU - Wang, Ruili

PY - 2024/3/14

Y1 - 2024/3/14

N2 - Action Recognition aims to understand human behavior and predict a label for each action. Recently, Vision Transformer (ViT) has achieved remarkable performance on action recognition, which models the long sequences token over spatial and temporal index in a video. The fully-connected self-attention layer is the fundamental key in the vanilla Transformer. However, the redundant architecture of the vision Transformer model ignores the locality of video frame patches, which involves non-informative tokens and potentially leads to increased computational complexity. To solve this problem, we propose a k-NN attention-based Video Vision Transformer (k-ViViT) network for action recognition. We adopt k-NN attention to Video Vision Transformer (ViViT) instead of original self-attention, which can optimize the training process and neglect the irrelevant or noisy tokens in the input sequence. We conduct experiments on the UCF101 and HMDB51 datasets to verify the effectiveness of our model. The experimental results illustrate that the proposed k-ViViT achieves superior accuracy compared to several state-of-the-art models on these action recognition datasets.

AB - Action Recognition aims to understand human behavior and predict a label for each action. Recently, Vision Transformer (ViT) has achieved remarkable performance on action recognition, which models the long sequences token over spatial and temporal index in a video. The fully-connected self-attention layer is the fundamental key in the vanilla Transformer. However, the redundant architecture of the vision Transformer model ignores the locality of video frame patches, which involves non-informative tokens and potentially leads to increased computational complexity. To solve this problem, we propose a k-NN attention-based Video Vision Transformer (k-ViViT) network for action recognition. We adopt k-NN attention to Video Vision Transformer (ViViT) instead of original self-attention, which can optimize the training process and neglect the irrelevant or noisy tokens in the input sequence. We conduct experiments on the UCF101 and HMDB51 datasets to verify the effectiveness of our model. The experimental results illustrate that the proposed k-ViViT achieves superior accuracy compared to several state-of-the-art models on these action recognition datasets.

KW - Action recognition

KW - Attention mechanism

KW - Transformer

KW - Vision transformer

UR - http://www.scopus.com/inward/record.url?scp=85183165396&partnerID=8YFLogxK

U2 - 10.1016/j.neucom.2024.127256

DO - 10.1016/j.neucom.2024.127256

M3 - Article

AN - SCOPUS:85183165396

SN - 0925-2312

VL - 574

JO - Neurocomputing

JF - Neurocomputing

M1 - 127256

ER -

k-NN attention-based video vision transformer for action recognition

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this