Spatial-temporal interaction learning based two-stream network for action recognition

Tianyu Liu; Yujun Ma; Wenhan Yang; Wanting Ji; Ruili Wang; Ping Jiang

doi:10.1016/j.ins.2022.05.092

Spatial-temporal interaction learning based two-stream network for action recognition

Tianyu Liu, Yujun Ma, Wenhan Yang, Wanting Ji, Ruili Wang, Ping Jiang

Research output: Journal Publication › Article › peer-review

53 Citations (Scopus)

Abstract

Two-stream convolutional neural networks have been widely applied to action recognition. However, two-stream networks are usually adopted to capture spatial information and temporal information separately, which normally ignore the strong complementarity and correlation between spatial and temporal information in videos. To solve this problem, we propose a Spatial-Temporal Interaction Learning Two-stream network (STILT) for action recognition. Our proposed two-stream (i.e., a spatial stream and a temporal stream) network has a spatial–temporal interaction learning module, which uses an alternating co-attention mechanism between two streams to learn the correlation between spatial features and temporal features. The spatial–temporal interaction learning module allows the two streams to guide each other and then generates optimized spatial attention features and temporal attention features. Thus, the proposed network can establish the interactive connection between two streams, which efficiently exploits the attended spatial and temporal features to improve recognition accuracy. Experiments on three widely used datasets (i.e., UCF101, HMDB51 and Kinetics) show that the proposed network outperforms the state-of-the-art models in action recognition.

Original language	English
Pages (from-to)	864-876
Number of pages	13
Journal	Information Sciences
Volume	606
DOIs	https://doi.org/10.1016/j.ins.2022.05.092
Publication status	Published - Aug 2022
Externally published	Yes

Keywords

Action recognition
Spatial-temporal
Two-stream CNNs

ASJC Scopus subject areas

Software
Control and Systems Engineering
Theoretical Computer Science
Computer Science Applications
Information Systems and Management
Artificial Intelligence

Access to Document

10.1016/j.ins.2022.05.092

Cite this

@article{f3f9c34b39474d16a756218c90e8be36,

title = "Spatial-temporal interaction learning based two-stream network for action recognition",

abstract = "Two-stream convolutional neural networks have been widely applied to action recognition. However, two-stream networks are usually adopted to capture spatial information and temporal information separately, which normally ignore the strong complementarity and correlation between spatial and temporal information in videos. To solve this problem, we propose a Spatial-Temporal Interaction Learning Two-stream network (STILT) for action recognition. Our proposed two-stream (i.e., a spatial stream and a temporal stream) network has a spatial–temporal interaction learning module, which uses an alternating co-attention mechanism between two streams to learn the correlation between spatial features and temporal features. The spatial–temporal interaction learning module allows the two streams to guide each other and then generates optimized spatial attention features and temporal attention features. Thus, the proposed network can establish the interactive connection between two streams, which efficiently exploits the attended spatial and temporal features to improve recognition accuracy. Experiments on three widely used datasets (i.e., UCF101, HMDB51 and Kinetics) show that the proposed network outperforms the state-of-the-art models in action recognition.",

keywords = "Action recognition, Spatial-temporal, Two-stream CNNs",

author = "Tianyu Liu and Yujun Ma and Wenhan Yang and Wanting Ji and Ruili Wang and Ping Jiang",

note = "Publisher Copyright: {\textcopyright} 2022 Elsevier Inc.",

year = "2022",

month = aug,

doi = "10.1016/j.ins.2022.05.092",

language = "English",

volume = "606",

pages = "864--876",

journal = "Information Sciences",

issn = "0020-0255",

publisher = "Elsevier Inc.",

}

TY - JOUR

T1 - Spatial-temporal interaction learning based two-stream network for action recognition

AU - Liu, Tianyu

AU - Ma, Yujun

AU - Yang, Wenhan

AU - Ji, Wanting

AU - Wang, Ruili

AU - Jiang, Ping

PY - 2022/8

Y1 - 2022/8

N2 - Two-stream convolutional neural networks have been widely applied to action recognition. However, two-stream networks are usually adopted to capture spatial information and temporal information separately, which normally ignore the strong complementarity and correlation between spatial and temporal information in videos. To solve this problem, we propose a Spatial-Temporal Interaction Learning Two-stream network (STILT) for action recognition. Our proposed two-stream (i.e., a spatial stream and a temporal stream) network has a spatial–temporal interaction learning module, which uses an alternating co-attention mechanism between two streams to learn the correlation between spatial features and temporal features. The spatial–temporal interaction learning module allows the two streams to guide each other and then generates optimized spatial attention features and temporal attention features. Thus, the proposed network can establish the interactive connection between two streams, which efficiently exploits the attended spatial and temporal features to improve recognition accuracy. Experiments on three widely used datasets (i.e., UCF101, HMDB51 and Kinetics) show that the proposed network outperforms the state-of-the-art models in action recognition.

AB - Two-stream convolutional neural networks have been widely applied to action recognition. However, two-stream networks are usually adopted to capture spatial information and temporal information separately, which normally ignore the strong complementarity and correlation between spatial and temporal information in videos. To solve this problem, we propose a Spatial-Temporal Interaction Learning Two-stream network (STILT) for action recognition. Our proposed two-stream (i.e., a spatial stream and a temporal stream) network has a spatial–temporal interaction learning module, which uses an alternating co-attention mechanism between two streams to learn the correlation between spatial features and temporal features. The spatial–temporal interaction learning module allows the two streams to guide each other and then generates optimized spatial attention features and temporal attention features. Thus, the proposed network can establish the interactive connection between two streams, which efficiently exploits the attended spatial and temporal features to improve recognition accuracy. Experiments on three widely used datasets (i.e., UCF101, HMDB51 and Kinetics) show that the proposed network outperforms the state-of-the-art models in action recognition.

KW - Action recognition

KW - Spatial-temporal

KW - Two-stream CNNs

UR - http://www.scopus.com/inward/record.url?scp=85131359941&partnerID=8YFLogxK

U2 - 10.1016/j.ins.2022.05.092

DO - 10.1016/j.ins.2022.05.092

M3 - Article

AN - SCOPUS:85131359941

SN - 0020-0255

VL - 606

SP - 864

EP - 876

JO - Information Sciences

JF - Information Sciences

ER -

Spatial-temporal interaction learning based two-stream network for action recognition

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this