Spatial and temporal saliency based four-stream network with multi-task learning for action recognition

Ming Zong; Ruili Wang; Yujun Ma; Wanting Ji

doi:10.1016/j.asoc.2022.109884

Spatial and temporal saliency based four-stream network with multi-task learning for action recognition

Ming Zong, Ruili Wang, Yujun Ma, Wanting Ji

Research output: Journal Publication › Article › peer-review

32 Citations (Scopus)

Abstract

Action recognition is a challenging video understanding task for the following two reasons: (i) the complex video background impairs the recognition of desirable actions, and (ii) the fusion of spatial information and temporal information. In this paper, we proposed a novel spatial and temporal saliency based four-stream network with multi-task learning. The proposed model comprises four streams: an appearance stream (i.e. a spatial stream), a motion stream (i.e. a temporal stream), a novel spatial saliency stream and a novel temporal saliency stream. The spatial stream captures the global spatial information from videos using the sampled RGB video frames as the input. The temporal stream captures the global motion information of each pixel using the sampled stacked optical flow frames as the input. The novel spatial saliency stream is used to acquire spatial saliency information from spatial saliency frames, and the novel temporal saliency stream is used to acquire temporal saliency information from temporal saliency frames. In addition, based on the four streams, multi-task learning based LSTM is adopted, which can share the complementary knowledge between different CNN features extracted from different stacked frames. The multi-task learning based LSTM can capture long-term dependency relationships between the consecutive frames over temporal evolution, which take full advantage of CNNs and LSTMs. We conduct experiments on three popular video action recognition datasets, including the UCF101 action dataset, the HMDB51 action dataset and the large-scale Kinetics action dataset, to verify the effectiveness of the proposed network, and the results demonstrate that the proposed network achieves better performance than the state-of-the-art methods on these action recognition datasets.

Original language	English
Article number	109884
Journal	Applied Soft Computing Journal
Volume	132
DOIs	https://doi.org/10.1016/j.asoc.2022.109884
Publication status	Published - Jan 2023
Externally published	Yes

Keywords

Action recognition
Multi-task learning
Spatial saliency
Temporal saliency

ASJC Scopus subject areas

Software

Access to Document

10.1016/j.asoc.2022.109884

Cite this

@article{8e2dc7474af94201af2f28cdf2297beb,

title = "Spatial and temporal saliency based four-stream network with multi-task learning for action recognition",

abstract = "Action recognition is a challenging video understanding task for the following two reasons: (i) the complex video background impairs the recognition of desirable actions, and (ii) the fusion of spatial information and temporal information. In this paper, we proposed a novel spatial and temporal saliency based four-stream network with multi-task learning. The proposed model comprises four streams: an appearance stream (i.e. a spatial stream), a motion stream (i.e. a temporal stream), a novel spatial saliency stream and a novel temporal saliency stream. The spatial stream captures the global spatial information from videos using the sampled RGB video frames as the input. The temporal stream captures the global motion information of each pixel using the sampled stacked optical flow frames as the input. The novel spatial saliency stream is used to acquire spatial saliency information from spatial saliency frames, and the novel temporal saliency stream is used to acquire temporal saliency information from temporal saliency frames. In addition, based on the four streams, multi-task learning based LSTM is adopted, which can share the complementary knowledge between different CNN features extracted from different stacked frames. The multi-task learning based LSTM can capture long-term dependency relationships between the consecutive frames over temporal evolution, which take full advantage of CNNs and LSTMs. We conduct experiments on three popular video action recognition datasets, including the UCF101 action dataset, the HMDB51 action dataset and the large-scale Kinetics action dataset, to verify the effectiveness of the proposed network, and the results demonstrate that the proposed network achieves better performance than the state-of-the-art methods on these action recognition datasets.",

keywords = "Action recognition, Multi-task learning, Spatial saliency, Temporal saliency",

author = "Ming Zong and Ruili Wang and Yujun Ma and Wanting Ji",

note = "Publisher Copyright: {\textcopyright} 2022 Elsevier B.V.",

year = "2023",

month = jan,

doi = "10.1016/j.asoc.2022.109884",

language = "English",

volume = "132",

journal = "Applied Soft Computing Journal",

issn = "1568-4946",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - Spatial and temporal saliency based four-stream network with multi-task learning for action recognition

AU - Zong, Ming

AU - Wang, Ruili

AU - Ma, Yujun

AU - Ji, Wanting

PY - 2023/1

Y1 - 2023/1

N2 - Action recognition is a challenging video understanding task for the following two reasons: (i) the complex video background impairs the recognition of desirable actions, and (ii) the fusion of spatial information and temporal information. In this paper, we proposed a novel spatial and temporal saliency based four-stream network with multi-task learning. The proposed model comprises four streams: an appearance stream (i.e. a spatial stream), a motion stream (i.e. a temporal stream), a novel spatial saliency stream and a novel temporal saliency stream. The spatial stream captures the global spatial information from videos using the sampled RGB video frames as the input. The temporal stream captures the global motion information of each pixel using the sampled stacked optical flow frames as the input. The novel spatial saliency stream is used to acquire spatial saliency information from spatial saliency frames, and the novel temporal saliency stream is used to acquire temporal saliency information from temporal saliency frames. In addition, based on the four streams, multi-task learning based LSTM is adopted, which can share the complementary knowledge between different CNN features extracted from different stacked frames. The multi-task learning based LSTM can capture long-term dependency relationships between the consecutive frames over temporal evolution, which take full advantage of CNNs and LSTMs. We conduct experiments on three popular video action recognition datasets, including the UCF101 action dataset, the HMDB51 action dataset and the large-scale Kinetics action dataset, to verify the effectiveness of the proposed network, and the results demonstrate that the proposed network achieves better performance than the state-of-the-art methods on these action recognition datasets.

AB - Action recognition is a challenging video understanding task for the following two reasons: (i) the complex video background impairs the recognition of desirable actions, and (ii) the fusion of spatial information and temporal information. In this paper, we proposed a novel spatial and temporal saliency based four-stream network with multi-task learning. The proposed model comprises four streams: an appearance stream (i.e. a spatial stream), a motion stream (i.e. a temporal stream), a novel spatial saliency stream and a novel temporal saliency stream. The spatial stream captures the global spatial information from videos using the sampled RGB video frames as the input. The temporal stream captures the global motion information of each pixel using the sampled stacked optical flow frames as the input. The novel spatial saliency stream is used to acquire spatial saliency information from spatial saliency frames, and the novel temporal saliency stream is used to acquire temporal saliency information from temporal saliency frames. In addition, based on the four streams, multi-task learning based LSTM is adopted, which can share the complementary knowledge between different CNN features extracted from different stacked frames. The multi-task learning based LSTM can capture long-term dependency relationships between the consecutive frames over temporal evolution, which take full advantage of CNNs and LSTMs. We conduct experiments on three popular video action recognition datasets, including the UCF101 action dataset, the HMDB51 action dataset and the large-scale Kinetics action dataset, to verify the effectiveness of the proposed network, and the results demonstrate that the proposed network achieves better performance than the state-of-the-art methods on these action recognition datasets.

KW - Action recognition

KW - Multi-task learning

KW - Spatial saliency

KW - Temporal saliency

UR - http://www.scopus.com/inward/record.url?scp=85143698696&partnerID=8YFLogxK

U2 - 10.1016/j.asoc.2022.109884

DO - 10.1016/j.asoc.2022.109884

M3 - Article

AN - SCOPUS:85143698696

SN - 1568-4946

VL - 132

JO - Applied Soft Computing Journal

JF - Applied Soft Computing Journal

M1 - 109884

ER -

Spatial and temporal saliency based four-stream network with multi-task learning for action recognition

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this