Feature fusion based deep spatiotemporal model for violence detection in videos

Mujtaba Asad; Zuopeng Yang; Zubair Khan; Jie Yang; Xiangjian He

doi:10.1007/978-3-030-36708-4_33

Feature fusion based deep spatiotemporal model for violence detection in videos

Mujtaba Asad, Zuopeng Yang, Zubair Khan, Jie Yang, Xiangjian He

Research output: Chapter in Book/Conference proceeding › Conference contribution › peer-review

12 Citations (Scopus)

Abstract

It is essential for public monitoring and security to detect violent behavior in surveillance videos. However, it requires constant human observation and attention, which is a challenging task. Autonomous detection of violent activities is essential for continuous, uninterrupted video surveillance systems. This paper proposed a novel method to detect violent activities in videos, using fused spatial feature maps, based on Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) units. The spatial features are extracted through CNN, and multi-level spatial features fusion method is used to combine the spatial features maps from two equally spaced sequential input video frames to incorporate motion characteristics. The additional residual layer blocks are used to further learn these fused spatial features to increase the classification accuracy of the network. The combined spatial features of input frames are then fed to LSTM units to learn the global temporal information. The output of this network classifies the violent or non-violent category present in the input video frame. Experimental results on three different standard benchmark datasets: Hockey Fight, Crowd Violence and BEHAVE show that the proposed algorithm provides better ability to recognize violent actions in different scenarios and results in improved performance compared to the state-of-the-art methods.

Original language	English
Title of host publication	Neural Information Processing - 26th International Conference, ICONIP 2019, Proceedings
Editors	Tom Gedeon, Kok Wai Wong, Minho Lee
Publisher	Springer
Pages	405-417
Number of pages	13
ISBN (Print)	9783030367077
DOIs	https://doi.org/10.1007/978-3-030-36708-4_33
Publication status	Published - 2019
Externally published	Yes
Event	26th International Conference on Neural Information Processing, ICONIP 2019 - Sydney, Australia Duration: 12 Dec 2019 → 15 Dec 2019

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	11953 LNCS
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Conference

Conference	26th International Conference on Neural Information Processing, ICONIP 2019
Country/Territory	Australia
City	Sydney
Period	12/12/19 → 15/12/19

Keywords

Autonomous video
CNN
LSTM
Surveillance spatiotemporal features
Violence detection

ASJC Scopus subject areas

Theoretical Computer Science
General Computer Science

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access to Document

10.1007/978-3-030-36708-4_33

Cite this

Asad, M., Yang, Z., Khan, Z., Yang, J., & He, X. (2019). Feature fusion based deep spatiotemporal model for violence detection in videos. In T. Gedeon, K. W. Wong, & M. Lee (Eds.), Neural Information Processing - 26th International Conference, ICONIP 2019, Proceedings (pp. 405-417). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 11953 LNCS). Springer. https://doi.org/10.1007/978-3-030-36708-4_33

Asad, Mujtaba ; Yang, Zuopeng ; Khan, Zubair et al. / Feature fusion based deep spatiotemporal model for violence detection in videos. Neural Information Processing - 26th International Conference, ICONIP 2019, Proceedings. editor / Tom Gedeon ; Kok Wai Wong ; Minho Lee. Springer, 2019. pp. 405-417 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{c639e5598756464daa88b259ba95adfe,

title = "Feature fusion based deep spatiotemporal model for violence detection in videos",

abstract = "It is essential for public monitoring and security to detect violent behavior in surveillance videos. However, it requires constant human observation and attention, which is a challenging task. Autonomous detection of violent activities is essential for continuous, uninterrupted video surveillance systems. This paper proposed a novel method to detect violent activities in videos, using fused spatial feature maps, based on Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) units. The spatial features are extracted through CNN, and multi-level spatial features fusion method is used to combine the spatial features maps from two equally spaced sequential input video frames to incorporate motion characteristics. The additional residual layer blocks are used to further learn these fused spatial features to increase the classification accuracy of the network. The combined spatial features of input frames are then fed to LSTM units to learn the global temporal information. The output of this network classifies the violent or non-violent category present in the input video frame. Experimental results on three different standard benchmark datasets: Hockey Fight, Crowd Violence and BEHAVE show that the proposed algorithm provides better ability to recognize violent actions in different scenarios and results in improved performance compared to the state-of-the-art methods.",

keywords = "Autonomous video, CNN, LSTM, Surveillance spatiotemporal features, Violence detection",

author = "Mujtaba Asad and Zuopeng Yang and Zubair Khan and Jie Yang and Xiangjian He",

note = "Publisher Copyright: {\textcopyright} Springer Nature Switzerland AG 2019.; 26th International Conference on Neural Information Processing, ICONIP 2019 ; Conference date: 12-12-2019 Through 15-12-2019",

year = "2019",

doi = "10.1007/978-3-030-36708-4\_33",

language = "English",

isbn = "9783030367077",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer",

pages = "405--417",

editor = "Tom Gedeon and Wong, \{Kok Wai\} and Minho Lee",

booktitle = "Neural Information Processing - 26th International Conference, ICONIP 2019, Proceedings",

}

Asad, M, Yang, Z, Khan, Z, Yang, J & He, X 2019, Feature fusion based deep spatiotemporal model for violence detection in videos. in T Gedeon, KW Wong & M Lee (eds), Neural Information Processing - 26th International Conference, ICONIP 2019, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 11953 LNCS, Springer, pp. 405-417, 26th International Conference on Neural Information Processing, ICONIP 2019, Sydney, Australia, 12/12/19. https://doi.org/10.1007/978-3-030-36708-4_33

Feature fusion based deep spatiotemporal model for violence detection in videos. / Asad, Mujtaba; Yang, Zuopeng; Khan, Zubair et al.
Neural Information Processing - 26th International Conference, ICONIP 2019, Proceedings. ed. / Tom Gedeon; Kok Wai Wong; Minho Lee. Springer, 2019. p. 405-417 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 11953 LNCS).

Research output: Chapter in Book/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Feature fusion based deep spatiotemporal model for violence detection in videos

AU - Asad, Mujtaba

AU - Yang, Zuopeng

AU - Khan, Zubair

AU - Yang, Jie

AU - He, Xiangjian

N1 - Publisher Copyright: © Springer Nature Switzerland AG 2019.

PY - 2019

Y1 - 2019

N2 - It is essential for public monitoring and security to detect violent behavior in surveillance videos. However, it requires constant human observation and attention, which is a challenging task. Autonomous detection of violent activities is essential for continuous, uninterrupted video surveillance systems. This paper proposed a novel method to detect violent activities in videos, using fused spatial feature maps, based on Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) units. The spatial features are extracted through CNN, and multi-level spatial features fusion method is used to combine the spatial features maps from two equally spaced sequential input video frames to incorporate motion characteristics. The additional residual layer blocks are used to further learn these fused spatial features to increase the classification accuracy of the network. The combined spatial features of input frames are then fed to LSTM units to learn the global temporal information. The output of this network classifies the violent or non-violent category present in the input video frame. Experimental results on three different standard benchmark datasets: Hockey Fight, Crowd Violence and BEHAVE show that the proposed algorithm provides better ability to recognize violent actions in different scenarios and results in improved performance compared to the state-of-the-art methods.

AB - It is essential for public monitoring and security to detect violent behavior in surveillance videos. However, it requires constant human observation and attention, which is a challenging task. Autonomous detection of violent activities is essential for continuous, uninterrupted video surveillance systems. This paper proposed a novel method to detect violent activities in videos, using fused spatial feature maps, based on Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) units. The spatial features are extracted through CNN, and multi-level spatial features fusion method is used to combine the spatial features maps from two equally spaced sequential input video frames to incorporate motion characteristics. The additional residual layer blocks are used to further learn these fused spatial features to increase the classification accuracy of the network. The combined spatial features of input frames are then fed to LSTM units to learn the global temporal information. The output of this network classifies the violent or non-violent category present in the input video frame. Experimental results on three different standard benchmark datasets: Hockey Fight, Crowd Violence and BEHAVE show that the proposed algorithm provides better ability to recognize violent actions in different scenarios and results in improved performance compared to the state-of-the-art methods.

KW - Autonomous video

KW - CNN

KW - LSTM

KW - Surveillance spatiotemporal features

KW - Violence detection

UR - http://www.scopus.com/inward/record.url?scp=85077508007&partnerID=8YFLogxK

U2 - 10.1007/978-3-030-36708-4_33

DO - 10.1007/978-3-030-36708-4_33

M3 - Conference contribution

AN - SCOPUS:85077508007

SN - 9783030367077

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 405

EP - 417

BT - Neural Information Processing - 26th International Conference, ICONIP 2019, Proceedings

A2 - Gedeon, Tom

A2 - Wong, Kok Wai

A2 - Lee, Minho

PB - Springer

T2 - 26th International Conference on Neural Information Processing, ICONIP 2019

Y2 - 12 December 2019 through 15 December 2019

ER -

Asad M, Yang Z, Khan Z, Yang J, He X. Feature fusion based deep spatiotemporal model for violence detection in videos. In Gedeon T, Wong KW, Lee M, editors, Neural Information Processing - 26th International Conference, ICONIP 2019, Proceedings. Springer. 2019. p. 405-417. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-030-36708-4_33

Feature fusion based deep spatiotemporal model for violence detection in videos

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

UN SDGs

Access to Document

Other files and links

Fingerprint

Cite this