Multi-frame feature-fusion-based model for violence detection

Mujtaba Asad; Jie Yang; Jiang He; Pourya Shamsolmoali; Xiangjian He

doi:10.1007/s00371-020-01878-6

Multi-frame feature-fusion-based model for violence detection

Mujtaba Asad, Jie Yang, Jiang He, Pourya Shamsolmoali, Xiangjian He

Research output: Journal Publication › Article › peer-review

57 Citations (Scopus)

Abstract

Human behavior detection is essential for public safety and monitoring. However, in human-based surveillance systems, it requires continuous human attention and observation, which is a difficult task. Detection of violent human behavior using autonomous surveillance systems is of critical importance for uninterrupted video surveillance. In this paper, we propose a novel method to detect fights or violent actions based on learning both the spatial and temporal features from equally spaced sequential frames of a video. Multi-level features for two sequential frames, extracted from the convolutional neural network’s top and bottom layers, are combined using the proposed feature fusion method to take into account the motion information. We also proposed Wide-Dense Residual Block to learn these combined spatial features from the two input frames. These learned features are then concatenated and fed to long short-term memory units for capturing temporal dependencies. The feature fusion method and use of additional wide-dense residual blocks enable the network to learn combined features from the input frames effectively and yields better accuracy results. Experimental results evaluated on four publicly available datasets: HockeyFight, Movies, ViolentFlow and BEHAVE show the superior performance of the proposed model in comparison with the state-of-the-art methods.

Original language	English
Pages (from-to)	1415-1431
Number of pages	17
Journal	Visual Computer
Volume	37
Issue number	6
DOIs	https://doi.org/10.1007/s00371-020-01878-6
Publication status	Published - Jun 2021
Externally published	Yes

Keywords

Autonomous Video Surveillance
CNN-LSTM
Feature fusion
Spatio-temporal features
Violence detection

ASJC Scopus subject areas

Software
Computer Vision and Pattern Recognition
Computer Graphics and Computer-Aided Design

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access to Document

10.1007/s00371-020-01878-6

Cite this

@article{2d2c39c97ba14467933d60e894369e39,

title = "Multi-frame feature-fusion-based model for violence detection",

abstract = "Human behavior detection is essential for public safety and monitoring. However, in human-based surveillance systems, it requires continuous human attention and observation, which is a difficult task. Detection of violent human behavior using autonomous surveillance systems is of critical importance for uninterrupted video surveillance. In this paper, we propose a novel method to detect fights or violent actions based on learning both the spatial and temporal features from equally spaced sequential frames of a video. Multi-level features for two sequential frames, extracted from the convolutional neural network{\textquoteright}s top and bottom layers, are combined using the proposed feature fusion method to take into account the motion information. We also proposed Wide-Dense Residual Block to learn these combined spatial features from the two input frames. These learned features are then concatenated and fed to long short-term memory units for capturing temporal dependencies. The feature fusion method and use of additional wide-dense residual blocks enable the network to learn combined features from the input frames effectively and yields better accuracy results. Experimental results evaluated on four publicly available datasets: HockeyFight, Movies, ViolentFlow and BEHAVE show the superior performance of the proposed model in comparison with the state-of-the-art methods.",

keywords = "Autonomous Video Surveillance, CNN-LSTM, Feature fusion, Spatio-temporal features, Violence detection",

author = "Mujtaba Asad and Jie Yang and Jiang He and Pourya Shamsolmoali and Xiangjian He",

note = "Publisher Copyright: {\textcopyright} 2020, Springer-Verlag GmbH Germany, part of Springer Nature.",

year = "2021",

month = jun,

doi = "10.1007/s00371-020-01878-6",

language = "English",

volume = "37",

pages = "1415--1431",

journal = "Visual Computer",

issn = "0178-2789",

publisher = "Springer Verlag",

number = "6",

}

TY - JOUR

T1 - Multi-frame feature-fusion-based model for violence detection

AU - Asad, Mujtaba

AU - Yang, Jie

AU - He, Jiang

AU - Shamsolmoali, Pourya

AU - He, Xiangjian

PY - 2021/6

Y1 - 2021/6

N2 - Human behavior detection is essential for public safety and monitoring. However, in human-based surveillance systems, it requires continuous human attention and observation, which is a difficult task. Detection of violent human behavior using autonomous surveillance systems is of critical importance for uninterrupted video surveillance. In this paper, we propose a novel method to detect fights or violent actions based on learning both the spatial and temporal features from equally spaced sequential frames of a video. Multi-level features for two sequential frames, extracted from the convolutional neural network’s top and bottom layers, are combined using the proposed feature fusion method to take into account the motion information. We also proposed Wide-Dense Residual Block to learn these combined spatial features from the two input frames. These learned features are then concatenated and fed to long short-term memory units for capturing temporal dependencies. The feature fusion method and use of additional wide-dense residual blocks enable the network to learn combined features from the input frames effectively and yields better accuracy results. Experimental results evaluated on four publicly available datasets: HockeyFight, Movies, ViolentFlow and BEHAVE show the superior performance of the proposed model in comparison with the state-of-the-art methods.

AB - Human behavior detection is essential for public safety and monitoring. However, in human-based surveillance systems, it requires continuous human attention and observation, which is a difficult task. Detection of violent human behavior using autonomous surveillance systems is of critical importance for uninterrupted video surveillance. In this paper, we propose a novel method to detect fights or violent actions based on learning both the spatial and temporal features from equally spaced sequential frames of a video. Multi-level features for two sequential frames, extracted from the convolutional neural network’s top and bottom layers, are combined using the proposed feature fusion method to take into account the motion information. We also proposed Wide-Dense Residual Block to learn these combined spatial features from the two input frames. These learned features are then concatenated and fed to long short-term memory units for capturing temporal dependencies. The feature fusion method and use of additional wide-dense residual blocks enable the network to learn combined features from the input frames effectively and yields better accuracy results. Experimental results evaluated on four publicly available datasets: HockeyFight, Movies, ViolentFlow and BEHAVE show the superior performance of the proposed model in comparison with the state-of-the-art methods.

KW - Autonomous Video Surveillance

KW - CNN-LSTM

KW - Feature fusion

KW - Spatio-temporal features

KW - Violence detection

UR - http://www.scopus.com/inward/record.url?scp=85087064730&partnerID=8YFLogxK

U2 - 10.1007/s00371-020-01878-6

DO - 10.1007/s00371-020-01878-6

M3 - Article

AN - SCOPUS:85087064730

SN - 0178-2789

VL - 37

SP - 1415

EP - 1431

JO - Visual Computer

JF - Visual Computer

IS - 6

ER -

Multi-frame feature-fusion-based model for violence detection

Abstract

Keywords

ASJC Scopus subject areas

UN SDGs

Access to Document

Other files and links

Fingerprint

Cite this