Abstract
Human behavior detection is essential for public safety and monitoring. However, in human-based surveillance systems, it requires continuous human attention and observation, which is a difficult task. Detection of violent human behavior using autonomous surveillance systems is of critical importance for uninterrupted video surveillance. In this paper, we propose a novel method to detect fights or violent actions based on learning both the spatial and temporal features from equally spaced sequential frames of a video. Multi-level features for two sequential frames, extracted from the convolutional neural network’s top and bottom layers, are combined using the proposed feature fusion method to take into account the motion information. We also proposed Wide-Dense Residual Block to learn these combined spatial features from the two input frames. These learned features are then concatenated and fed to long short-term memory units for capturing temporal dependencies. The feature fusion method and use of additional wide-dense residual blocks enable the network to learn combined features from the input frames effectively and yields better accuracy results. Experimental results evaluated on four publicly available datasets: HockeyFight, Movies, ViolentFlow and BEHAVE show the superior performance of the proposed model in comparison with the state-of-the-art methods.
Original language | English |
---|---|
Pages (from-to) | 1415-1431 |
Number of pages | 17 |
Journal | Visual Computer |
Volume | 37 |
Issue number | 6 |
DOIs | |
Publication status | Published - Jun 2021 |
Externally published | Yes |
Keywords
- Autonomous Video Surveillance
- CNN-LSTM
- Feature fusion
- Spatio-temporal features
- Violence detection
ASJC Scopus subject areas
- Software
- Computer Vision and Pattern Recognition
- Computer Graphics and Computer-Aided Design