Alexandria Engineering Journal (Mar 2024)
Shots segmentation-based optimized dual-stream framework for robust human activity recognition in surveillance video
Abstract
Nowadays, for controlling crime, surveillance cameras are typically installed in all public places to ensure urban safety and security. However, automating Human Activity Recognition (HAR) using computer vision techniques faces several challenges such as lowlighting, complex spatiotemporal features, clutter backgrounds, and inefficient utilization of surveillance system resources. Existing attempts in HAR designed straightforward networks by analyzing either spatial or motion patterns resulting in limited performance while the dual streams methods are entirely based on Convolutional Neural Networks (CNN) that are inadequate to learning the long-range temporal information for HAR. To overcome the above-mentioned challenges, this paper proposes an optimized dual stream framework for HAR which mainly consists of three steps. First, a shots segmentation module is introduced in the proposed framework to efficiently utilize the surveillance system resources by enhancing the lowlight video stream and then it detects salient video frames that consist of human. This module is trained on our own challenging Lowlight Human Surveillance Dataset (LHSD) which consists of both normal and different levels of lowlighting data to recognize humans in complex uncertain environments. Next, to learn HAR from both contextual and motion information, a dual stream approach is used in the feature extraction. In the first stream, it freezes the learned weights of the backbone Vision Transformer (ViT) B-16 model to select the discriminative contextual information. In the second stream, ViT features are then fused with the intermediate encoder layers of FlowNet2 model for optical flow to extract a robust motion feature vector. Finally, a two stream Parallel Bidirectional Long Short-Term Memory (PBiLSTM) is proposed for sequence learning to capture the global semantics of activities, followed by Dual Stream Multi-Head Attention (DSMHA) with a late fusion strategy to optimize the huge features vector for accurate HAR. To assess the strength of the proposed framework, extensive empirical results are conducted on real-world surveillance scenarios and various benchmark HAR datasets that achieve 78.6285%, 96.0151%, and 98.875% accuracies on HMDB51, UCF101, and YouTube Action, respectively. Our results show that the proposed strategy outperforms State-of-the-Art (SOTA) methods. The proposed framework gives superior performance in HAR, providing accurate and reliable recognition of human activities in surveillance systems.