Scientific Reports (Oct 2024)
Multimodal and multiscale feature fusion for weakly supervised video anomaly detection
Abstract
Abstract Weakly supervised video anomaly detection aims to detect anomalous events with only video-level labels. In the absence of boundary information for anomaly segments, most existing methods rely on multiple instance learning. In these approaches, the predictions for unlabeled video snippets are guided by the classification of labeled untrimmed videos. However, these methods do not account for issues such as video blur and visual occlusion, which can hinder accurate anomaly detection. To address these issues, we propose a novel weakly supervised video anomaly detection method that fuses multimodal and multiscale features. Firstly, RGB and optical flow snippets are input into pre-trained I3D to extract appearance and motion features. Then, we introduce an Attention De-redundancy (AD) module, which employs an attention mechanism to filter out task-irrelevant redundancy in these appearance and motion features. Next, to mitigate the effects of video blurring and visual occlusion, we propose a Multi-scale Feature Learning module. This module captures long-term and short-term temporal dependencies among video snippets to provide global and local guidance for blurred or occluded video snippets. Finally, to effectively utilize the discriminative features of different modalities, we propose an Adaptive Feature Fusion module. This module adaptively fuses appearance and motion features based on their respective feature weights. Extensive experimental results demonstrate that our proposed method outperforms mainstream unsupervised and weakly supervised methods in terms of AUC. Specifically, our proposed method achieves 97.00% AUC and 85.31% AUC on two benchmark datasets, i.e., ShanghaiTech and UCF-Crime, respectively.
Keywords