IEEE Access (Jan 2024)
Caption-Guided Interpretable Video Anomaly Detection Based on Memory Similarity
Abstract
Most video anomaly detection approaches are based on non-semantic features, which are not interpretable, and prevent the identification of anomaly causes. Therefore, we propose a caption-guided interpretable video anomaly detection framework that explains the prediction results based on video captions (semantic). It utilizes non-semantic features to fit the dataset and semantic features to provide common sense and interpretability to the model. It automatically stores representative anomaly prototypes and uses them to guide the model based on similarity with these prototypes. Specifically, we use video memory to represent the content of videos, which includes video features (non-semantic) and caption information (semantic). The proposed method generates and updates a memory space during training, and predicts anomaly scores based on the memory similarities between the input video and the stored memories. The stored captions can be used as descriptions of representative anomaly actions. The proposed module can be easily integrated with existing methods. The interpretability and reliable detection performance of the proposed method are evaluated through extensive experiments on public benchmark datasets.
Keywords