IEEE Access (Jan 2024)
HCMT: A Novel Hierarchical Cross-Modal Transformer for Recognition of Abnormal Behavior
Abstract
Enhancing video recognition systems with advanced abnormal behavior recognition technologies is crucial for school safety and campus security. Traditional methods primarily rely on visual data and often fail to recognize complex behaviors due to intricate backgrounds. Similarly, traditional audio processing techniques struggle to capture transient anomalies, as they have limited capacity to handle complex sounds. This study overcomes these limitations by integrating audio and visual data, addressing the shortcomings of visual-only modalities in recognizing subtle behaviors. This study introduces a novel Hierarchical Cross-Modal Transformer (HCMT), which innovatively combines multiple hierarchical branches of visual and audio. The innovative integration of hierarchical audio and visual modalities in HCMT enables capturing low-level features often overlooked by single late-stage fusion methods, thus learning global features more effectively. The audio branch utilizes the newly developed Audio Temporal Spectrogram Transformer (ATST), which employs a global sparse uniform sampling technique to effectively capture the transient nature of audio-based abnormalities, thereby enhancing behavior recognition robustness. The HCMT model demonstrated a Top-1 accuracy of 79.45% and a Top-5 accuracy of 98.44% on the challenging Campus Abnormal Behavior Recognition Hard (CABRH8) dataset, consisting of eight indistinguishable human abnormal behaviors. The ATST significantly improved Top-1 accuracy by 7.45% over visual benchmarks alone. Furthermore, the HCMT recorded Top-1 and Top-5 accuracies of 84.93% and 97.63% on the CABR50 dataset, outperforming prior models that relied solely on visual data. It underscores the adaptability of the HCMT approach. The model’s complexity includes 992 GFLOPs, achieving 28 frames per second (FPS). The model’s generalizability was also confirmed on additional datasets, including UCF-101, which achieved advanced outcomes. The code and models will be made publicly available at https://github.com/LiuHaiChuan0/2021-Deep-learning/tree/main/HCMT.
Keywords