Jisuanji kexue yu tansuo (Dec 2024)

Human Uncivilized Behavior Detection Method Integrating Non-uniform Sampling and Feature Enhancement

  • YE Hao, WANG Longye, ZENG Xiaoli, XIAO Yue

DOI
https://doi.org/10.3778/j.issn.1673-9418.2401064
Journal volume & issue
Vol. 18, no. 12
pp. 3219 – 3234

Abstract

Read online

In order to solve the problems of misdetection of similar behaviors and low accuracy for detecting local body behaviors in the spatio-temporal action detection of abnormal human behavior, based on the self-made uncivilized behavior spatio-temporal action detection dataset (UBSAD), a method that integrates non-uniform sampling and feature enhancement is proposed. Firstly, this method incorporates the video swin transformer (VST) as the backbone network in the spatio-temporal feature extraction stage to capture long-term temporal dependencies in videos, and enhance the network’s global information learning capability. Additionally, a ringed residual VST block replaces the standard VST block in the final stage of the backbone network, enlarging the difference between target area and background area. Combined with the multi-head self-attention mechanism, the feature extraction of the target area is strengthened. Furthermore, during the video frame collection stage, a unique non-uniform sampling method is proposed to adjust the input data distribution according to task requirements, allowing the model to obtain action change information in a hierarchical manner, effectively improving the network’s attention to detailed features of similar behaviors. Finally, after the feature extraction network, a new cascaded pooling three-dimensional spatial pyramid feature enhancement module incorporating shallow features is embedded to further enhance feature applicability at various scales, reduce the loss of detailed motion information during the feature extraction process, reduce the interference of background information, and achieve the effect of feature enhancement. Experimental results show that the method achieves mAP of 71.93% and 83.09% respectively on the UBSAD dataset and the public dataset UCF101-24. They are 7.39 percentage points and 1.22 percentage points higher than those of using the baseline network VST as the spatio-temporal feature extraction model, demonstrating the method’s effectiveness in accurately detecting behavior.

Keywords