Human Uncivilized Behavior Detection Method Integrating Non-uniform Sampling and Feature Enhancement

YE Hao, WANG Longye, ZENG Xiaoli, XIAO Yue

doi:10.3778/j.issn.1673-9418.2401064

Jisuanji kexue yu tansuo (Dec 2024)

Human Uncivilized Behavior Detection Method Integrating Non-uniform Sampling and Feature Enhancement

YE Hao, WANG Longye, ZENG Xiaoli, XIAO Yue

Affiliations

YE Hao, WANG Longye, ZENG Xiaoli, XIAO Yue: 1. School of Electronics and Information Engineering, Southwest Petroleum University, Chengdu 610500, China 2. School of Information Science and Technology, Tibet University, Lhasa 850000, China

DOI: https://doi.org/10.3778/j.issn.1673-9418.2401064
Journal volume & issue: Vol. 18, no. 12
pp. 3219 – 3234

Abstract

Read online

In order to solve the problems of misdetection of similar behaviors and low accuracy for detecting local body behaviors in the spatio-temporal action detection of abnormal human behavior, based on the self-made uncivilized behavior spatio-temporal action detection dataset (UBSAD), a method that integrates non-uniform sampling and feature enhancement is proposed. Firstly, this method incorporates the video swin transformer (VST) as the backbone network in the spatio-temporal feature extraction stage to capture long-term temporal dependencies in videos, and enhance the network’s global information learning capability. Additionally, a ringed residual VST block replaces the standard VST block in the final stage of the backbone network, enlarging the difference between target area and background area. Combined with the multi-head self-attention mechanism, the feature extraction of the target area is strengthened. Furthermore, during the video frame collection stage, a unique non-uniform sampling method is proposed to adjust the input data distribution according to task requirements, allowing the model to obtain action change information in a hierarchical manner, effectively improving the network’s attention to detailed features of similar behaviors. Finally, after the feature extraction network, a new cascaded pooling three-dimensional spatial pyramid feature enhancement module incorporating shallow features is embedded to further enhance feature applicability at various scales, reduce the loss of detailed motion information during the feature extraction process, reduce the interference of background information, and achieve the effect of feature enhancement. Experimental results show that the method achieves mAP of 71.93% and 83.09% respectively on the UBSAD dataset and the public dataset UCF101-24. They are 7.39 percentage points and 1.22 percentage points higher than those of using the baseline network VST as the spatio-temporal feature extraction model, demonstrating the method’s effectiveness in accurately detecting behavior.

spatio-temporal motion detection; ringed residual video swin transformer; non-uniform sampling; cascaded pooling three-dimensional spatial pyramid

Published in Jisuanji kexue yu tansuo

ISSN: 1673-9418 (Print)
Publisher: Journal of Computer Engineering and Applications Beijing Co., Ltd., Science Press
Country of publisher: China
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: http://fcst.ceaj.org

About the journal

Abstract

Keywords