Jisuanji kexue yu tansuo (Dec 2024)
Point Cloud Action Recognition Method Based on Masked Self-Supervised Learning
Abstract
Point cloud action recognition methods can provide precise 3D motion monitoring and recognition services, with broad application prospects in fields such as intelligent interaction, intelligent security, and medical health. Existing methods typically use a large amount of annotated point cloud data to train models, but point cloud videos contain a large number of 3D coordinates, precise annotation of point clouds is very expensive, and point cloud videos are highly redundant with uneven distribution of point cloud information in the video, all of which increase the difficulty of annotation. To address the aforementioned issue and achieve superior performance in point cloud action recognition, a novel masked self-supervised action recognition method called MSTD-Transformer is proposed, which can capture the spatiotemporal structure of point cloud videos without the need for manual annotation. Specifically, the point cloud video is divided into point tubes and adaptive video-level masks are generated based on importance, learning the appearance and motion features of point cloud videos through self-supervised learning of point cloud reconstruction and motion prediction dual-stream. To better capture motion information, MSTD-Transformer extracts dynamic attention from the displacement of point cloud keypoints and embeds it into a Transformer, using a dual-branch structure for differential learning to capture motion information and global structure separately. Experimental results on the standard dataset MSRAction-3D show that the proposed method achieves an accuracy of 96.17% for 24-frame point cloud video action recognition, which is 2.09 percentage points higher than the best existing method, confirming the effectiveness of the masking strategy and dynamic attention.
Keywords