IEEE Access (Jan 2024)
Temporal-Channel Attention and Convolution Fusion for Skeleton-Based Human Action Recognition
Abstract
Human Action Recognition (HAR) based on skeleton sequences has attracted much attention due to the robustness and background insensitivity of skeletal data. The convolutional neural network (CNN) for spatio-temporal representation learning has been widely utilized for skeleton-based HAR. However, the long-term spatio-temporal modeling and action category-specific feature attention have not been fully exploited. In order to explore the current potential of CNNs for skeleton-based HAR, a novel CNN architecture with temporal-channel attention and convolution fusion is proposed. Specially, the network architecture is composed of two novel modules, the Temporal-Channels Attention Module (TCA) and Multiscale Temporal Convolution Fusion module (MTCF). TCA module is designed to generate a temporal-channel attention matrix for different visual channels and temporal features, motivating the CNN to focus on the critical category-associated feature representation learning. Along the channels, MTCF module adapts the grouped residual connections to flexibly extend the convolutional temporal receptive field, without introducing additional parameters. By reverse stacking, MTCF module creates a bidirectional information interaction among inter-channels, compensating for the receptive field and information imbalance between subgroups from different branches. The proposed method was evaluated on three benchmark datasets, including NTU RGB-D, NTU RGB-D120 and FineGYM. The results show that the proposed TCA-MTCF method improves the CNNs’ ability to model long-term temporal features of skeleton sequences, achieving the state-of-the-art performance for HAR.
Keywords