Temporal-Channel Attention and Convolution Fusion for Skeleton-Based Human Action Recognition

Chengwu Liang; Jie Yang; Ruolin Du; Wei Hu; Ning Hou

doi:10.1109/ACCESS.2024.3389499

IEEE Access (Jan 2024)

Temporal-Channel Attention and Convolution Fusion for Skeleton-Based Human Action Recognition

Chengwu Liang,
Jie Yang,
Ruolin Du,
Wei Hu,
Ning Hou

Affiliations

Chengwu Liang: ORCiD; School of Electrical and Control Engineering, Henan University of Urban Construction, Pingdingshan, Henan, China
Jie Yang: ORCiD; School of Electrical and Control Engineering, Henan University of Urban Construction, Pingdingshan, Henan, China
Ruolin Du: ORCiD; School of Transportation and Civil Engineering, Nantong University, Nantong, Jiangsu, China
Wei Hu: ORCiD; School of Electrical and Control Engineering, Henan University of Urban Construction, Pingdingshan, Henan, China
Ning Hou: ORCiD; School of Electrical and Control Engineering, Henan University of Urban Construction, Pingdingshan, Henan, China

DOI: https://doi.org/10.1109/ACCESS.2024.3389499
Journal volume & issue: Vol. 12
pp. 64937 – 64948

Abstract

Read online

Human Action Recognition (HAR) based on skeleton sequences has attracted much attention due to the robustness and background insensitivity of skeletal data. The convolutional neural network (CNN) for spatio-temporal representation learning has been widely utilized for skeleton-based HAR. However, the long-term spatio-temporal modeling and action category-specific feature attention have not been fully exploited. In order to explore the current potential of CNNs for skeleton-based HAR, a novel CNN architecture with temporal-channel attention and convolution fusion is proposed. Specially, the network architecture is composed of two novel modules, the Temporal-Channels Attention Module (TCA) and Multiscale Temporal Convolution Fusion module (MTCF). TCA module is designed to generate a temporal-channel attention matrix for different visual channels and temporal features, motivating the CNN to focus on the critical category-associated feature representation learning. Along the channels, MTCF module adapts the grouped residual connections to flexibly extend the convolutional temporal receptive field, without introducing additional parameters. By reverse stacking, MTCF module creates a bidirectional information interaction among inter-channels, compensating for the receptive field and information imbalance between subgroups from different branches. The proposed method was evaluated on three benchmark datasets, including NTU RGB-D, NTU RGB-D120 and FineGYM. The results show that the proposed TCA-MTCF method improves the CNNs’ ability to model long-term temporal features of skeleton sequences, achieving the state-of-the-art performance for HAR.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords