SAST: Learning Semantic Action-Aware Spatial-Temporal Features for Efficient Action Recognition

Fei Wang; Guorui Wang; Yunwen Huang; Hao Chu

doi:10.1109/ACCESS.2019.2953113

IEEE Access (Jan 2019)

SAST: Learning Semantic Action-Aware Spatial-Temporal Features for Efficient Action Recognition

Fei Wang,
Guorui Wang,
Yunwen Huang,
Hao Chu

Affiliations

Fei Wang: ORCiD; Faculty of Robot Science and Engineering, Northeastern University, Shenyang, China
Guorui Wang: ORCiD; College of Information Science and Engineering, Northeastern University, Shenyang, China
Yunwen Huang: ORCiD; College of Information Science and Engineering, Northeastern University, Shenyang, China
Hao Chu: ORCiD; Faculty of Robot Science and Engineering, Northeastern University, Shenyang, China

DOI: https://doi.org/10.1109/ACCESS.2019.2953113
Journal volume & issue: Vol. 7
pp. 164876 – 164886

Abstract

Read online

The state-of-the-arts in action recognition are suffering from three challenges: (1) How to model spatial transformations of action since it is always geometric variation over time in videos. (2) How to develop the semantic action-aware temporal features from one video with a large proportion of irrelevant frames to the labeled action class, which hurt the final performance. (3) The action recognition speed of most existing models is too slow to be applied to actual scenes. In this paper, to address these three challenges, we propose a novel CNN-based action recognition method called SAST including three important modules, which can effectively learn semantic action-aware spatial-temporal features with a faster speed. Firstly, to learn action-aware spatial features (spatial transformations), we design a weight shared 2D Deformable Convolutional network named 2DDC with deformable convolutions whose receptive fields can be adaptively adjusted according to the complex geometric structure of actions. Then, we propose a light Temporal Attention model called TA to develop the action-aware temporal features that are discriminative for the labeled action category. Finally, we apply an effective 3D network to learn the temporal context between frames for building the final video-level representation. To improve the efficiency, we only utilize the raw RGB rather than optical flow and RGB as the input to our model. Experimental results on four challenging video recognition datasets Kinetics-400, Something-Something-V1, UCF101 and HMDB51 demonstrate that our proposed method can not only achieve comparable performances but be 10× to 50× faster than most of state-of-the-art action recognition methods.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords