An Improved Attention-Based Spatiotemporal-Stream Model for Action Recognition in Videos

Dan Liu; Yunfeng Ji; Mao Ye; Yan Gan; Jianwei Zhang

doi:10.1109/ACCESS.2020.2983355

IEEE Access (Jan 2020)

An Improved Attention-Based Spatiotemporal-Stream Model for Action Recognition in Videos

Dan Liu,
Yunfeng Ji,
Mao Ye,
Yan Gan,
Jianwei Zhang

Affiliations

Dan Liu: Key Laboratory for NeuroInformation of Ministry of Education, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China
Yunfeng Ji: Institute of Machine Intelligence, University of Shanghai for Science and Technology, Shanghai, China
Mao Ye: ORCiD; Key Laboratory for NeuroInformation of Ministry of Education, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China
Yan Gan: Key Laboratory for NeuroInformation of Ministry of Education, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China
Jianwei Zhang: Department of Informatics, Institute of Technical Aspects of Multimodal Systems (TAMS), Universität Hamburg, Hamburg, Germany

DOI: https://doi.org/10.1109/ACCESS.2020.2983355
Journal volume & issue: Vol. 8
pp. 61462 – 61470

Abstract

Read online

Action recognition is an important yet challenging task in computer vision. Attention mechanism not only tells where to focus but when to focus. It plays a key role in extracting discriminative spatial and temporal features for solving the task. In this paper, we propose an improved spatiotemporal attention model based on the two-stream structure to recognize the different actions in videos. Specifically, we first extract the intra-frame spatial features and inter-frame optical flow features for each video data. Then we implement an effective attention module, which sequentially infers attention maps along three separate dimensions: channel, spatial and temporal. After adaptive feature refinement based on the attention maps, we perform a temporal pooling process to squeeze the temporal dimension. Then, these achieved spatial and temporal features are fed into the spatial LSTM and temporal LSTM, respectively. Finally, we fuse the spatial feature, temporal feature and two-stream fusion feature to classify the actions in videos. Additionally, we also collect and construct a new Ping-Pong action dataset for subsequent human-robot interaction task from YouTube. It contains 2400 labeled videos for 4 categories. We compare with other action recognition algorithms and validate the feasibility and effectiveness of the proposed method on Ping-Pong action dataset and HMDB51 dataset.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords