Dynamic gesture recognition based on feature fusion network and variant ConvLSTM

Yuqing Peng; Huifang Tao; Wei Li; Hongtao Yuan; Tiejun Li

doi:10.1049/iet-ipr.2019.1248

IET Image Processing (Sep 2020)

Dynamic gesture recognition based on feature fusion network and variant ConvLSTM

Yuqing Peng,
Huifang Tao,
Wei Li,
Hongtao Yuan,
Tiejun Li

Affiliations

Yuqing Peng: School of Artificial Intelligence, Hebei University of TechnologyTianjin300401People's Republic of China
Huifang Tao: School of Artificial Intelligence, Hebei University of TechnologyTianjin300401People's Republic of China
Wei Li: School of Artificial Intelligence, Hebei University of TechnologyTianjin300401People's Republic of China
Hongtao Yuan: School of Artificial Intelligence, Hebei University of TechnologyTianjin300401People's Republic of China
Tiejun Li: School of Mechanical Engineering, Hebei University of TechnologyTianjin300401People's Republic of China

DOI: https://doi.org/10.1049/iet-ipr.2019.1248
Journal volume & issue: Vol. 14, no. 11
pp. 2480 – 2486

Abstract

Read online

Gesture is a natural form of human communication, and it is of great significance in human–computer interaction. In the dynamic gesture recognition method based on deep learning, the key is to obtain comprehensive gesture feature information. Aiming at the problem of inadequate extraction of spatiotemporal features or loss of feature information in current dynamic gesture recognition, a new gesture recognition architecture is proposed, which combines feature fusion network with variant convolutional long short‐term memory (ConvLSTM). The architecture extracts spatiotemporal feature information from local, global and deep aspects, and combines feature fusion to alleviate the loss of feature information. Firstly, local spatiotemporal feature information is extracted from video sequence by 3D residual network based on channel feature fusion. Then the authors use the variant ConvLSTM to learn the global spatiotemporal information of dynamic gesture, and introduce the attention mechanism to change the gate structure of ConvLSTM. Finally, a multi‐feature fusion depthwise separable network is used to learn higher‐level features including depth feature information. The proposed approach obtains very competitive performance on the Jester dataset with the classification accuracies of 95.59%, achieving state‐of‐the‐art performance with 99.65% accuracy on the SKIG (Sheffifield Kinect Gesture) dataset.

Published in IET Image Processing

ISSN: 1751-9659 (Print); 1751-9667 (Online)
Publisher: Wiley
Country of publisher: United Kingdom
LCC subjects: Technology: Photography; Science: Mathematics: Instruments and machines: Electronic computers. Computer science: Computer software
Website: https://ietresearch.onlinelibrary.wiley.com/journal/17519667

About the journal

Abstract

Keywords