IEEE Access (Jan 2023)

Dynamic Gesture Recognition Based on Three-Stream Coordinate Attention Network and Knowledge Distillation

  • Shanshan Wan,
  • Lan Yang,
  • Keliang Ding,
  • Dongwei Qiu

DOI
https://doi.org/10.1109/ACCESS.2023.3278100
Journal volume & issue
Vol. 11
pp. 50547 – 50559

Abstract

Read online

Gesture recognition has always been one of the important research directions in the field of computer vision. The dynamic gesture has the problems of complex backgrounds and many interference factors. The gesture recognition model based on deep learning usually has high computational cost and poor real-time performance. In addition, deep learning models are limited to recognizing existing categories in the training set and their performance largely depends on the amount of labeled data. To address the above problems, this paper presents a dynamic gesture recognition method named 3SCKI based on a three-stream coordinate attention (CA) network, knowledge distillation, and image-text contrastive learning. Specifically, 1) CA is utilized for feature fusion to make the model focus more on target gestures and reduce background interference, 2) traditional knowledge distillation loss is improved to reduce the amount of calculation and improve the real-time performance. Specifically, the guidance function is added to make the student network only learn the classification probability correctly identified by the teacher network, and 3) multi-granularity context prompt template integration method is proposed to construct an improved CLIP visual language model MG-CLIP. It aligns text and visual concepts from the image level to the object level to the part level. Through comparative learning of image features and text features, gesture classification is performed, enabling the model to identify image categories that have not appeared during the training phase. The proposed method is evaluated on the ChaLearn LAP large-scale isolated gesture dataset (IsoGD). The results show that our proposed method can obtain recognition rates of 65.87% on the validation set of IsoGD. For single mode data, 3SCKI can obtain the state-of-the-art recognition accuracy on RGB, Depth, and Optical Flow data (61.22%, 58.84%, and 50.30% of the validation set of IsoGD, respectively).

Keywords