Alexandria Engineering Journal (May 2025)

Yo3RL-Net:A fusion of two-phase end-to-end deep net framework for hand detection and gesture recognition

  • Xiang Wu,
  • Yuanhao Ma,
  • Shijie Zhang,
  • Tianfei Chen,
  • He Jiang

Journal volume & issue
Vol. 121
pp. 77 – 89

Abstract

Read online

When it comes to intelligent industrial production, gesture-based human–computer interaction encounters obstacles such as limited visibility due to objects blocking the view, intense illumination, and the difficulties of identifying small, faraway targets. In order to tackle these problems, we suggest a sophisticated approach called Yo3RL-Net using Yolo Algorithm and 3D Resnet-LSTM, which is designed specifically for hand detection and gesture recognition. During the hand detection phase, the YOLOv8 algorithm is improved by incorporating dynamic snake convolutions, CPCA attention modules, and a detection layer specifically designed for small targets. Additionally, the Inner-CIoU loss function is replaced. These enhancements greatly improve the extraction of hand characteristics in video frames. For the gesture recognition phase, we suggest utilizing a method that combines a three-dimensional residual network with a long short-term memory (LSTM) network. The average recognition rate of 93.78% was achieved through experiments conducted on a specific set of seven dynamic gestures. The results demonstrate that the suggested model enhances recognition accuracy by 11.5% and 5.45% in comparison to models just employing 3D CNN and 3D CNN-Bi-LSTM, respectively. Consequently, this model provides a superior user experience in gesture interaction tasks.

Keywords