IEEE Access (Jan 2024)
Efficient Multimodal Fusion for Hand Pose Estimation With Hourglass Network
Abstract
Hand pose estimation is vital for various applications, including virtual reality (VR), augmented reality (AR), gesture recognition, human-computer interaction (HCI), and robotics. Achieving accurate and real-time hand pose estimation is challenging due to factors such as the high degree of articulation in the human hand and the variability in hand shapes and sizes. While multimodal data offers advantages, developing a fast and resource-efficient hand pose estimation system remains challenging. Current state-of-the-art methods often require powerful graphics processing units (GPUs) for high performance, limiting deployment on edge platforms with limited computational resources. There is a critical need for higher efficiency without compromising accuracy, especially in real-world applications like mobile devices and embedded systems. Additionally, real-time performance is essential for practical applications, where systems must respond immediately to user interactions. Unfortunately, most current methods struggle to achieve real-time speeds, even on powerful GPUs, let alone on resource-constrained devices. To address these challenges, we propose an efficient hand pose estimation system that leverages both red-green-blue (RGB) and depth (RGBD) data through a unified fusion strategy. Our method combines appearance and geometric data early in the processing pipeline, significantly reducing computational complexity while maintaining real-time performance on resource-constrained devices. Experimental results show that the proposed model runs at over 110 fps on GPU, and 30 fps on the edge platform of NVidia Jetson NX Xavier, which is 4 to 5 times faster than existing methods, while achieving competitive accuracy.
Keywords