IEEE Access (Jan 2024)

Multi-Modal Hand-Object Pose Estimation With Adaptive Fusion and Interaction Learning

  • Dinh-Cuong Hoang,
  • Phan Xuan Tan,
  • Anh-Nhat Nguyen,
  • Duy-Quang Vu,
  • Van-Duc Vu,
  • Thu-Uyen Nguyen,
  • Ngoc-Anh Hoang,
  • Khanh-Toan Phan,
  • Duc-Thanh Tran,
  • Van-Thiep Nguyen,
  • Quang-Tri Duong,
  • Ngoc-Trung Ho,
  • Cong-Trinh Tran,
  • Van-Hiep Duong,
  • Phuc-Quan Ngo

DOI
https://doi.org/10.1109/ACCESS.2024.3388870
Journal volume & issue
Vol. 12
pp. 54339 – 54351

Abstract

Read online

Hand-object configuration recovery is an important task in computer vision. The estimation of pose and shape for both hands and objects during interactive scenarios has various applications, particularly in augmented reality, virtual reality, or imitation-based robot learning. The problem is particularly challenging when the hand is interacting with objects in the environment, as this setting features both extreme occlusions and non-trivial shape deformations. While existing works treat the problem of estimating hand configurations (that is pose and shape parameters) in isolation from the recovery of parameters related to the object acted upon, we stipulate that the two problems are related and can be solved more accurately concurrently. We introduce an approach that jointly learns the features of hand and object from color and depth (RGB-D) images. Our approach fuses appearance and geometric features in an adaptive manner which allows us to accent or suppress features that are more meaningful for the upstream task of hand-object configuration recovery. We combine a deep Hough voting strategy that builds on our adaptive features with a graph convolutional network (GCN) to learn the interaction relationships between the hand and held object shapes during interaction. Experimental results demonstrate that our proposed approach consistently outperforms state-of-the-art methods on popular datasets.

Keywords