IEEE Access (Jan 2024)

A Unified Framework for Depth-Assisted Monocular Object Pose Estimation

  • Dinh-Cuong Hoang,
  • Phan Xuan Tan,
  • Thu-Uyen Nguyen,
  • Hai-Nam Pham,
  • Chi-Minh Nguyen,
  • Son-Anh Bui,
  • Quang-Tri Duong,
  • van-Duc Vu,
  • van-Thiep Nguyen,
  • van-Hiep Duong,
  • Ngoc-Anh Hoang,
  • Khanh-Toan Phan,
  • Duc-Thanh Tran,
  • Ngoc-Trung Ho,
  • Cong-Trinh Tran

DOI
https://doi.org/10.1109/ACCESS.2024.3443148
Journal volume & issue
Vol. 12
pp. 111723 – 111740

Abstract

Read online

Monocular Depth Estimation (MDE) and Object Pose Estimation (OPE) are important tasks in visual scene understanding. Traditionally, these challenges have been addressed independently, with separate deep neural networks designed for each task. However, we contend that MDE, which provides information about object distances from the camera, and OPE, which focuses on determining precise object position and orientation, are inherently connected. Combining these tasks in a unified approach facilitates the integration of spatial context, offering a more comprehensive understanding of object distribution in three-dimensional space. Consequently, this work addresses both challenges simultaneously, treating them as a multi-task learning problem. Our proposed solution is encapsulated in a Unified Framework for Depth-Assisted Monocular Object Pose Estimation. Leveraging Red-Green-Blue (RGB) images as input, our framework estimates pose of multiple object instances alongside an instance-level depth map. During training, we utilize both depth and color images, but during inference, the model relies exclusively on color images. To enhance the depth-aware features crucial for robust object pose estimation, we introduce a depth estimation branch supervised by depth images. These features undergo further refinement through a cross-task attention module, contributing to the innovation of our method in significantly improving feature discriminability and robustness in object pose estimation. Through extensive experiments, our approach demonstrates competitive performance compared to state-of-the-art methods in object pose estimation. Moreover, our method operates in real-time, underscoring its efficiency and practical applicability in various scenarios. This unified framework not only advances the state of the art in monocular depth estimation and object pose estimation but also underscores the potential of multi-task learning for enhancing the understanding of complex visual scenes.

Keywords