Multiscale Feature Fusion with Self-Attention for Efficient 6D Pose Estimation

Zekai Lv; Yufeng Guo; Shangbin Yang; Linlin Du; Rui Gao; Jinti Sun; Jiaqi Han; Hui Zhang; Qiang Wang

doi:10.3390/a18040207

Algorithms (Apr 2025)

Multiscale Feature Fusion with Self-Attention for Efficient 6D Pose Estimation

Zekai Lv,
Yufeng Guo,
Shangbin Yang,
Linlin Du,
Rui Gao,
Jinti Sun,
Jiaqi Han,
Hui Zhang,
Qiang Wang

Affiliations

Zekai Lv: College of Information and Management Science, Henan Agricultural University, Zhengzhou 450046, China
Yufeng Guo: College of Information and Management Science, Henan Agricultural University, Zhengzhou 450046, China
Shangbin Yang: College of Information and Management Science, Henan Agricultural University, Zhengzhou 450046, China
Linlin Du: College of Information and Management Science, Henan Agricultural University, Zhengzhou 450046, China
Rui Gao: College of Information and Management Science, Henan Agricultural University, Zhengzhou 450046, China
Jinti Sun: College of Information and Management Science, Henan Agricultural University, Zhengzhou 450046, China
Jiaqi Han: College of Information and Management Science, Henan Agricultural University, Zhengzhou 450046, China
Hui Zhang: College of Information and Management Science, Henan Agricultural University, Zhengzhou 450046, China
Qiang Wang: College of Information and Management Science, Henan Agricultural University, Zhengzhou 450046, China

DOI: https://doi.org/10.3390/a18040207
Journal volume & issue: Vol. 18, no. 4
p. 207

Abstract

Read online

Six-dimensional (6D) pose estimation remains a significant challenge in computer vision, particularly for objects in complex environments. To overcome the limitations of existing methods in occluded and low-texture scenarios, a lightweight, multiscale feature fusion network was proposed. In the network, a self-attention mechanism is integrated with a multiscale point cloud feature extraction module, enhancing the representation of local features and mitigating information loss caused by occlusion. A lightweight image feature extraction module was also introduced to reduce the computational complexity while maintaining high precision in pose estimation. Ablation experiments on the LineMOD dataset validated the effectiveness of the two modules. The proposed network achieved 98.5% accuracy, contained 19.49 million parameters, and exhibited a processing speed of 31.8 frames per second (FPS). Comparative experiments on the LineMOD, Yale-CMU-Berkeley (YCB)-Video, and Occlusion LineMOD datasets demonstrated the superior performance of the proposed method. Specifically, the average nearest point distance (ADD-S) metric was improved by 4.2 percentage points over DenseFusion for LineMOD and by 0.6 percentage points for YCB-Video, with it reaching 63.4% on the Occlusion LineMOD dataset. In addition, inference speed comparisons showed that the proposed method outperforms most RGB-D-based methods. The results confirmed that the proposed method is both robust and efficient in handling occlusions and low-texture objects while also featuring a lightweight network design.

Published in Algorithms

ISSN: 1999-4893 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://www.mdpi.com/journal/algorithms

About the journal

Abstract

Keywords