Algorithms (Apr 2025)
Multiscale Feature Fusion with Self-Attention for Efficient 6D Pose Estimation
Abstract
Six-dimensional (6D) pose estimation remains a significant challenge in computer vision, particularly for objects in complex environments. To overcome the limitations of existing methods in occluded and low-texture scenarios, a lightweight, multiscale feature fusion network was proposed. In the network, a self-attention mechanism is integrated with a multiscale point cloud feature extraction module, enhancing the representation of local features and mitigating information loss caused by occlusion. A lightweight image feature extraction module was also introduced to reduce the computational complexity while maintaining high precision in pose estimation. Ablation experiments on the LineMOD dataset validated the effectiveness of the two modules. The proposed network achieved 98.5% accuracy, contained 19.49 million parameters, and exhibited a processing speed of 31.8 frames per second (FPS). Comparative experiments on the LineMOD, Yale-CMU-Berkeley (YCB)-Video, and Occlusion LineMOD datasets demonstrated the superior performance of the proposed method. Specifically, the average nearest point distance (ADD-S) metric was improved by 4.2 percentage points over DenseFusion for LineMOD and by 0.6 percentage points for YCB-Video, with it reaching 63.4% on the Occlusion LineMOD dataset. In addition, inference speed comparisons showed that the proposed method outperforms most RGB-D-based methods. The results confirmed that the proposed method is both robust and efficient in handling occlusions and low-texture objects while also featuring a lightweight network design.
Keywords