IEEE Access (Jan 2024)

A Global Pose and Relative Pose Fusion Network for Monocular Visual Odometry

  • Bo Su,
  • Tianxiang Zang

DOI
https://doi.org/10.1109/ACCESS.2024.3439529
Journal volume & issue
Vol. 12
pp. 108863 – 108875

Abstract

Read online

Visual odometry (VO) systems typically rely on stacked convolutional layers and Long Short-Term Memory (LSTM) units to capture long-range dependencies within sequences. However, convolutional neural networks are relatively weak in modeling global context information, making them ill-suited to addressing trajectory drift issues that arise during long-term navigation. Furthermore, their inherent locality impedes the model from achieving better performance. Therefore, we propose a transformer and cross-frame-based visual odometry method, named TSCVO, to address these challenges. TSCVO consists of two main components: 1) A relative pose subnetwork based on the k-NN Timeformer model, which utilizes divided space-time attention to estimate relative poses between adjacent frames. It substitutes k-NN attention for the original fully connected self-attention in the Timeformer model to ignore irrelevant token information and reduce computational complexity. 2) A global pose subnetwork based on cross-frame attention, which utilizes cross-frame interaction mechanisms to learn long-range dependencies and internal correlations within image sequences. Additionally, we introduce robust - as the foundational loss function, supplemented by geometric consistency loss to further constrain structural similarity between adjacent frames and improve prediction accuracy. The evaluation results on the KITTI and EuRoC datasets indicate that TSCVO outperforms other learning-based methods.

Keywords