IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (Jan 2024)

A Scene Graph Encoding and Matching Network for UAV Visual Localization

  • Ran Duan,
  • Long Chen,
  • Zhaojin Li,
  • Zeyu Chen,
  • Bo Wu

DOI
https://doi.org/10.1109/JSTARS.2024.3396168
Journal volume & issue
Vol. 17
pp. 9890 – 9902

Abstract

Read online

This article tackles the visual localization of unmanned aerial vehicles (UAVs) in the presence of multisource and cross-view images are involved. We present a lightweight end-to-end scene graph encoding and matching network that finds the best matches for the airborne camera views from the reference image maps. The scene graph addresses the challenges of encoding the semantic scene by aggregating the image convolutional features into global and structured semiglobal descriptors. The principal contributions of this article are as follows: First, we develop a new network architecture that embeds a nonlocal block and a modified vector of locally aggregated descriptors network (NetVLAD) into a backbone convolutional neural network. The main component of the modified NetVLAD is a cluster similarity masking graph (CSMG) encoder, which is proposed to replace the feature-cluster residuals computing in NetVLAD with cluster consensus feature aggregation and structure-aware scene graph extraction. In addition, a global descriptor is extracted by a nonlocal block to label each image with a discriminative global feature descriptor. Second, we develop a new triplet loss for the network training procedure to learn the features at different semantic levels. The proposed global descriptor and CSMG encoder are trained together according to a weighted sum of cosine triplet losses. Third, the global descriptor from the nonlocal block and semiglobal descriptor from the CSMG encoder work hierarchically for coarse-to-fine image retrieval and can achieve real-time efficiency and favorable accuracy of image searching and matching from the reference image map. We train and test the model on two challenging benchmark datasets. We also test the pretrained model on a dataset collected by a fixed-wing UAV to further evaluate the model's generalizability. The benchmark evaluations and ablation experiments show that the developed method outperforms state-of-the-art methods and achieves superior performance in the real-time matching of UAV images and reference image maps for UAV visual localization. Open-source code is available on GitHub.

Keywords