DINO-Mix enhancing visual place recognition with foundational vision model and feature mixing

Gaoshuang Huang; Yang Zhou; Xiaofei Hu; Chenglong Zhang; Luying Zhao; Wenjian Gan

doi:10.1038/s41598-024-73853-3

Scientific Reports (Sep 2024)

DINO-Mix enhancing visual place recognition with foundational vision model and feature mixing

Gaoshuang Huang,
Yang Zhou,
Xiaofei Hu,
Chenglong Zhang,
Luying Zhao,
Wenjian Gan

Affiliations

Gaoshuang Huang: Institute of Geospatial Information, PLA Strategic Support Force Information Engineering University
Yang Zhou: Institute of Geospatial Information, PLA Strategic Support Force Information Engineering University
Xiaofei Hu: Institute of Geospatial Information, PLA Strategic Support Force Information Engineering University
Chenglong Zhang: Institute of Geospatial Information, PLA Strategic Support Force Information Engineering University
Luying Zhao: Institute of Geospatial Information, PLA Strategic Support Force Information Engineering University
Wenjian Gan: Institute of Geospatial Information, PLA Strategic Support Force Information Engineering University

DOI: https://doi.org/10.1038/s41598-024-73853-3
Journal volume & issue: Vol. 14, no. 1
pp. 1 – 16

Abstract

Read online

Abstract Using visual place recognition (VPR) technology to ascertain the geographical location of publicly available images is a pressing issue. Although most current VPR methods achieve favorable results under ideal conditions, their performance in complex environments, characterized by lighting variations, seasonal changes, and occlusions, is generally unsatisfactory. Therefore, obtaining efficient and robust image feature descriptors in complex environments is a pressing issue. In this study, we utilized the DINOv2 model as the backbone for trimming and fine-tuning to extract robust image features and employed a feature mix module to aggregate image features, resulting in globally robust and generalizable descriptors that enable high-precision VPR. We experimentally demonstrated that the proposed DINO-Mix outperforms the current state-of-the-art (SOTA) methods. Using test sets having lighting variations, seasonal changes, and occlusions such as Tokyo24/7, Nordland, and SF-XL-Testv1, our proposed architecture achieved Top-1 accuracy rates of 91.75%, 80.18%, and 82%, respectively, and exhibited an average accuracy improvement of 5.14%. In addition, we compared it with other SOTA methods using representative image retrieval case studies, and our architecture outperformed its competitors in terms of VPR performance. Furthermore, we visualized the attention maps of DINO-Mix and other methods to provide a more intuitive understanding of their respective strengths. These visualizations serve as compelling evidence of the superiority of the DINO-Mix framework in this domain.

Published in Scientific Reports

ISSN: 2045-2322 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Medicine; Science
Website: https://www.nature.com/srep/

About the journal

Abstract

Keywords