Remote Sensing (Jun 2025)
ICT-Net: A Framework for Multi-Domain Cross-View Geo-Localization with Multi-Source Remote Sensing Fusion
Abstract
Traditional single neural network-based geo-localization methods for cross-view imagery primarily rely on polar coordinate transformations while suffering from limited global correlation modeling capabilities. To address these fundamental challenges of weak feature correlation and poor scene adaptation, we present a novel framework termed ICT-Net (Integrated CNN-Transformer Network) that synergistically combines convolutional neural networks with Transformer architectures. Our approach harnesses the complementary strengths of CNNs in capturing local geometric details and Transformers in establishing long-range dependencies, enabling comprehensive joint perception of both local and global visual patterns. Furthermore, capitalizing on the Transformer’s flexible input processing mechanism, we develop an attention-guided non-uniform cropping strategy that dynamically eliminates redundant image patches with minimal impact on localization accuracy, thereby achieving enhanced computational efficiency. To facilitate practical deployment, we propose a deep embedding clustering algorithm optimized for rapid parsing of geo-localization information. Extensive experiments demonstrate that ICT-Net establishes new state-of-the-art localization accuracy on the CVUSA benchmark, achieving a top-1 recall rate improvement of 8.6% over previous methods. Additional validation on a challenging real-world dataset collected at Beihang University (BUAA) further confirms the framework’s effectiveness and practical applicability in complex urban environments, particularly showing 23% higher robustness to vegetation variations.
Keywords