IEEE Access (Jan 2024)
Infrared and Visible Image Fusion via General Feature Embedding From CLIP and DINOv2
Abstract
Jointing multi-modal image fusion and subsequent high-level tasks is attracting more researches to achieve both mutual promotions. However, owing the feature gap between the two tasks, complicated network structure and training strategies need to be redesigned for specific different datasets. To address these issues, this paper proposes an infrared and visible image fusion via general feature embedding from frozen CLIP and DINOv2 models. The core idea is that the general semantic features from CLIP model are injected into the fusion network with the DINOv2-based segmenter as a constraint. Specially, the feature merging module and injection strategies are design to generate the semantic features that are compatible with fusion features meanwhile aligned with DINOv2 features. Leveraging the generalization ability of these foundation models, the proposed network can be optimized mutually to promote the training process. Comprehensive experiments on the four public datasets demonstrate the effectiveness of our method.
Keywords