IEEE Access (Jan 2024)
Consistent Augmentation Learning for Generalizing CLIP to Unseen Domains
Abstract
Domain generalization (DG) is a challenging transfer learning task focused on learning invariant knowledge from limited source domains, thereby enhancing generalization to the out-of-distribution data in unseen domains. Recent advancements in vision-language models (VLMs) have notably boosted the capability of deep models to generalize across unseen target domains. Models like CLIP have recently demonstrated promising zero-shot transferability by image-text matching, making it a robust tool for domain generalization. In this work, we explore generic methods to leverage CLIP for DG tasks in image classification. Specifically, we propose Consistent Augmentation Learning (CAL), a simple and effective extension of CLIP within the domain generalization framework. In the training phase, CAL introduces a new fine-tuning method called Consistent Augmentation Fine-Tuning (CAFT), which leverages feature consistency across different augmented views of the same sample. In the inference phase, CAL introduces a new Test-Time Augmentation strategy called Entropy-guided Test-time Augmentation (ETTA), which enhances the prediction confidence and robustness of the fine-tuned CLIP model by leveraging information extracted from test images. Extensive experiments have shown that CAL is successful in extracting domain-invariant features, thereby greatly enhancing the generalization capability of CLIP and achieving the state-of-the-art performance across three challenging benchmarks: Domainbed generalization, ImageNet classification, and Base-to-New generalization.
Keywords