IEEE Access (Jan 2023)
DAKRS: Domain Adaptive Knowledge-Based Retrieval System for Natural Language-Based Vehicle Retrieval
Abstract
Given Natural Language (NL) text descriptions, NL-based vehicle retrieval aims to extract target vehicles from a multi-view multi-camera traffic video pool. Solutions to the problem have been challenged by not only inherent distinctions between textual and visual domains, but also by the complexities of the high-dimensionality of visual data, the diverse range of textual descriptions, a major lack of high-volume datasets in this relatively new field, alongside prominently large domain gaps between training and test sets. To deal with these issues, existing approaches have advocated computationally expensive models to separately extract the subspaces of language and vision before blending them into the same shared representation space. Through our proposed Domain Adaptive Knowledge-based Retrieval System (DAKRS), we show that by taking advantage of multi-modal information in a pretrained model, we can better focus on training robust representations in the shared space of limited labels, rather than on robust extraction of uni-modal representations that comes with increased computational burdens. Our contributions are threefold: (i) An efficient extension of Contrastive Language-Image Pre-training (CLIP)’s transfer learning into a baseline text-to-image multi-modular vehicle retrieval framework; (ii) A data enhancement method to create pseudo-vehicle tracks from the traffic video pool by leveraging the robustness of baseline retrieval model combined with background subtraction; and (iii) A Semi-Supervised Domain Adaptation (SSDA) scheme to engineer pseudo-labels for adapting model parameters to the target domain. Experimental results are benchmarked on Cityflow-NL to obtain 63.20% MRR with 150.0 M of parameters, illustrating our competitive effectiveness and efficiency against state-of-the-arts, without ensembling.
Keywords