Jisuanji kexue yu tansuo (Mar 2022)
Survey of Research on Deep Learning Image-Text Cross-Modal Retrieval
Abstract
As the rapid development of deep neural networks, multi-modal learning techniques are widely concerned. Cross-modal retrieval is an important branch of multimodal learning. Its fundamental purpose is to reveal the relation between different modal samples by retrieving modal samples with identical semantics. In recent years, cross-modal retrieval has gradually become the forefront and hot spot of academic research. It’s an important direction in the future development of information retrieval. This paper focuses on the latest development of cross-modal retrieval based on deep learning, reviews the development trends of real value representation-based and binary representation-based learning methods systematically. Among them, the real value representation-based method is adopted to improve the semantic relevance, and improve the accuracy, and the binary representation-based learning method is used to improve the efficiency of image-text cross-modal retrieval and reduce storage space. In addition, the common open datasets in the field of image-text cross-modal retrieval are summarized, and the performance of various algorithms on different datasets is compared. Especially, this paper summarizes and analyzes the specified implementations of cross-modal retrieval techniques in the fields of public security, media and medicine. Finally, combined with the state-of-the-art technologies, development trends and future research directions are discussed.
Keywords