IEEE Access (Jan 2024)
Architectural Synergies in Bi-Modal and Bi-Contrastive Learning
Abstract
The integration of visual and linguistic elements within artificial intelligence research is increasingly emphasized, spurred by advancements in pre-trained model technologies. Traditionally, such models have been developed independently, using methods like contrastive learning and image-captioning to boost their analytical and creative outputs. This paper introduces an innovative architecture known as the Zero-shot Unified Image-Text (ZsU-IT) framework, which synthesizes pre-training objectives into a cohesive Unicode-decoder structure. The ZsU-IT is intricately designed with distinct components for image and text processing, coupled with a bi-modal decoder, which seamlessly manages both encoding and decoding tasks across various functions. This dual functionality promotes an effective knowledge transfer between the visual and linguistic modalities, thereby enhancing the system’s adaptability and efficiency in tasks like image-to-text translation and vice versa. Rigorous empirical studies reveal that ZsU-IT outstrips prevailing models across multiple applications, including image and text retrieval, image captioning, Visual Question Answering (VQA), and Stanford Natural Language Inference - Visual Entailment (SNLI-VE). This is particularly notable in complex settings involving sophisticated datasets such as medical texts and CT images. In zero-shot environments, ZsU-IT excels, displaying exceptional generalization capabilities. This prowess is highlighted by its significant achievements. The ZsU-IT framework not only sets a new benchmark in the fusion of vision and language technologies but also fosters novel opportunities for both ongoing research and practical implementations. This advancement marks a crucial step forward in the application of integrated multimodal data for complex problem-solving within the artificial intelligence landscape, paving the way for future breakthroughs.
Keywords