Architectural Synergies in Bi-Modal and Bi-Contrastive Learning

Yujia Gu; Brian Liu; Tianlong Zhang; Xinye Sha; Shiyong Chen

doi:10.1109/ACCESS.2024.3457586

IEEE Access (Jan 2024)

Architectural Synergies in Bi-Modal and Bi-Contrastive Learning

Yujia Gu,
Brian Liu,
Tianlong Zhang,
Xinye Sha,
Shiyong Chen

Affiliations

Yujia Gu: ORCiD; California State University at Long Beach, Long Beach, CA, USA
Brian Liu: Stuyvesant High School, New York, NY, USA
Tianlong Zhang: University of Pittsburgh, Pittsburgh, PA, USA
Xinye Sha: ORCiD; Columbia University, New York, NY, USA
Shiyong Chen: ORCiD; Beihang University, Beijing, China

DOI: https://doi.org/10.1109/ACCESS.2024.3457586
Journal volume & issue: Vol. 12
pp. 187128 – 187140

Abstract

Read online

The integration of visual and linguistic elements within artificial intelligence research is increasingly emphasized, spurred by advancements in pre-trained model technologies. Traditionally, such models have been developed independently, using methods like contrastive learning and image-captioning to boost their analytical and creative outputs. This paper introduces an innovative architecture known as the Zero-shot Unified Image-Text (ZsU-IT) framework, which synthesizes pre-training objectives into a cohesive Unicode-decoder structure. The ZsU-IT is intricately designed with distinct components for image and text processing, coupled with a bi-modal decoder, which seamlessly manages both encoding and decoding tasks across various functions. This dual functionality promotes an effective knowledge transfer between the visual and linguistic modalities, thereby enhancing the system’s adaptability and efficiency in tasks like image-to-text translation and vice versa. Rigorous empirical studies reveal that ZsU-IT outstrips prevailing models across multiple applications, including image and text retrieval, image captioning, Visual Question Answering (VQA), and Stanford Natural Language Inference - Visual Entailment (SNLI-VE). This is particularly notable in complex settings involving sophisticated datasets such as medical texts and CT images. In zero-shot environments, ZsU-IT excels, displaying exceptional generalization capabilities. This prowess is highlighted by its significant achievements. The ZsU-IT framework not only sets a new benchmark in the fusion of vision and language technologies but also fosters novel opportunities for both ongoing research and practical implementations. This advancement marks a crucial step forward in the application of integrated multimodal data for complex problem-solving within the artificial intelligence landscape, paving the way for future breakthroughs.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords