CPT: Colorful Prompt Tuning for pre-trained vision-language models

Yuan Yao; Ao Zhang; Zhengyan Zhang; Zhiyuan Liu; Tat-Seng Chua; Maosong Sun

AI Open (Jan 2024)

CPT: Colorful Prompt Tuning for pre-trained vision-language models

Yuan Yao,
Ao Zhang,
Zhengyan Zhang,
Zhiyuan Liu,
Tat-Seng Chua,
Maosong Sun

Affiliations

Yuan Yao: Department of Computer Science and Technology, Institute for Artificial Intelligence, Tsinghua University, Beijing, China
Ao Zhang: Sea-NExT Joint Lab, Singapore, School of Computing, National University of Singapore, Singapore
Zhengyan Zhang: Department of Computer Science and Technology, Institute for Artificial Intelligence, Tsinghua University, Beijing, China
Zhiyuan Liu: Department of Computer Science and Technology, Institute for Artificial Intelligence, Tsinghua University, Beijing, China; Corresponding author.
Tat-Seng Chua: Sea-NExT Joint Lab, Singapore, School of Computing, National University of Singapore, Singapore
Maosong Sun: Department of Computer Science and Technology, Institute for Artificial Intelligence, Tsinghua University, Beijing, China

Journal volume & issue: Vol. 5
pp. 30 – 38

Abstract

Read online

Vision-Language Pre-training (VLP) models have shown promising capabilities in grounding natural language in image data, facilitating a broad range of cross-modal tasks. However, we note that there exists a significant gap between the objective forms of model pre-training and fine-tuning, resulting in a need for large amounts of labeled data to stimulate the visual grounding capability of VLP models for downstream tasks. To address the challenge, we present Color-based Prompt Tuning (CPT), a novel paradigm for tuning VLP models, which reformulates visual grounding into a fill-in-the-blank problem with color-based co-referential markers in image and text, maximally mitigating the gap. In this way, CPT enables strong few-shot and even zero-shot visual grounding capabilities of VLP models. Comprehensive experimental results show that CPT achieves state-of-the-art performance on zero/few-shot visual grounding (e.g., 75.1 zero-shot accuracy in RefCOCO evaluation), outperforming fine-tuned and other prompt-tuned models by a large margin. Moreover, CPT can also be easily extended to achieve promising zero/few-shot performance on other vision-language tasks, such as visual relation detection, visual commonsense reasoning and visual question answering. We make the data and codes publicly available at https://github.com/thunlp/CPT.

Published in AI Open

ISSN: 2666-6510 (Online)
Publisher: KeAi Communications Co. Ltd.
Country of publisher: China
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://www.keaipublishing.com/en/journals/ai-open/

About the journal

Abstract

Keywords