CLIP-Driven Prototype Network for Few-Shot Semantic Segmentation

Shi-Cheng Guo; Shang-Kun Liu; Jing-Yu Wang; Wei-Min Zheng; Cheng-Yu Jiang

doi:10.3390/e25091353

Entropy (Sep 2023)

CLIP-Driven Prototype Network for Few-Shot Semantic Segmentation

Shi-Cheng Guo,
Shang-Kun Liu,
Jing-Yu Wang,
Wei-Min Zheng,
Cheng-Yu Jiang

Affiliations

Shi-Cheng Guo: College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao 266590, China
Shang-Kun Liu: College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao 266590, China
Jing-Yu Wang: College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao 266590, China
Wei-Min Zheng: College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao 266590, China
Cheng-Yu Jiang: College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao 266590, China

DOI: https://doi.org/10.3390/e25091353
Journal volume & issue: Vol. 25, no. 9
p. 1353

Abstract

Read online

Recent research has shown that visual–text pretrained models perform well in traditional vision tasks. CLIP, as the most influential work, has garnered significant attention from researchers. Thanks to its excellent visual representation capabilities, many recent studies have used CLIP for pixel-level tasks. We explore the potential abilities of CLIP in the field of few-shot segmentation. The current mainstream approach is to utilize support and query features to generate class prototypes and then use the prototype features to match image features. We propose a new method that utilizes CLIP to extract text features for a specific class. These text features are then used as training samples to participate in the model’s training process. The addition of text features enables model to extract features that contain richer semantic information, thus making it easier to capture potential class information. To better match the query image features, we also propose a new prototype generation method that incorporates multi-modal fusion features of text and images in the prototype generation process. Adaptive query prototypes were generated by combining foreground and background information from the images with the multi-modal support prototype, thereby allowing for a better matching of image features and improved segmentation accuracy. We provide a new perspective to the task of few-shot segmentation in multi-modal scenarios. Experiments demonstrate that our proposed method achieves excellent results on two common datasets, PASCAL-5i and COCO-20i.

Published in Entropy

ISSN: 1099-4300 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Science: Astronomy: Astrophysics; Science: Physics
Website: http://www.mdpi.com/journal/entropy

About the journal

Abstract

Keywords