ViTFSL-Baseline: A Simple Baseline of Vision Transformer Network for Few-Shot Image Classification

Guangpeng Wang; Yongxiong Wang; Zhiqun Pan; Xiaoming Wang; Jiapeng Zhang; Jiayun Pan

doi:10.1109/ACCESS.2024.3356187

IEEE Access (Jan 2024)

ViTFSL-Baseline: A Simple Baseline of Vision Transformer Network for Few-Shot Image Classification

Guangpeng Wang,
Yongxiong Wang,
Zhiqun Pan,
Xiaoming Wang,
Jiapeng Zhang,
Jiayun Pan

Affiliations

Guangpeng Wang: ORCiD; School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai, China
Yongxiong Wang: ORCiD; School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai, China
Zhiqun Pan: ORCiD; School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai, China
Xiaoming Wang: ORCiD; School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai, China
Jiapeng Zhang: ORCiD; School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai, China
Jiayun Pan: ORCiD; School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai, China

DOI: https://doi.org/10.1109/ACCESS.2024.3356187
Journal volume & issue: Vol. 12
pp. 11836 – 11849

Abstract

Read online

Few-shot image classification, whose goal is to generalize to unseen tasks with scarce labeled data, has developed rapidly over the years. However, in traditional few-shot learning methods with CNNs, non-local features and long-rang dependencies of the image may be lost, and this leads to a poor generalization of the trained model. With the advantage of the self-attention mechanism of Transformer, researchers have tried to use vision transformer to improve few-shot learning recently. However, these methods are more complicated and take up a lot of computing resources, and there is no baseline to measure their effectiveness. We propose a new method called ViTFSL-baseline. We take advantage of vision transformer and train our model on all train set without episodic training. Meanwhile, we design a new nearest-neighbor classifier to used for few-shot image classification. Furthermore, in order to narrow the gap between difference of same class, we introduce centroid calibration in classifier after the feature extraction of backbone. We run the experiments on popular benchmarks to show that our method is a simple and effective for few-shot image classification. Our approach could be taken as the baseline upon vision transformer for few-shot learning.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords