IEEE Access (Jan 2022)

Improving Vision Transformers to Learn Small-Size Dataset From Scratch

  • Seunghoon Lee,
  • Seunghyun Lee,
  • Byung Cheol Song

DOI
https://doi.org/10.1109/ACCESS.2022.3224044
Journal volume & issue
Vol. 10
pp. 123212 – 123224

Abstract

Read online

This paper proposes various techniques that help Vision Transformer (ViT) to learn small-size datasets from scratch successfully. ViT, which applied the transformer structure to the image classification task, has outperformed convolutional neural networks, recently. However, the high performance of ViT results from pre-training using large-size datasets, and its dependence on large datasets comes from low locality inductive bias. And conventional ViT cannot effectively attend the target class due to redundant attention caused by a rather high constant temperature factor. In order to improve the locality inductive bias of ViT, this paper proposes novel tokenization (Shifted Patch Tokenization: SPT) using shifted patches and a position encoding (CoordConv Position Encoding: CPE) using $1 \times 1$ CoordConv. Also, to improve poor attention, we propose a new self-attention mechanism (Locality Self-Attention: LSA) based on learnable temperature and self-relation masking. SPT, CPE, and LSA are intuitive techniques, but they successfully improve the performance of ViT even on small-size datasets. We qualitatively show that each technique attends a more important area and contributes to having a flatter loss landscape. Moreover, the proposed techniques are generic add-on modules applicable to various ViT backbones. Our experiments show, when learning Tiny-ImageNet from scratch, the proposed scheme based on SPT, CPE, and LSA increases the accuracy of ViT backbones by +3.66 on average and up to +5.7. Also, the performance improvement of ViT backbones in ImageNet-1K classification, learning on COCO from scratch, and transfer learning on classification datasets verify that the generalization ability of the proposed method is excellent.

Keywords