IEEE Access (Jan 2025)
IMViT: Adjacency Matrix-Based Lightweight Plain Vision Transformer
Abstract
Transformers are becoming dominant deep learning backbones for both computer vision and natural language processing. While extensive experiments prove its outstanding ability for large models, transformers with small sizes are not comparable with convolutional neural networks in various downstream tasks due to its lack of inductive bias which can benefit image understanding. Hierarchical vision transformers are a big family for better efficiency in computer vision, but in order to obtain global dependencies, their design is often complex. This paper proposes a non-hierarchical transformer network which can capture both long-range and short-range as non-hierarchical transformers, and keep its performance on small-sized transformers. First, we discard the framework of progressively reduced feature maps but design two separate stages, i.e., multi-scale feature preparatory stages and multi-scale feature perception stages, where the first stage uses the lightweight multi-branch structure to extract multi-scale features, and the second stage leverages the non-hierarchical networks to learn semantic information for downstream tasks. Second, we design a multi-receptive attention and interaction mechanism to perceive global and local correlations of the images in every transformer block for effective feature learning for small-sized networks. Extensive experiments show that the proposed lightweight IMViT-B outperforms DeiTIII, this paper IMViT-B(300 epochs) achieves a top accuracy of $82.8~\%$ on ImageNet-1K with only 26M parameters, surpasses the DeiTIII-S(800 epochs) +1.4%, with a similar number of parameters and computation cost. Codes are available at https://github.com/LQchen1/IMViT.
Keywords