Jisuanji kexue yu tansuo (Feb 2024)

Survey on Visual Transformer for Image Classification

  • PENG Bin, BAI Jing, LI Wenjing, ZHENG Hu, MA Xiangyu

DOI
https://doi.org/10.3778/j.issn.1673-9418.2310092
Journal volume & issue
Vol. 18, no. 2
pp. 320 – 344

Abstract

Read online

Transformer is a deep learning model based on the self-attention mechanism, showing tremendous potential in computer vision. In image classification tasks, the key challenge lies in efficiently and accurately capturing both local and global features of input images. Traditional approaches rely on convolutional neural networks to extract local features at the lower layers, expanding the receptive field through stacked convolutional layers to obtain global features. However, this strategy aggregates information over relatively short distances, making it difficult to model long-term dependencies. In contrast, the self-attention mechanism of Transformer directly compares features across all spatial positions, capturing long-range dependencies at both local and global levels and exhibiting stronger global modeling capabilities. Therefore, a thorough exploration of the challenges faced by Transformer in image classification tasks is crucial. Taking Vision Transformer as an example, this paper provides a detailed overview of the core principles and architecture of Transformer. It then focuses on image classification tasks, summarizing key issues and recent advancements in visual Transformer research related to performance enhancement, computational costs, and training optimization. Furthermore, applications of Transformer in specific domains such as medical imagery, remote sensing, and agricultural images are summarized, highlighting its versatility and generality. Finally, a comprehensive analysis of the research progress in visual Transformer for image classification is presented, offering insights into future directions for the development of visual Transformer.

Keywords