IEEE Access (Jan 2024)

FSwin Transformer: Feature-Space Window Attention Vision Transformer for Image Classification

  • Dayeon Yoo,
  • Jeesu Kim,
  • Jinwoo Yoo

DOI
https://doi.org/10.1109/ACCESS.2024.3394539
Journal volume & issue
Vol. 12
pp. 72598 – 72606

Abstract

Read online

The vision transformer (ViT) with global self-attention exhibits quadratic computational complexity that depends on the image size. To address this issue, window-based self-attention ViT limits attention area to a specific window, thereby mitigating the computational complexity. However, it cannot effectively capture the relationships between windows. The Swin Transformer, a representative window-based self-attention ViT, introduces shifted-window multi-head self-attention (SW-MSA) to capture the cross-window information. However, SW-MSA groups tokens that are close to each other in the image into one window and thus cannot capture relationships between distant tokens. Therefore, this paper introduces a feature-space window attention transformer (FSwin Transformer) that includes distant but similar tokens in one window. The proposed FSwin Transformer clusters similar tokens based on the feature space and conducts self-attention within the cluster. Thus, this approach helps understand the global context of the image by compensating for interactions between long-distance tokens, which cannot be captured when windows are set based on the image space. In addition, we incorporate a feature-space refinement method with channel and spatial attention to emphasize key parts and suppress non-essential parts. The refined feature map improves the representation power of the model, resulting in improved classification performance. Consequently, in classification tasks for ImageNet-1K, FSwin Transformer outperforms existing Transformer-based backbones, including the Swin Transformer.

Keywords