IEEE Access (Jan 2023)

Improved Image Classification With Token Fusion

  • Keong-Hun Choi,
  • Jin-Woo Kim,
  • Yao Wang,
  • Jong-Eun Ha

DOI
https://doi.org/10.1109/ACCESS.2023.3291597
Journal volume & issue
Vol. 11
pp. 67460 – 67467

Abstract

Read online

In this paper, we propose a method to improve image classification performance using the fusion of CNN and transformer structure. In the case of CNN, information about a local area on an image can be extracted well, but global information extraction is limited. On the other hand, the transformer has an advantage in global information extraction, but it requires much memory compared to CNN. We apply CNN on an image and consider the feature vector of each pixel on the resulting feature map by CNN as a token. At the same time, the image is divided into patches, and each patch is considered a token, like a transformer. Tokens by CNN and transformer have advantages in extracting local and global information, respectively. We assume that the combination of these two types of tokens will have an improved characteristic, and we show it through experiments. We propose three methods to fuse tokens having different characteristics: (1) late token fusion with parallel structure, (2) early token fusion (3) token fusion in layer-by-layer. The proposed method shows the best classification performance in experiments using ImageNet-1K.

Keywords