CA-CMT: Coordinate Attention for Optimizing CMT Networks

Ji-Hyeon Bang; Sung-Wook Park; Jun-Yeong Kim; Jun Park; Jun-Ho Huh; Se-Hoon Jung; Chun-Bo Sim

doi:10.1109/ACCESS.2023.3297206

IEEE Access (Jan 2023)

CA-CMT: Coordinate Attention for Optimizing CMT Networks

Ji-Hyeon Bang,
Sung-Wook Park,
Jun-Yeong Kim,
Jun Park,
Jun-Ho Huh,
Se-Hoon Jung,
Chun-Bo Sim

Affiliations

Ji-Hyeon Bang: ORCiD; Interdisciplinary Program in IT-Bio Convergence System, Sunchon National University, Suncheon, South Korea
Sung-Wook Park: ORCiD; Interdisciplinary Program in IT-Bio Convergence System, Sunchon National University, Suncheon, South Korea
Jun-Yeong Kim: ORCiD; Interdisciplinary Program in IT-Bio Convergence System, Sunchon National University, Suncheon, South Korea
Jun Park: ORCiD; Interdisciplinary Program in IT-Bio Convergence System, Sunchon National University, Suncheon, South Korea
Jun-Ho Huh: ORCiD; Department of Data Science, (National) Korea Maritime and Ocean University, Busan, South Korea
Se-Hoon Jung: ORCiD; Department of Computer Engineering, Sunchon National University, Suncheon, South Korea
Chun-Bo Sim: ORCiD; Interdisciplinary Program in IT-Bio Convergence System, Sunchon National University, Suncheon, South Korea

DOI: https://doi.org/10.1109/ACCESS.2023.3297206
Journal volume & issue: Vol. 11
pp. 76691 – 76702

Abstract

Read online

Vision Transformer (ViT) has been proposed as a new image recognition method in the field of computer vision. ViT applies a Transformer structure with excellent performance in the field of natural language processing to recognize images. Unlike existing Convolutional Neural Network (CNN) models, ViT can achieve State-Of-The-Art (SOTA) image recognition without inputting Inductive Biases into the model, demonstrating that the Transformer is a useful structure in the field of computer vision. However, ViT requires large datasets such as ImageNet-21K and Joint Foto Tree (JFT) for learning. In addition, it takes a lot of time to train. Moreover, there is a problem that location information is lost by inputting images in patch units. To improve such issues, many models are being proposed. In this paper, a new model is proposed by restructuring the Convolutional neural networks Meet vision Transformers (CMT) model by applying Coordinate Attention Block, a CNN model, to improve problems of the Vision Transformer family of models. The proposed model combines Transformer, which has shown excellent performance in Long Range, and CNN, which has shown excellent performance in Local Feature, to achieve higher performance than existing models. We also compared performance of the proposed model with those of existing models with relatively small datasets such as Canadian Institute For Advanced Research-10 (CIFAR-10), Self-Taught Learning-10 (STL-10), and Tiny-ImageNet to facilitate researchers’ access to the evaluation. Despite being restructured from the smallest CMT-Tiny model, the proposed model showed better accuracy than CMT-Tiny, CMT-XS, CMT-S, and CMT-B models with CIFAR-10, STL-10, and Tiny-ImageNet datasets. The proposed model showed an accuracy of 90.21% with the CIFAR-10 dataset, higher than existing CMT models except for the CMT-S model with an accuracy of 90.6%. It had the lowest loss value of 0.3967. The proposed model is expected to be utilized as a backbone in Object Detection and Segmentation fields in the future.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords