IEEE Access (Jan 2023)
CA-CMT: Coordinate Attention for Optimizing CMT Networks
Abstract
Vision Transformer (ViT) has been proposed as a new image recognition method in the field of computer vision. ViT applies a Transformer structure with excellent performance in the field of natural language processing to recognize images. Unlike existing Convolutional Neural Network (CNN) models, ViT can achieve State-Of-The-Art (SOTA) image recognition without inputting Inductive Biases into the model, demonstrating that the Transformer is a useful structure in the field of computer vision. However, ViT requires large datasets such as ImageNet-21K and Joint Foto Tree (JFT) for learning. In addition, it takes a lot of time to train. Moreover, there is a problem that location information is lost by inputting images in patch units. To improve such issues, many models are being proposed. In this paper, a new model is proposed by restructuring the Convolutional neural networks Meet vision Transformers (CMT) model by applying Coordinate Attention Block, a CNN model, to improve problems of the Vision Transformer family of models. The proposed model combines Transformer, which has shown excellent performance in Long Range, and CNN, which has shown excellent performance in Local Feature, to achieve higher performance than existing models. We also compared performance of the proposed model with those of existing models with relatively small datasets such as Canadian Institute For Advanced Research-10 (CIFAR-10), Self-Taught Learning-10 (STL-10), and Tiny-ImageNet to facilitate researchers’ access to the evaluation. Despite being restructured from the smallest CMT-Tiny model, the proposed model showed better accuracy than CMT-Tiny, CMT-XS, CMT-S, and CMT-B models with CIFAR-10, STL-10, and Tiny-ImageNet datasets. The proposed model showed an accuracy of 90.21% with the CIFAR-10 dataset, higher than existing CMT models except for the CMT-S model with an accuracy of 90.6%. It had the lowest loss value of 0.3967. The proposed model is expected to be utilized as a backbone in Object Detection and Segmentation fields in the future.
Keywords