IEEE Access (Jan 2024)
Fusing Brilliance: Evaluating the Encoder-Decoder Hybrids With CNN and Swin Transformer for Medical Segmentation
Abstract
U-Net has become a standard model for medical image segmentation, alleviating the challenges posed by the costly acquisition and labeling of medical data. The convolutional layer, a fundamental component of U-Net, is renowned for its ability to incorporate inductive bias and efficiently extract local features. Building upon the success of the Vision Transformer (ViT), the Swin Transformer has emerged as a promising alternative, particularly adept at learning global features with Shifted-Window Attention. While convolution operates by systematically filtering inputs through a fixed kernel for feature extraction in a local neighborhood, attention focuses on dynamically weighting inputs based on relevance. However, adoption of Transformer presents challenges due to its quadratic computational demands and the necessity for substantial data volumes. Prior research has shown that blending different modules appropriately can lead to improved performance. In this study, we depart from conventional approaches by placing an emphasis on categorizing errors in the output rather than relying solely on average metric evaluations. Our objective is to pioneer a novel methodology focusing on how the Transformer operates differently from convolution. We analyze the pros and cons of each module based on the unique characteristics of the data. To the best of our knowledge, we present the first exploration of combining Swin Transformer and convolution in both the encoder and decoder stages. Through comprehensive comparative analysis, we introduce ConjugateUnet, a novel model that effectively integrates the strengths and weaknesses of these components in a balanced manner. Our proposed approach achieves substantial improvements in Dice Similarity Coefficient (DSC), rivaling state-of-the-art 2D medical image segmentation models while maintaining a simpler structure.
Keywords