IEEE Transactions on Neural Systems and Rehabilitation Engineering (Jan 2024)
USCT-UNet: Rethinking the Semantic Gap in U-Net Network From U-Shaped Skip Connections With Multichannel Fusion Transformer
Abstract
Medical image segmentation is a crucial component of computer-aided clinical diagnosis, with state-of-the-art models often being variants of U-Net. Despite their success, these models’ skip connections introduce an unnecessary semantic gap between the encoder and decoder, which hinders their ability to achieve the high precision required for clinical applications. Awareness of this semantic gap and its detrimental influences have increased over time. However, a quantitative understanding of how this semantic gap compromises accuracy and reliability remains lacking, emphasizing the need for effective mitigation strategies. In response, we present the first quantitative evaluation of the semantic gap between corresponding layers of U-Net and identify two key characteristics: 1) The direct skip connection (DSC) exhibits a semantic gap that negatively impacts models’ performance; 2) The magnitude of the semantic gap varies across different layers. Based on these findings, we re-examine this issue through the lens of skip connections. We introduce a Multichannel Fusion Transformer (MCFT) and propose a novel USCT-UNet architecture, which incorporates U-shaped skip connections (USC) to replace DSC, allocates varying numbers of MCFT blocks based on the semantic gap magnitude at different layers, and employs a spatial channel cross-attention (SCCA) module to facilitate the fusion of features between the decoder and USC. We evaluate USCT-UNet on four challenging datasets, and the results demonstrate that it effectively eliminates the semantic gap. Compared to using DSC, our USC and SCCA strategies achieve maximum improvements of 4.79% in the Dice coefficient, 5.70% in mean intersection over union (MIoU), and 3.26 in Hausdorff distance.
Keywords