IEEE Access (Jan 2023)

RUnT: A Network Combining Residual U-Net and Transformer for Vertebral Edge Feature Fusion Constrained Spine CT Image Segmentation

  • Hao Xu,
  • Xinxin Cui,
  • Chaofan Li,
  • Zhenyu Tian,
  • Jing Liu,
  • Jianlan Yang

DOI
https://doi.org/10.1109/ACCESS.2023.3281468
Journal volume & issue
Vol. 11
pp. 55692 – 55705

Abstract

Read online

Scoliosis, spinal deformity and vertebral spondylolisthesis are spinal disorders with high incidence, which seriously affect people’s lives and health. CT is an important medical tool for the detection and diagnosis of spinal disorders and provides a large amount of pathologically valid information in various clinical practices such as spine pathology assessment and computer-assisted surgical interventions. As the spine presents long span, complex shape of biological curve and high multi-stage similarity in the sagittal plane of CT images. Therefore, fast and accurate spine segmentation technology has become an important research direction for computer-aided diagnosis. We proposed an RUnT network based on the combination of residual U-Net feature extraction network and Vision Transformer structure for fast and efficient automatic segmentation of multiple vertebrae of the spine. The deep vertebral features are first extracted using the residual U-Net network to prevent gradient diffusion while improving the accuracy of vertebral contour segmentation. Then the multi-scale feature maps extracted by the residual structure containing rich vertebral superficial information are input to the edge segmentation module. We designed the vertebral contour feature extraction structure to refine the segmentation boundaries and ensure the segmentation consistency of each vertebra by combining the operations of deconvolution and convolution for three different scales of deep features.Finally, the global information extraction module based on Transformer structure is combined with the local feature extraction module to achieve the blending of global location information of vertebrae with local features through the self-attentive feature map of multi-scale volume. By mixing edge features with semantic features, the semantic confusion arising from the high similarity between vertebrae when the decoder extracts vertebral features is reduced. The model proposed in this paper is experimented on the CTSpine1K and VerSe 20 public datasets. The results show that the model proposed in this paper obtains the state-of-the-art segmentation performance with the average DSC scores of 88.4% and 81.5% on CTSpine1K and VerSe 20, respectively, while reducing the average distance of HD95 from 4.86 to 3.88.

Keywords