Heliyon (May 2024)

Enhancing skin lesion segmentation with a fusion of convolutional neural networks and transformer models

  • Zhijian Xu,
  • Xingyue Guo,
  • Juan Wang

Journal volume & issue
Vol. 10, no. 10
p. e31395

Abstract

Read online

Accurate segmentation is crucial in diagnosing and analyzing skin lesions. However, automatic segmentation of skin lesions is extremely challenging because of their variable sizes, uneven color distributions, irregular shapes, hair occlusions, and blurred boundaries. Owing to the limited range of convolutional networks receptive fields, shallow convolution cannot extract the global features of images and thus has limited segmentation performance. Because medical image datasets are small in scale, the use of excessively deep networks could cause overfitting and increase computational complexity. Although transformer networks can focus on extracting global information, they cannot extract sufficient local information and accurately segment detailed lesion features. In this study, we designed a dual-branch encoder that combines a convolution neural network (CNN) and a transformer. The CNN branch of the encoder comprises four layers, which learn the local features of images through layer-wise downsampling. The transformer branch also comprises four layers, enabling the learning of global image information through attention mechanisms. The feature fusion module in the network integrates local features and global information, emphasizes important channel features through the channel attention mechanism, and filters irrelevant feature expressions. The information exchange between the decoder and encoder is finally achieved through skip connections to supplement the information lost during the sampling process, thereby enhancing segmentation accuracy. The data used in this paper are from four public datasets, including images of melanoma, basal cell tumor, fibroma, and benign nevus. Because of the limited size of the image data, we enhanced them using methods such as random horizontal flipping, random vertical flipping, random brightness enhancement, random contrast enhancement, and rotation. The segmentation accuracy is evaluated through intersection over union and duration, integrity, commitment, and effort indicators, reaching 87.7 % and 93.21 %, 82.05 % and 89.19 %, 86.81 % and 92.72 %, and 92.79 % and 96.21 %, respectively, on the ISIC 2016, ISIC 2017, ISIC 2018, and PH2 datasets, respectively (code: https://github.com/hyjane/CCT-Net).

Keywords