Assessing the Generalizability of Deep Learning Models Trained on Standardized and Nonstandardized Images and Their Performance Against Teledermatologists: Retrospective Comparative Study

Ayooluwatomiwa I Oloruntoba; Tine Vestergaard; Toan D Nguyen; Zhen Yu; Maithili Sashindranath; Brigid Betz-Stablein; H Peter Soyer; Zongyuan Ge; Victoria Mar

doi:10.2196/35150

JMIR Dermatology (Sep 2022)

Assessing the Generalizability of Deep Learning Models Trained on Standardized and Nonstandardized Images and Their Performance Against Teledermatologists: Retrospective Comparative Study

Ayooluwatomiwa I Oloruntoba,
Tine Vestergaard,
Toan D Nguyen,
Zhen Yu,
Maithili Sashindranath,
Brigid Betz-Stablein,
H Peter Soyer,
Zongyuan Ge,
Victoria Mar

Affiliations

Ayooluwatomiwa I Oloruntoba: ORCiD
Tine Vestergaard: ORCiD
Toan D Nguyen: ORCiD
Zhen Yu: ORCiD
Maithili Sashindranath: ORCiD
Brigid Betz-Stablein: ORCiD
H Peter Soyer: ORCiD
Zongyuan Ge: ORCiD
Victoria Mar: ORCiD

DOI: https://doi.org/10.2196/35150
Journal volume & issue: Vol. 5, no. 3
p. e35150

Abstract

Read online

BackgroundConvolutional neural networks (CNNs) are a type of artificial intelligence that shows promise as a diagnostic aid for skin cancer. However, the majority are trained using retrospective image data sets with varying image capture standardization. ObjectiveThe aim of our study was to use CNN models with the same architecture—trained on image sets acquired with either the same image capture device and technique (standardized) or with varied devices and capture techniques (nonstandardized)—and test variability in performance when classifying skin cancer images in different populations. MethodsIn all, 3 CNNs with the same architecture were trained. CNN nonstandardized (CNN-NS) was trained on 25,331 images taken from the International Skin Imaging Collaboration (ISIC) using different image capture devices. CNN standardized (CNN-S) was trained on 177,475 MoleMap images taken with the same capture device, and CNN standardized number 2 (CNN-S2) was trained on a subset of 25,331 standardized MoleMap images (matched for number and classes of training images to CNN-NS). These 3 models were then tested on 3 external test sets: 569 Danish images, the publicly available ISIC 2020 data set consisting of 33,126 images, and The University of Queensland (UQ) data set of 422 images. Primary outcome measures were sensitivity, specificity, and area under the receiver operating characteristic curve (AUROC). Teledermatology assessments available for the Danish data set were used to determine model performance compared to teledermatologists. ResultsWhen tested on the 569 Danish images, CNN-S achieved an AUROC of 0.861 (95% CI 0.830-0.889) and CNN-S2 achieved an AUROC of 0.831 (95% CI 0.798-0.861; standardized models), with both outperforming CNN-NS (nonstandardized model; P=.001 and P=.009, respectively), which achieved an AUROC of 0.759 (95% CI 0.722-0.794). When tested on 2 additional data sets (ISIC 2020 and UQ), CNN-S (P<.001 and P<.001, respectively) and CNN-S2 (P=.08 and P=.35, respectively) still outperformed CNN-NS. When the CNNs were matched to the mean sensitivity and specificity of the teledermatologists on the Danish data set, the models’ resultant sensitivities and specificities were surpassed by the teledermatologists. However, when compared to CNN-S, the differences were not statistically significant (sensitivity: P=.10; specificity: P=.053). Performance across all CNN models as well as teledermatologists was influenced by image quality. ConclusionsCNNs trained on standardized images had improved performance and, therefore, greater generalizability in skin cancer classification when applied to unseen data sets. This finding is an important consideration for future algorithm development, regulation, and approval.

Published in JMIR Dermatology

ISSN: 2562-0959 (Online)
Publisher: JMIR Publications
Country of publisher: Canada
LCC subjects: Medicine: Dermatology
Website: https://derma.jmir.org/

About the journal