Stress testing reveals gaps in clinic readiness of image-based diagnostic artificial intelligence models

Albert T. Young; Kristen Fernandez; Jacob Pfau; Rasika Reddy; Nhat Anh Cao; Max Y. von Franque; Arjun Johal; Benjamin V. Wu; Rachel R. Wu; Jennifer Y. Chen; Raj P. Fadadu; Juan A. Vasquez; Andrew Tam; Michael J. Keiser; Maria L. Wei

doi:10.1038/s41746-020-00380-6

npj Digital Medicine (Jan 2021)

Stress testing reveals gaps in clinic readiness of image-based diagnostic artificial intelligence models

Albert T. Young,
Kristen Fernandez,
Jacob Pfau,
Rasika Reddy,
Nhat Anh Cao,
Max Y. von Franque,
Arjun Johal,
Benjamin V. Wu,
Rachel R. Wu,
Jennifer Y. Chen,
Raj P. Fadadu,
Juan A. Vasquez,
Andrew Tam,
Michael J. Keiser,
Maria L. Wei

Affiliations

Albert T. Young: Dermatology Service, San Francisco VA Health Care System
Kristen Fernandez: Dermatology Service, San Francisco VA Health Care System
Jacob Pfau: Dermatology Service, San Francisco VA Health Care System
Rasika Reddy: Dermatology Service, San Francisco VA Health Care System
Nhat Anh Cao: Dermatology Service, San Francisco VA Health Care System
Max Y. von Franque: Dermatology Service, San Francisco VA Health Care System
Arjun Johal: Dermatology Service, San Francisco VA Health Care System
Benjamin V. Wu: Dermatology Service, San Francisco VA Health Care System
Rachel R. Wu: Dermatology Service, San Francisco VA Health Care System
Jennifer Y. Chen: Dermatology Service, San Francisco VA Health Care System
Raj P. Fadadu: Dermatology Service, San Francisco VA Health Care System
Juan A. Vasquez: Dermatology Service, San Francisco VA Health Care System
Andrew Tam: Dermatology Service, San Francisco VA Health Care System
Michael J. Keiser: Department of Pharmaceutical Chemistry, Department of Bioengineering and Therapeutic Sciences, Institute for Neurodegenerative Diseases, and Bakar Computational Health Sciences Institute, University of California
Maria L. Wei: Dermatology Service, San Francisco VA Health Care System

DOI: https://doi.org/10.1038/s41746-020-00380-6
Journal volume & issue: Vol. 4, no. 1
pp. 1 – 8

Abstract

Read online

Abstract Artificial intelligence models match or exceed dermatologists in melanoma image classification. Less is known about their robustness against real-world variations, and clinicians may incorrectly assume that a model with an acceptable area under the receiver operating characteristic curve or related performance metric is ready for clinical use. Here, we systematically assessed the performance of dermatologist-level convolutional neural networks (CNNs) on real-world non-curated images by applying computational “stress tests”. Our goal was to create a proxy environment in which to comprehensively test the generalizability of off-the-shelf CNNs developed without training or evaluation protocols specific to individual clinics. We found inconsistent predictions on images captured repeatedly in the same setting or subjected to simple transformations (e.g., rotation). Such transformations resulted in false positive or negative predictions for 6.5–22% of skin lesions across test datasets. Our findings indicate that models meeting conventionally reported metrics need further validation with computational stress tests to assess clinic readiness.

Published in npj Digital Medicine

ISSN: 2398-6352 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics
Website: https://www.nature.com/npjdigitalmed/

About the journal