Detecting Spurious Correlations With Sanity Tests for Artificial Intelligence Guided Radiology Systems

Usman Mahmood; Robik Shrestha; David D. B. Bates; Lorenzo Mannelli; Giuseppe Corrias; Yusuf Emre Erdi; Christopher Kanan

doi:10.3389/fdgth.2021.671015

Frontiers in Digital Health (Aug 2021)

Detecting Spurious Correlations With Sanity Tests for Artificial Intelligence Guided Radiology Systems

Usman Mahmood,
Robik Shrestha,
David D. B. Bates,
Lorenzo Mannelli,
Giuseppe Corrias,
Yusuf Emre Erdi,
Christopher Kanan

Affiliations

Usman Mahmood: Department of Medical Physics, Memorial Sloan Kettering Cancer Center, New York, NY, United States
Robik Shrestha: Chester F. Carlson Center for Imaging Science, Rochester Institute of Technology, Rochester, NY, United States
David D. B. Bates: Department of Radiology, Memorial Sloan Kettering Cancer Center, New York, NY, United States
Lorenzo Mannelli: Institute of Research and Medical Care (IRCCS) SDN, Institute of Diagnostic and Nuclear Research, Naples, Italy
Giuseppe Corrias: Department of Radiology, University of Cagliari, Cagliari, Italy
Yusuf Emre Erdi: Department of Medical Physics, Memorial Sloan Kettering Cancer Center, New York, NY, United States
Christopher Kanan: Chester F. Carlson Center for Imaging Science, Rochester Institute of Technology, Rochester, NY, United States

DOI: https://doi.org/10.3389/fdgth.2021.671015
Journal volume & issue: Vol. 3

Abstract

Read online

Artificial intelligence (AI) has been successful at solving numerous problems in machine perception. In radiology, AI systems are rapidly evolving and show progress in guiding treatment decisions, diagnosing, localizing disease on medical images, and improving radiologists' efficiency. A critical component to deploying AI in radiology is to gain confidence in a developed system's efficacy and safety. The current gold standard approach is to conduct an analytical validation of performance on a generalization dataset from one or more institutions, followed by a clinical validation study of the system's efficacy during deployment. Clinical validation studies are time-consuming, and best practices dictate limited re-use of analytical validation data, so it is ideal to know ahead of time if a system is likely to fail analytical or clinical validation. In this paper, we describe a series of sanity tests to identify when a system performs well on development data for the wrong reasons. We illustrate the sanity tests' value by designing a deep learning system to classify pancreatic cancer seen in computed tomography scans.

Published in Frontiers in Digital Health

ISSN: 2673-253X (Online)
Publisher: Frontiers Media S.A.
Country of publisher: Switzerland
LCC subjects: Medicine: Public aspects of medicine; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://www.frontiersin.org/journals/digital-health#

About the journal

Abstract

Keywords