A data-driven framework for identifying patient subgroups on which an AI/machine learning model may underperform

Adarsh Subbaswamy; Berkman Sahiner; Nicholas Petrick; Vinay Pai; Roy Adams; Matthew C. Diamond; Suchi Saria

doi:10.1038/s41746-024-01275-6

npj Digital Medicine (Nov 2024)

A data-driven framework for identifying patient subgroups on which an AI/machine learning model may underperform

Adarsh Subbaswamy,
Berkman Sahiner,
Nicholas Petrick,
Vinay Pai,
Roy Adams,
Matthew C. Diamond,
Suchi Saria

Affiliations

Adarsh Subbaswamy: Department of Computer Science, Johns Hopkins University
Berkman Sahiner: Center for Devices and Radiological Health, U.S. Food and Drug Administration
Nicholas Petrick: Center for Devices and Radiological Health, U.S. Food and Drug Administration
Vinay Pai: Center for Devices and Radiological Health, U.S. Food and Drug Administration
Roy Adams: Department of Psychiatry and Behavioral Science, Johns Hopkins School of Medicine
Matthew C. Diamond: Center for Devices and Radiological Health, U.S. Food and Drug Administration
Suchi Saria: Department of Computer Science, Johns Hopkins University

DOI: https://doi.org/10.1038/s41746-024-01275-6
Journal volume & issue: Vol. 7, no. 1
pp. 1 – 11

Abstract

Read online

Abstract A fundamental goal of evaluating the performance of a clinical model is to ensure it performs well across a diverse intended patient population. A primary challenge is that the data used in model development and testing often consist of many overlapping, heterogeneous patient subgroups that may not be explicitly defined or labeled. While a model’s average performance on a dataset may be high, the model can have significantly lower performance for certain subgroups, which may be hard to detect. We describe an algorithmic framework for identifying subgroups with potential performance disparities (AFISP), which produces a set of interpretable phenotypes corresponding to subgroups for which the model’s performance may be relatively lower. This could allow model evaluators, including developers and users, to identify possible failure modes prior to wide-scale deployment. We illustrate the application of AFISP by applying it to a patient deterioration model to detect significant subgroup performance disparities, and show that AFISP is significantly more scalable than existing algorithmic approaches.

Published in npj Digital Medicine

ISSN: 2398-6352 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics
Website: https://www.nature.com/npjdigitalmed/

About the journal