Performance deterioration of deep learning models after clinical deployment: a case study with auto-segmentation for definitive prostate cancer radiotherapy

Biling Wang; Michael Dohopolski; Ti Bai; Junjie Wu; Raquibul Hannan; Neil Desai; Aurelie Garant; Daniel Yang; Dan Nguyen; Mu-Han Lin; Robert Timmerman; Xinlei Wang; Steve B Jiang

doi:10.1088/2632-2153/ad580f

Machine Learning: Science and Technology (Jan 2024)

Performance deterioration of deep learning models after clinical deployment: a case study with auto-segmentation for definitive prostate cancer radiotherapy

Biling Wang,
Michael Dohopolski,
Ti Bai,
Junjie Wu,
Raquibul Hannan,
Neil Desai,
Aurelie Garant,
Daniel Yang,
Dan Nguyen,
Mu-Han Lin,
Robert Timmerman,
Xinlei Wang,
Steve B Jiang

Affiliations

Biling Wang: ORCiD; Medical Artificial Intelligence and Automation Laboratory, University of Texas Southwestern Medical Center , Dallas, TX, United States of America; Department of Statistics and Data Science, Southern Methodist University , Dallas, TX, United States of America
Michael Dohopolski: Medical Artificial Intelligence and Automation Laboratory, University of Texas Southwestern Medical Center , Dallas, TX, United States of America; Department of Radiation Oncology, University of Texas Southwestern Medical Center , Dallas, TX, United States of America
Ti Bai: ORCiD; Medical Artificial Intelligence and Automation Laboratory, University of Texas Southwestern Medical Center , Dallas, TX, United States of America; Department of Radiation Oncology, University of Texas Southwestern Medical Center , Dallas, TX, United States of America
Junjie Wu: Medical Artificial Intelligence and Automation Laboratory, University of Texas Southwestern Medical Center , Dallas, TX, United States of America; Department of Radiation Oncology, University of Texas Southwestern Medical Center , Dallas, TX, United States of America
Raquibul Hannan: Medical Artificial Intelligence and Automation Laboratory, University of Texas Southwestern Medical Center , Dallas, TX, United States of America; Department of Radiation Oncology, University of Texas Southwestern Medical Center , Dallas, TX, United States of America
Neil Desai: Medical Artificial Intelligence and Automation Laboratory, University of Texas Southwestern Medical Center , Dallas, TX, United States of America; Department of Radiation Oncology, University of Texas Southwestern Medical Center , Dallas, TX, United States of America
Aurelie Garant: Medical Artificial Intelligence and Automation Laboratory, University of Texas Southwestern Medical Center , Dallas, TX, United States of America; Department of Radiation Oncology, University of Texas Southwestern Medical Center , Dallas, TX, United States of America
Daniel Yang: Medical Artificial Intelligence and Automation Laboratory, University of Texas Southwestern Medical Center , Dallas, TX, United States of America; Department of Radiation Oncology, University of Texas Southwestern Medical Center , Dallas, TX, United States of America
Dan Nguyen: ORCiD; Medical Artificial Intelligence and Automation Laboratory, University of Texas Southwestern Medical Center , Dallas, TX, United States of America; Department of Radiation Oncology, University of Texas Southwestern Medical Center , Dallas, TX, United States of America
Mu-Han Lin: Medical Artificial Intelligence and Automation Laboratory, University of Texas Southwestern Medical Center , Dallas, TX, United States of America; Department of Radiation Oncology, University of Texas Southwestern Medical Center , Dallas, TX, United States of America
Robert Timmerman: Medical Artificial Intelligence and Automation Laboratory, University of Texas Southwestern Medical Center , Dallas, TX, United States of America; Department of Radiation Oncology, University of Texas Southwestern Medical Center , Dallas, TX, United States of America
Xinlei Wang: ORCiD; Department of Statistics and Data Science, Southern Methodist University , Dallas, TX, United States of America; Department of Mathematics, University of Texas at Arlington , Dallas, TX, United States of America
Steve B Jiang: ORCiD; Medical Artificial Intelligence and Automation Laboratory, University of Texas Southwestern Medical Center , Dallas, TX, United States of America; Department of Radiation Oncology, University of Texas Southwestern Medical Center , Dallas, TX, United States of America

DOI: https://doi.org/10.1088/2632-2153/ad580f
Journal volume & issue: Vol. 5, no. 2
p. 025077

Abstract

Read online

Our study aims to explore the long-term performance patterns for deep learning (DL) models deployed in clinic and to investigate their efficacy in relation to evolving clinical practices. We conducted a retrospective study simulating the clinical implementation of our DL model involving 1328 prostate cancer patients treated between January 2006 and August 2022. We trained and validated a U-Net-based auto-segmentation model on data obtained from 2006 to 2011 and tested on data from 2012 to 2022, simulating the model’s clinical deployment starting in 2012. We visualized the trends of the model performance using exponentially weighted moving average (EMA) curves. Additionally, we performed Wilcoxon Rank Sum Test and multiple linear regression to investigate Dice similarity coefficient (DSC) variations across distinct periods and the impact of clinical factors, respectively. Initially, from 2012 to 2014, the model showed high performance in segmenting the prostate, rectum, and bladder. Post-2015, a notable decline in EMA DSC was observed for the prostate and rectum, while bladder contours remained stable. Key factors impacting the prostate contour quality included physician contouring styles, using various hydrogel spacers, CT scan slice thickness, MRI-guided contouring, and intravenous (IV) contrast ( p < 0.0001, p < 0.0001, p = 0.0085, p = 0.0012, p < 0.0001, respectively). Rectum contour quality was notably influenced by factors such as slice thickness, physician contouring styles, and the use of various hydrogel spacers. The quality of the bladder contour was primarily affected by IV contrast. The deployed DL model exhibited a substantial decline in performance over time, aligning with the evolving clinical settings.

Published in Machine Learning: Science and Technology

ISSN: 2632-2153 (Online)
Publisher: IOP Publishing
Country of publisher: United Kingdom
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Electronics: Computer engineering. Computer hardware; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://iopscience.iop.org/journal/2632-2153

About the journal

Abstract

Keywords