Diagnostics (May 2024)

Improving the Generalizability and Performance of an Ultrasound Deep Learning Model Using Limited Multicenter Data for Lung Sliding Artifact Identification

  • Derek Wu,
  • Delaney Smith,
  • Blake VanBerlo,
  • Amir Roshankar,
  • Hoseok Lee,
  • Brian Li,
  • Faraz Ali,
  • Marwan Rahman,
  • John Basmaji,
  • Jared Tschirhart,
  • Alex Ford,
  • Bennett VanBerlo,
  • Ashritha Durvasula,
  • Claire Vannelli,
  • Chintan Dave,
  • Jason Deglint,
  • Jordan Ho,
  • Rushil Chaudhary,
  • Hans Clausdorff,
  • Ross Prager,
  • Scott Millington,
  • Samveg Shah,
  • Brian Buchanan,
  • Robert Arntfield

DOI
https://doi.org/10.3390/diagnostics14111081
Journal volume & issue
Vol. 14, no. 11
p. 1081

Abstract

Read online

Deep learning (DL) models for medical image classification frequently struggle to generalize to data from outside institutions. Additional clinical data are also rarely collected to comprehensively assess and understand model performance amongst subgroups. Following the development of a single-center model to identify the lung sliding artifact on lung ultrasound (LUS), we pursued a validation strategy using external LUS data. As annotated LUS data are relatively scarce—compared to other medical imaging data—we adopted a novel technique to optimize the use of limited external data to improve model generalizability. Externally acquired LUS data from three tertiary care centers, totaling 641 clips from 238 patients, were used to assess the baseline generalizability of our lung sliding model. We then employed our novel Threshold-Aware Accumulative Fine-Tuning (TAAFT) method to fine-tune the baseline model and determine the minimum amount of data required to achieve predefined performance goals. A subgroup analysis was also performed and Grad-CAM++ explanations were examined. The final model was fine-tuned on one-third of the external dataset to achieve 0.917 sensitivity, 0.817 specificity, and 0.920 area under the receiver operator characteristic curve (AUC) on the external validation dataset, exceeding our predefined performance goals. Subgroup analyses identified LUS characteristics that most greatly challenged the model’s performance. Grad-CAM++ saliency maps highlighted clinically relevant regions on M-mode images. We report a multicenter study that exploits limited available external data to improve the generalizability and performance of our lung sliding model while identifying poorly performing subgroups to inform future iterative improvements. This approach may contribute to efficiencies for DL researchers working with smaller quantities of external validation data.

Keywords