Ophthalmology Science (Sep 2024)

Strong versus Weak Data Labeling for Artificial Intelligence Algorithms in the Measurement of Geographic Atrophy

  • Amitha Domalpally, MD, PhD,
  • Robert Slater, PhD,
  • Rachel E. Linderman, PhD,
  • Rohit Balaji,
  • Jacob Bogost,
  • Rick Voland, PhD,
  • Jeong Pak, PhD,
  • Barbara A. Blodi, MD,
  • Roomasa Channa, MD,
  • Donald Fong, MD,
  • Emily Y. Chew, MD

Journal volume & issue
Vol. 4, no. 5
p. 100477

Abstract

Read online

Purpose: To gain an understanding of data labeling requirements to train deep learning models for measurement of geographic atrophy (GA) with fundus autofluorescence (FAF) images. Design: Evaluation of artificial intelligence (AI) algorithms. Subjects: The Age-Related Eye Disease Study 2 (AREDS2) images were used for training and cross-validation, and GA clinical trial images were used for testing. Methods: Training data consisted of 2 sets of FAF images; 1 with area measurements only and no indication of GA location (Weakly labeled) and the second with GA segmentation masks (Strongly labeled). Main Outcome Measures: Bland–Altman plots and scatter plots were used to compare GA area measurement between ground truth and AI measurements. The Dice coefficient was used to compare accuracy of segmentation of the Strong model. Results: In the cross-validation AREDS2 data set (n = 601), the mean (standard deviation [SD]) area of GA measured by human grader, Weakly labeled AI model, and Strongly labeled AI model was 6.65 (6.3) mm2, 6.83 (6.29) mm2, and 6.58 (6.24) mm2, respectively. The mean difference between ground truth and AI was 0.18 mm2 (95% confidence interval, [CI], −7.57 to 7.92) for the Weakly labeled model and −0.07 mm2 (95% CI, −1.61 to 1.47) for the Strongly labeled model. With GlaxoSmithKline testing data (n = 156), the mean (SD) GA area was 9.79 (5.6) mm2, 8.82 (4.61) mm2, and 9.55 (5.66) mm2 for human grader, Strongly labeled AI model, and Weakly labeled AI model, respectively. The mean difference between ground truth and AI for the 2 models was −0.97 mm2 (95% CI, −4.36 to 2.41) and −0.24 mm2 (95% CI, −4.98 to 4.49), respectively. The Dice coefficient was 0.99 for intergrader agreement, 0.89 for the cross-validation data, and 0.92 for the testing data. Conclusions: Deep learning models can achieve reasonable accuracy even with Weakly labeled data. Training methods that integrate large volumes of Weakly labeled images with small number of Strongly labeled images offer a promising solution to overcome the burden of cost and time for data labeling. Financial Disclosures: Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.

Keywords