PLoS ONE (Jan 2024)

Machine-learning-based identification of patients with IgA nephropathy using a computerized medical billing database.

  • Ryoya Tsunoda,
  • Keitaro Kume,
  • Rina Kagawa,
  • Masaru Sanuki,
  • Hiroyuki Kitagawa,
  • Kaori Mase,
  • Kunihiro Yamagata

DOI
https://doi.org/10.1371/journal.pone.0312915
Journal volume & issue
Vol. 19, no. 12
p. e0312915

Abstract

Read online

The billing database of the universal healthcare system in Japan potentially includes large-cohort data of patients with immunoglobulin A nephropathy, diagnosis codes aimed at billing should not be directly used for clinical research because of the risk of misdiagnosis. To solve this problem, we aimed to develop a novel method for identifying patients with immunoglobulin A nephropathy from billing data using machine learning. The medical records and bills of 3,743 patients who consulted nephrologists at a single center were extracted. Patients were labeled to have been diagnosed with immunoglobulin A nephropathy through a review of medical records. A manual analysis of the diagnostic accuracy and machine learning was performed. For machine learning, the datasets were preprocessed in three patterns and assigned to the XGBoost program using five-fold cross-validation. Of all the participants, 437 were labeled as having been diagnosed with immunoglobulin A nephropathy. Bill codes for immunoglobulin A nephropathy were provided to approximately half of them. The manually created criteria consisting of the recommended examinations and treatments in the Japanese guidelines for immunoglobulin A nephropathy showed both specificity and sensitivity < 0.8. In contrast, with the receiver operating characteristic curve analysis, the machine learning process yielded area under the curve values over 0.9 with preprocessing from the clinical viewpoint. Applying machine learning technology to a dataset preprocessed from a clinical viewpoint achieved a high performance in detecting patients with immunoglobulin A nephropathy. This methodology contributes to the construction of a disease-specific cohort using big bill data.