Journal of Health Economics and Outcomes Research (Feb 2019)

Applying Machine Learning Techniques to Identify Undiagnosed Patients with Exocrine Pancreatic Insufficiency

  • Bruce Pyenson,
  • Maggie Alston,
  • Jeffrey Gomberg,
  • Feng Han,
  • Nikhil Khandelwal,
  • Motoharu Dei,
  • Monica Son,
  • Jaime Vora

Journal volume & issue
Vol. 6, no. 2

Abstract

Read online

**Background:** Exocrine pancreatic insufficiency (EPI) is a serious condition characterized by a lack of functional exocrine pancreatic enzymes and the resultant inability to properly digest nutrients. EPI can be caused by a variety of disorders, including chronic pancreatitis, pancreatic cancer, and celiac disease. EPI remains underdiagnosed because of the nonspecific nature of clinical symptoms, lack of an ideal diagnostic test, and the inability to easily identify affected patients using administrative claims data. **Objectives:** To develop a machine learning model that identifies patients in a commercial medical claims database who likely have EPI but are undiagnosed. **Methods:** A machine learning algorithm was developed in Scikit-learn, a Python module. The study population, selected from the 2014 Truven MarketScan® Commercial Claims Database, consisted of patients with EPI-prone conditions. Patients were labeled with 290 condition category flags and split into actual positive EPI cases, actual negative EPI cases, and unlabeled cases. The study population was then randomly divided into a training subset and a testing subset. The training subset was used to determine the performance metrics of 27 models and to select the highest performing model, and the testing subset was used to evaluate performance of the best machine learning model. **Results:** The study population consisted of 2088 actual positive EPI cases, 1077 actual negative EPI cases, and 437 530 unlabeled cases. In the best performing model, the precision, recall, and accuracy were 0.91, 0.80, and 0.86, respectively. The best-performing model estimated that the number of patients likely to have EPI was about 12 times the number of patients directly identified as EPI-positive through a claims analysis in the study population. The most important features in assigning EPI probability were the presence or absence of diagnosis codes related to pancreatic and digestive conditions. **Conclusions:** Machine learning techniques demonstrated high predictive power in identifying patients with EPI and could facilitate an enhanced understanding of its etiology and help to identify patients for possible diagnosis and treatment.