Scientific Reports (Apr 2024)

Classifying early infant feeding status from clinical notes using natural language processing and machine learning

  • Dominick J. Lemas,
  • Xinsong Du,
  • Masoud Rouhizadeh,
  • Braeden Lewis,
  • Simon Frank,
  • Lauren Wright,
  • Alex Spirache,
  • Lisa Gonzalez,
  • Ryan Cheves,
  • Marina Magalhães,
  • Ruben Zapata,
  • Rahul Reddy,
  • Ke Xu,
  • Leslie Parker,
  • Chris Harle,
  • Bridget Young,
  • Adetola Louis-Jaques,
  • Bouri Zhang,
  • Lindsay Thompson,
  • William R. Hogan,
  • François Modave

DOI
https://doi.org/10.1038/s41598-024-58299-x
Journal volume & issue
Vol. 14, no. 1
pp. 1 – 8

Abstract

Read online

Abstract The objective of this study is to develop and evaluate natural language processing (NLP) and machine learning models to predict infant feeding status from clinical notes in the Epic electronic health records system. The primary outcome was the classification of infant feeding status from clinical notes using Medical Subject Headings (MeSH) terms. Annotation of notes was completed using TeamTat to uniquely classify clinical notes according to infant feeding status. We trained 6 machine learning models to classify infant feeding status: logistic regression, random forest, XGBoost gradient descent, k-nearest neighbors, and support-vector classifier. Model comparison was evaluated based on overall accuracy, precision, recall, and F1 score. Our modeling corpus included an even number of clinical notes that was a balanced sample across each class. We manually reviewed 999 notes that represented 746 mother-infant dyads with a mean gestational age of 38.9 weeks and a mean maternal age of 26.6 years. The most frequent feeding status classification present for this study was exclusive breastfeeding [n = 183 (18.3%)], followed by exclusive formula bottle feeding [n = 146 (14.6%)], and exclusive feeding of expressed mother’s milk [n = 102 (10.2%)], with mixed feeding being the least frequent [n = 23 (2.3%)]. Our final analysis evaluated the classification of clinical notes as breast, formula/bottle, and missing. The machine learning models were trained on these three classes after performing balancing and down sampling. The XGBoost model outperformed all others by achieving an accuracy of 90.1%, a macro-averaged precision of 90.3%, a macro-averaged recall of 90.1%, and a macro-averaged F1 score of 90.1%. Our results demonstrate that natural language processing can be applied to clinical notes stored in the electronic health records to classify infant feeding status. Early identification of breastfeeding status using NLP on unstructured electronic health records data can be used to inform precision public health interventions focused on improving lactation support for postpartum patients.