Global Epidemiology (Jun 2025)

Modeling the determinants of attrition in a two-stage epilepsy prevalence survey in Nairobi using machine learning

  • Daniel M. Mwanga,
  • Isaac C. Kipchirchir,
  • George O. Muhua,
  • Charles R. Newton,
  • Damazo T. Kadengye,
  • Abankwah Junior,
  • Albert Akpalu,
  • Arjune Sen,
  • Bruno Mmbando,
  • Charles R. Newton,
  • Cynthia Sottie,
  • Dan Bhwana,
  • Daniel Mtai Mwanga,
  • Damazo T. Kadengye,
  • Daniel Nana Yaw,
  • David McDaid,
  • Dorcas Muli,
  • Emmanuel Darkwa,
  • Frederick Murunga Wekesah,
  • Gershim Asiki,
  • Gergana Manolova,
  • Guillaume Pages,
  • Helen Cross,
  • Henrika Kimambo,
  • Isolide S. Massawe,
  • Josemir W. Sander,
  • Mary Bitta,
  • Mercy Atieno,
  • Neerja Chowdhary,
  • Patrick Adjei,
  • Peter O. Otieno,
  • Ryan Wagner,
  • Richard Walker,
  • Sabina Asiamah,
  • Samuel Iddi,
  • Simone Grassi,
  • Sloan Mahone,
  • Sonia Vallentin,
  • Stella Waruingi,
  • Symon Kariuki,
  • Tarun Dua,
  • Thomas Kwasa,
  • Timothy Denison,
  • Tony Godi,
  • Vivian Mushi,
  • William Matuja

Journal volume & issue
Vol. 9
p. 100183

Abstract

Read online

Background: Attrition is a challenge in parameter estimation in both longitudinal and multi-stage cross-sectional studies. Here, we examine utility of machine learning to predict attrition and identify associated factors in a two-stage population-based epilepsy prevalence study in Nairobi. Methods: All individuals in the Nairobi Urban Health and Demographic Surveillance System (NUHDSS) (Korogocho and Viwandani) were screened for epilepsy in two stages. Attrition was defined as probable epilepsy cases identified at stage-I but who did not attend stage-II (neurologist assessment). Categorical variables were one-hot encoded, class imbalance was addressed using synthetic minority over-sampling technique (SMOTE) and numeric variables were scaled and centered. The dataset was split into training and testing sets (7:3 ratio), and seven machine learning models, including the ensemble Super Learner, were trained. Hyperparameters were tuned using 10-fold cross-validation, and model performance evaluated using metrics like Area under the curve (AUC), accuracy, Brier score and F1 score over 500 bootstrap samples of the test data. Results: Random forest (AUC = 0.98, accuracy = 0.95, Brier score = 0.06, and F1 = 0.94), extreme gradient boost (XGB) (AUC = 0.96, accuracy = 0.91, Brier score = 0.08, F1 = 0.90) and support vector machine (SVM) (AUC = 0.93, accuracy = 0.93, Brier score = 0.07, F1 = 0.92) were the best performing models (base learners). Ensemble Super Learner had similarly high performance. Important predictors of attrition included proximity to industrial areas, male gender, employment, education, smaller households, and a history of complex partial seizures. Conclusion: These findings can aid researchers plan targeted mobilization for scheduled clinical appointments to improve follow-up rates. These findings will inform development of a web-based algorithm to predict attrition risk and aid in targeted follow-up efforts in similar studies.

Keywords