Use of artificial intelligence for public health surveillance: a case study to develop a machine Learning-algorithm to estimate the incidence of diabetes mellitus in France

Romana Haneef; Sofiane Kab; Rok Hrzic; Sonsoles Fuentes; Sandrine Fosse-Edorh; Emmanuel Cosson; Anne Gallay

doi:10.1186/s13690-021-00687-0

Archives of Public Health (Sep 2021)

Use of artificial intelligence for public health surveillance: a case study to develop a machine Learning-algorithm to estimate the incidence of diabetes mellitus in France

Romana Haneef,
Sofiane Kab,
Rok Hrzic,
Sonsoles Fuentes,
Sandrine Fosse-Edorh,
Emmanuel Cosson,
Anne Gallay

Affiliations

Romana Haneef: Department of Non-Communicable Diseases and Injuries, Santé Publique France
Sofiane Kab: Population-Based Epidemiological Cohorts Unit, INSERM UMS 011
Rok Hrzic: Department of International Health, Care and Public Health Research Institute – CAPHRI, University of Maastricht University
Sonsoles Fuentes: Department of Non-Communicable Diseases and Injuries, Santé Publique France
Sandrine Fosse-Edorh: Department of Non-Communicable Diseases and Injuries, Santé Publique France
Emmanuel Cosson: Department of Endocrinology-Diabetology-Nutrition, AP-HP, Avicenne Hospital, Paris 13 University, Sorbonne Paris Cité, CRNH-IdF, CINFO
Anne Gallay: Department of Non-Communicable Diseases and Injuries, Santé Publique France

DOI: https://doi.org/10.1186/s13690-021-00687-0
Journal volume & issue: Vol. 79, no. 1
pp. 1 – 13

Abstract

Read online

Abstract Background The use of machine learning techniques is increasing in healthcare which allows to estimate and predict health outcomes from large administrative data sets more efficiently. The main objective of this study was to develop a generic machine learning (ML) algorithm to estimate the incidence of diabetes based on the number of reimbursements over the last 2 years. Methods We selected a final data set from a population-based epidemiological cohort (i.e., CONSTANCES) linked with French National Health Database (i.e., SNDS). To develop this algorithm, we adopted a supervised ML approach. Following steps were performed: i. selection of final data set, ii. target definition, iii. Coding variables for a given window of time, iv. split final data into training and test data sets, v. variables selection, vi. training model, vii. Validation of model with test data set and viii. Selection of the model. We used the area under the receiver operating characteristic curve (AUC) to select the best algorithm. Results The final data set used to develop the algorithm included 44,659 participants from CONSTANCES. Out of 3468 variables from SNDS linked to CONSTANCES cohort were coded, 23 variables were selected to train different algorithms. The final algorithm to estimate the incidence of diabetes was a Linear Discriminant Analysis model based on number of reimbursements of selected variables related to biological tests, drugs, medical acts and hospitalization without a procedure over the last 2 years. This algorithm has a sensitivity of 62%, a specificity of 67% and an accuracy of 67% [95% CI: 0.66–0.68]. Conclusions Supervised ML is an innovative tool for the development of new methods to exploit large health administrative databases. In context of InfAct project, we have developed and applied the first time a generic ML-algorithm to estimate the incidence of diabetes for public health surveillance. The ML-algorithm we have developed, has a moderate performance. The next step is to apply this algorithm on SNDS to estimate the incidence of type 2 diabetes cases. More research is needed to apply various MLTs to estimate the incidence of various health conditions.

Published in Archives of Public Health

ISSN: 2049-3258 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Public aspects of medicine
Website: http://archpublichealth.biomedcentral.com

About the journal

Abstract

Keywords