Clinical Epigenetics (Jul 2023)

Stability selection enhances feature selection and enables accurate prediction of gestational age using only five DNA methylation sites

  • Kristine L. Haftorn,
  • Julia Romanowska,
  • Yunsung Lee,
  • Christian M. Page,
  • Per M. Magnus,
  • Siri E. Håberg,
  • Jon Bohlin,
  • Astanand Jugessur,
  • William R. P. Denault

DOI
https://doi.org/10.1186/s13148-023-01528-3
Journal volume & issue
Vol. 15, no. 1
pp. 1 – 14

Abstract

Read online

Abstract Background DNA methylation (DNAm) is robustly associated with chronological age in children and adults, and gestational age (GA) in newborns. This property has enabled the development of several epigenetic clocks that can accurately predict chronological age and GA. However, the lack of overlap in predictive CpGs across different epigenetic clocks remains elusive. Our main aim was therefore to identify and characterize CpGs that are stably predictive of GA. Results We applied a statistical approach called ‘stability selection’ to DNAm data from 2138 newborns in the Norwegian Mother, Father, and Child Cohort study. Stability selection combines subsampling with variable selection to restrict the number of false discoveries in the set of selected variables. Twenty-four CpGs were identified as being stably predictive of GA. Intriguingly, only up to 10% of the CpGs in previous GA clocks were found to be stably selected. Based on these results, we used generalized additive model regression to develop a new GA clock consisting of only five CpGs, which showed a similar predictive performance as previous GA clocks (R 2 = 0.674, median absolute deviation = 4.4 days). These CpGs were in or near genes and regulatory regions involved in immune responses, metabolism, and developmental processes. Furthermore, accounting for nonlinear associations improved prediction performance in preterm newborns. Conclusion We present a methodological framework for feature selection that is broadly applicable to any trait that can be predicted from DNAm data. We demonstrate its utility by identifying CpGs that are highly predictive of GA and present a new and highly performant GA clock based on only five CpGs that is more amenable to a clinical setting.

Keywords