Integration of datasets for individual prediction of DNA methylation-based biomarkers

Charlotte Merzbacher; Barry Ryan; Thibaut Goldsborough; Robert F. Hillary; Archie Campbell; Lee Murphy; Andrew M. McIntosh; David Liewald; Sarah E. Harris; Allan F. McRae; Simon R. Cox; Timothy I. Cannings; Catalina A. Vallejos; Daniel L. McCartney; Riccardo E. Marioni

doi:10.1186/s13059-023-03114-5

Genome Biology (Dec 2023)

Integration of datasets for individual prediction of DNA methylation-based biomarkers

Charlotte Merzbacher,
Barry Ryan,
Thibaut Goldsborough,
Robert F. Hillary,
Archie Campbell,
Lee Murphy,
Andrew M. McIntosh,
David Liewald,
Sarah E. Harris,
Allan F. McRae,
Simon R. Cox,
Timothy I. Cannings,
Catalina A. Vallejos,
Daniel L. McCartney,
Riccardo E. Marioni

Affiliations

Charlotte Merzbacher: School of Informatics, University of Edinburgh
Barry Ryan: School of Informatics, University of Edinburgh
Thibaut Goldsborough: School of Informatics, University of Edinburgh
Robert F. Hillary: Centre for Genomic and Experimental Medicine, Institute of Genetics and Cancer, University of Edinburgh
Archie Campbell: Centre for Genomic and Experimental Medicine, Institute of Genetics and Cancer, University of Edinburgh
Lee Murphy: Edinburgh Clinical Research Facility, University of Edinburgh
Andrew M. McIntosh: Centre for Genomic and Experimental Medicine, Institute of Genetics and Cancer, University of Edinburgh
David Liewald: Department of Psychology, Lothian Birth Cohorts, University of Edinburgh
Sarah E. Harris: Department of Psychology, Lothian Birth Cohorts, University of Edinburgh
Allan F. McRae: Institute for Molecular Bioscience, University of Queensland
Simon R. Cox: Department of Psychology, Lothian Birth Cohorts, University of Edinburgh
Timothy I. Cannings: Maxwell Institute for Mathematical Sciences, School of Mathematics, University of Edinburgh
Catalina A. Vallejos: MRC Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh
Daniel L. McCartney: Centre for Genomic and Experimental Medicine, Institute of Genetics and Cancer, University of Edinburgh
Riccardo E. Marioni: Centre for Genomic and Experimental Medicine, Institute of Genetics and Cancer, University of Edinburgh

DOI: https://doi.org/10.1186/s13059-023-03114-5
Journal volume & issue: Vol. 24, no. 1
pp. 1 – 12

Abstract

Read online

Abstract Background Epigenetic scores (EpiScores) can provide biomarkers of lifestyle and disease risk. Projecting new datasets onto a reference panel is challenging due to separation of technical and biological variation with array data. Normalisation can standardise data distributions but may also remove population-level biological variation. Results We compare two birth cohorts (Lothian Birth Cohorts of 1921 and 1936 — nLBC1921 = 387 and nLBC1936 = 498) with blood-based DNA methylation assessed at the same chronological age (79 years) and processed in the same lab but in different years and experimental batches. We examine the effect of 16 normalisation methods on a novel BMI EpiScore (trained in an external cohort, n = 18,413), and Horvath’s pan-tissue DNA methylation age, when the cohorts are normalised separately and together. The BMI EpiScore explains a maximum variance of R 2=24.5% in BMI in LBC1936 (SWAN normalisation). Although there are cross-cohort R 2 differences, the normalisation method makes a minimal difference to within-cohort estimates. Conversely, a range of absolute differences are seen for individual-level EpiScore estimates for BMI and age when cohorts are normalised separately versus together. While within-array methods result in identical EpiScores whether a cohort is normalised on its own or together with the second dataset, a range of differences is observed for between-array methods. Conclusions Normalisation methods returning similar EpiScores, whether cohorts are analysed separately or together, will minimise technical variation when projecting new data onto a reference panel. These methods are important for cases where raw data is unavailable and joint normalisation of cohorts is computationally expensive.

Published in Genome Biology

ISSN: 1474-760X (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Science: Biology (General): Genetics
Website: https://genomebiology.biomedcentral.com/

About the journal

Abstract

Keywords