Molecular Medicine (Aug 2017)

Efficient Genome-wide Association in Biobanks Using Topic Modeling Identifies Multiple Novel Disease Loci

  • Thomas H. McCoy,
  • Victor M. Castro,
  • Leslie A. Snapper,
  • Kamber L. Hart,
  • Roy H. Perlis

DOI
https://doi.org/10.2119/molmed.2017.00100
Journal volume & issue
Vol. 23, no. 1
pp. 285 – 294

Abstract

Read online

Abstract Biobanks and national registries represent a powerful tool for genomic discovery, but rely on diagnostic codes that can be unreliable and fail to capture relationships between related diagnoses. We developed an efficient means of conducting genome-wide association studies using combinations of diagnostic codes from electronic health records for 10,845 participants in a biobanking program at two large academic medical centers. Specifically, we applied latent Dirichilet allocation to fit 50 disease topics based on diagnostic codes, then conducted a genome-wide common-variant association for each topic. In sensitivity analysis, these results were contrasted with those obtained from traditional single-diagnosis phenome-wide association analysis, as well as those in which only a subset of diagnostic codes were included per topic. In meta-analysis across three biobank cohorts, we identified 23 disease-associated loci with p < 1e-15, including previously associated autoimmune disease loci. In all cases, observed significant associations were of greater magnitude than single phenome-wide diagnostic codes, and incorporation of less strongly loading diagnostic codes enhanced association. This strategy provides a more efficient means of identifying phenome-wide associations in biobanks with coded clinical data.