Gastro Hep Advances (Jan 2022)

A Machine Learning Approach to Identifying Causal Monogenic Variants in Inflammatory Bowel Disease

  • Daniel J. Mulder,
  • Sam Khalouei,
  • Michael Li,
  • Neil Warner,
  • Claudia Gonzaga-Jauregui,
  • Eric I. Benchimol,
  • Peter C. Church,
  • Thomas D. Walters,
  • Arun K. Ramani,
  • Anne M. Griffiths,
  • Amanda Ricciuto,
  • Aleixo M. Muise

Journal volume & issue
Vol. 1, no. 2
pp. 171 – 179

Abstract

Read online

Background and Aims: Diagnosis of monogenic disease is increasingly important for patient care and personalizing therapy. However, the current process is nonstandardized, expensive, and time consuming. There is currently no accepted strategy to help identify disease-causing variants in monogenic inflammatory bowel disease (IBD). The aim of the study is to develop a prioritization strategy for monogenic IBD variant discovery through detailed analysis of a whole-exome sequencing (WES) data set. Methods: All consenting pediatric patients with IBD presenting to our tertiary care hospital during the study period were enrolled and underwent WES (n = 1005). Available family members also underwent WES. Variants were analyzed en masse using the GEMINI framework and were further annotated using data from dbNSFP, Combined Annotation Dependent Depletion, and gnomAD. Known disease-causing variants (n = 36) were used as positive controls. Machine learning algorithms were optimized and then compared to assist with identifying monogenic IBD case characteristics. Results: Initial gene-level analysis identified 11 genes not previously linked to IBD that could potentially harbor IBD-causing variants. Machine learning algorithms identified 4 primary variant characteristics (Combined Annotation Dependent Depletion score, dbNSFP score, relationship with a known immunodeficiency gene, and alternate allele frequency), and optimal threshold values for each were determined to assist with identifying monogenic IBD variants. Based on these characteristics, an automated variant prioritization pipeline was then created that filters and prioritizes variants from >100,000 variants per patient down to a mean of 15. This pipeline is available online for all to use. Conclusion: Leveraging a large WES data set, we demonstrate a statistically rigorous strategy for prioritization of variants for monogenic IBD diagnosis.

Keywords