Rare disease variant curation from literature: assessing gaps with creatine transport deficiency in focus

Erica L. Lyons; Daniel Watson; Mohammad S. Alodadi; Sharie J. Haugabook; Gregory J. Tawa; Fady Hannah-Shmouni; Forbes D. Porter; Jack R. Collins; Elizabeth A. Ottinger; Uma S. Mudunuri

doi:10.1186/s12864-023-09561-5

BMC Genomics (Aug 2023)

Rare disease variant curation from literature: assessing gaps with creatine transport deficiency in focus

Erica L. Lyons,
Daniel Watson,
Mohammad S. Alodadi,
Sharie J. Haugabook,
Gregory J. Tawa,
Fady Hannah-Shmouni,
Forbes D. Porter,
Jack R. Collins,
Elizabeth A. Ottinger,
Uma S. Mudunuri

Affiliations

Erica L. Lyons: Advanced Biomedical Computational Science, Frederick National Laboratory for Cancer Research
Daniel Watson: Advanced Biomedical Computational Science, Frederick National Laboratory for Cancer Research
Mohammad S. Alodadi: Advanced Biomedical Computational Science, Frederick National Laboratory for Cancer Research
Sharie J. Haugabook: Division of Preclinical Innovation, Therapeutic Development Branch, Therapeutics for Rare and Neglected Diseases (TRND) Program, National Center for Advancing Translational Sciences, National Institutes of Health
Gregory J. Tawa: Division of Preclinical Innovation, Therapeutic Development Branch, Therapeutics for Rare and Neglected Diseases (TRND) Program, National Center for Advancing Translational Sciences, National Institutes of Health
Fady Hannah-Shmouni: Division of Translational Research, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health
Forbes D. Porter: Division of Translational Research, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health
Jack R. Collins: Advanced Biomedical Computational Science, Frederick National Laboratory for Cancer Research
Elizabeth A. Ottinger: Division of Preclinical Innovation, Therapeutic Development Branch, Therapeutics for Rare and Neglected Diseases (TRND) Program, National Center for Advancing Translational Sciences, National Institutes of Health
Uma S. Mudunuri: Advanced Biomedical Computational Science, Frederick National Laboratory for Cancer Research

DOI: https://doi.org/10.1186/s12864-023-09561-5
Journal volume & issue: Vol. 24, no. 1
pp. 1 – 18

Abstract

Read online

Abstract Background Approximately 4–8% of the world suffers from a rare disease. Rare diseases are often difficult to diagnose, and many do not have approved therapies. Genetic sequencing has the potential to shorten the current diagnostic process, increase mechanistic understanding, and facilitate research on therapeutic approaches but is limited by the difficulty of novel variant pathogenicity interpretation and the communication of known causative variants. It is unknown how many published rare disease variants are currently accessible in the public domain. Results This study investigated the translation of knowledge of variants reported in published manuscripts to publicly accessible variant databases. Variants, symptoms, biochemical assay results, and protein function from literature on the SLC6A8 gene associated with X-linked Creatine Transporter Deficiency (CTD) were curated and reported as a highly annotated dataset of variants with clinical context and functional details. Variants were harmonized, their availability in existing variant databases was analyzed and pathogenicity assignments were compared with impact algorithm predictions. 24% of the pathogenic variants found in PubMed articles were not captured in any database used in this analysis while only 65% of the published variants received an accurate pathogenicity prediction from at least one impact prediction algorithm. Conclusions Despite being published in the literature, pathogenicity data on patient variants may remain inaccessible for genetic diagnosis, therapeutic target identification, mechanistic understanding, or hypothesis generation. Clinical and functional details presented in the literature are important to make pathogenicity assessments. Impact predictions remain imperfect but are improving, especially for single nucleotide exonic variants, however such predictions are less accurate or unavailable for intronic and multi-nucleotide variants. Developing text mining workflows that use natural language processing for identifying diseases, genes and variants, along with impact prediction algorithms and integrating with details on clinical phenotypes and functional assessments might be a promising approach to scale literature mining of variants and assigning correct pathogenicity. The curated variants list created by this effort includes context details to improve any such efforts on variant curation for rare diseases.

Published in BMC Genomics

ISSN: 1471-2164 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Technology: Chemical technology: Biotechnology; Science: Biology (General): Genetics
Website: http://bmcgenomics.biomedcentral.com

About the journal

Abstract

Keywords