A data-driven architecture using natural language processing to improve phenotyping efficiency and accelerate genetic diagnoses of rare disorders
Jignesh R. Parikh,
Casie A. Genetti,
Asli Aykanat,
Catherine A. Brownstein,
Klaus Schmitz-Abe,
Morgan Danowski,
Andrew Quitadomo,
Jill A. Madden,
Calum Yacoubian,
Richard Gain,
Tessa Williams,
Mary Meskell,
Andrew Brown,
Alison Frith,
Shira Rockowitz,
Piotr Sliz,
Pankaj B. Agrawal,
Thomas Defay,
Paul McDonagh,
John Reynders,
Sebastien Lefebvre,
Alan H. Beggs
Affiliations
Jignesh R. Parikh
J Square Labs, LLC, Natick, MA 01760, USA
Casie A. Genetti
The Manton Center for Orphan Disease Research, Division of Genetics and Genomics, Boston Children’s Hospital, Harvard Medical School, Boston, MA 02115, USA
Asli Aykanat
The Manton Center for Orphan Disease Research, Division of Genetics and Genomics, Boston Children’s Hospital, Harvard Medical School, Boston, MA 02115, USA
Catherine A. Brownstein
The Manton Center for Orphan Disease Research, Division of Genetics and Genomics, Boston Children’s Hospital, Harvard Medical School, Boston, MA 02115, USA
Klaus Schmitz-Abe
The Manton Center for Orphan Disease Research, Division of Genetics and Genomics, Boston Children’s Hospital, Harvard Medical School, Boston, MA 02115, USA
Morgan Danowski
The Manton Center for Orphan Disease Research, Division of Genetics and Genomics, Boston Children’s Hospital, Harvard Medical School, Boston, MA 02115, USA
Andrew Quitadomo
The Manton Center for Orphan Disease Research, Division of Genetics and Genomics, Boston Children’s Hospital, Harvard Medical School, Boston, MA 02115, USA; Computational Health Informatics Program, Boston Children’s Hospital, Harvard Medical School, Boston, MA 02115, USA
Jill A. Madden
The Manton Center for Orphan Disease Research, Division of Genetics and Genomics, Boston Children’s Hospital, Harvard Medical School, Boston, MA 02115, USA
Calum Yacoubian
Clinithink, Ltd., London N1 6DR, UK
Richard Gain
Clinithink, Ltd., London N1 6DR, UK
Tessa Williams
Clinithink, Ltd., London N1 6DR, UK
Mary Meskell
Clinithink, Ltd., London N1 6DR, UK
Andrew Brown
Clinithink, Ltd., London N1 6DR, UK
Alison Frith
Clinithink, Ltd., London N1 6DR, UK
Shira Rockowitz
The Manton Center for Orphan Disease Research, Division of Genetics and Genomics, Boston Children’s Hospital, Harvard Medical School, Boston, MA 02115, USA; Computational Health Informatics Program, Boston Children’s Hospital, Harvard Medical School, Boston, MA 02115, USA
Piotr Sliz
The Manton Center for Orphan Disease Research, Division of Genetics and Genomics, Boston Children’s Hospital, Harvard Medical School, Boston, MA 02115, USA; Computational Health Informatics Program, Boston Children’s Hospital, Harvard Medical School, Boston, MA 02115, USA
Pankaj B. Agrawal
The Manton Center for Orphan Disease Research, Division of Genetics and Genomics, Boston Children’s Hospital, Harvard Medical School, Boston, MA 02115, USA; Division of Newborn Medicine, Boston Children’s Hospital, Harvard Medical School, Boston, MA 02115, USA
Thomas Defay
Alexion Pharmaceuticals, Inc., Boston, MA 02210, USA
Paul McDonagh
Alexion Pharmaceuticals, Inc., Boston, MA 02210, USA
John Reynders
Alexion Pharmaceuticals, Inc., Boston, MA 02210, USA
Sebastien Lefebvre
Alexion Pharmaceuticals, Inc., Boston, MA 02210, USA; Corresponding author
Alan H. Beggs
The Manton Center for Orphan Disease Research, Division of Genetics and Genomics, Boston Children’s Hospital, Harvard Medical School, Boston, MA 02115, USA; Corresponding author
Summary: Effective genetic diagnosis requires the correlation of genetic variant data with detailed phenotypic information. However, manual encoding of clinical data into machine-readable forms is laborious and subject to observer bias. Natural language processing (NLP) of electronic health records has great potential to enhance reproducibility at scale but suffers from idiosyncrasies in physician notes and other medical records. We developed methods to optimize NLP outputs for automated diagnosis. We filtered NLP-extracted Human Phenotype Ontology (HPO) terms to more closely resemble manually extracted terms and identified filter parameters across a three-dimensional space for optimal gene prioritization. We then developed a tiered pipeline that reduces manual effort by prioritizing smaller subsets of genes to consider for genetic diagnosis. Our filtering pipeline enabled NLP-based extraction of HPO terms to serve as a sufficient replacement for manual extraction in 92% of prospectively evaluated cases. In 75% of cases, the correct causal gene was ranked higher with our applied filters than without any filters. We describe a framework that can maximize the utility of NLP-based phenotype extraction for gene prioritization and diagnosis. The framework is implemented within a cloud-based modular architecture that can be deployed across health and research institutions.