Journal of Pathology Informatics (Jan 2025)

Enhancing human phenotype ontology term extraction through synthetic case reports and embedding-based retrieval: A novel approach for improved biomedical data annotation

  • Abdulkadir Albayrak,
  • Yao Xiao,
  • Piyush Mukherjee,
  • Sarah S. Barnett,
  • Cherisse A. Marcou,
  • Steven N. Hart

Journal volume & issue
Vol. 16
p. 100409

Abstract

Read online

With the increasing utilization of exome and genome sequencing in clinical and research genetics, accurate and automated extraction of human phenotype ontology (HPO) terms from clinical texts has become imperative. Traditional methods for HPO term extraction, such as PhenoTagger, often face limitations in coverage and precision. In this study, we propose a novel approach that leverages large language models (LLMs) to generate synthetic sentences with clinical context, which were semantically encoded into vector embeddings. These embeddings are linked to HPO terms, creating a robust knowledgebase that facilitates precise information retrieval. Our method circumvents the known issue of LLM hallucinations by storing and querying these embeddings within a true database, ensuring accurate context matching without the need for a predictive model. We evaluated the performance of three different embedding models, all of which demonstrated substantial improvements over PhenoTagger. Top recall (sensitivity), precision (positive-predictive value, PPV), and F1 are 0.64, 0.64, and 0.64, respectively, which were 31%, 10%, and 21% better than PhenoTagger. Furthermore, optimal performance was achieved when we combined the best performing embedding model with PhenoTagger (a.k.a. Fused model), resulting in recall (sensitivity), precision (PPV), and F1 values of 0.7, 0.7, and 0.7, respectively, which are 10%, 10%, and 10% better than the best embedding models. Our findings underscore the potential of this integrated approach to enhance the precision and reliability of HPO term extraction, offering a scalable and effective solution for biomedical data annotation.

Keywords