npj Digital Medicine (Jan 2025)

Clinical entity augmented retrieval for clinical information extraction

  • Ivan Lopez,
  • Akshay Swaminathan,
  • Karthik Vedula,
  • Sanjana Narayanan,
  • Fateme Nateghi Haredasht,
  • Stephen P. Ma,
  • April S. Liang,
  • Steven Tate,
  • Manoj Maddali,
  • Robert Joseph Gallo,
  • Nigam H. Shah,
  • Jonathan H. Chen

DOI
https://doi.org/10.1038/s41746-024-01377-1
Journal volume & issue
Vol. 8, no. 1
pp. 1 – 11

Abstract

Read online

Abstract Large language models (LLMs) with retrieval-augmented generation (RAG) have improved information extraction over previous methods, yet their reliance on embeddings often leads to inefficient retrieval. We introduce CLinical Entity Augmented Retrieval (CLEAR), a RAG pipeline that retrieves information using entities. We compared CLEAR to embedding RAG and full-note approaches for extracting 18 variables using six LLMs across 20,000 clinical notes. Average F1 scores were 0.90, 0.86, and 0.79; inference times were 4.95, 17.41, and 20.08 s per note; average model queries were 1.68, 4.94, and 4.18 per note; and average input tokens were 1.1k, 3.8k, and 6.1k per note for CLEAR, embedding RAG, and full-note approaches, respectively. In conclusion, CLEAR utilizes clinical entities for information retrieval and achieves >70% reduction in token usage and inference time with improved performance compared to modern methods.