Enhancing early detection of cognitive decline in the elderly: a comparative study utilizing large language models in clinical notesResearch in context
Xinsong Du,
John Novoa-Laurentiev,
Joseph M. Plasek,
Ya-Wen Chuang,
Liqin Wang,
Gad A. Marshall,
Stephanie K. Mueller,
Frank Chang,
Surabhi Datta,
Hunki Paek,
Bin Lin,
Qiang Wei,
Xiaoyan Wang,
Jingqi Wang,
Hao Ding,
Frank J. Manion,
Jingcheng Du,
David W. Bates,
Li Zhou
Affiliations
Xinsong Du
Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, MA, 02115, USA; Department of Medicine, Harvard Medical School, Boston, MA, 02115, USA; Corresponding author. Division of General Internal Medicine and Primary Care, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, 399 Revolution Dr, Suite 777, Somerville, MA, 02145, USA.
John Novoa-Laurentiev
Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, MA, 02115, USA
Joseph M. Plasek
Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, MA, 02115, USA; Department of Medicine, Harvard Medical School, Boston, MA, 02115, USA
Ya-Wen Chuang
Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, MA, 02115, USA; Department of Medicine, Harvard Medical School, Boston, MA, 02115, USA; Division of Nephrology, Taichung Veterans General Hospital, Taichung, 407219, Taiwan; Department of Post-Baccalaureate Medicine, College of Medicine, National Chung Hsing University, Taichung, 402202, Taiwan; School of Medicine, College of Medicine, China Medical University, Taichung, 406040, Taiwan
Liqin Wang
Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, MA, 02115, USA; Department of Medicine, Harvard Medical School, Boston, MA, 02115, USA
Gad A. Marshall
Department of Medicine, Harvard Medical School, Boston, MA, 02115, USA; Department of Neurology, Brigham and Women's Hospital, Boston, MA, 02115, USA
Stephanie K. Mueller
Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, MA, 02115, USA; Department of Medicine, Harvard Medical School, Boston, MA, 02115, USA
Frank Chang
Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, MA, 02115, USA
Surabhi Datta
Intelligent Medical Objects, Rosemont, Illinois, 60018, USA
Hunki Paek
Intelligent Medical Objects, Rosemont, Illinois, 60018, USA
Bin Lin
Intelligent Medical Objects, Rosemont, Illinois, 60018, USA
Qiang Wei
Intelligent Medical Objects, Rosemont, Illinois, 60018, USA
Xiaoyan Wang
Intelligent Medical Objects, Rosemont, Illinois, 60018, USA
Jingqi Wang
Intelligent Medical Objects, Rosemont, Illinois, 60018, USA
Hao Ding
Intelligent Medical Objects, Rosemont, Illinois, 60018, USA
Frank J. Manion
Intelligent Medical Objects, Rosemont, Illinois, 60018, USA
Jingcheng Du
Intelligent Medical Objects, Rosemont, Illinois, 60018, USA
David W. Bates
Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, MA, 02115, USA; Department of Medicine, Harvard Medical School, Boston, MA, 02115, USA
Li Zhou
Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, MA, 02115, USA; Department of Medicine, Harvard Medical School, Boston, MA, 02115, USA
Summary: Background: Large language models (LLMs) have shown promising performance in various healthcare domains, but their effectiveness in identifying specific clinical conditions in real medical records is less explored. This study evaluates LLMs for detecting signs of cognitive decline in real electronic health record (EHR) clinical notes, comparing their error profiles with traditional models. The insights gained will inform strategies for performance enhancement. Methods: This study, conducted at Mass General Brigham in Boston, MA, analysed clinical notes from the four years prior to a 2019 diagnosis of mild cognitive impairment in patients aged 50 and older. We developed prompts for two LLMs, Llama 2 and GPT-4, on Health Insurance Portability and Accountability Act (HIPAA)-compliant cloud-computing platforms using multiple approaches (e.g., hard prompting, retrieval augmented generation, and error analysis-based instructions) to select the optimal LLM-based method. Baseline models included a hierarchical attention-based neural network and XGBoost. Subsequently, we constructed an ensemble of the three models using a majority vote approach. Confusion-matrix-based scores were used for model evaluation. Findings: We used a randomly annotated sample of 4949 note sections from 1969 patients (women: 1046 [53.1%]; age: mean, 76.0 [SD, 13.3] years), filtered with keywords related to cognitive functions, for model development. For testing, a random annotated sample of 1996 note sections from 1161 patients (women: 619 [53.3%]; age: mean, 76.5 [SD, 10.2] years) without keyword filtering was utilised. GPT-4 demonstrated superior accuracy and efficiency compared to Llama 2, but did not outperform traditional models. The ensemble model outperformed the individual models in terms of all evaluation metrics with statistical significance (p < 0.01), achieving a precision of 90.2% [95% CI: 81.9%–96.8%], a recall of 94.2% [95% CI: 87.9%–98.7%], and an F1-score of 92.1% [95% CI: 86.8%–96.4%]. Notably, the ensemble model showed a significant improvement in precision, increasing from a range of 70%–79% to above 90%, compared to the best-performing single model. Error analysis revealed that 63 samples were incorrectly predicted by at least one model; however, only 2 cases (3.2%) were mutual errors across all models, indicating diverse error profiles among them. Interpretation: LLMs and traditional machine learning models trained using local EHR data exhibited diverse error profiles. The ensemble of these models was found to be complementary, enhancing diagnostic performance. Future research should investigate integrating LLMs with smaller, localised models and incorporating medical data and domain knowledge to enhance performance on specific tasks. Funding: This research was supported by the National Institute on Aging grants (R44AG081006, R01AG080429) and National Library of Medicine grant (R01LM014239).