Evaluating and Enhancing Large Language Models’ Performance in Domain-Specific Medicine: Development and Usability Study With DocOA

Xi Chen; Li Wang; MingKe You; WeiZhi Liu; Yu Fu; Jie Xu; Shaoting Zhang; Gang Chen; Kang Li; Jian Li

doi:10.2196/58158

Journal of Medical Internet Research (Jul 2024)

Evaluating and Enhancing Large Language Models’ Performance in Domain-Specific Medicine: Development and Usability Study With DocOA

Xi Chen,
Li Wang,
MingKe You,
WeiZhi Liu,
Yu Fu,
Jie Xu,
Shaoting Zhang,
Gang Chen,
Kang Li,
Jian Li

Affiliations

Xi Chen: ORCiD
Li Wang: ORCiD
MingKe You: ORCiD
WeiZhi Liu: ORCiD
Yu Fu: ORCiD
Jie Xu: ORCiD
Shaoting Zhang: ORCiD
Gang Chen: ORCiD
Kang Li: ORCiD
Jian Li: ORCiD

DOI: https://doi.org/10.2196/58158
Journal volume & issue: Vol. 26
p. e58158

Abstract

Read online

BackgroundThe efficacy of large language models (LLMs) in domain-specific medicine, particularly for managing complex diseases such as osteoarthritis (OA), remains largely unexplored. ObjectiveThis study focused on evaluating and enhancing the clinical capabilities and explainability of LLMs in specific domains, using OA management as a case study. MethodsA domain-specific benchmark framework was developed to evaluate LLMs across a spectrum from domain-specific knowledge to clinical applications in real-world clinical scenarios. DocOA, a specialized LLM designed for OA management integrating retrieval-augmented generation and instructional prompts, was developed. It can identify the clinical evidence upon which its answers are based through retrieval-augmented generation, thereby demonstrating the explainability of those answers. The study compared the performance of GPT-3.5, GPT-4, and a specialized assistant, DocOA, using objective and human evaluations. ResultsResults showed that general LLMs such as GPT-3.5 and GPT-4 were less effective in the specialized domain of OA management, particularly in providing personalized treatment recommendations. However, DocOA showed significant improvements. ConclusionsThis study introduces a novel benchmark framework that assesses the domain-specific abilities of LLMs in multiple aspects, highlights the limitations of generalized LLMs in clinical contexts, and demonstrates the potential of tailored approaches for developing domain-specific medical LLMs.

Published in Journal of Medical Internet Research

ISSN: 1438-8871 (Online)
Publisher: JMIR Publications
Country of publisher: Canada
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Medicine: Public aspects of medicine
Website: https://www.jmir.org

About the journal