Using a Diverse Test Suite to Assess Large Language Models on Fast Health Care Interoperability Resources Knowledge: Comparative Analysis

Ahmad Idrissi-Yaghir; Kamyar Arzideh; Henning Schäfer; Bahadir Eryilmaz; Mikel Bahn; Yutong Wen; Katarzyna Borys; Eva Hartmann; Cynthia Schmidt; Obioma Pelka; Johannes Haubold; Christoph M Friedrich; Felix Nensa; René Hosch

doi:10.2196/73540

Journal of Medical Internet Research (Aug 2025)

Using a Diverse Test Suite to Assess Large Language Models on Fast Health Care Interoperability Resources Knowledge: Comparative Analysis

Ahmad Idrissi-Yaghir,
Kamyar Arzideh,
Henning Schäfer,
Bahadir Eryilmaz,
Mikel Bahn,
Yutong Wen,
Katarzyna Borys,
Eva Hartmann,
Cynthia Schmidt,
Obioma Pelka,
Johannes Haubold,
Christoph M Friedrich,
Felix Nensa,
René Hosch

Affiliations

Ahmad Idrissi-Yaghir: ORCiD
Kamyar Arzideh: ORCiD
Henning Schäfer: ORCiD
Bahadir Eryilmaz: ORCiD
Mikel Bahn: ORCiD
Yutong Wen: ORCiD
Katarzyna Borys: ORCiD
Eva Hartmann: ORCiD
Cynthia Schmidt: ORCiD
Obioma Pelka: ORCiD
Johannes Haubold: ORCiD
Christoph M Friedrich: ORCiD
Felix Nensa: ORCiD
René Hosch: ORCiD

DOI: https://doi.org/10.2196/73540
Journal volume & issue: Vol. 27
pp. e73540 – e73540

Abstract

Read online

Abstract BackgroundRecent natural language processing breakthroughs, particularly with the emergence of large language models (LLMs), have demonstrated remarkable capabilities on general knowledge benchmarks. However, there is limited data on the performance and understanding of these models in relation to the Fast Healthcare Interoperability Resources (FHIR) standard. The complexity and specialized nature of FHIR present challenges for LLMs, which are typically trained on broad datasets and may have a limited understanding of the nuances required for domain-specific tasks. Improving health data interoperability can greatly benefit the use of clinical data and interaction with electronic health records. ObjectiveThis study presents the Fast Healthcare Interoperability Resources (FHIR) Workbench, a comprehensive suite of datasets designed to evaluate the ability of LLMs to understand and apply the FHIR standard. MethodsIn total, 4 evaluation datasets were created to assess the FHIR knowledge and capabilities of LLMs. These tasks include multiple-choice questions on general FHIR concepts and the FHIR Representational State Transfer (REST) application programming interface, as well as correctly identifying the resource type and generating FHIR resources from unstructured clinical patient notes. In addition, we evaluate open-source LLMs, such as Qwen 2.5 Coder and DeepSeek-V3, and commercial LLMs, including GPT-4o and Gemini 2, on these tasks in a zero-shot setting. To provide context for interpreting LLM performance, a subset of the datasets was human-evaluated by recruiting 6 participants with varying levels of FHIR expertise. ResultsOur evaluation across multiple FHIR tasks revealed nuanced performance metrics. Commercial models demonstrated exceptional capabilities, with GPT-4o achieving a 0.9990 F1 ConclusionsThis study highlights the competitive performance of both open-source models, such as Qwen and DeepSeek, and commercial models, such as GPT-4o and Gemini, in FHIR-related tasks. While open-source models are advancing rapidly, commercial models still have an advantage for specific, complex tasks. The FHIR Workbench offers a valuable platform for evaluating the capabilities of these models and promoting improvements in health data interoperability.

Published in Journal of Medical Internet Research

ISSN: 1438-8871 (Online)
Publisher: JMIR Publications
Country of publisher: Canada
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Medicine: Public aspects of medicine
Website: https://www.jmir.org

About the journal