MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models

Mianxin Liu; Weiguo Hu; Jinru Ding; Jie Xu; Xiaoyang Li; Lifeng Zhu; Zhian Bai; Xiaoming Shi; Benyou Wang; Haitao Song; Pengfei Liu; Xiaofan Zhang; Shanshan Wang; Kang Li; Haofen Wang; Tong Ruan; Xuanjing Huang; Xin Sun; Shaoting Zhang

doi:10.26599/bdma.2024.9020044

Big Data Mining and Analytics (Dec 2024)

MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models

Mianxin Liu,
Weiguo Hu,
Jinru Ding,
Jie Xu,
Xiaoyang Li,
Lifeng Zhu,
Zhian Bai,
Xiaoming Shi,
Benyou Wang,
Haitao Song,
Pengfei Liu,
Xiaofan Zhang,
Shanshan Wang,
Kang Li,
Haofen Wang,
Tong Ruan,
Xuanjing Huang,
Xin Sun,
Shaoting Zhang

Affiliations

Mianxin Liu: Shanghai Artificial Intelligence Laboratory, Shanghai 200232, China
Weiguo Hu: Ruijin Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai 200025, China
Jinru Ding: Shanghai Artificial Intelligence Laboratory, Shanghai 200232, China
Jie Xu: Shanghai Artificial Intelligence Laboratory, Shanghai 200232, China
Xiaoyang Li: Ruijin Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai 200025, China
Lifeng Zhu: Ruijin Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai 200025, China
Zhian Bai: Ruijin Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai 200025, China
Xiaoming Shi: Shanghai Artificial Intelligence Laboratory, Shanghai 200232, China
Benyou Wang: Chinese University of Hong Kong, Shenzhen 518172, China
Haitao Song: Shanghai Artificial Intelligence Research Institute, Shanghai 200240, and also with Shanghai Jiao Tong University, Shanghai 200240, China
Pengfei Liu: School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China
Xiaofan Zhang: Qing Yuan Research Institute, Shanghai Jiao Tong University, Shanghai 200240, China
Shanshan Wang: Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China
Kang Li: West China Hospital, Sichuan University, Chengdu 610041, China
Haofen Wang: School of Design and Innovation, Tongji University, Shanghai 200092, China
Tong Ruan: Department of Computer Science and Technology, East China University of Science and Technology, Shanghai 200237, China
Xuanjing Huang: School of Computer Science, Fudan University, Shanghai 200433, China
Xin Sun: Xinhua Hospital Affiliated to Shanghai Jiaotong University School of Medicine, Shanghai 200092, China
Shaoting Zhang: Shanghai Artificial Intelligence Laboratory, Shanghai 200232, China

DOI: https://doi.org/10.26599/bdma.2024.9020044
Journal volume & issue: Vol. 7, no. 4
pp. 1116 – 1128

Abstract

Read online

Ensuring the general efficacy and benefit for human beings from medical Large Language Models (LLM) before real-world deployment is crucial. However, a widely accepted and accessible evaluation process for medical LLM, especially in the Chinese context, remains to be established. In this work, we introduce “MedBench”, a comprehensive, standardized, and reliable benchmarking system for Chinese medical LLM. First, MedBench assembles the currently largest evaluation dataset (300901 questions) to cover 43 clinical specialties, and performs multi-faceted evaluation on medical LLM. Second, MedBench provides a standardized and fully automatic cloud-based evaluation infrastructure, with physical separations between question and ground truth. Third, MedBench implements dynamic evaluation mechanisms to prevent shortcut learning and answer memorization. Applying MedBench to popular general and medical LLMs, we observe unbiased, reproducible evaluation results largely aligning with medical professionals’ perspectives. This study establishes a significant foundation for preparing the practical applications of Chinese medical LLMs. MedBench is publicly accessible at https://medbench.opencompass.org.cn.

Published in Big Data Mining and Analytics

ISSN: 2096-0654 (Print); 2097-406X (Online)
Publisher: Tsinghua University Press
Country of publisher: China
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=8254253

About the journal

Abstract

Keywords