大数据 (Sep 2024)
PeMeBench: Chinese pediatric medical Q&A benchmark testing method
Abstract
Large language model (LLM) has demonstrated significant application potential in the medical field. However, evaluating the performance of LLM in medical scenarios poses a challenge. Existing medical benchmarks, predominantly in the form of multiple-choice questions, struggle to comprehensively and accurately assess LLM's performance in pediatric domains. To address this issue, PeMeBench, the first Chinese pediatric question-answering benchmark, was proposed. Leveraging a dual-perspective evaluation dimensions and referencing diagnostic and treatment guidelines from 10 pediatric disease systems, PeMeBench meticulously categorized pediatric medical question-answering tasks into five subdomains: disease knowledge, treatment plans, medication dosages, disease prevention, and pharmacological effects. It comprised over 10 000 open-ended question-answering items and introduced a multi-grained automated evaluation scheme that integrated entity retrieval with the detection of hallucinated sentences. This approach aimed to provide a comprehensive and precise assessment of LLM's performance in pediatric healthcare, delving into their potential limitations and laying a solid foundation for enhancing the intelligence level of medical services.