Scientific Data (Mar 2025)
An astronomical question answering dataset for evaluating large language models
Abstract
Abstract Large language models (LLMs) have recently demonstrated exceptional capabilities across a variety of linguistic tasks including question answering (QA). However, it remains challenging to assess their performance in astronomical QA due to the lack of comprehensive benchmark datasets. To bridge this gap, we construct Astro-QA, the first benchmark dataset specifically for QA in astronomy. The dataset contains a collection of 3,082 questions of six types in both English and Chinese, along with standard (reference) answers and related material. These questions encompass several core branches of astronomy, including astrophysics, astrometry, celestial mechanics, history of astronomy, and astronomical techniques and methods. Furthermore, we propose a new measure called DGscore that integrates different measures for objective and subjective questions and incorporates a weighting scheme based on type- and question-specific difficulty coefficients to accurately assess the QA performance of each LLM. We validate the Astro-QA dataset through extensive experimentation with 27 open-source and commercial LLMs. The results show that it can serve as a reliable benchmark dataset to evaluate the capacity of LLM in terms of instruction following, knowledge reasoning, and natural language generation in the astronomical domain, which can calibrate current progress and facilitate future research of astronomical LLMs.