An astronomical question answering dataset for evaluating large language models

Jie Li; Fuyong Zhao; Panfeng Chen; Jiafu Xie; Xiangrui Zhang; Hui Li; Mei Chen; Yanhao Wang; Ming Zhu

doi:10.1038/s41597-025-04613-9

Scientific Data (Mar 2025)

An astronomical question answering dataset for evaluating large language models

Jie Li,
Fuyong Zhao,
Panfeng Chen,
Jiafu Xie,
Xiangrui Zhang,
Hui Li,
Mei Chen,
Yanhao Wang,
Ming Zhu

Affiliations

Jie Li: State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University
Fuyong Zhao: State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University
Panfeng Chen: State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University
Jiafu Xie: State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University
Xiangrui Zhang: State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University
Hui Li: State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University
Mei Chen: State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University
Yanhao Wang: School of Data Science and Engineering, East China Normal University
Ming Zhu: National Astronomical Observatories, Chinese Academy of Sciences

DOI: https://doi.org/10.1038/s41597-025-04613-9
Journal volume & issue: Vol. 12, no. 1
pp. 1 – 14

Abstract

Read online

Abstract Large language models (LLMs) have recently demonstrated exceptional capabilities across a variety of linguistic tasks including question answering (QA). However, it remains challenging to assess their performance in astronomical QA due to the lack of comprehensive benchmark datasets. To bridge this gap, we construct Astro-QA, the first benchmark dataset specifically for QA in astronomy. The dataset contains a collection of 3,082 questions of six types in both English and Chinese, along with standard (reference) answers and related material. These questions encompass several core branches of astronomy, including astrophysics, astrometry, celestial mechanics, history of astronomy, and astronomical techniques and methods. Furthermore, we propose a new measure called DGscore that integrates different measures for objective and subjective questions and incorporates a weighting scheme based on type- and question-specific difficulty coefficients to accurately assess the QA performance of each LLM. We validate the Astro-QA dataset through extensive experimentation with 27 open-source and commercial LLMs. The results show that it can serve as a reliable benchmark dataset to evaluate the capacity of LLM in terms of instruction following, knowledge reasoning, and natural language generation in the astronomical domain, which can calibrate current progress and facilitate future research of astronomical LLMs.

Published in Scientific Data

ISSN: 2052-4463 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Science
Website: https://www.nature.com/sdata/

About the journal