TG-CSR: A human-labeled dataset grounded in nine formal commonsense categories

Henrique Santos; Alice M. Mulvehill; Ke Shen; Mayank Kejriwal; Deborah L. McGuinness

Data in Brief (Dec 2023)

TG-CSR: A human-labeled dataset grounded in nine formal commonsense categories

Henrique Santos,
Alice M. Mulvehill,
Ke Shen,
Mayank Kejriwal,
Deborah L. McGuinness

Affiliations

Henrique Santos: Rensselaer Polytechnic Institute 110 8th St., Troy, NY 12180, USA; Corresponding author.
Alice M. Mulvehill: Rensselaer Polytechnic Institute 110 8th St., Troy, NY 12180, USA
Ke Shen: University of Southern California 4676 Admiralty Way, Suite 1001 Marina del Rey CA, 90292, USA
Mayank Kejriwal: University of Southern California 4676 Admiralty Way, Suite 1001 Marina del Rey CA, 90292, USA
Deborah L. McGuinness: Rensselaer Polytechnic Institute 110 8th St., Troy, NY 12180, USA

Journal volume & issue: Vol. 51
p. 109666

Abstract

Read online

Machine Common Sense Reasoning is the subfield of Artificial Intelligence that aims to enable machines to behave or make decisions similarly to humans in everyday and ordinary situations. To measure progress, benchmarks in the form of question-answering datasets have been developed and published in the community to evaluate machine commonsense models, including large language models. We describe the individual label data produced by six human annotators originally used in computing ground truth for the Theoretically-Grounded Commonsense Reasoning (TG-CSR) benchmark's composing datasets. According to a set of instructions, annotators were provided with spreadsheets containing the original TG-CSR prompts and asked to insert labels in specific spreadsheet cells during annotation sessions. TG-CSR data is organized in JSON files, individual raw label data in a spreadsheet file, and individual normalized label data in JSONL files. The release of individual labels can enable the analysis of the labeling process itself, including studies of noise and consistency across annotators.

Published in Data in Brief

ISSN: 2352-3409 (Online)
Publisher: Elsevier
Country of publisher: United States
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Science (General)
Website: http://www.journals.elsevier.com/data-in-brief/

About the journal

Abstract

Keywords