Natural Questions: A Benchmark for Question Answering Research

Kwiatkowski, Tom; Palomaki, Jennimaria; Redfield, Olivia; Collins, Michael; Parikh, Ankur; Alberti, Chris; Epstein, Danielle; Polosukhin, Illia; Devlin, Jacob; Lee, Kenton; Toutanova, Kristina; Jones, Llion; Kelcey, Matthew; Chang, Ming-Wei; Dai, Andrew M.; Uszkoreit, Jakob; Le, Quoc; Petrov, Slav

doi:10.1162/tacl_a_00276

Transactions of the Association for Computational Linguistics (Nov 2019)

Natural Questions: A Benchmark for Question Answering Research

Kwiatkowski, Tom,
Palomaki, Jennimaria,
Redfield, Olivia,
Collins, Michael,
Parikh, Ankur,
Alberti, Chris,
Epstein, Danielle,
Polosukhin, Illia,
Devlin, Jacob,
Lee, Kenton,
Toutanova, Kristina,
Jones, Llion,
Kelcey, Matthew,
Chang, Ming-Wei,
Dai, Andrew M.,
Uszkoreit, Jakob,
Le, Quoc,
Petrov, Slav

Affiliations

Kwiatkowski, Tom
Palomaki, Jennimaria
Redfield, Olivia
Collins, Michael
Parikh, Ankur
Alberti, Chris
Epstein, Danielle
Polosukhin, Illia
Devlin, Jacob
Lee, Kenton
Toutanova, Kristina
Jones, Llion
Kelcey, Matthew
Chang, Ming-Wei
Dai, Andrew M.
Uszkoreit, Jakob
Le, Quoc
Petrov, Slav

DOI: https://doi.org/10.1162/tacl_a_00276
Journal volume & issue: Vol. 7
pp. 453 – 466

Abstract

Read online

We present the Natural Questions corpus, a question answering data set. Questions consist of real anonymized, aggregated queries issued to the Google search engine. An annotator is presented with a question along with a Wikipedia page from the top 5 search results, and annotates a long answer (typically a paragraph) and a short answer (one or more entities) if present on the page, or marks null if no long/short answer is present. The public release consists of 307,373 training examples with single annotations; 7,830 examples with 5-way annotations for development data; and a further 7,842 examples with 5-way annotated sequestered as test data. We present experiments validating quality of the data. We also describe analysis of 25-way annotations on 302 examples, giving insights into human variability on the annotation task. We introduce robust metrics for the purposes of evaluating question answering systems; demonstrate high human upper bounds on these metrics; and establish baseline results using competitive methods drawn from related literature.

Published in Transactions of the Association for Computational Linguistics

ISSN: 2307-387X (Online)
Publisher: The MIT Press
Country of publisher: United States
LCC subjects: Language and Literature: Philology. Linguistics: Computational linguistics. Natural language processing
Website: https://direct.mit.edu/tacl

About the journal