Transactions of the Association for Computational Linguistics (Nov 2019)

Natural Questions: A Benchmark for Question Answering Research

  • Kwiatkowski, Tom,
  • Palomaki, Jennimaria,
  • Redfield, Olivia,
  • Collins, Michael,
  • Parikh, Ankur,
  • Alberti, Chris,
  • Epstein, Danielle,
  • Polosukhin, Illia,
  • Devlin, Jacob,
  • Lee, Kenton,
  • Toutanova, Kristina,
  • Jones, Llion,
  • Kelcey, Matthew,
  • Chang, Ming-Wei,
  • Dai, Andrew M.,
  • Uszkoreit, Jakob,
  • Le, Quoc,
  • Petrov, Slav

DOI
https://doi.org/10.1162/tacl_a_00276
Journal volume & issue
Vol. 7
pp. 453 – 466

Abstract

Read online

We present the Natural Questions corpus, a question answering data set. Questions consist of real anonymized, aggregated queries issued to the Google search engine. An annotator is presented with a question along with a Wikipedia page from the top 5 search results, and annotates a long answer (typically a paragraph) and a short answer (one or more entities) if present on the page, or marks null if no long/short answer is present. The public release consists of 307,373 training examples with single annotations; 7,830 examples with 5-way annotations for development data; and a further 7,842 examples with 5-way annotated sequestered as test data. We present experiments validating quality of the data. We also describe analysis of 25-way annotations on 302 examples, giving insights into human variability on the annotation task. We introduce robust metrics for the purposes of evaluating question answering systems; demonstrate high human upper bounds on these metrics; and establish baseline results using competitive methods drawn from related literature.