Data in Brief (Feb 2024)

SNLI Indo: A recognizing textual entailment dataset in Indonesian derived from the Stanford Natural Language Inference dataset

  • I Made Suwija Putra,
  • Daniel Siahaan,
  • Ahmad Saikhu

Journal volume & issue
Vol. 52
p. 109998

Abstract

Read online

Recognizing textual entailment (RTE) is an essential task in natural language processing (NLP). It is the task of determining the inference relationship between text fragments (premise and hypothesis), of which the inference relationship is either entailment (true), contradiction (false), or neutral (undetermined). The most popular approach for RTE is neural networks, which has resulted in the best RTE models. Neural network approaches, in particular deep learning, are data-driven and, consequently, the quantity and quality of the data significantly influences the performance of these approaches. Therefore, we introduce SNLI Indo, a large-scale RTE dataset in the Indonesian language, which was derived from the Stanford Natural Language Inference (SNLI) corpus by translating the original sentence pairs. SNLI is a large-scale dataset that contains premise-hypothesis pairs that were generated using a crowdsourcing framework. The SNLI dataset is comprised of a total of 569,027 sentence pairs with the distribution of sentence pairs as follows: 549,365 pairs for training, 9,840 pairs for model validation, and 9,822 pairs for testing. We translated the original sentence pairs of the SNLI dataset from English to Indonesian using the Google Cloud Translation API. The existence of SNLI Indo addresses the resource gap in the field of NLP for the Indonesian language. Even though large datasets are available in other languages, in particular English, the SNLI Indo dataset enables a more optimal development of deep learning models for RTE in the Indonesian language.

Keywords