<scp>ParsiNLU</scp>: A Suite of Language Understanding Challenges for Persian

Daniel Khashabi; Arman Cohan; Siamak Shakeri; Pedram Hosseini; Pouya Pezeshkpour; Malihe Alikhani; Moin Aminnaseri; Marzieh Bitaab; Faeze Brahman; Sarik Ghazarian; Mozhdeh Gheini; Arman Kabiri; Rabeeh Karimi Mahabagdi; Omid Memarrast; Ahmadreza Mosallanezhad; Erfan Noury; Shahab Raji; Mohammad Sadegh Rasooli; Sepideh Sadeghi; Erfan Sadeqi Azer; Niloofar Safi Samghabadi; Mahsa Shafaei; Saber Sheybani; Ali Tazarv; Yadollah Yaghoobzadeh

doi:10.1162/tacl_a_00419

Transactions of the Association for Computational Linguistics (Jan 2021)

<scp>ParsiNLU</scp>: A Suite of Language Understanding Challenges for Persian

Daniel Khashabi,
Arman Cohan,
Siamak Shakeri,
Pedram Hosseini,
Pouya Pezeshkpour,
Malihe Alikhani,
Moin Aminnaseri,
Marzieh Bitaab,
Faeze Brahman,
Sarik Ghazarian,
Mozhdeh Gheini,
Arman Kabiri,
Rabeeh Karimi Mahabagdi,
Omid Memarrast,
Ahmadreza Mosallanezhad,
Erfan Noury,
Shahab Raji,
Mohammad Sadegh Rasooli,
Sepideh Sadeghi,
Erfan Sadeqi Azer,
Niloofar Safi Samghabadi,
Mahsa Shafaei,
Saber Sheybani,
Ali Tazarv,
Yadollah Yaghoobzadeh

Affiliations

Daniel Khashabi: Allen Institute for AI, USA
Arman Cohan: Allen Institute for AI, USA
Siamak Shakeri: Google, USA
Pedram Hosseini: George Washington University, USA
Pouya Pezeshkpour: UC Irvine, USA
Malihe Alikhani: University of Pittsburgh, USA
Moin Aminnaseri: TaskRabbit, USA
Marzieh Bitaab: Arizona State University, USA
Faeze Brahman: UC Santa Cruz, USA
Sarik Ghazarian: University of Southern California, USA
Mozhdeh Gheini
Arman Kabiri: IMRSV Data Labs, Canada
Rabeeh Karimi Mahabagdi: EPFL, Switzerland
Omid Memarrast: University of Illinois - Chicago, USA
Ahmadreza Mosallanezhad: Arizona State University, USA
Erfan Noury: University of Maryland Baltimore County, USA
Shahab Raji: Rutgers University, USA
Mohammad Sadegh Rasooli: University of Pennsylvania, USA
Sepideh Sadeghi: Google, USA
Erfan Sadeqi Azer: Google, USA
Niloofar Safi Samghabadi: Expedia Inc., USA
Mahsa Shafaei
Saber Sheybani: Indiana University - Bloomington, USA
Ali Tazarv: UC Irvine, USA
Yadollah Yaghoobzadeh: Microsoft, Canada

DOI: https://doi.org/10.1162/tacl_a_00419
Journal volume & issue: Vol. 9
pp. 1147 – 1162

Abstract

Read online

AbstractDespite the progress made in recent years in addressing natural language understanding (NLU) challenges, the majority of this progress remains to be concentrated on resource-rich languages like English. This work focuses on Persian language, one of the widely spoken languages in the world, and yet there are few NLU datasets available for this language. The availability of high-quality evaluation datasets is a necessity for reliable assessment of the progress on different NLU tasks and domains. We introduce ParsiNLU, the first benchmark in Persian language that includes a range of language understanding tasks—reading comprehension, textual entailment, and so on. These datasets are collected in a multitude of ways, often involving manual annotations by native speakers. This results in over 14.5k new instances across 6 distinct NLU tasks. Additionally, we present the first results on state-of-the-art monolingual and multilingual pre-trained language models on this benchmark and compare them with human performance, which provides valuable insights into our ability to tackle natural language understanding challenges in Persian. We hope ParsiNLU fosters further research and advances in Persian language understanding.1

Published in Transactions of the Association for Computational Linguistics

ISSN: 2307-387X (Online)
Publisher: The MIT Press
Country of publisher: United States
LCC subjects: Language and Literature: Philology. Linguistics: Computational linguistics. Natural language processing
Website: https://direct.mit.edu/tacl

About the journal