AmericasNLI: Machine translation and natural language inference systems for Indigenous languages of the Americas

Katharina Kann; Abteen Ebrahimi; Manuel Mager; Arturo Oncevay; John E. Ortega; Annette Rios; Angela Fan; Ximena Gutierrez-Vasques; Luis Chiruzzo; Gustavo A. Giménez-Lugo; Ricardo Ramos; Ivan Vladimir Meza Ruiz; Elisabeth Mager; Vishrav Chaudhary; Graham Neubig; Alexis Palmer; Rolando Coto-Solano; Ngoc Thang Vu

doi:10.3389/frai.2022.995667

Frontiers in Artificial Intelligence (Dec 2022)

AmericasNLI: Machine translation and natural language inference systems for Indigenous languages of the Americas

Katharina Kann,
Abteen Ebrahimi,
Manuel Mager,
Arturo Oncevay,
John E. Ortega,
Annette Rios,
Angela Fan,
Ximena Gutierrez-Vasques,
Luis Chiruzzo,
Gustavo A. Giménez-Lugo,
Ricardo Ramos,
Ivan Vladimir Meza Ruiz,
Elisabeth Mager,
Vishrav Chaudhary,
Graham Neubig,
Alexis Palmer,
Rolando Coto-Solano,
Ngoc Thang Vu

Affiliations

Katharina Kann: Department of Computer Science, University of Colorado Boulder, Boulder, CO, United States
Abteen Ebrahimi: Department of Computer Science, University of Colorado Boulder, Boulder, CO, United States
Manuel Mager: Institute for Natural Language Processing, University of Stuttgart, Stuttgart, Germany
Arturo Oncevay: School of Informatics, University of Edinburgh, Edinburgh, United Kingdom
John E. Ortega: Courant Institute of Mathematical Sciences, New York University, New York, NY, United States
Annette Rios: Institut für Computerlinguistik, University of Zurich, Zurich, Switzerland
Angela Fan: Facebook AI Research, Menlo Park, CA, United States
Ximena Gutierrez-Vasques: URPP Language and Space, University of Zurich, Zurich, Switzerland
Luis Chiruzzo: Institute of Computation, Universidad de la República, Montevideo, Uruguay
Gustavo A. Giménez-Lugo: Department of Informatics, Universidade Tecnológica Federal do Paraná, Curitiba, Brazil
Ricardo Ramos: 0Universidad Tecnológica de Tlaxcala, Huamantla, Mexico
Ivan Vladimir Meza Ruiz: 1Department of Computer Science, Universidad Nacional Autónoma de México, Mexico City, Mexico
Elisabeth Mager: 2Facultad de Estudios Superiores Acatlán, Universidad Nacional Autónoma de México, Mexico City, Mexico
Vishrav Chaudhary: 3Microsoft Turing Research, Redmond, WA, United States
Graham Neubig: 4Language Technology Institute, Carnegie Mellon University, Pittsburgh, PA, United States
Alexis Palmer: 5Department of Linguistics, University of Colorado Boulder, Boulder, CO, United States
Rolando Coto-Solano: 6Department of Linguistics, Dartmouth College, Hanover, NH, United States
Ngoc Thang Vu: Institute for Natural Language Processing, University of Stuttgart, Stuttgart, Germany

DOI: https://doi.org/10.3389/frai.2022.995667
Journal volume & issue: Vol. 5

Abstract

Read online

Little attention has been paid to the development of human language technology for truly low-resource languages—i.e., languages with limited amounts of digitally available text data, such as Indigenous languages. However, it has been shown that pretrained multilingual models are able to perform crosslingual transfer in a zero-shot setting even for low-resource languages which are unseen during pretraining. Yet, prior work evaluating performance on unseen languages has largely been limited to shallow token-level tasks. It remains unclear if zero-shot learning of deeper semantic tasks is possible for unseen languages. To explore this question, we present AmericasNLI, a natural language inference dataset covering 10 Indigenous languages of the Americas. We conduct experiments with pretrained models, exploring zero-shot learning in combination with model adaptation. Furthermore, as AmericasNLI is a multiway parallel dataset, we use it to benchmark the performance of different machine translation models for those languages. Finally, using a standard transformer model, we explore translation-based approaches for natural language inference. We find that the zero-shot performance of pretrained models without adaptation is poor for all languages in AmericasNLI, but model adaptation via continued pretraining results in improvements. All machine translation models are rather weak, but, surprisingly, translation-based approaches to natural language inference outperform all other models on that task.

Published in Frontiers in Artificial Intelligence

ISSN: 2624-8212 (Online)
Publisher: Frontiers Media S.A.
Country of publisher: Switzerland
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://www.frontiersin.org/journals/artificial-intelligence#

About the journal

Abstract

Keywords