Documenting Geographically and Contextually Diverse Language Data Sources

Angelina McMillan-Major; Francesco De Toni; Zaid Alyafeai; Stella Biderman; Kimbo Chen; Gérard Dupont; Hady Elsahar; Chris Emezue; Alham Fikri Aji; Suzana Ilić; Nurulaqilla Khamis; Colin Leong; Maraim Masoud; Aitor Soroa; Pedro Ortiz Suarez; Daniel van Strien; Zeerak Talat; Yacine Jernite

doi:10.3384/nejlt.2000-1533.2024.5217

Northern European Journal of Language Technology (Jan 2025)

Documenting Geographically and Contextually Diverse Language Data Sources

Angelina McMillan-Major,
Francesco De Toni,
Zaid Alyafeai,
Stella Biderman,
Kimbo Chen,
Gérard Dupont,
Hady Elsahar,
Chris Emezue,
Alham Fikri Aji,
Suzana Ilić,
Nurulaqilla Khamis,
Colin Leong,
Maraim Masoud,
Aitor Soroa,
Pedro Ortiz Suarez,
Daniel van Strien,
Zeerak Talat,
Yacine Jernite

Affiliations

Angelina McMillan-Major: University of Washington
Francesco De Toni
Zaid Alyafeai
Stella Biderman
Kimbo Chen
Gérard Dupont
Hady Elsahar
Chris Emezue
Alham Fikri Aji
Suzana Ilić
Nurulaqilla Khamis
Colin Leong
Maraim Masoud
Aitor Soroa
Pedro Ortiz Suarez
Daniel van Strien
Zeerak Talat
Yacine Jernite

DOI: https://doi.org/10.3384/nejlt.2000-1533.2024.5217
Journal volume & issue: Vol. 10, no. 1

Abstract

Read online

Contemporary large-scale data collection efforts have prioritized the amount of data collected to improve large language models (LLM). This quantitative approach has resulted in concerns for the rights of data subjects represented in data collections. This concern is exacerbated by a lack of documentation and analysis tools, making it difficult to interrogate these collections. Mindful of these pitfalls, we present a methodology for documentation-first, human-centered data collection. We apply this approach in an effort to train a multilingual LLM. We identify a geographically diverse set of target language groups (Arabic varieties, Basque, Chinese varieties, Catalan, English, French, Indic languages, Indonesian, Niger-Congo languages, Portuguese, Spanish, and Vietnamese, as well as programming languages) for which to collect metadata on potential data sources. We structure this effort by developing an online catalogue in English as a tool for gathering metadata through public hackathons. We present our tool and analyses of the resulting resource metadata, including distributions over languages, regions, and resource types, and discuss our lessons learned.

Published in Northern European Journal of Language Technology

ISSN: 2000-1533 (Online)
Publisher: Linköping University Electronic Press
Country of publisher: Sweden
LCC subjects: Language and Literature: Philology. Linguistics: Computational linguistics. Natural language processing
Website: https://www.nejlt.org/

About the journal