Slovenščina 2.0: Empirične, aplikativne in interdisciplinarne raziskave (Dec 2022)

Data preparation in crowdsourcing for pedagogical purposes

  • Tanara Zingano Kuhn,
  • Špela Arhar Holdt,
  • Iztok Kosem,
  • Carole Tiberius,
  • Kristina Koppel,
  • Rina Zviel-Girshin

DOI
https://doi.org/10.4312/slo2.0.2022.2.62-100
Journal volume & issue
Vol. 10, no. 2

Abstract

Read online

One way to stimulate the use of corpora in language education is by making pedagogically appropriate corpora, labeled with different types of problems (sensitive content, offensive language, structural problems). However, manually labeling corpora is extremely time-consuming and a better approach should be found. We thus propose a combination of two approaches to the creation of problem-labeled pedagogical corpora of Dutch, Estonian, Slovene and Brazilian Portuguese: the use of games with a purpose and of crowdsourcing for the task. We conducted initial experiments to establish the suitability of the crowdsourcing task, and used the lessons learned to design the Crowdsourcing for Language Learning (CrowLL) game in which players identify problematic sentences, classify them, and indicate problematic excerpts. The focus of this paper is on data preparation, given the crucial role that such a stage plays in any crowdsourcing project dealing with the creation of language learning resources. We present the methodology for data preparation, offering a detailed presentation of source corpora selection, pedagogically oriented GDEX configurations, and the creation of lemma lists, with a special focus on common and language-dependent decisions. Finally, we offer a discussion of the challenges that emerged and the solutions that have been implemented so far.

Keywords