Data preparation in crowdsourcing for pedagogical purposes

Tanara Zingano Kuhn; Špela Arhar Holdt; Iztok Kosem; Carole Tiberius; Kristina Koppel; Rina Zviel-Girshin

doi:10.4312/slo2.0.2022.2.62-100

Slovenščina 2.0: Empirične, aplikativne in interdisciplinarne raziskave (Dec 2022)

Data preparation in crowdsourcing for pedagogical purposes

Tanara Zingano Kuhn,
Špela Arhar Holdt,
Iztok Kosem,
Carole Tiberius,
Kristina Koppel,
Rina Zviel-Girshin

Affiliations

Tanara Zingano Kuhn: University of Coimbra, Research Centre for General and Applied Linguistics, Portugal
Špela Arhar Holdt: University of Ljubljana, Faculty of Arts; University of Ljubljana, Faculty of Computer and Information Science, Slovenia
Iztok Kosem: University of Ljubljana, Faculty of Arts; Jožef Stefan Institute, Ljubljana, Slovenia
Carole Tiberius: Dutch Language Institute, Rotterdam, Netherlands
Kristina Koppel: Institute of the Estonian Language, Tallinn, Estonia
Rina Zviel-Girshin: Ruppin Academic Center, Israel

DOI: https://doi.org/10.4312/slo2.0.2022.2.62-100
Journal volume & issue: Vol. 10, no. 2

Abstract

Read online

One way to stimulate the use of corpora in language education is by making pedagogically appropriate corpora, labeled with different types of problems (sensitive content, offensive language, structural problems). However, manually labeling corpora is extremely time-consuming and a better approach should be found. We thus propose a combination of two approaches to the creation of problem-labeled pedagogical corpora of Dutch, Estonian, Slovene and Brazilian Portuguese: the use of games with a purpose and of crowdsourcing for the task. We conducted initial experiments to establish the suitability of the crowdsourcing task, and used the lessons learned to design the Crowdsourcing for Language Learning (CrowLL) game in which players identify problematic sentences, classify them, and indicate problematic excerpts. The focus of this paper is on data preparation, given the crucial role that such a stage plays in any crowdsourcing project dealing with the creation of language learning resources. We present the methodology for data preparation, offering a detailed presentation of source corpora selection, pedagogically oriented GDEX configurations, and the creation of lemma lists, with a special focus on common and language-dependent decisions. Finally, we offer a discussion of the challenges that emerged and the solutions that have been implemented so far.

Published in Slovenščina 2.0: Empirične, aplikativne in interdisciplinarne raziskave

ISSN: 2335-2736 (Online)
Publisher: University of Ljubljana Press (Založba Univerze v Ljubljani)
Country of publisher: Slovenia
LCC subjects: Language and Literature: Philology. Linguistics
Website: https://journals.uni-lj.si/slovenscina2

About the journal

Abstract

Keywords