Studies in African Linguistics (Dec 2023)
Evaluating the representativeness of the Setswana corpus using behavioral data
Abstract
This paper presents efforts to evaluate the representativeness of the Setswana corpus data with measures that are independent of corpora. Two frequency measures were used: one sourced via a subjective frequency rating survey and another from a corpus of Setswana. Strong correlations (r =.75; p<.001) between survey ratings and corpus frequencies suggest that the corpus reflects native speaker intuitions. In addition, the study tested for frequency effects using an unprimed visual lexical decision task where participants had to judge whether a letter string on a screen is an existing word or a made-up non-word. In the analysis of reaction times, survey ratings and corpus frequencies were found to have similar correlations with reaction times, although survey ratings provided a better fit. Our study therefore makes a methodological contribution as results illustrate that in the absence of established corpus databases, participant intuitions can be used in linguistic research. This observation concurs with previous research on European languages that found that native speakers can reliably estimate the frequencies of words.