Size of corpora and collocations: The case of Russian

Maria Khokhlova; Vladimir Benko

doi:10.4312/slo2.0.2020.2.58-77

Slovenščina 2.0: Empirične, aplikativne in interdisciplinarne raziskave (Aug 2020)

Size of corpora and collocations: The case of Russian

Maria Khokhlova,
Vladimir Benko

Affiliations

Maria Khokhlova: St Petersburg State University, Russia
Vladimir Benko: Slovak Academy of Sciences, Bratislava, Slovakia

DOI: https://doi.org/10.4312/slo2.0.2020.2.58-77
Journal volume & issue: Vol. 8, no. 2

Abstract

Read online

With the arrival of information technologies to linguistics, compiling a large corpus of data, and of web texts in particular, has now become a mere technical matter. These new opportunities have revived the question of corpus volume that can be formulated in the following way: are larger corpora better for linguistic research or, more precisely, do lexicographers need to analyze bigger amounts of collocations? The paper deals with experiments on collocation identification in low-frequency lexis using corpora of different volumes (1 million, 10 million, 100 million and 1.2 billion words). We have selected low-frequency adjectives, nouns and verbs in the Russian Frequency Dictionary and tested the following hypotheses: 1) collocations in low-frequency lexis are better represented by larger corpora; 2) frequent collocations presented in dictionaries have low occurrences in small corpora; 3) statistical measures for collocation extraction behave differently in corpora of different volumes. The results prove the fact that corpora of under 100 M are not representative enough to study collocations, especially those with nouns and verbs. MI and Dice tend to extract less reliable collocations as the corpus volume extends, whereas t-score and Fisher’s exact test demonstrate better results for larger corpora.

Published in Slovenščina 2.0: Empirične, aplikativne in interdisciplinarne raziskave

ISSN: 2335-2736 (Online)
Publisher: University of Ljubljana Press (Založba Univerze v Ljubljani)
Country of publisher: Slovenia
LCC subjects: Language and Literature: Philology. Linguistics
Website: https://journals.uni-lj.si/slovenscina2

About the journal

Abstract

Keywords