Slovenščina 2.0: Empirične, aplikativne in interdisciplinarne raziskave (Dec 2021)
Collocation ranking: frequency vs semantics
Abstract
Collocations play a very important role in language description, especially in identifying meanings of words. Modern lexicography’s inevitable part of meaning deduction are lists of collocates ranked by some statistical measurement. In the paper, we present a comparison between two approaches to the ranking of collocates: (a) the logDice method, which is dominantly used and frequency-based, and (b) the fastText word embeddings method, which is new and semantic-based. The comparison was made on two Slovene datasets, one representing general language headwords and their collocates, and the other representing headwords and their collocates extracted from a language for special purposes corpus. In the experiment, two methods were used: for the quantitative part of the evaluation, we used supervised machine learning with the area-under-the-curve (AUC) ROC score and support-vector machines (SVMs) algorithm, and in the qualitative part the ranking results of the two methods were evaluated by lexicographers. The results were somewhat inconsistent; while the quantitative evaluation confirmed that the machine-learning-based approach produced better collocate ranking results than the frequency-based one, lexicographers in most cases considered the listings of collocates of both methods very similar.
Keywords