Collocation ranking: frequency vs semantics

Nikola Ljubešić; Nataša Logar; Iztok Kosem

doi:10.4312/slo2.0.2021.2.41-70

Slovenščina 2.0: Empirične, aplikativne in interdisciplinarne raziskave (Dec 2021)

Collocation ranking: frequency vs semantics

Nikola Ljubešić,
Nataša Logar,
Iztok Kosem

Affiliations

Nikola Ljubešić: Jožef Stefan Institute, Ljubljana, Slovenia; University of Ljubljana, Faculty of Computer and Information Science, Slovenia
Nataša Logar: University of Ljubljana, Faculty of Social Sciences, Slovenia
Iztok Kosem: University of Ljubljana, Faculty of Arts, Slovenia; Jožef Stefan Institute, Ljubljana, Slovenia

DOI: https://doi.org/10.4312/slo2.0.2021.2.41-70
Journal volume & issue: Vol. 9, no. 2

Abstract

Read online

Collocations play a very important role in language description, especially in identifying meanings of words. Modern lexicography’s inevitable part of meaning deduction are lists of collocates ranked by some statistical measurement. In the paper, we present a comparison between two approaches to the ranking of collocates: (a) the logDice method, which is dominantly used and frequency-based, and (b) the fastText word embeddings method, which is new and semantic-based. The comparison was made on two Slovene datasets, one representing general language headwords and their collocates, and the other representing headwords and their collocates extracted from a language for special purposes corpus. In the experiment, two methods were used: for the quantitative part of the evaluation, we used supervised machine learning with the area-under-the-curve (AUC) ROC score and support-vector machines (SVMs) algorithm, and in the qualitative part the ranking results of the two methods were evaluated by lexicographers. The results were somewhat inconsistent; while the quantitative evaluation confirmed that the machine-learning-based approach produced better collocate ranking results than the frequency-based one, lexicographers in most cases considered the listings of collocates of both methods very similar.

Published in Slovenščina 2.0: Empirične, aplikativne in interdisciplinarne raziskave

ISSN: 2335-2736 (Online)
Publisher: University of Ljubljana Press (Založba Univerze v Ljubljani)
Country of publisher: Slovenia
LCC subjects: Language and Literature: Philology. Linguistics
Website: https://journals.uni-lj.si/slovenscina2

About the journal

Abstract

Keywords