Testing word embeddings for Polish

Agnieszka Mykowiecka; Małgorzata Marciniak; Piotr Rychlik

doi:10.11649/cs.1468

Cognitive Studies | Études cognitives (Dec 2017)

Testing word embeddings for Polish

Agnieszka Mykowiecka,
Małgorzata Marciniak,
Piotr Rychlik

Affiliations

Agnieszka Mykowiecka: Instytut Podstaw Informatyki Polskiej Akademii Nauk [Institute of Computer Science, Polish Academy of Sciences], Warszawa [Warsaw]
Małgorzata Marciniak: Instytut Podstaw Informatyki Polskiej Akademii Nauk [Institute of Computer Science, Polish Academy of Sciences], Warszawa [Warsaw]
Piotr Rychlik: Instytut Podstaw Informatyki Polskiej Akademii Nauk [Institute of Computer Science, Polish Academy of Sciences], Warszawa [Warsaw]

DOI: https://doi.org/10.11649/cs.1468
Journal volume & issue: no. 17

Abstract

Read online

Testing word embeddings for Polish Distributional Semantics postulates the representation of word meaning in the form of numeric vectors which represent words which occur in context in large text data. This paper addresses the problem of constructing such models for the Polish language. The paper compares the effectiveness of models based on lemmas and forms created with Continuous Bag of Words (CBOW) and skip-gram approaches based on different Polish corpora. For the purposes of this comparison, the results of two typical tasks solved with the help of distributional semantics, i.e. synonymy and analogy recognition, are compared. The results show that it is not possible to identify one universal approach to vector creation applicable to various tasks. The most important feature is the quality and size of the data, but different strategy choices can also lead to significantly different results. Testowanie wektorowych reprezentacji dystrybucyjnych słów języka polskiego Semantyka dystrybucyjna opiera się na założeniu, że znaczenie słów wyrażone jest za pomocą wektorów reprezentujących, w sposób bezpośredni bądź pośredni, konteksty, w jakich słowo to jest używane w dużym zbiorze tekstów. Niniejszy artykuł dotyczy ewaluacji wielu takich modeli skonstruowanych dla języka polskiego. W pracy porównano skuteczność modeli opartych na lematach i formach słów, utworzonych przy wykorzystaniu sieci neuronowych na danych z dwóch różnych korpusów języka polskiego. Ewaluacji dokonano na podstawie wyników dwóch typowych zadań rozwiązywanych za pomocą metod semantyki dystrybucyjnej, tzn. rozpoznania występowania synonimii i analogii między konkretnymi parami słów. Uzyskane wyniki dowodzą, że nie można wskazać jednego uniwersalnego podejścia do tworzenia modeli dystrybucyjnych, gdyż ich skuteczność jest różna w zależności od zastosowania. Najważniejszą cechą wpływającą na jakość modelu jest jakość oraz rozmiar danych, ale wybory różnych strategii uczenia sieci mogą również prowadzić do istotnie odmiennych wyników.

Published in Cognitive Studies | Études cognitives

ISSN: 2080-7147 (Print); 2392-2397 (Online)
Publisher: Institute of Slavic Studies, Polish Academy of Sciences
Country of publisher: Poland
LCC subjects: Language and Literature: Philology. Linguistics: Computational linguistics. Natural language processing; Language and Literature: Philology. Linguistics: Language. Linguistic theory. Comparative grammar: Semantics; Language and Literature: Philology. Linguistics: Language. Linguistic theory. Comparative grammar: Lexicography
Website: https://journals.ispan.edu.pl/index.php/cs-ec/

About the journal

Abstract

Keywords