Word Similarity Datasets for Thai: Construction and Evaluation

Ponrudee Netisopakul; Gerhard Wohlgenannt; Aleksei Pulich

doi:10.1109/ACCESS.2019.2944151

IEEE Access (Jan 2019)

Word Similarity Datasets for Thai: Construction and Evaluation

Ponrudee Netisopakul,
Gerhard Wohlgenannt,
Aleksei Pulich

Affiliations

Ponrudee Netisopakul: ORCiD; Faculty of Information Technology, King Mongkut’s Institute of Technology Ladkrabang (KMITL), Bangkok, Thailand
Gerhard Wohlgenannt: Faculty of Software Engineering and Computer Systems, ITMO University, St. Petersburg, Russia
Aleksei Pulich: Faculty of Software Engineering and Computer Systems, ITMO University, St. Petersburg, Russia

DOI: https://doi.org/10.1109/ACCESS.2019.2944151
Journal volume & issue: Vol. 7
pp. 142907 – 142915

Abstract

Read online

Distributional semantics in the form of word embeddings are an essential ingredient to many modern natural language processing systems. The quantification of semantic similarity between words can be used to evaluate the ability of a system to perform semantic interpretation. To this end, a number of word similarity datasets have been created for the English language over the last decades. For Thai language few such resources are available. In this work, we create three Thai word similarity datasets by translating and re-rating the popular WordSim-353, SimLex-999 and SemEval-2017-Task-2 datasets. The three datasets contain 1852 word pairs in total and have different characteristics in terms of difficulty, domain coverage, and notion of similarity (relatedness vs. similarity). These features help to gain a broader picture of the properties of an evaluated word embedding model. We include baseline evaluations with existing Thai embedding models, and identify the high ratio of out-of-vocabulary words as one of the biggest challenges in the evaluation process. All datasets, evaluation results, and a tool for easy evaluation of new Thai embedding models are available to the NLP community online.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords