Engineering Science and Technology, an International Journal (Dec 2018)
Diacritic restoration of Turkish tweets with word2vec
Abstract
Social media platforms such as Twitter have grown at a tremendous pace in recent years and have become an important source of data providing information countless field. This situation was of interest to researchers and many studies on machine learning and natural language processing was conducted on social media data. However, the language is used in social media contains a very high amount of noisy data than the formal writing language. In this article, we present a study on diacritic restoration which is one of the important difficulties of social media text normalization in order to reduce the noise problem. Diacritic is a set of marks used to change the sound values of letters and is used on many languages besides Turkish. We suggest a 3-step model for this study to overcome the top of the diacritic restoration problem. In the first step, a candidate word generator produces possible word forms, in the second step the language validator chooses the correct word forms and at the final Word2vec is used to create vector representations of the words and make the most appropriate word choice by using cosine similarities. The proposed method was tested on both the 2 ad-hoc created datasets and the real dataset. Studies on small ad-hoc created dataset and real dataset provided a relative error reduction of 37.8% with an average performance of 94.5%. In addition, tests on more than 6 M words on large ad-hoc created dataset yielded a serious performance with an error rate of 3.9%. Furthermore, the proposed method was tested on the binary classification problem consisting of highway traffic data in order to evaluate the effects on classification performance, and a 3.1% increase in classification performance was achieved. Keywords: Text mining, Diacritics restoration, Twitter, Tweet normalization