PROBABILITY DISTRIBUTION OVER THE SET OF CLASSES IN ARABIC DIALECT CLASSIFICATION TASK

O. V. Durandin; N. R. Hilal; D. Y. Strebkov; N. Y. Zolotykh

doi:10.17586/2226-1494-2017-17-1-110-116

Naučno-tehničeskij Vestnik Informacionnyh Tehnologij, Mehaniki i Optiki (Jan 2017)

PROBABILITY DISTRIBUTION OVER THE SET OF CLASSES IN ARABIC DIALECT CLASSIFICATION TASK

O. V. Durandin,
N. R. Hilal,
D. Y. Strebkov,
N. Y. Zolotykh

Affiliations

O. V. Durandin: postgraduate, Lobachevsky State University of Nizhni Novgorod (UNN), Nizhny Novgorod, 603950, Russian Federation; senior lecturer, Higher School of Economics National Research University, Nizhny Novgorod, 603155, Russian Federation
N. R. Hilal: postgraduate; Project manager – Linguist, Lobachevsky State University of Nizhni Novgorod (UNN), Nizhny Novgorod, 603950, Russian Federation; “Dictum” Ltd., Nizhny Novgorod, 603070, Russian Federation
D. Y. Strebkov: software engineer, Dictum” Ltd., Nizhny Novgorod, 603070, Russian Federation
N. Y. Zolotykh: D.Sc.,Professor, Lobachevsky State University of Nizhni Novgorod (UNN), Nizhny Novgorod, 603950, Russian Federation

DOI: https://doi.org/10.17586/2226-1494-2017-17-1-110-116
Journal volume & issue: Vol. 17, no. 1
pp. 110 – 116

Abstract

Read online

Subject of Research.We propose an approach for solving machine learning classification problem that uses the information about the probability distribution on the training data class label set. The algorithm is illustrated on a complex natural language processing task - classification of Arabic dialects. Method. Each object in the training set is associated with a probability distribution over the class label set instead of a particular class label. The proposed approach solves the classification problem taking into account the probability distribution over the class label set to improve the quality of the built classifier. Main Results. The suggested approach is illustrated on the automatic Arabic dialects classification example. Mined from the Twitter social network, the analyzed data contain word-marks and belong to the following six Arabic dialects: Saudi, Levantine, Algerian, Egyptian, Iraq, Jordan, and to the modern standard Arabic (MSA). The paper results demonstrate an increase of the quality of the built classifier achieved by taking into account probability distributions over the set of classes. Experiments carried out show that even relatively naive accounting of the probability distributions improves the precision of the classifier from 44% to 67%. Practical Relevance. Our approach and corresponding algorithm could be effectively used in situations when a manual annotation process performed by experts is connected with significant financial and time resources, but it is possible to create a system of heuristic rules. The implementation of the proposed algorithm enables to decrease significantly the data preparation expenses without substantial losses in the precision of the classification.

Published in Naučno-tehničeskij Vestnik Informacionnyh Tehnologij, Mehaniki i Optiki

ISSN: 2226-1494 (Print); 2500-0373 (Online)
Publisher: Saint Petersburg National Research University of Information Technologies, Mechanics and Optics (ITMO University)
Country of publisher: Russian Federation
LCC subjects: Science: Physics: Optics. Light; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: http://ntv.ifmo.ru/en/english.htm

About the journal

Abstract

Keywords