Homograph recognition algorithm based on Euclidean metric

Elisa S. Izrailova; Arslanbek V. Astemirov; Ayshat S. Badaeva; Zelimhan A. Sultanov; Salaudin M. Umarkhadzhiev; Mokhmad-Salekh L. Khekhaev; Madina L. Yasaeva

doi:10.17586/2226-1494-2024-24-1-41-50

Naučno-tehničeskij Vestnik Informacionnyh Tehnologij, Mehaniki i Optiki (Feb 2024)

Homograph recognition algorithm based on Euclidean metric

Elisa S. Izrailova,
Arslanbek V. Astemirov,
Ayshat S. Badaeva,
Zelimhan A. Sultanov,
Salaudin M. Umarkhadzhiev,
Mokhmad-Salekh L. Khekhaev,
Madina L. Yasaeva

Affiliations

Elisa S. Izrailova: ORCiD; Senior Researcher, Academy of Sciences of the Chechen Republic, Grozny, 364043, Russian Federation; Junior Researcher, Kh. Ibragimov Complex Institute of the Russian Academy of Sciences, Grozny, 364051, Russian Federation
Arslanbek V. Astemirov: ORCiD; Scientific Researcher, Academy of Sciences of the Chechen Republic, Grozny, 364043, Russian Federation; Junior Researcher, Kh. Ibragimov Complex Institute of the Russian Academy of Sciences, Grozny, 364051, Russian Federation
Ayshat S. Badaeva: ORCiD; Scientific Researcher, Academy of Sciences of the Chechen Republic, Grozny, 364043, Russian Federation; Junior Researcher, Kh. Ibragimov Complex Institute of the Russian Academy of Sciences, Grozny, 364051, Russian Federation
Zelimhan A. Sultanov: ORCiD; Scientific Researcher, Academy of Sciences of the Chechen Republic, Grozny, 364043, Russian Federation; Junior Researcher, Kh. Ibragimov Complex Institute of the Russian Academy of Sciences, Grozny, 364051, Russian Federation
Salaudin M. Umarkhadzhiev: ORCiD; D.Sc (Physics & Mathematics), Associate Professor, Head of Department, Academy of Sciences of the Chechen Republic, Grozny, 364043, Russian Federation; Head of Laboratory, Kh. Ibragimov Complex Institute of the Russian Academy of Sciences, Grozny, 364051, Russian Federation, sc 37089765500
Mokhmad-Salekh L. Khekhaev: ORCiD; Scientific Researcher, Academy of Sciences of the Chechen Republic, Grozny, 364043, Russian Federation; Junior Researcher, Kh. Ibragimov Complex Institute of the Russian Academy of Sciences, Grozny, 364051, Russian Federation
Madina L. Yasaeva: ORCiD; Scientific Researcher, Academy of Sciences of the Chechen Republic, Grozny, 364043, Russian Federation

DOI: https://doi.org/10.17586/2226-1494-2024-24-1-41-50
Journal volume & issue: Vol. 24, no. 1
pp. 41 – 50

Abstract

Read online

The problem of resolving the uncertainties associated with homonymy for the Chechen language has become especially relevant after the creation of speech synthesis systems. The main disadvantage of speech synthesizers in the Chechen language are errors in reading homograph words that differ in the length / brevity of vowels — the longitude of such sounds is not displayed in any way when writing. The reproduction of diphthongs, which are indicated on the letter in the same way as monophthongs close to them in sound, causes problems. To improve the quality of synthesized speech in the Chechen language, an automatic homograph recognition program is needed. To solve this problem, the article considers the task of eliminating the ambiguity of the meaning of the words WSD (Word Sense Disambiguation). Algorithmic (supervised) methods based on a pre-marked database have been selected for the Chechen language. These methods are the most common solutions for eliminating the ambiguity of the meaning of words. The implementation of such methods is possible in the presence of large marked-up corpora that are inaccessible to most languages of the world including Chechen. The Chechen language belongs to low-resource languages for which the optimal approach from the point of view of saving labor and time resources is a semi-controlled hybrid method of homograph recognition based on the use of algorithmic and statistical methods. The algorithm created by the authors for recognizing homographs by six adjacent words in a sentence is presented. The method is implemented as a program. Preliminary preparation of the initial data for the operation of the algorithm includes marking of proposals by the values of homographs performed “manually”. The results of the program were evaluated using generally recognized accuracy metrics and amounted to F1 — 39 %, Accuracy — 45 %. A comparative analysis of the data obtained with the results of other methods and models showed that the accuracy of the algorithm presented in this article is closest to the results of the accuracy of algorithms based on the Lesk method. Using Lesk method for English, the results of F1 accuracy were obtained — 41.1 % (simple Lesk) and 51.1 % (extended Lesk). Methods using neural network algorithms provide higher WSD accuracy rates for most languages; however, their implementation requires large data bodies, which is not always available for low-resource languages, including Chechen.

Published in Naučno-tehničeskij Vestnik Informacionnyh Tehnologij, Mehaniki i Optiki

ISSN: 2226-1494 (Print); 2500-0373 (Online)
Publisher: Saint Petersburg National Research University of Information Technologies, Mechanics and Optics (ITMO University)
Country of publisher: Russian Federation
LCC subjects: Science: Physics: Optics. Light; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: http://ntv.ifmo.ru/en/english.htm

About the journal

Abstract

Keywords