Developing named entity recognition algorithms for Uzbek: Dataset insights and implementation

Davlatyor Mengliev; Vladimir Barakhnin; Nilufar Abdurakhmonova; Mukhriddin Eshkulov

Data in Brief (Jun 2024)

Developing named entity recognition algorithms for Uzbek: Dataset insights and implementation

Davlatyor Mengliev,
Vladimir Barakhnin,
Nilufar Abdurakhmonova,
Mukhriddin Eshkulov

Affiliations

Davlatyor Mengliev: Urgench branch of Tashkent University of Information Technologies named after Muhammad al-Khwarizmi, 110, al-Khwarizmi str., 220100, Urgench city, Uzbekistan; Novosibirsk State University, 2, Pirogova str., Novosibirsk city, 630090, Russia; Corresponding author.
Vladimir Barakhnin: Urgench branch of Tashkent University of Information Technologies named after Muhammad al-Khwarizmi, 110, al-Khwarizmi str., 220100, Urgench city, Uzbekistan; Novosibirsk State University, 2, Pirogova str., Novosibirsk city, 630090, Russia; Federal Research Center for Information and Computational Technologies, 6, Academician M.A. Lavrentiev avenue, Novosibirsk, 630090, Russia
Nilufar Abdurakhmonova: National University of Uzbekistan named after Mirzo-Ulugbek, 4, Universitet str., Olmazor distr., Tashkent city, 100174, Uzbekistan
Mukhriddin Eshkulov: Jizzakh polytechnic institute, 4, Islom Karimov str., Jizzakh city, 130100, Uzbekistan

Journal volume & issue: Vol. 54
p. 110413

Abstract

Read online

This paper presents a dataset and approaches to named entity recognition (NLP) in Uzbek language, in a resource-constrained language environment. Despite the increase in NLP applications, the Uzbek language is still underrepresented, which underscores the importance of our work. Our dataset includes 1,160 sentences with nearly 19,000 word forms annotated for parts of speech and named entities, making it a valuable resource for linguistic research and machine learning applications in Uzbek. In addition, for practical application and experiments, the authors have developed two algorithms that, using this dictionary, identifies named entities in Uzbek language texts. In addition, the authors described the methodology for creating the dataset, the design of the algorithms, and their application to the Uzbek language. This study not only provides an important dataset for future named entity recognition(NER) tasks in the Uzbek language, but also offers a methodological basis for the use of vocabulary-based NER or Machine learning NER in other low-resource languages (e.g. Karakalpak). The dataset (and algorithms) we have developed can be used to create applications such as improved chatbot systems, text mining applications and other analytical tools for the Uzbek language, contributing to the development of those areas in the region for which these solutions will be developed.

Published in Data in Brief

ISSN: 2352-3409 (Online)
Publisher: Elsevier
Country of publisher: United States
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Science (General)
Website: http://www.journals.elsevier.com/data-in-brief/

About the journal

Abstract

Keywords