Improved Script Identification Algorithm Using Unicode-Based Regular Expression Matching Strategy

Mamtimin Qasim; Wushour Silamu

doi:10.3390/data10040043

Data (Mar 2025)

Improved Script Identification Algorithm Using Unicode-Based Regular Expression Matching Strategy

Mamtimin Qasim,
Wushour Silamu

Affiliations

Mamtimin Qasim: School of Information Technology and Engineering, Guangzhou College of Commerce, Guangzhou 511363, China
Wushour Silamu: School of Computer Science and Technology, Xinjiang University, Urumqi 830046, China

DOI: https://doi.org/10.3390/data10040043
Journal volume & issue: Vol. 10, no. 4
p. 43

Abstract

Read online

While script identification is the first step in many natural language processing and text mining tasks, at present, there is no open-source script identification algorithm for text. For this reason, we analyze the Unicode encoding of each type of script and construct regular expressions in this study, in order to design an improved script identification algorithm. Because some scripts share common characters, it’s impossible to count and summarize them. As a result, some extracted scripts are incomplete, which affects subsequent text processing tasks; furthermore, if a new script identification feature is required, the regular expression for each script must be re-adjusted. To improve the performance and scalability of script identification, we analyze the encoding range of each script provided on the official Unicode website and identify the shared characters, allowing us to design an improved script identification algorithm. Using this approach, we can fully consider all 169 Unicode script types. The proposed method is scalable and does not require numbers, punctuation marks, or other symbols to be filtered during script identification; furthermore, these items in the text are also included in the script identification results, thus ensuring the integrity of the provided information. The experimental results show that the proposed algorithm performs almost as well as our previous script identification algorithm while providing improvements on its basis.

Published in Data

ISSN: 2306-5729 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Bibliography. Library science. Information resources
Website: http://www.mdpi.com/journal/data

About the journal

Abstract

Keywords