Toward a Low-Resource Non-Latin-Complete Baseline: An Exploration of Khmer Optical Character Recognition

Rina Buoy; Masakazu Iwamura; Sovila Srun; Koichi Kise

doi:10.1109/ACCESS.2023.3332361

IEEE Access (Jan 2023)

Toward a Low-Resource Non-Latin-Complete Baseline: An Exploration of Khmer Optical Character Recognition

Rina Buoy,
Masakazu Iwamura,
Sovila Srun,
Koichi Kise

Affiliations

Rina Buoy: ORCiD; Department of Core Informatics, Graduate School of Informatics, Osaka Metropolitan University, Sakai, Osaka, Japan
Masakazu Iwamura: ORCiD; Department of Core Informatics, Graduate School of Informatics, Osaka Metropolitan University, Sakai, Osaka, Japan
Sovila Srun: Department of Information Technology Engineering, Faculty of Engineering, Royal University of Phnom Penh, Phnom Penh, Cambodia
Koichi Kise: Department of Core Informatics, Graduate School of Informatics, Osaka Metropolitan University, Sakai, Osaka, Japan

DOI: https://doi.org/10.1109/ACCESS.2023.3332361
Journal volume & issue: Vol. 11
pp. 128044 – 128060

Abstract

Read online

Many existing text recognition methods rely on the structure of Latin characters and words. Such methods may not be able to deal with non-Latin scripts that have highly complex features, such as character stacking, diacritics, ligatures, non-uniform character widths, and writing without explicit word boundaries. In addition, from a natural language processing (NLP) perspective, most non-Latin languages are considered low-resource due to the scarcity of large-scale data. This paper presents a convolutional Transformer-based text recognition method for low-resource non-Latin scripts, which uses local two-dimensional (2D) feature maps. The proposed method can handle images of arbitrarily long textlines, which may occur with non-Latin writing without explicit word boundaries, without resizing them to a fixed size by using an improved image chunking and merging strategy. It has a low time complexity in self-attention layers and allows efficient training. The Khmer script is used as the representative of non-Latin scripts because it shares many features with other non-Latin scripts, which makes the construction of an optical character recognition (OCR) method for Khmer as hard as that for other non-Latin scripts. Thus, by analogy with the AI-complete concept, a Khmer OCR method can be considered as one of the non-Latin-complete methods and can be used as a low-resource non-Latin baseline method. The proposed 2D method was trained on synthetic datasets and outperformed the baseline models on both synthetic and real datasets. Fine-tuning experiments using Khmer handwritten palm leaf manuscripts and other non-Latin scripts demonstrated the feasibility of transfer learning from the Khmer OCR method. To contribute to the low-resource language community, the training and evaluation datasets will be made publicly available.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords