Character recognition system for pegon typed manuscript

Yova Ruldeviyani; Heru Suhartanto; Beltsazar Anugrah Sotardodo; Muhammad Hanif Fahreza; Andre Septiano; Muhammad Febrian Rachmadi

Heliyon (Aug 2024)

Character recognition system for pegon typed manuscript

Yova Ruldeviyani,
Heru Suhartanto,
Beltsazar Anugrah Sotardodo,
Muhammad Hanif Fahreza,
Andre Septiano,
Muhammad Febrian Rachmadi

Affiliations

Yova Ruldeviyani: Corresponding author.; Faculty of Computer Science, Universitas Indonesia, Depok, Jawa Barat, 16424, Indonesia
Heru Suhartanto: Faculty of Computer Science, Universitas Indonesia, Depok, Jawa Barat, 16424, Indonesia
Beltsazar Anugrah Sotardodo: Faculty of Computer Science, Universitas Indonesia, Depok, Jawa Barat, 16424, Indonesia
Muhammad Hanif Fahreza: Faculty of Computer Science, Universitas Indonesia, Depok, Jawa Barat, 16424, Indonesia
Andre Septiano: Faculty of Computer Science, Universitas Indonesia, Depok, Jawa Barat, 16424, Indonesia
Muhammad Febrian Rachmadi: Faculty of Computer Science, Universitas Indonesia, Depok, Jawa Barat, 16424, Indonesia

Journal volume & issue: Vol. 10, no. 16
p. e35959

Abstract

Read online

The Pegon script is an Arabic-based writing system used for Javanese, Sundanese, Madurese, and Indonesian languages. Due to various reasons, this script is now mainly found among collectors and private Islamic boarding schools (pesantren), creating a need for its preservation. One preservation method is digitization through transcription into machine-encoded text, known as OCR (Optical Character Recognition). No published literature exists on OCR systems for this specific script. This research explores the OCR of Pegon typed manuscripts, introducing novel synthesized and real annotated datasets for this task. These datasets evaluate proposed OCR methods, especially those adapted from existing Arabic OCR systems. Results show that deep learning techniques outperform conventional ones, which fail to detect Pegon text. The proposed system uses YOLOv5 for line segmentation and a CTC-CRNN architecture for line text recognition, achieving an F1-score of 0.94 for segmentation and a CER of 0.03 for recognition.

Published in Heliyon

ISSN: 2405-8440 (Online)
Publisher: Elsevier
Country of publisher: United Kingdom
LCC subjects: Science: Science (General); Social Sciences: Social sciences (General)
Website: https://www.cell.com/heliyon/home

About the journal

Abstract

Keywords