Computational modelling of an optical character recognition system for Yorùbá printed text images

Olalekan Joseph ONI; Franklin Oladiipo ASAHIAH

Scientific African (Sep 2020)

Computational modelling of an optical character recognition system for Yorùbá printed text images

Olalekan Joseph ONI,
Franklin Oladiipo ASAHIAH

Affiliations

Olalekan Joseph ONI: Corresponding author.; Computing and Intelligent Systems Research Group, Department of Computer Science & Engineering, Obafemi Awolowo University, Ile-Ife, 220005 Nigeria
Franklin Oladiipo ASAHIAH: Computing and Intelligent Systems Research Group, Department of Computer Science & Engineering, Obafemi Awolowo University, Ile-Ife, 220005 Nigeria

Journal volume & issue: Vol. 9
p. e00415

Abstract

Read online

This study acquired a dataset of scanned images of Standard Yorùbá printed text and formulated a Yorùbá character image recognition model. The model formulated was implemented and the performance of the model evaluated to develop an Optical Character Recognition (OCR) model for Yorùbá printed text images.The image dataset at 300 dots per inches (dpi) was acquired by generating image text-line from Yorùbá New Testament Bible (Bibeli Mimo) corpus using Unicode UTF8. The Long Short Term Memory (LSTM) model, a variant of Recurrent Neural Network (RNN) was used to formulate the Standard Yorùbá character image recognition model. The Python OCRopus framework was used to implement the model designed. The performance of the model designed was evaluated using character error rate based on Levenshtein Edit Distance algorithm.The results show that the Character Error Rate (CER) of 3.138% for the font Times New Roman which gives better recognition than the other font style metric performance. The model achieved an OCR result of (7.435% CER) DejaVuSans font style image dataset, while for Ariel font image dataset, a result of 15.141% was achieved. The introduction of Language model-based Standard Yorùbá a spell-checker corrector show a reduction in the Character Error Rate. The Times New Roman font recorded an error rate of 1.182%, the DejaVuSans font style at an error rate of 4.098% while the Ariel font at 5.87%.The study concluded that the performance of the model shows that the farther away an image text font is from the font(s) used in training the network, the higher the character error rate of the recognition and that the inclusion of a post-processing stage shows a reduction in the Character Error Rates.

Published in Scientific African

ISSN: 2468-2276 (Online)
Publisher: Elsevier
Country of publisher: Netherlands
LCC subjects: Science
Website: https://www.journals.elsevier.com/scientific-african

About the journal

Abstract

Keywords