IEEE Access (Jan 2021)
MMU-OCR-21: Towards End-to-End Urdu Text Recognition Using Deep Learning
Abstract
Optical Character Recognition (OCR) is a technique that generates text from an image. Recognizing the importance of OCR in real-world settings, a plethora of techniques have been developed for Western, as well as Asian languages. Urdu is a prominent South Asian language and a number of different solutions for Urdu OCR have been proposed. However, fewer attempts have been made to develop end-to-end deep learning-based solutions for recognizing printed Urdu text. Furthermore, several benchmark corpora for Urdu OCR have been developed that can be used for training and evaluation of different OCR techniques. However, there are a number of limitations of the existing Urdu corpora: firstly, most of them have either character or word or text images, which are usually rendered using only a single font, Nastaleeq. Secondly, the volume of the existing datasets is so small that it is not suitable for working with the deep-learning techniques that have achieved groundbreaking results for OCRs. To that end, in this study, we have proposed a very large Multi-level and Multi-script Urdu corpus (MMU-OCR-21). It is the largest-ever Urdu corpus of printed text that is effectively suitable to work with deep learning techniques. In total, the corpus is composed of over 602,472 images, including text-line and word images in three prominent fonts, and their respective ground truth. Also, we have performed experiments using multiple state-of-the-art deep learning techniques for text-line and word level images.
Keywords