Cursive-Text: A Comprehensive Dataset for End-to-End Urdu Text Recognition in Natural Scene Images

Asghar Ali Chandio; Md. Asikuzzaman; Mark Pickering; Mehwish Leghari

Data in Brief (Aug 2020)

Cursive-Text: A Comprehensive Dataset for End-to-End Urdu Text Recognition in Natural Scene Images

Asghar Ali Chandio,
Md. Asikuzzaman,
Mark Pickering,
Mehwish Leghari

Affiliations

Asghar Ali Chandio: School of Engineering and Information Technology, University of New South Wales, Canberra, Australia; Department of Information Technology, Quaid-e-Awam University of Engineering, Science and Technology, Pakistan; Corresponding author(s)
Md. Asikuzzaman: School of Engineering and Information Technology, University of New South Wales, Canberra, Australia
Mark Pickering: School of Engineering and Information Technology, University of New South Wales, Canberra, Australia
Mehwish Leghari: Department of Information Technology, Quaid-e-Awam University of Engineering, Science and Technology, Pakistan; Institute of Information and Communication Technology, University of Sindh, Pakistan

Journal volume & issue: Vol. 31
p. 105749

Abstract

Read online

Reading text in natural scene images is an active research area in the fields of computer vision and pattern recognition as text detection, text recognition and script identification are required. In this data article, a comprehensive dataset for Urdu text detection and recognition in natural scene images is presented and analysed. To develop the dataset, more than 2500 natural scene images were captured using a digital camera and a built-in mobile phone camera. Three separate datasets for isolated Urdu character images, cropped word images and end-to-end text spotting were developed. The isolated Urdu character and cropped word images dataset contain a much larger number of samples than existing Arabic natural scene text datasets. The Urdu text spotting dataset contains images with Urdu, English and Sindhi text instances. However, the focus has been given to the Urdu text instances. The ground truths for each image in the isolated character, cropped word or text spotting datasets are provided separately. The proposed datasets can be used to perform Urdu text detection and recognition or end-to-end recognition in natural scenes. These datasets can also be helpful to develop Arabic and Persian natural scene text detection and recognition systems, as Urdu is a derived language of these scripts and has many similar letters. The datasets can also be helpful to develop multi-language translation systems, which can facilitate foreign tourists to read and translate multilingual text in natural scene images. To evaluate the datasets, state-of-the-art machine learning and deep neural networks were used to build the text detection and recognition models, where the best classification accuracies are achieved. To the best of the authors’ knowledge, this is the first dataset proposed for Urdu text detection, recognition or end-to-end text recognition in natural scene images. The aim of this data article is to present a benchmark work in the field of document analysis and recognition.

Published in Data in Brief

ISSN: 2352-3409 (Online)
Publisher: Elsevier
Country of publisher: United States
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Science (General)
Website: http://www.journals.elsevier.com/data-in-brief/

About the journal

Abstract

Keywords