KSTRV1: A scene text recognition dataset for central Kurdish in (Arabic-Based) scriptZenodo

Sardar Omar Salih; Karwan Jacksi

Data in Brief (Jun 2025)

KSTRV1: A scene text recognition dataset for central Kurdish in (Arabic-Based) scriptZenodo

Sardar Omar Salih,
Karwan Jacksi

Affiliations

Sardar Omar Salih: Web Technology Dept., Duhok Technical Institute, Duhok Polytechnic University, Duhok, Iraq; Information Technology Dept., Technical College of Informatics, Akre University of Applied Sciences, Duhok, Iraq
Karwan Jacksi: Semantic Web Lab., University of Zakho, Kurdistan Region of Iraq, Iraq; Corresponding author.

Journal volume & issue: Vol. 60
p. 111648

Abstract

Read online

Scene Text Recognition (STR) has advanced significantly in recent years, yet languages utilizing Arabic-based scripts, such as Kurdish, remain underrepresented in existing datasets. This paper introduces KSTRV1, the first large-scale dataset designed for Kurdish Scene Text Recognition (KSTR), addressing the lack of resources for non-Latin scripts. The dataset comprises 1,420 natural scene images and 19,872 cropped word samples, covering Kurdish (Sorani and Badini dialects), Arabic, and English. Additionally, 20,000 synthetic text instances have been generated to enhance the dataset’s diversity, quantity, and quality by incorporating varied fonts, orientations, distortions, and background complexities.KSTRV1 captures the multilingual landscape of the Kurdistan Region while addressing real-world challenges like occlusion, lighting variations, and script complexity. The dataset includes detailed annotations with bounding boxes, language identification, and text orientation labels, ensuring comprehensive support for training and evaluating STR models. By providing both natural and synthetic data, KSTRV1 enables the development of robust text recognition models, particularly for Central Kurdish, a low-resource language.The KSTRV1 dataset is publicly available at https://doi.org/10.5281/zenodo.15038953 and is expected to significantly contribute to research in multilingual STR, document analysis, and optical character recognition (OCR), facilitating more inclusive and accurate text recognition systems.

Published in Data in Brief

ISSN: 2352-3409 (Online)
Publisher: Elsevier
Country of publisher: United States
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Science (General)
Website: http://www.journals.elsevier.com/data-in-brief/

About the journal

Abstract

Keywords