Data in Brief (Jun 2025)

KSTRV1: A scene text recognition dataset for central Kurdish in (Arabic-Based) scriptZenodo

  • Sardar Omar Salih,
  • Karwan Jacksi

Journal volume & issue
Vol. 60
p. 111648

Abstract

Read online

Scene Text Recognition (STR) has advanced significantly in recent years, yet languages utilizing Arabic-based scripts, such as Kurdish, remain underrepresented in existing datasets. This paper introduces KSTRV1, the first large-scale dataset designed for Kurdish Scene Text Recognition (KSTR), addressing the lack of resources for non-Latin scripts. The dataset comprises 1,420 natural scene images and 19,872 cropped word samples, covering Kurdish (Sorani and Badini dialects), Arabic, and English. Additionally, 20,000 synthetic text instances have been generated to enhance the dataset’s diversity, quantity, and quality by incorporating varied fonts, orientations, distortions, and background complexities.KSTRV1 captures the multilingual landscape of the Kurdistan Region while addressing real-world challenges like occlusion, lighting variations, and script complexity. The dataset includes detailed annotations with bounding boxes, language identification, and text orientation labels, ensuring comprehensive support for training and evaluating STR models. By providing both natural and synthetic data, KSTRV1 enables the development of robust text recognition models, particularly for Central Kurdish, a low-resource language.The KSTRV1 dataset is publicly available at https://doi.org/10.5281/zenodo.15038953 and is expected to significantly contribute to research in multilingual STR, document analysis, and optical character recognition (OCR), facilitating more inclusive and accurate text recognition systems.

Keywords