A hybrid approach to Bangla handwritten OCR: combining YOLO and an advanced CNN

Aye T. Maung; Sumaiya Salekin; Mohammad A. Haque

doi:10.1007/s44163-025-00251-7

Discover Artificial Intelligence (Jun 2025)

A hybrid approach to Bangla handwritten OCR: combining YOLO and an advanced CNN

Aye T. Maung,
Sumaiya Salekin,
Mohammad A. Haque

Affiliations

Aye T. Maung: Department of Electrical and Electronic Engineering, Bangladesh University of Engineering and Technology
Sumaiya Salekin: Department of Electrical and Electronic Engineering, Bangladesh University of Engineering and Technology
Mohammad A. Haque: Department of Electrical and Electronic Engineering, Bangladesh University of Engineering and Technology

DOI: https://doi.org/10.1007/s44163-025-00251-7
Journal volume & issue: Vol. 5, no. 1
pp. 1 – 26

Abstract

Read online

Abstract Optical Character Recognition (OCR) plays a vital role in automating data entry from handwritten forms into digital systems. However, a significant gap exists in the research on OCR techniques tailored for handwritten texts in complex languages such as Bangla. Challenges in Bangla script arise from the presence of modifiers, compound characters, and diacritic marks, making accurate recognition difficult. Our research introduces a scalable and effective OCR pipeline for Bangla handwritten documents that addresses these complexities. The proposed pipeline leverages the YOLO (You Only Look Once) model for character detection, accurately isolating base alphabets, consonant conjuncts, and characters with modifiers (matras). For character recognition, the pipeline utilizes the EfficientNet-B4 model, which demonstrated a recognition accuracy of 93.87% for grapheme roots, 98.22% for vowel diacritics, and 98.0% for consonant diacritics on publicly available datasets, combined and adapted for our use. Additionally, the system’s resilience was enhanced using a Word2Vec-based spelling correction layer, reducing the Character Error Rate (CER) from 10.37% to 2.47%. Comparative evaluations on in-house data show that the proposed pipeline with spelling correction achieves the highest precision (0.9701) and lowest CER (0.0247), outperforming the Google Cloud Vision API’s OCR. In contrast, the Vision API has the highest CER (0.1389) and lower precision (0.8220), highlighting the effectiveness of the proposed approach for Bangla OCR.

Published in Discover Artificial Intelligence

ISSN: 2731-0809 (Online)
Publisher: Springer
Country of publisher: Switzerland
LCC subjects: Language and Literature: Philology. Linguistics: Computational linguistics. Natural language processing; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://www.springer.com/journal/44163

About the journal

Abstract

Keywords