IEEE Access (Jan 2021)
uTHCD: A New Benchmarking for Tamil Handwritten OCR
Abstract
The robustness of a typical Handwritten character recognition system relies on the availability of comprehensive supervised data samples. There has been considerable work reported in the literature about creating the database for several Indic scripts, but the Tamil script has only one standardized database up to date. This paper presents the work done to create an exhaustive and extensive unconstrained Tamil Handwritten Character Database (uTHCD). The samples were generated from around 850 native Tamil volunteers including school-going kids, homemakers, university students, and faculty. The database consists of about 91000 samples with nearly 600 samples in each of 156 classes. This isolated character database is made publicly available as raw images and Hierarchical Data File (HDF) compressed file. The paper also presents several possible use cases of the proposed uTHCD database using Convolutional Neural Networks (CNN) to classify handwritten Tamil characters. Several experiments demonstrate that training on the proposed database helps traditional and contemporary classifiers perform on par or better than the existing dataset when tested with unseen data. With this database, we expect to set a new benchmark in Tamil handwritten character recognition and serve as a launchpad for developing robust language technologies for the Tamil script.
Keywords