Burapha-TH: A Multi-Purpose Character, Digit, and Syllable Handwriting Dataset

Athita Onuean; Uraiwan Buatoom; Thatsanee Charoenporn; Taehong Kim; Hanmin Jung

doi:10.3390/app12084083

Applied Sciences (Apr 2022)

Burapha-TH: A Multi-Purpose Character, Digit, and Syllable Handwriting Dataset

Athita Onuean,
Uraiwan Buatoom,
Thatsanee Charoenporn,
Taehong Kim,
Hanmin Jung

Affiliations

Athita Onuean: Faculty of Informatics, Burapha University, Chonburi 20131, Thailand
Uraiwan Buatoom: Faculty of Science and Arts, Chanthaburi Campus, Burapha University, Chanthaburi 22170, Thailand
Thatsanee Charoenporn: AAII, Faculty of Data Science, Musashino University, Tokyo 135-8181, Japan
Taehong Kim: Korea Institute of Oriental Medicine, Daejeon 34054, Korea
Hanmin Jung: Korea Institute of Science and Technology Information, Daejeon 34141, Korea

DOI: https://doi.org/10.3390/app12084083
Journal volume & issue: Vol. 12, no. 8
p. 4083

Abstract

Read online

In handwriting recognition research, a public image dataset is necessary to evaluate algorithm correctness and runtime performance. Unfortunately, in existing Thai language script image datasets, there is a lack of variety of standard handwriting types. This paper focuses on a new offline Thai handwriting image dataset named Burapha-TH. The dataset has 68 character classes, 10 digit classes, and 320 syllable classes. For constructing the dataset, 1072 Thai native speakers wrote on collection datasheets that were then digitized using a 300 dpi scanner. De-skewing, detection box and segmentation algorithms were applied to the raw scans for image extraction. The experiment used different deep convolutional models with the proposed dataset. The result shows that the VGG-13 model (with batch normalization) achieved accuracy rates of 95.00%, 98.29%, and 96.16% on character, digit, and syllable classes, respectively. The Burapha-TH dataset, unlike all other known Thai handwriting datasets, retains existing noise, the white background, and all artifacts generated by scanning. This comprehensive, raw, and more realistic dataset will be helpful for a variety of research purposes in the future.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords