Applied Sciences (Apr 2022)

Burapha-TH: A Multi-Purpose Character, Digit, and Syllable Handwriting Dataset

  • Athita Onuean,
  • Uraiwan Buatoom,
  • Thatsanee Charoenporn,
  • Taehong Kim,
  • Hanmin Jung

DOI
https://doi.org/10.3390/app12084083
Journal volume & issue
Vol. 12, no. 8
p. 4083

Abstract

Read online

In handwriting recognition research, a public image dataset is necessary to evaluate algorithm correctness and runtime performance. Unfortunately, in existing Thai language script image datasets, there is a lack of variety of standard handwriting types. This paper focuses on a new offline Thai handwriting image dataset named Burapha-TH. The dataset has 68 character classes, 10 digit classes, and 320 syllable classes. For constructing the dataset, 1072 Thai native speakers wrote on collection datasheets that were then digitized using a 300 dpi scanner. De-skewing, detection box and segmentation algorithms were applied to the raw scans for image extraction. The experiment used different deep convolutional models with the proposed dataset. The result shows that the VGG-13 model (with batch normalization) achieved accuracy rates of 95.00%, 98.29%, and 96.16% on character, digit, and syllable classes, respectively. The Burapha-TH dataset, unlike all other known Thai handwriting datasets, retains existing noise, the white background, and all artifacts generated by scanning. This comprehensive, raw, and more realistic dataset will be helpful for a variety of research purposes in the future.

Keywords