Heritage Science (Oct 2024)
An intelligent character segmentation system coupled with deep learning based recognition for the digitization of ancient Tamil palm leaf manuscripts
Abstract
Abstract Palm-leaf manuscripts, rich with ancient knowledge in areas such as history, art, and medicine, are vital cultural treasures, making their digitization essential for preserving this heritage. Digitization of these organic and fragile manuscripts is required to safeguard the essential ancient data. This requires optimal character segmentation and recognition algorithms. A limited number of studies have been carried out in Tamil character recognition in literature. Handling row-overlapped characters, noise introduced due to lightning issues, and dirt, as well as the removal of punch holes, auto-cropping the content, filtering out noisy or improper segmentation, etc. are the essential concerns carried out in our proposed work. This work is executed as a four-step process (1) Palm Leaf Manuscript Acquisition (2) Pre-Processing (3) Segmentation of Tamil Characters and (4) Tamil Character Recognition. During acquisition, the scanners are used for recording palm leaf manuscripts from the Tamil Nadu-oriented manuscript library. In the Pre-processing step, the Fast Non-Local Means (Fast-NLM) method, paired with median filtering is used for Denoising the scanner output image. Later, the pixels that make the characters and borders (i.e., the foreground) are identified using Sauvola thresholding. The proposed methodology introduces efficient techniques to remove Punch hole impressions from the pre-processed image, and to crop the written content from the edges. After pre-processing, the Segmentation of Tamil Characters is performed as a three-step process (a) Manuscript (b) Line, and (c) character segmentation, which addresses conjoined lines, partially/completely empty segmentations that are not previously addressed by existing techniques. This work introduces an Augmented HPP line-splitting algorithm that accurately segments written lines, handling wrong segmentation cases that were previously not considered by existing techniques. The system achieves an average segmentation accuracy of 98.25%, which far outperforms existing techniques. It also proposes a novel Punch hole removal algorithm that can locate and remove the punch-hole impressions in the manuscript image. This algorithm, along with the automated content cropping technique, increases recognition accuracy and eliminates any manual labor needed. These features make the proposed methodology highly suitable for real-time archaeological and historical researches that include manuscripts. All 247 letters and 12 numeric digits are analyzed and separated into 125 distinct writable characters. In our work, characters are segmented and used for recognition of all 247 letters and 12 digits in Tamil using a multi-class CNN with 125 classes, which drastically reduces the complexity of the neural network compared to having 257 output nodes. It offered a notable performance of 96.04% accuracy. As compared with existing Tamil and other character recognitions, this work is effective in essence of considering real-time images and the increased number of characters used.
Keywords