Semantic Text Segmentation from Synthetic Images of Full-Text Documents

Lukáš Bureš; Ivan Gruber; Petr Neduchal; Miroslav Hlaváč; Marek Hrúz

doi:10.15622/sp.2019.18.6.1381-1406

Информатика и автоматизация (Dec 2019)

Semantic Text Segmentation from Synthetic Images of Full-Text Documents

Lukáš Bureš,
Ivan Gruber,
Petr Neduchal,
Miroslav Hlaváč,
Marek Hrúz

Affiliations

Lukáš Bureš: University of West Bohemia
Ivan Gruber: University of West Bohemia
Petr Neduchal: University of West Bohemia
Miroslav Hlaváč: University of West Bohemia
Marek Hrúz: University of West Bohemia

DOI: https://doi.org/10.15622/sp.2019.18.6.1381-1406
Journal volume & issue: Vol. 18, no. 6
pp. 1381 – 1406

Abstract

Read online

An algorithm (divided into multiple modules) for generating images of full-text documents is presented. These images can be used to train, test, and evaluate models for Optical Character Recognition (OCR). The algorithm is modular, individual parts can be changed and tweaked to generate desired images. A method for obtaining background images of paper from already digitized documents is described. For this, a novel approach based on Variational AutoEncoder (VAE) to train a generative model was used. These backgrounds enable the generation of similar background images as the training ones on the fly. The module for printing the text uses large text corpora, a font, and suitable positional and brightness character noise to obtain believable results (for natural-looking aged documents). A few types of layouts of the page are supported. The system generates a detailed, structured annotation of the synthesized image. Tesseract OCR to compare the real-world images to generated images is used. The recognition rate is very similar, indicating the proper appearance of the synthetic images. Moreover, the errors which were made by the OCR system in both cases are very similar. From the generated images, fully-convolutional encoder-decoder neural network architecture for semantic segmentation of individual characters was trained. With this architecture, the recognition accuracy of 99.28% on a test set of synthetic documents is reached.

Published in Информатика и автоматизация

ISSN: 2713-3192 (Print); 2713-3206 (Online)
Publisher: Russian Academy of Sciences, St. Petersburg Federal Research Center
Country of publisher: Russian Federation
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: http://ia.spcras.ru/index.php/sp/index

About the journal

Abstract

Keywords