EmmDocClassifier: Efficient Multimodal Document Image Classifier for Scarce Data

Shrinidhi Kanchi; Alain Pagani; Hamam Mokayed; Marcus Liwicki; Didier Stricker; Muhammad Zeshan Afzal

doi:10.3390/app12031457

Applied Sciences (Jan 2022)

EmmDocClassifier: Efficient Multimodal Document Image Classifier for Scarce Data

Shrinidhi Kanchi,
Alain Pagani,
Hamam Mokayed,
Marcus Liwicki,
Didier Stricker,
Muhammad Zeshan Afzal

Affiliations

Shrinidhi Kanchi: Department of Computer Science, Technical University of Kaiserslautern, 67663 Kaiserslautern, Germany
Alain Pagani: German Research Institute for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany
Hamam Mokayed: Department of Computer Science, Luleå University of Technology, 971 87 Luleå, Sweden
Marcus Liwicki: Department of Computer Science, Luleå University of Technology, 971 87 Luleå, Sweden
Didier Stricker: Department of Computer Science, Technical University of Kaiserslautern, 67663 Kaiserslautern, Germany
Muhammad Zeshan Afzal: Department of Computer Science, Technical University of Kaiserslautern, 67663 Kaiserslautern, Germany

DOI: https://doi.org/10.3390/app12031457
Journal volume & issue: Vol. 12, no. 3
p. 1457

Abstract

Read online

Document classification is one of the most critical steps in the document analysis pipeline. There are two types of approaches for document classification, known as image-based and multimodal approaches. Image-based document classification approaches are solely based on the inherent visual cues of the document images. In contrast, the multimodal approach co-learns the visual and textual features, and it has proved to be more effective. Nonetheless, these approaches require a huge amount of data. This paper presents a novel approach for document classification that works with a small amount of data and outperforms other approaches. The proposed approach incorporates a hierarchical attention network (HAN) for the textual stream and the EfficientNet-B0 for the image stream. The hierarchical attention network in the textual stream uses dynamic word embedding through fine-tuned BERT. HAN incorporates both the word level and sentence level features. While earlier approaches rely on training on a large corpus (RVL-CDIP), we show that our approach works with a small amount of data (Tobacco-3482). To this end, we trained the neural network at Tobacco-3482 from scratch. Therefore, we outperform the state-of-the-art by obtaining an accuracy of 90.3%. This results in a relative error reduction rate of 7.9%.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords