Unsupervised Exemplar-Based Learning for Improved Document Image Classification

Sherif Abuelwafa; Marco Pedersoli; Mohamed Cheriet

doi:10.1109/ACCESS.2019.2940884

IEEE Access (Jan 2019)

Unsupervised Exemplar-Based Learning for Improved Document Image Classification

Sherif Abuelwafa,
Marco Pedersoli,
Mohamed Cheriet

Affiliations

Sherif Abuelwafa: ORCiD; École de technologie supérieure, University of Quebec, Montreal, QC, Canada
Marco Pedersoli: ORCiD; École de technologie supérieure, University of Quebec, Montreal, QC, Canada
Mohamed Cheriet: École de technologie supérieure, University of Quebec, Montreal, QC, Canada

DOI: https://doi.org/10.1109/ACCESS.2019.2940884
Journal volume & issue: Vol. 7
pp. 133738 – 133748

Abstract

Read online

Many recent state-of-the-art approaches for document image classification are based on supervised feature learning that requires a large amount of labeled training data. In real-world problem of document image classification, the available amount of labeled data is limited and scarce while a large amount of unlabeled data is often available at almost no cost. In this paper, we present an approach for learning visual features for document analysis in an unsupervised way, which improves the document image classification performance without increasing the amount of annotated data. The proposed approach trains a neural network model on an auxiliary task in which every training example is associated with a different label (exemplar) and expanded to multiple images through a data augmentation technique. Thus, the learned model, which is trained in an unsupervised way, is used to boost the document classification performance. In fact, this learned model has proved to be consistently efficient in two different settings: i) as an unsupervised feature extractor to represent document images for an unsupervised classification task (i.e., clustering); and ii) in the parameters initialization of a supervised classification task trained with a small amount of annotated data. We perform experiments on the Tobacco-3482 dataset and demonstrate the capability of our approach to improve i) the unsupervised classification accuracy up to 2.4%; and ii) the supervised classification accuracy by 1.5% without any extra data or by 5% when using 3000 additional not annotated samples.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords