IEEE Access (Jan 2021)

Self-Supervised Representation Learning for Document Image Classification

  • Shoaib Ahmed Siddiqui,
  • Andreas Dengel,
  • Sheraz Ahmed

DOI
https://doi.org/10.1109/ACCESS.2021.3133200
Journal volume & issue
Vol. 9
pp. 164358 – 164367

Abstract

Read online

Supervised learning, despite being extremely effective, relies on expensive, time-consuming, and error-prone annotations. Self-supervised learning has recently emerged as a strong alternate to supervised learning in a range of different domains as collecting a large amount of unlabeled data can be achieved by simply crawling the internet. These self-supervised methods automatically discover features relevant to represent an input example by using self-defined proxy tasks. In this paper, we question the potential of commonly employed purely supervised training (starting either from ImageNet pretrained networks or pure random initialization) in contrast to self-supervised representations that can be learned directly using self-supervised representation learning methods on large document image datasets. For this purpose, we leverage a large-scale document image collection (RVL-CDIP) to train ResNet-50 image encoder using two different self-supervision methods (SimCLR and Barlow Twins). Employing a linear classifier on top of self-supervised embeddings from ResNet-50 results in an accuracy of 86.75% as compared to 71.43% from the corresponding ImageNet pretrained embeddings. Similarly, evaluating on Tobacco-3482 dataset using self-supervised embeddings from ResNet-50 yields an accuracy of 88.52% in contrast to 74.16% from the corresponding ImageNet pretrained embeddings. We show that in the case of limited labeled data, this wide gap in performance between self-supervised and fully supervised models persists even after fine-tuning pretrained models. However, a significant reduction in this gap is observed with an increasing amount of data including the case where the model is trained from scratch. Our results show that representations learned using self-supervised representation learning techniques are a viable option for document image classification, specifically in the context of limited labeled data, which is a usual restriction in industrial use cases.

Keywords