IEEE Access (Jan 2024)

Data Augmentation With Semantic Enrichment for Deep Learning Invoice Text Classification

  • Wei Wen Chi,
  • Tiong Yew Tang,
  • Narishah Mohamed Salleh,
  • Muaadh Mukred,
  • Hussain AlSalman,
  • Muhammad Zohaib

DOI
https://doi.org/10.1109/ACCESS.2024.3387860
Journal volume & issue
Vol. 12
pp. 57326 – 57344

Abstract

Read online

Natural language processing (NLP) is a research field that provides huge potential to automate accounting tasks dealing with text data. This research studies the application of NLP in automatically categorizing invoices based on the invoice text description. The study employs semantic enrichment, data augmentation, and deep learning to address the NLP unique issues posed by the inherent short text and multi-class imbalance nature of invoice descriptions. Semantic enrichment was done using labels as an information source. Training data was artificially increased with either WordNet synonym replacement, Global Vectors for Word Representation (GloVe) word replacement, or the Bidirectional Encoder Representations from Transformers (BERT) word replacement method. Each training dataset was then supplied for training with one nondeep learning classifier and two deep learning classifiers respectively, namely Linear Support Vector Machine (LSVM), Bidirectional Long Short-Term Memory (Bi-LSTM), and BERT. Overall, the semantically enriched, WordNet augmented training set paired with the BERT classifier yielded the best results, successfully preserving semantics, reducing noise and overfitting while improving accuracy per class, achieving an increase of performance up to 20 percentage points (ppts) for macro F1 score and 6.7 ppts for accuracy.

Keywords