Data Augmentation With Semantic Enrichment for Deep Learning Invoice Text Classification

Wei Wen Chi; Tiong Yew Tang; Narishah Mohamed Salleh; Muaadh Mukred; Hussain AlSalman; Muhammad Zohaib

doi:10.1109/ACCESS.2024.3387860

IEEE Access (Jan 2024)

Data Augmentation With Semantic Enrichment for Deep Learning Invoice Text Classification

Wei Wen Chi,
Tiong Yew Tang,
Narishah Mohamed Salleh,
Muaadh Mukred,
Hussain AlSalman,
Muhammad Zohaib

Affiliations

Wei Wen Chi: Department of Business Analytics, Sunway Business School, Sunway University, Bandar Sunway, Selangor, Malaysia
Tiong Yew Tang: ORCiD; Department of Business Analytics, Sunway Business School, Sunway University, Bandar Sunway, Selangor, Malaysia
Narishah Mohamed Salleh: Department of Business Analytics, Sunway Business School, Sunway University, Bandar Sunway, Selangor, Malaysia
Muaadh Mukred: ORCiD; Department of Business Analytics, Sunway Business School, Sunway University, Bandar Sunway, Selangor, Malaysia
Hussain AlSalman: ORCiD; Department of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
Muhammad Zohaib: Software Engineering Department, Lappeenranta-Lahti University of Technology, Lappeenranta, Finland

DOI: https://doi.org/10.1109/ACCESS.2024.3387860
Journal volume & issue: Vol. 12
pp. 57326 – 57344

Abstract

Read online

Natural language processing (NLP) is a research field that provides huge potential to automate accounting tasks dealing with text data. This research studies the application of NLP in automatically categorizing invoices based on the invoice text description. The study employs semantic enrichment, data augmentation, and deep learning to address the NLP unique issues posed by the inherent short text and multi-class imbalance nature of invoice descriptions. Semantic enrichment was done using labels as an information source. Training data was artificially increased with either WordNet synonym replacement, Global Vectors for Word Representation (GloVe) word replacement, or the Bidirectional Encoder Representations from Transformers (BERT) word replacement method. Each training dataset was then supplied for training with one nondeep learning classifier and two deep learning classifiers respectively, namely Linear Support Vector Machine (LSVM), Bidirectional Long Short-Term Memory (Bi-LSTM), and BERT. Overall, the semantically enriched, WordNet augmented training set paired with the BERT classifier yielded the best results, successfully preserving semantics, reducing noise and overfitting while improving accuracy per class, achieving an increase of performance up to 20 percentage points (ppts) for macro F1 score and 6.7 ppts for accuracy.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords