IEEE Access (Jan 2023)
Solving Data Imbalance in Text Classification With Constructing Contrastive Samples
Abstract
Contrastive learning (CL) has been successfully applied in Natural Language Processing (NLP) as a powerful representation learning method and has shown promising results in various downstream tasks. Recent research has highlighted the importance of constructing effective contrastive samples through data augmentation. However, current data augmentation methods primarily rely on random word deletion, substitution, and cropping, which may introduce noisy samples and hinder representation learning. In this article, we propose a novel approach to address data imbalance in text classification by constructing contrastive samples. Our method involves the use of a Label-indicative Component to generate high-quality positive samples for the minority class, along with the introduction of a Hard Negative Mixing strategy to synthesize challenging negative samples at the feature level. By applying supervised contrastive learning to these samples, we are able to obtain superior text representations, which significantly benefit text classification tasks with imbalanced data. Our approach effectively mitigates distributional biases and promotes noise-resistant representation learning. To validate the effectiveness of our method, we conducted experiments on benchmark datasets (THUCNews, AG’s News, 20NG) as well as the imbalanced FDCNews dataset. The code for our method is publicly available at the following GitHub repository: https://github.com/hanggun/CLDMTC.
Keywords