Solving Data Imbalance in Text Classification With Constructing Contrastive Samples

Xi Chen; Wei Zhang; Shuai Pan; Jiayin Chen

doi:10.1109/ACCESS.2023.3306805

IEEE Access (Jan 2023)

Solving Data Imbalance in Text Classification With Constructing Contrastive Samples

Xi Chen,
Wei Zhang,
Shuai Pan,
Jiayin Chen

Affiliations

Xi Chen: ORCiD; Advanced Institution of Information Technology, Peking University, Hangzhou, China
Wei Zhang: ORCiD; Advanced Institution of Information Technology, Peking University, Hangzhou, China
Shuai Pan: ORCiD; Advanced Institution of Information Technology, Peking University, Hangzhou, China
Jiayin Chen: ORCiD; Advanced Institution of Information Technology, Peking University, Hangzhou, China

DOI: https://doi.org/10.1109/ACCESS.2023.3306805
Journal volume & issue: Vol. 11
pp. 90554 – 90562

Abstract

Read online

Contrastive learning (CL) has been successfully applied in Natural Language Processing (NLP) as a powerful representation learning method and has shown promising results in various downstream tasks. Recent research has highlighted the importance of constructing effective contrastive samples through data augmentation. However, current data augmentation methods primarily rely on random word deletion, substitution, and cropping, which may introduce noisy samples and hinder representation learning. In this article, we propose a novel approach to address data imbalance in text classification by constructing contrastive samples. Our method involves the use of a Label-indicative Component to generate high-quality positive samples for the minority class, along with the introduction of a Hard Negative Mixing strategy to synthesize challenging negative samples at the feature level. By applying supervised contrastive learning to these samples, we are able to obtain superior text representations, which significantly benefit text classification tasks with imbalanced data. Our approach effectively mitigates distributional biases and promotes noise-resistant representation learning. To validate the effectiveness of our method, we conducted experiments on benchmark datasets (THUCNews, AG’s News, 20NG) as well as the imbalanced FDCNews dataset. The code for our method is publicly available at the following GitHub repository: https://github.com/hanggun/CLDMTC.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords