Jisuanji kexue yu tansuo (Sep 2024)
Research on Processing and Application of Imbalanced Textual Data on Social Platforms
Abstract
With the informatization of the society, it’s of great practical value to extract useful information from massive textual data available online using tools of NLP (natural language processing). However, the texts collected from social platforms suffer from issues such as low amount of valuable data and data imbalance. This paper proposes two methods to deal with these problems, named SimDyFeFL (SimBERT & dynamic feedback Focal Loss) and EdaDyFeFL(EDA & dynamic feedback Focal Loss), one is applicable for crisis-related information recognition tasks in Chinese, and another is for cyber trolls detection tasks in English. Specifically, SimBERT and EDA (easy data augmentation) methods are used to augment the original data with large differences between classes to a similar number of classes, and then the Focal Loss function with dynamic feedback process is fused to weight each class. Then, BERT (bidirectional encoder representations from transformers), RoBERTa (robustly optimized BERT pre-training approach), and BERT_DPCNN (BERT deep pyramid convolutional neural networks) text classification models are designed for three-stage comparative experiments to validate the effectiveness of proposed methods. Extensive experiments on two real datasets in Chinese and English show that the performance of the improved text classification models using SimDyFeFL and EdaDyFeFL is significantly improved, the accuracy of Chinese model is increased by 7.70 percentage points, and the accuracy of English model is increased by 5.15 percentage points. Compared with the best results on the Kaggle platform, the accuracy of the English model is 2.92 percentage points higher, and the Macro F1 score and Weighted F1 score are 2.83 percentage points and 2.95 percentage points higher, respectively.
Keywords