A Novel Classification Method Based on a Two-Phase Technique for Learning Imbalanced Text Data

Der-Chiang Li; Szu-Chou Chen; Yao-San Lin; Wen-Yen Hsu

doi:10.3390/sym14030567

Symmetry (Mar 2022)

A Novel Classification Method Based on a Two-Phase Technique for Learning Imbalanced Text Data

Der-Chiang Li,
Szu-Chou Chen,
Yao-San Lin,
Wen-Yen Hsu

Affiliations

Der-Chiang Li: Department of Industrial and Information Management, National Cheng Kung University, Tainan City 70101, Taiwan
Szu-Chou Chen: Institute of Information Management, National Cheng Kung University, Tainan City 70101, Taiwan
Yao-San Lin: Singapore Centre for Chinese Language, Nanyang Technological University, Singapore 279623, Singapore
Wen-Yen Hsu: Institute of Information Management, National Cheng Kung University, Tainan City 70101, Taiwan

DOI: https://doi.org/10.3390/sym14030567
Journal volume & issue: Vol. 14, no. 3
p. 567

Abstract

Read online

The problem of imbalanced data has a heavy impact on the performance of learning models. In the case of an imbalanced text dataset, minority class data are often classified to the majority class, resulting in a loss of minority information and low accuracy. Thus, it is a serious challenge to determine how to tackle the high imbalance ratio distribution of datasets. Here, we propose a novel classification method for learning tasks with imbalanced test data. It aims to construct a method for data preprocessing that researchers can apply to their learning tasks with imbalanced text data and save the efforts to search for more dedicated learning tools. In our proposed method, there are two core stages. In stage one, balanced datasets are generated using an asymmetric cost-sensitive support vector machine; in stage two, the balanced dataset is classified using the symmetric cost-sensitive support vector machine. In addition, the learning parameters in both stages are adjusted with a genetic algorithm to create an optimal model. A Yelp review dataset was used to validate the effectiveness of the proposed method. The experimental results showed that the proposed method led to a better performance subject to the targeted dataset, with at least 75% accuracy, and revealed that this new method significantly improved the learning approach.

Published in Symmetry

ISSN: 2073-8994 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Science: Mathematics
Website: http://www.mdpi.com/journal/symmetry/

About the journal

Abstract

Keywords