SMOTETomek-Based Resampling for Personality Recognition

Zhe Wang; Chunhua Wu; Kangfeng Zheng; Xinxin Niu; Xiujuan Wang

doi:10.1109/ACCESS.2019.2940061

IEEE Access (Jan 2019)

SMOTETomek-Based Resampling for Personality Recognition

Zhe Wang,
Chunhua Wu,
Kangfeng Zheng,
Xinxin Niu,
Xiujuan Wang

Affiliations

Zhe Wang: ORCiD; School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing, China
Chunhua Wu: School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing, China
Kangfeng Zheng: School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing, China
Xinxin Niu: School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing, China
Xiujuan Wang: ORCiD; School of Computer Science, Beijing University of Technology, Beijing, China

DOI: https://doi.org/10.1109/ACCESS.2019.2940061
Journal volume & issue: Vol. 7
pp. 129678 – 129689

Abstract

Read online

The main challenge of user personality recognition is low accuracy resulting from small sample size and severe sample distribution imbalance. This paper analyzes the impact of imbalanced data distribution and positive and negative sample overlap on the machine learning classification model. The classification model is based on the data resampling technique, which can improve the classification accuracy. These problems can be solved once the data are effectively resampled. We present a personality prediction method based on particle swarm optimization (PSO) and synthetic minority oversampling technique+Tomek Link (SMOTETomek)resampling (PSO-SMOTETomek), which, apart from effective SMOTETomek resampling of data samples, is able to execute PSO feature optimization for each set of feature combinations. Validated by simulation, our analysis reveals that the PSO-SMOTETomek method is efficient under a small dataset, and the accuracy of personality recognition is improved by up to around 10%. The results are better than those of previous similar studies. The average accuracies of the plain text dataset and the non-plain text dataset are 75.34% and 78.78%, respectively. The average accuracies of the short text dataset and the long text dataset are 75.34% and 64.25%, respectively. From the experimental results, we found that short text has a better classification effect than long text. Plain text data can still have high personality discrimination accuracy, but there is no relevant external information. The proposed model is able to facilitate the design and implementation of a personality recognition system, and the model significantly outperforms existing state-of-the-art models.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords