How to balance the bioinformatics data: pseudo-negative sampling

Yongqing Zhang; Shaojie Qiao; Rongzhao Lu; Nan Han; Dingxiang Liu; Jiliu Zhou

doi:10.1186/s12859-019-3269-4

BMC Bioinformatics (Dec 2019)

How to balance the bioinformatics data: pseudo-negative sampling

Yongqing Zhang,
Shaojie Qiao,
Rongzhao Lu,
Nan Han,
Dingxiang Liu,
Jiliu Zhou

Affiliations

Yongqing Zhang: School of Computer Science, Chengdu University of Information Technology
Shaojie Qiao: School of Software Engineering, Chengdu University of Information Technology
Rongzhao Lu: School of Computer Science, Chengdu University of Information Technology
Nan Han: School of Management, Chengdu University of Information Technology
Dingxiang Liu: School of Cybersecurity, Chengdu University of Information Technology
Jiliu Zhou: School of Computer Science, Chengdu University of Information Technology

DOI: https://doi.org/10.1186/s12859-019-3269-4
Journal volume & issue: Vol. 20, no. S25
pp. 1 – 13

Abstract

Read online

Abstract Background Imbalanced datasets are commonly encountered in bioinformatics classification problems, that is, the number of negative samples is much larger than that of positive samples. Particularly, the data imbalance phenomena will make us underestimate the performance of the minority class of positive samples. Therefore, how to balance the bioinformatic data becomes a very challenging and difficult problem. Results In this study, we propose a new data sampling approach, called pseudo-negative sampling, which can be effectively applied to handle the case that: negative samples greatly dominate positive samples. Specifically, we design a supervised learning method based on a max-relevance min-redundancy criterion beyond Pearson correlation coefficient (MMPCC), which is used to choose pseudo-negative samples from the negative samples and view them as positive samples. In addition, MMPCC uses an incremental searching technique to select optimal pseudo-negative samples to reduce the computation cost. Consequently, the discovered pseudo-negative samples have strong relevance to positive samples and less redundancy to negative ones. Conclusions To validate the performance of our method, we conduct experiments base on four UCI datasets and three real bioinformatics datasets. According to the experimental results, we clearly observe the performance of MMPCC is better than other sampling methods in terms of Sensitivity, Specificity, Accuracy and the Mathew’s Correlation Coefficient. This reveals that the pseudo-negative samples are particularly helpful to solve the imbalance dataset problem. Moreover, the gain of Sensitivity from the minority samples with pseudo-negative samples grows with the improvement of prediction accuracy on all dataset.

Published in BMC Bioinformatics

ISSN: 1471-2105 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Biology (General)
Website: http://www.biomedcentral.com/bmcbioinformatics/

About the journal

Abstract

Keywords