A Two-Phase Approach for Semi-Supervised Feature Selection

Amit Saxena; Shreya Pare; Mahendra Singh Meena; Deepak Gupta; Akshansh Gupta; Imran Razzak; Chin-Teng Lin; Mukesh Prasad

doi:10.3390/a13090215

Algorithms (Aug 2020)

A Two-Phase Approach for Semi-Supervised Feature Selection

Amit Saxena,
Shreya Pare,
Mahendra Singh Meena,
Deepak Gupta,
Akshansh Gupta,
Imran Razzak,
Chin-Teng Lin,
Mukesh Prasad

Affiliations

Amit Saxena: Department of Computer Science and Information Technology, Guru Ghasidas University, Bilaspur, Chhattisgarh 495009, India
Shreya Pare: School of Computer Science, FEIT, University of Technology Sydney, Sydney, NSW 2007, Australia
Mahendra Singh Meena: School of Computer Science, FEIT, University of Technology Sydney, Sydney, NSW 2007, Australia
Deepak Gupta: Department of Computer Science & Engineering, National Institute of Technology Arunachal Pradesh, Yupia 791112, India
Akshansh Gupta: Central Electronics Engineering Research Institute, Delhi 110028, India
Imran Razzak: School of Information Technology, Deakin University, Geeloing, VIC 3217, Australia
Chin-Teng Lin: School of Computer Science, FEIT, University of Technology Sydney, Sydney, NSW 2007, Australia
Mukesh Prasad: School of Computer Science, FEIT, University of Technology Sydney, Sydney, NSW 2007, Australia

DOI: https://doi.org/10.3390/a13090215
Journal volume & issue: Vol. 13, no. 9
p. 215

Abstract

Read online

This paper proposes a novel approach for selecting a subset of features in semi-supervised datasets where only some of the patterns are labeled. The whole process is completed in two phases. In the first phase, i.e., Phase-I, the whole dataset is divided into two parts: The first part, which contains labeled patterns, and the second part, which contains unlabeled patterns. In the first part, a small number of features are identified using well-known maximum relevance (from first part) and minimum redundancy (whole dataset) based feature selection approaches using the correlation coefficient. The subset of features from the identified set of features, which produces a high classification accuracy using any supervised classifier from labeled patterns, is selected for later processing. In the second phase, i.e., Phase-II, the patterns belonging to the first and second part are clustered separately into the available number of classes of the dataset. In the clusters of the first part, take the majority of patterns belonging to a cluster as the class for that cluster, which is given already. Form the pairs of cluster centroids made in the first and second part. The centroid of the second part nearest to a centroid of the first part will be paired. As the class of the first centroid is known, the same class can be assigned to the centroid of the cluster of the second part, which is unknown. The actual class of the patterns if known for the second part of the dataset can be used to test the classification accuracy of patterns in the second part. The proposed two-phase approach performs well in terms of classification accuracy and number of features selected on the given benchmarked datasets.

Published in Algorithms

ISSN: 1999-4893 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://www.mdpi.com/journal/algorithms

About the journal

Abstract

Keywords