sCwc/sLcc: Highly Scalable Feature Selection Algorithms

Kilho Shin; Tetsuji Kuboyama; Takako Hashimoto; Dave Shepard

doi:10.3390/info8040159

Information (Dec 2017)

sCwc/sLcc: Highly Scalable Feature Selection Algorithms

Kilho Shin,
Tetsuji Kuboyama,
Takako Hashimoto,
Dave Shepard

Affiliations

Kilho Shin: Graduate School of Applied Informatics, University of Hyogo, Kobe 651-2197, Japan
Tetsuji Kuboyama: Computer Centre, Gakushuin University, Tokyo 171-0031, Japan
Takako Hashimoto: Institude of Economic Research, Chiba University of Commerce, Chiba 272-8512, Japan
Dave Shepard: Center for Digital Humanities, University of California Las Angeles; Los Angeles, CA 90095, USA

DOI: https://doi.org/10.3390/info8040159
Journal volume & issue: Vol. 8, no. 4
p. 159

Abstract

Read online

Feature selection is a useful tool for identifying which features, or attributes, of a dataset cause or explain the phenomena that the dataset describes, and improving the efficiency and accuracy of learning algorithms for discovering such phenomena. Consequently, feature selection has been studied intensively in machine learning research. However, while feature selection algorithms that exhibit excellent accuracy have been developed, they are seldom used for analysis of high-dimensional data because high-dimensional data usually include too many instances and features, which make traditional feature selection algorithms inefficient. To eliminate this limitation, we tried to improve the run-time performance of two of the most accurate feature selection algorithms known in the literature. The result is two accurate and fast algorithms, namely sCwc and sLcc. Multiple experiments with real social media datasets have demonstrated that our algorithms improve the performance of their original algorithms remarkably. For example, we have two datasets, one with 15,568 instances and 15,741 features, and another with 200,569 instances and 99,672 features. sCwc performed feature selection on these datasets in 1.4 seconds and in 405 seconds, respectively. In addition, sLcc has turned out to be as fast as sCwc on average. This is a remarkable improvement because it is estimated that the original algorithms would need several hours to dozens of days to process the same datasets. In addition, we introduce a fast implementation of our algorithms: sCwc does not require any adjusting parameter, while sLcc requires a threshold parameter, which we can use to control the number of features that the algorithm selects.

Published in Information

ISSN: 2078-2489 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering: Information technology
Website: http://www.mdpi.com/journal/information/

About the journal

Abstract

Keywords