A Fast Parallel Random Forest Algorithm Based on Spark

Linzi Yin; Ken Chen; Zhaohui Jiang; Xuemei Xu

doi:10.3390/app13106121

Applied Sciences (May 2023)

A Fast Parallel Random Forest Algorithm Based on Spark

Linzi Yin,
Ken Chen,
Zhaohui Jiang,
Xuemei Xu

Affiliations

Linzi Yin: School of Physics and Electronics, Central South University, Changsha 410012, China
Ken Chen: School of Physics and Electronics, Central South University, Changsha 410012, China
Zhaohui Jiang: School of Automation, Central South University, Changsha 410012, China
Xuemei Xu: School of Physics and Electronics, Central South University, Changsha 410012, China

DOI: https://doi.org/10.3390/app13106121
Journal volume & issue: Vol. 13, no. 10
p. 6121

Abstract

Read online

To improve the computational efficiency and classification accuracy in the context of big data, an optimized parallel random forest algorithm is proposed based on the Spark computing framework. First, a new Gini coefficient is defined to reduce the impact of feature redundancy for higher classification accuracy. Next, to reduce the number of candidate split points and Gini coefficient calculations for continuous features, an approximate equal-frequency binning method is proposed to determine the optimal split points efficiently. Finally, based on Apache Spark computing framework, the forest sampling index (FSI) table is defined to speed up the parallel training process of decision trees and reduce data communication overhead. Experimental results show that the proposed algorithm improves the efficiency of constructing random forests while ensuring classification accuracy, and is superior to Spark-MLRF in terms of performance and scalability.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords