Journal of Big Data (Jul 2019)

Random forest implementation and optimization for Big Data analytics on LexisNexis’s high performance computing cluster platform

  • Victor M. Herrera,
  • Taghi M. Khoshgoftaar,
  • Flavio Villanustre,
  • Borko Furht

DOI
https://doi.org/10.1186/s40537-019-0232-1
Journal volume & issue
Vol. 6, no. 1
pp. 1 – 36

Abstract

Read online

Abstract In this paper, we comprehensively explain how we built a novel implementation of the Random Forest algorithm on the High Performance Computing Cluster (HPCC) Systems Platform from LexisNexis. The algorithm was previously unavailable on that platform. Random Forest’s learning process is based on the principle of recursive partitioning and although recursion per se is not allowed in ECL (HPCC’s programming language), we were able to implement the recursive partition algorithm as an iterative split/partition process. In addition, we analyze the flaws found in our initial implementation and we thoroughly describe all the modifications required to overcome the bottleneck within the iterative split/partition process, i.e., the optimization of the data gathering of selected independent variables which are used for the node’s best-split analysis. Essentially, we describe how our initial Random Forest implementation has been optimized and has become an efficient distributed machine learning implementation for Big Data. By taking full advantage of the HPCC Systems Platform’s Big Data processing and analytics capabilities, we succeed in enhancing the data gathering method from an inefficient Pass them All and Filter approach into an effective and completely parallelized Fetching on Demand approach. Finally, based upon the results of our learning process runtime comparison between these two approaches, we confirm the speed up of our optimized Random Forest implementation.

Keywords