大数据 (May 2024)
Bootstrap sample partition data model and distributed ensemble learning
Abstract
A sequential implementation of Bootstrap sampling and Bagging ensemble learning is computationally inefficient and not scalable to build large Bagging ensemble models with a large number of component models.Inspired by distributed big data computing, a new Bootstrap sample partition (BSP) big data model and a distributed ensemble learning method for large-scale distributed ensemble learning were proposed.The BSP data model extended a dataset as a set of Bootstrap samples stored in Hadoop distributed file system.Our distributed ensemble learning method randomly selected a subset of samples from the BSP data model and read them into Java virtual machines of the cluster.Following this, a serial algorithm was executed in each virtual machine to process each sample data and build a machine learningmodel on each sample data independently and in parallel with other virtual machines.Eventually, allsub-results were collected and processed in the master node to produce the ensemble result, optionally adding a sample preferences trategy for the BSP data blocks.The BSP data model generation and the component model building were computed using a non-MapReduce computing paradigm.All component models were computed in parallel without data communication among the nodes.The algorithms proposed in this paper were implemented in spark as internal operators that can be utilized in Spark applications.Experiments have demonstrated that BSP data model of a dataset can be generated efficiently through the new distributed algorithm.It improves the reusability of data samples and increases computational efficiency by over 50% in large-scale Bagging ensemble learning, while also increasing prediction accuracy by approximately 2%.