Jisuanji kexue yu tansuo (May 2022)
Improved Parallel Random Forest Algorithm Combining Information Theory and Norm
Abstract
Aiming at the problems of excessive redundancy and irrelevant features, low training feature information and low parallelization efficiency in big data random forest algorithm based on MapReduce, this paper proposes a parallel random forest algorithm based on information theory and norm (PRFITN). Firstly, the algorithm designs the DRIGFN (dimension reduction based on information gain and Frobenius norm) strategy to reduce the number of redundant and irrelevant features. Secondly, a feature grouping strategy based on information theory (FGSIT) is proposed. According to the FGSIT strategy, the features are grouped, and the stratified sampling method is adopted to ensure the information amount of the training features when constructing the decision tree in the random forest. Accuracy of classification results is improved. Finally, in order to improve the parallel efficiency of the cluster, the redistribution of key-value pairs (RSKP) is presented to realize the rapid and uniform distribution of key-value pairs, and obtain the global classification results. Experimental results show that the algorithm has better classification effect in big data environment, especially for datasets with more features.
Keywords