工程科学学报 (Jul 2023)
Differential privacy protection random forest algorithm and its application in steel materials
Abstract
Data-driven material informatics is considered the fourth paradigm of materials research and development (R&D), which can greatly reduce R&D costs and shorten the R&D cycle. However, the data-driven method increases the risk of privacy disclosure when sharing and using materials data and sensitive information such as key processes in materials R&D. Therefore, privacy-preserving machine learning is a key issue in material informatics. The mainstream privacy protection methods in the current times include differential privacy, secure multi-party computation, federated learning, etc. The differential privacy model proposes strict definitions and metrics for quantitative evaluation of privacy protection, and the noise added by differential privacy is independent of the data scale. Only a small amount of noise is required to achieve a high level of protection, which considerably improves data usability. A novel differential privacy preserving random forest algorithm (DPRF) is proposed based on the fact that random forest is one of the most widely used models in material informatics. DPRF introduces the Laplace mechanism and exponential mechanism of differential privacy during the decision process tree building. First, the total privacy budget for the DPRF algorithm is set and then equally divided into each decision tree. During the tree-building process, the splitting features are randomly selected in the decision tree by the exponential mechanism and noise is added to the number of nodes by the Laplace mechanism, which is effective for differential privacy protection for the random forest. In experiments such as steel fatigue prediction experiments, the efficacies of DPRF under centralized or distributed data storage are verified. By setting different privacy budgets, the R2 of the predicted results of the DPRF algorithm can reach more than 0.8 for each target feature after adding differential privacy, which is not much different from the original random forest algorithm. A distributed data storage scenario shows that with the increase of privacy budget, the R2 of each target property prediction gradually increases. Comparing the effect of different tree depths in DPRF, it is shown that the overall R2 of the target prediction tends to increase and then later decrease .as the maximum depth of the tree increases. Overall, the best prediction accuracy is achieved when the maximum depth of the tree is set at 5. In summary, DPRF has good prediction accuracy in terms of achieving differential privacy protection of random forests. Specifically, in a distributed and decentralized data environment, DPRF can strike a balance between privacy-preserving strength and prediction accuracy by setting privacy budgets, tree depth, etc., which shows a wide range of application prospects of our algorithm.
Keywords