IEEE Access (Jan 2024)
Recursive Elimination of “Outliers” to Get Benchmark Dataset
Abstract
Benchmark datasets normally have relatively conserved relationships and low fraction of outliers, indicated from higher determination coefficient (R2) and lower Mean Absolute Error (MAE) in regression model. Here inspired by the process of peeling onions, we introduced a recursive data elimination (RDE) of “outliers” strategy to get benchmark dataset. Outliers are labeled using William’s plot in residual vs leverage (recorded as RDE_W), and the performance was compared with that using residual alone (recorded as RDE). The validation was performed in single-target and multiple-target ways through the predictions of mechanical properties including Young’s modulus, tensile strength, and elongation at break for 643 polyurethane elastomers (the first time this dataset has been released), and compressive strength for 1030 concrete samples. In the single-target way, RDE_W strategy achieved an 8.06% increase in R2 and a 19.87% reduction in MAE compared to RDE. In the multiple-target way the improvement was approximately 3%. SVM outperformed XGB, NN, RF, Lasso and DT algorithms in the RDE_W strategy. Additional tests also validated the advantages for RDE_W over RDE to generate high-quality benchmark datasets. We released the data and code to facilitate the construction of high quality benchmark datasets and the development of new approaches to better understand, explore and design advanced materials.
Keywords