Scientific Reports (Oct 2024)

Estimation of gross calorific value of coal based on the cubist regression model

  • Junlin Chen,
  • Yuli He,
  • Yuexia Liang,
  • Wenjia Wang,
  • Xiong Duan

DOI
https://doi.org/10.1038/s41598-024-74469-3
Journal volume & issue
Vol. 14, no. 1
pp. 1 – 18

Abstract

Read online

Abstract The gross calorific value (GCV) of coal is an important parameter for evaluating coal quality, and regression analysis methods can be used to predict GCV. In this study, we proposed a GCV prediction model based on cubist regression. To develop a good regression model, feature selection of input variables was performed using a correlation analysis and a recursive feature elimination algorithm. Thus, in this study, we determined three sets of variables as the optimal combination for regression models: proximate analysis variables (Set 1: moisture, standard ash, and volatile matter), element analysis variables (Set 2: carbon, sulfur, and oxygen), and comprehensive index variables (Set 3: carbon, volatile matter, standard ash, sulfur, moisture, and hydrogen). Results for comparison with multiple linear regression, random forest regression, and numerous previous prediction models, such as gradient boosting regression tree, support vector regression (SVR), backpropagation neural networks, and particle swarm optimization–artificial neural network (PSO-ANN), indicate that these seven regression models have the best fitting effect on the comprehensive index variables among the three sets of input variables. The cubist model showed higher prediction accuracy and lower error than most other models (R2, mean absolute error, root mean square error, and average absolute relative deviation percentage values are 0.990, 0.476, 0.668, and 0.086% for the proximate analysis variables; 0.992, 0.381, 0.596, and 0.140% for element analysis variables; and 0.999, 0.161, 0.219, and 0.087% for comprehensive index variables, respectively). The cubist model combines the advantages of decision tree and linear regression, which not only enables it to perform well in terms of accuracy but also makes the model highly interpretable because it is based on multiple sublinear equations. In addition, the cubist model shows obvious advantages in terms of running speed, especially compared with SVR and PSO-ANN, which require complex parameter optimization. In summary, the cubist model considers the prediction accuracy, model interpretability, and computational efficiency as well as provides a new and effective method for GCV prediction.