IEEE Access (Jan 2022)
Ensemble Machine Learning Techniques Using Computer Simulation Data for Wild Blueberry Yield Prediction
Abstract
Precision agriculture is a challenging task to achieve. Several studies have been conducted to forecast agricultural yields using machine learning algorithms (MLA), but few studies have used ensemble machine learning algorithms (EMLA). In the current study, we use a dataset generated by a computer simulation program, and meteorological data obtained over 30 years from Maine, United States (USA). The primary goal of this research is to increase the forecast accuracy of the best characteristics for overcoming hunger challenges. We adopted stacking regression (SR) and cascading regression (CR) with a novel combination of MLA based on the wild blueberry dataset. We used features that indicated the best regulation for wild blueberry agroecosystems. Four feature engineering selection techniques are applied, namely variance inflation factor (VIF), sequential forward feature selection (SFFS), sequential backward elimination feature selection (SBEFS), and extreme gradient boosting based on feature importance (XFI). We applied Bayesian optimization on popular MLA to obtain the best hyperparameters to achieve accurate wild blueberry yield prediction. The SR used a two-layer structure: level-0 containing light gradient boosting machine (LGBM), gradient boost regression (GBR) and extreme gradient boosting (XGBoost), and level-1 providing the output prediction using a Ridge. The CR topology is the same MLA used in SR, but in a series form that takes the new prediction as a feeder to each MLA and removes the previous prediction in each stage. We assessed the CR, and SR with outcomes according to the root mean square error (RMSE) and coefficient of determination ( $R^{2}$ ). In the results, the proposed SR showed the best performance with $R^{2}$ of 0.984 and RMSE of 179.898 compared with another study that reported $R^{2}$ of 0.938 and RMSE of 343.026 on the seven features selected by XFI. The SR achieved the highest $R^{2}$ of 0.985 on all features and the features that were selected by the SBEFS. Our SR outperformed CR, and another study on wild blueberry yield prediction.
Keywords