International Journal of Applied Earth Observations and Geoinformation (Apr 2023)
A county-level soybean yield prediction framework coupled with XGBoost and multidimensional feature engineering
Abstract
Yield prediction is essential in food security, food trade, and field management. However, due to the associated complex formation mechanisms of yield, accurate and timely yield prediction remains challenging in remote sensing-based crop monitoring domains. In this study, a framework of soybean yield prediction integrating extreme gradient boosting (XGBoost) and multidimensional feature engineering was developed at the county level in the United States using publicly available datasets. Excellent accuracy values were obtained for over 959 counties in 12 states throughout the midwestern U.S., with a test coefficient of determination (R2) of 0.82 and a root-mean-square error (RMSE) of 0.246 t/ha, using our approach. Following a “train–validate–test” assessment strategy, our study shows that XGBoost outperforms other county-level soybean yield prediction models with identical inputs, including linear regression (LR), random forest (RF), k-nearest neighbor (KNN), artificial neural network (ANN), support vector regression (SVR), long short-term memory (LSTM), and deep neural network (DNN). The results show that accurate results of soybean yield prediction can be obtained as early as the pod-setting stage. We implemented the feature importance and Shapley additive explanations (SHAP) algorithms to quantify the impact of input features on the XGBoost model in the training and prediction stages, respectively. The enhanced vegetation index (EVI) at the pod-setting period is the most crucial factor, but the yield prediction is not dependent on only a few key features. Yields were detrended using longer-term historical yield data, and R2 increased from 0.58 to 0.82 while RMSE decreased from 0.374 t/ha to 0.246 t/ha. We employed multidimensional feature engineering to generate phenology-based features, and R2 improved from 0.79 to 0.82 while RMSE decreased from 0.268 t/ha to 0.246 t/ha using this approach. The framework can be easily implemented and extended in the future in combination with early crop identification.