پژوهشنامه مدیریت حوزه آبخیز (Oct 2024)
Reconstruction of Missing Daily Streamflow Data using the MissForest Algorithm in Southern Baluchestan Basin, Iran
Abstract
Extended Abstract Background: Long-term hydrometeorological variables can be used for planning and managing water resources at the basin level using different physical models, such as hydrological and hydraulic models. However, such variables are often accompanied by missing data, which makes analysis difficult or sometimes impossible. Data gaps cause problems in interpretation, model calibration, and biased statistics. In this study, the validity of a non-parametric random learning machine algorithm, called MissForest, has been evaluated to fill the gap of daily streamflow series in a region with scarce data and strong climate variability. Methods: The daily streamflow data in the gauge stations of the Southern Baluchestan catchment were analyzed in a long-term hydrological period (09/23/1972 to 09/22/2018). First, the missingness percentage was selected based on a conventional criterion (less than 50%) as an acceptable ratio of the missing rate in the streamflow data, followed by investigating the mechanisms and patterns of the missing data. Accordingly, the number of gauge stations was reduced to seven samples. Then, the temporal distribution of the missing daily streamflows during the months of the year and the relative frequency of gap length were investigated during the period. Next, the performance of the missing data reconstruction algorithm was challenged with two different artificial missing data scenarios. Two types of artificial gaps were generated, namely a) Removed contiguous segments: at each gauge only a segment (having lengths of 7, 14, 21, 30, 60, 180, and 365 days) was randomly removed from the entire record (1972–2018); b) Removed single data points: observed values (30, 60, 90, 120, 180, and 365 days) were randomly removed from the entire record (1972–2018) at each of the gauges. MissForest was applied to fill the gaps contained in the records together with the artificial gaps. Our analysis includes reconstructions of the 1972–2018 period at each of the streamflow gauges. Finally, the performance of MissForest in infilling daily streamflow data was tested by comparing the filled series with the observed data using goodness-of-fit (GoF) indicators, coefficient of determination (R2 ), the percent bias (PBIAS), and the Kling-Gupta efficiency (KGE). Results: The MissForest algorithm generally performed satisfactorily, allowing for accurately and reliably simulating lost data quickly and automatically. The performance of the MissForest algorithm is highly dependent on the number of predictor records, record length, and streamflow type. Finally, the reconstruction of real gaps in streamflow data was possible by applying this intelligent algorithm. The river flow time series were simulated with the natural flow regime with good performance; however, this performance dropped slightly for flow rate changes as a result of water storage and diversion for irrigation, especially downstream of dams. The performance of this algorithm in filling the daily time series of flow with severe changes in the flow regime, such as peak discharge, was not evaluated optimally. This drop in performance is more related to the hydroclimatic conditions of the studied watershed than the structure of the algorithm. The reconstructed hydrographs allow for analyzing flow variability and their interaction with key climate variables. Conclusion: The MissForest algorithm is introduced as one of the imputation methods based on machine learning with high credibility and performance in reconstructing the missing data of the daily streamflow. It can also be used automatically and intelligently in the reconstruction of the statistical defects of the river flow in the scale used daily. Future studies are suggested to analyze the effects of different watersheds with specific hydro-physical-climatic characteristics on the performance of the MissForest algorithm. The other issues that need to be addressed in future studies include the investigation of the proposed method of this study in other climatic and geographical regions, the sensitivity measurement to the rainfall and flow regime, and finally, the investigation of its performance compared to other common methods.