IEEE Access (Jan 2024)
Time Series Reconstruction With Feature-Driven Imputation: A Comparison of Base Learning Algorithms
Abstract
Addressing the challenge of missing values is a critical step when preparing and analyzing data. This process, known as imputation, helps ensure the dataset is complete, accurate, and reliable. As a result, the possibility of bias and errors in subsequent analysis is significantly reduced. The key contribution of this work is to assess the efficiency of imputation by feature importance employing several base learning algorithms. This study investigates the effectiveness of individual and ensemble machine learning methods as the base learning algorithms, including support vector machines with the linear kernel (SVML), boosted linear regression (BLR), deep boost (DBP), and K-Nearest Neighbor (K-NN), in predicting missingness patterns. The dataset for each category explicitly introduces missingness patterns, including missing not at random (MNAR), at random (MAR), and completely at random (MCAR) at different percentages (15%, 45%, 25%), and Root Mean Squared Error (RMSE) and Mean Absolute Percentage Error (MAPE) are among the commonly used performance matrices employed to gauge the effectiveness of the IBFI framework. The dataset for this study, comprising soil radon and thoron gas concentration time series along with meteorological parameters, the dataset spans a 14-month period. Four earthquake events were recorded during the whole study period. The deep boosting model (DBP) consistently outperforms other base learning models in imputing missing values across various variables within the imputation by feature importance (IBFI) framework. Specifically, DBP achieves an average RMSE value of 573.165 for the Radon variable under MCAR scenarios. For the Thoron variable, DBP demonstrates impressive performance with average MAPE values of 0.7405, 0.7249, and 0.8212 under MCAR, MNAR, and MAR conditions respectively. Additionally, DBP yields competitive results for imputing missing entries in Temperature, Relative Humidity, and Pressure variables. These findings highlight effectiveness of DBP in accurately predicting missing values. This study concludes that the IBFI with deep boosting model executes the imputations quite accurately relative to other base learning models. Moreover, this study recommends using DBP as a base learning algorithm in imputation by feature importance framework for uncovering hidden patterns in time series data like soil radon gas. The replication of the study using heterogeneous datasets would enhance the understanding of the generalization and broader applicability of the imputation by feature importance.
Keywords