대한환경공학회지 (Mar 2021)
Outlier Detection of Water Quality Data Using Ensemble Empirical Mode Decomposition
Abstract
Objectives : This study was conducted to propose a new methodology for efficiently identifying and removing various outliers that occur in data collected through automated water quality monitoring systems. In the present study, water temperature data were collected from domestic G_water supply system, and the performance of the proposed methodology was tested for water temperature data collected from domestic G_water supply system. Methods : We applied the following analytical procedure to identify outliers in the water quality data: First, a normality test was performed on the collected data. If normality condition was satisfied, the Z-score was used. However, if the normality condition was not satisfied, outliers were identified using the quartile, and the limitations of the existing methodology were analyzed. Second, we decomposed the intrinsic mode function using empirical mode decomposition and ensemble empirical mode decomposition for the collected data, and then considered the occurrence of modal mixing. Finally, a group of intrinsic mode functions was selected using statistical characteristics to identify outliers. In addition, the performance of the method was verified after removing and interpolating outliers using regression analysis and Cook’s distance. Results and Discussion : In the case of water temperature data, as normality condition was not satisfied, outlier identification was carried out by applying the modified quartile method. It was confirmed that outliers distributed within the seasonal component could not be identified at all. In the case of empirical mode decomposition, modal mixing occurred because of the effect of outliers. However, in the case of the ensemble empirical mode decomposition, modal mixing was resolved and the distinct seasonal components were decomposed as intrinsic mode functions. The intrinsic mode functions were synthesized, which showed statistical correlation with the raw water temperature data. As a result of developing a regression model using the synthesized intrinsic mode functions and raw water temperature data and performing outlier search based on Cook’s distances, we concluded that various outliers distributed within the seasonal component could be effectively identified. Conclusions : Considering that satisfactory results could be derived from statistical analysis of the data collected from the automated water quality monitoring system, it can be concluded that outlier identification procedures are essential. However, in the case of the conventional univariate outlier search method, it is apparent that the outlier search performance is significantly poor for data with strong inherent variability, and the interpolation method for the searched outlier cannot be performed. Conversely, the outlier identification method based on ensemble empirical mode decomposition and regression analysis proposed in this study shows excellent discrimination performance for outliers distributed in data with strong inherent variability. Moreover, this method has the advantage of reducing the analyst’s dependence on subjective judgment by presenting statistical cutoff criteria. An additional advantage of the method is that data can be interpolated after removing outliers using intrinsic mode functions. Therefore, the outlier search and interpolation method proposed in this study is expected to have greater applicability as a more effective analysis tool compared to the existing univariate outlier search method.
Keywords