پژوهشهای ریاضی (Dec 2022)
Identification of outliers types in multivariate time series using genetic algorithm
Abstract
Multivariate time series data, often, modeled using vector autoregressive moving average (VARMA) model. But presence of outliers can violates the stationary assumption and may lead to wrong modeling, biased estimation of parameters and inaccurate prediction. Thus, detection of these points and how to deal properly with them, especially in relation to modeling and parameter estimation of VARMA model is necessary. By detecting outliers, their effect can be eliminated over time and we obtain the modified data. Using this modified data, the proper estimates of the VARMA model are obtained which have the least effect on the outliers. On the other hand, detect of outliers is important in finding an external event over time. For example, by finding outliers in river water monitoring data, flood times can be obtained. The parameter estimation of VAR model is less time consuming than VARMA. On the other hand, under condition of invertibility, VARMA models could be approximated by VAR(p) for large p. Therefore, we use this model to fit and investigate the data generated from VARMA models that contaminated by outliers. Multivariate observations of time series may be contaminated with different types of outliers. However, the effect of different types of outliers in multivariate and univariate case is different, and this observation must be assessed by multivariate approach. In this research, we use a Genetic Algorithm (GA) to develop a procedure for detecting different types of outliers (additive, innovation, level shift and temporary change outliers) in a multivariate time series. GA detects outlier location which minimizes Akaike-like Information Criterion (AIC) and we try to "minimize the number of outliers" and "maximize the likelihood function". GA is a numerical optimization algorithm whose idea is based on natural selection and natural genetics. This algorithm does not require strong assumptions to obtain the optimal value of a function and has the ability to search for the optimal solution from a space with several local optimal. That is, for example, if a function has several relative maxima, GA finds the absolute maximum of this function as well. For minimization of a function, GA operates by first generating, at random or optionally, several minimal solutions to the function that this set of solutions called the initial population and each solution as a chromosome. Then, using reproductive operators, we combine chromosomes and make a jump into them. If the function of newly produced chromosomes is lower than the previous chromosomes, these chromosomes can be added to the initial population or replaced with chromosomes with less function in this population. This process is repeated until convergence occurs or the end number of itteration obtained. Furthermore, we introduce another method of detecting outliers, the Tsay Pena and Pankratz (TPP) method. TPP uses some test statistics based on outliers size and VAR parameters. This method detects outliers in three stages. In stage I, it detects one by one outliers and remove their effects. Iteration done until no outlier found. In stage II, for detected outlier in stage I, the estimation of outliers effects are obtained simultaneously. Then, outliers with insignificant effects are removed. The VAR parameters re-estimated based on modified series of this stage. In stage III, we repeated stage I and II with new VAR parameter estimation. In each iteration of TPP, an outlier is detected and the effect of this outlier is removed from series (modified series). Then the parameter estimation is obtained from the modified series and the next outlier detection is continued using these estimates. This may lead to biased estimates and wrong detection of the next outlier point. In other words, in the TPP method, one detected outlier hides another outlier (masking), or one detected outlier reveals the usual observation as an outlier (swamping). This method often mis-detects the type of outliers. But in each iteration of GA, a random pattern of outliers (for testing) is first generated and a temporary modified series is obtained by removing effect of this pattern from series. Then the estimation of the parameters obtained and the detection of this pattern is tested. This work reduces the effect of the previously identified outliers on the full pattern of the outliers. In fact, if the random pattern of all outliers is correctly generated, almost effect of all of them will be eliminated in the modified series. Therefore using this temporary modified series, the GA obtained more accurate estimates and detected outliers more accurately. The simulation results confirm the validity of the GA method and the percentage of correct outlier detection in this method is higher than the TPP method. GA, of course, needs more time to calculate. Also, although the VAR model is used in both detection methods, the percentage of correct outlier detection in the VARMA model data is similar to the VAR model. Gas-furnace data were analyzed and modeled and it was determined that GA and TPP methods detected similar outliers. Fitting the VAR(6) model on these data shows that the variance of input gas error in modified data of GA to TPP is reduced by 17% and the variance of carbon dioxide error in the modified data of GA to TPP reduced by 43%.