International Transactions on Electrical Energy Systems (Jan 2022)
Preprocessing Approach for Power Transformer Maintenance Data Mining Based on k-Nearest Neighbor Completion and Principal Component Analysis
Abstract
The accuracy of a knowledge extraction algorithm in a large database depends on the quality of the data preprocessing and the methods used. The massive amounts of data that we collect every day are putting storage capacity at a premium. In reality, many databases are characterized by attributes with outliers, redundant, and even more missing values. Missing data and outliers are ubiquitous in our databases, and imputation techniques will help us mitigate their influence. To solve this problem, as well as the problem of data size, this paper proposes a data preprocessing approach based on the k-nearest neighbor (KNN) completion for imputation of missing data and principal component analysis (PCA) for processing redundant data, thus reducing the data size by generating a significant quality sample after imputation of missing and outlier data. A rigorous comparison is made between our approach and two others. The dissolved gas data from Rio Tinto Alcan’s transformer T0001 were imputed by KNN, where k equals 5. For 6 imputed gases, the average percentage error is about 2%, 17.5% after average imputation, and 23.65% after multiple imputations. For data compression, 2 axes were selected based on the elbow rule and the Kaiser threshold.