IEEE Access (Jan 2024)
Isolation Forest With Exclusion of Attributes Based on Shapley Index
Abstract
Recognizing anomalies is an extremely important process in data analysis, aimed at identifying patterns in data that deviate from known norms or typical standards. These anomalies are often indicative of significant, and sometimes critical, issues such as fraud, network intrusions, and system failures. Traditional anomaly detection algorithms primarily focus on the attributes of individual observations within a dataset, typically establishing a ‘normal’ profile and flagging deviations from this profile as anomalies. This paper introduces an innovative enhancement to the Isolation Forest algorithm, a renowned method for anomaly detection known for its effectiveness and efficiency, especially in large datasets. The Isolation Forest algorithm operates by randomly partitioning the data space and constructing a binary tree, where the oddity score of a data point is ascertained based on its separation from the extremity to the base of the structure, enabling the autonomous detection of outliers in a completely unsupervised manner. The methodology presented in the paper is based on repeatedly building Isolation Forest models on datasets from which individual attributes are excluded. In our research, we used the SHAP (SHapley Additive exPlanations) method which comes from game theory and is used to determine the impact of individual features on the result of the model. When training the Isolation Forest on the full dataset, the SHAP method is used to obtain the coefficients of influence of model attributes on the prediction result. Both negative and positive influences are considered significant when counting the anomaly score. On the foundations of the results from all sub-models, a weighted average is calculated, to which weights are calculated based on the SHAP model. The comparative analysis of evaluation metrics revealed a substantial enhancement attributed to the implemented methodology. The metrics used for evaluation have shown improvement in most cases from 3.5 to 6 percent point. One of the metrics have shown an improvement of 12 percent. Obtained results demonstrate that this integrated approach not only enhances the prediction accuracy of the Isolation Forest algorithm but also offers a more interpretable understanding of the data. This advancement in anomaly detection methodology promises significant implications for various fields where quick and accurate detection of outliers is paramount.
Keywords