IEEE Access (Jan 2023)
A Data Analytics Methodology to Visually Analyze the Impact of Bias and Rebalancing
Abstract
Data Analytics have become a key component of many business processes which influence several aspects of our daily life. Indeed, any misinterpretation or flaw in the outputs of Data Analytics results can cause significant damage, specialy when dealing with one of the most often overlooked issues, namely the unaware use of biased data. When data bias goes unadverted, it warps the meaning of data, having a devastating effect on Data Analytics results. Although it is widely argued that the most common manner to deal with data bias is to rebalance biased datasets, it is not an aseptic transformation, leading to several potentially undesired side-effects that will probably harm the result of data analyses. Therefore, in order to analyze the underlying bias in datasets, in this work we present (i) a comprehensive methodology based on visualization techniques, which assists users in the definition of their analytical requirements to detect and visually represent the data bias automatically helping them to find out whether it is appropriate to artificially rebalance their dataset or not; (ii) a novel metamodel for visually representing bias; (iii) a motivating real-world running example used to analyze the impact of bias in Data Analytics and (iv) an assessment of the improvements introduced by our proposal through a complete real-world case study by using a Fire Department Calls for Service dataset, thus demonstrating that rebalancing datasets is not always the best option. It is crucial to study the context where the decisions are going to be taken. Moreover, it is also important to do a pre-analysis with the aim of knowing the nature of the datasets and how biased they are.
Keywords