IEEE Access (Jan 2022)

Misclassification Bias in Computational Social Science: A Simulation Approach for Assessing the Impact of Classification Errors on Social Indicators Research

  • Sergey Smetanin,
  • Mikhail Komarov

DOI
https://doi.org/10.1109/ACCESS.2022.3149897
Journal volume & issue
Vol. 10
pp. 18886 – 18898

Abstract

Read online

A growing body of literature has examined the potential of machine learning algorithms in constructing social indicators based on the automatic classification of digital traces. However, as long as the classification algorithms’ predictions are not completely error-free, the estimate of the relative occurrence of a particular class may be affected by misclassification bias, thereby affecting the value of the calculated social indicator. Although a significant amount of studies have investigated misclassification bias correction techniques, they commonly rely on a set of assumptions that are likely to be violated in practice, which calls into question the effectiveness of these methods. Thus, there is a knowledge gap with respect to the assessment of misclassification bias’s impact on a specific social indicator formula without strict reference to the number of classes. Moreover, given the erroneous nature of automatic classification algorithms, the quality of a predicted indicator can be assessed not only using regression quality metrics, as was done in existing literature, but also using correlation metrics. In this paper, we propose a simulation approach for assessing the impact of misclassification bias on the calculated social indicators in terms of regression and correlation metrics. The proposed approach focuses on indicators calculated based on the distribution of classes and can process any number of classes. The proposed approach allows selecting the most appropriate classification model for a particular social indicator, and vice versa. Moreover, it allows for assessment of the optimistic level of correlation between the indicator calculated based on the results of the classification algorithm and the true underlying indicator.

Keywords