IEEE Access (Jan 2020)

Correcting Biases in Online Social Media Data Based on Target Distributions in the Physical World

  • Zhu Wang,
  • Zhiwen Yu,
  • Renjie Fan,
  • Bin Guo

DOI
https://doi.org/10.1109/ACCESS.2020.2966790
Journal volume & issue
Vol. 8
pp. 15256 – 15264

Abstract

Read online

Social media is an important data source. Billions of posts, likes, and connections are created by people all around the world every day. The promises of such social media data are plentiful, including understanding “what the world thinks” about a social issue, brand, product, celebrity, or other entity, as well as enabling better decision-making in a variety of fields including public policy, transportation, healthcare, and economics. However, while the validity of these data-driven researches are largely dependent on the accuracy and representativeness of the used data, online social media data collected with common mechanisms are usually biased compared with the distribution of related features in the physical world. For example, sampling issues, especially selection bias, associated with such data sources can have far reaching implications for data analysis and interpretation. Therefore, how to calibrate biases in the online social media data set to achieve unbiased results becomes a significant and urgent problem. In this paper, we propose to address the bias calibration issue by adopting a data resampling approach. Specifically, we develop a data resampling algorithm based on the stochastic stability theory of Markov Chains to collect data samples from the given biased data set to calibrate possible biases. By regarding the data resampling process as status transitions of a stochastic variable, the algorithm leverages the stationary distribution of Markov Chains to build an acceptance matrix to control the resampling process, and thus optimize the original dataset towards target distributions in the physical world. Experimental results demonstrate that the proposed algorithm can effectively output data sets with similar distributions to the target ones.

Keywords