IEEE Access (Jan 2019)

Quality Management of Workers in an In-House Crowdsourcing-Based Framework for Deduplication of Organizations’ Databases

  • Morteza Saberi,
  • Omar Khadeer Hussain,
  • Elizabeth Chang

DOI
https://doi.org/10.1109/ACCESS.2019.2924979
Journal volume & issue
Vol. 7
pp. 90715 – 90730

Abstract

Read online

While organizations in the current era of big data are generating massive volumes of data, they also need to ensure that its quality is maintained for it to be useful in decision-making purposes. The problem of dirty data plagues every organization. One aspect of dirty data is the presence of duplicate data records that negatively impact the organization's operations in many ways. Many existing approaches attempt to address this problem by using traditional data cleansing methods. In this paper, we address this problem by using an in-house crowdsourcing-based framework, namely, DedupCrowd. One of the main obstacles of crowdsourcing-based approaches is to monitor the performance of the crowd, by which the integrity of the whole process is maintained. In this paper, a statistical quality control-based technique is proposed to regulate the performance of the crowd. We apply our proposed framework in the context of a contact center, where the Customer Service Representatives are used as the crowd to assist in the process of deduplicating detection. By using comprehensive working examples, we show how the different modules of the DedupCrowd work not only to monitor the performance of the crowd but also to assist in duplicate detection.

Keywords