Exploring and cleaning big data with random sample data blocks

Salman Salloum; Joshua Zhexue Huang; Yulin He

doi:10.1186/s40537-019-0205-4

Journal of Big Data (Jun 2019)

Exploring and cleaning big data with random sample data blocks

Salman Salloum,
Joshua Zhexue Huang,
Yulin He

Affiliations

Salman Salloum: Big Data Institute, College of Computer Science and Software Engineering, Shenzhen University
Joshua Zhexue Huang: Big Data Institute, College of Computer Science and Software Engineering, Shenzhen University
Yulin He: Big Data Institute, College of Computer Science and Software Engineering, Shenzhen University

DOI: https://doi.org/10.1186/s40537-019-0205-4
Journal volume & issue: Vol. 6, no. 1
pp. 1 – 28

Abstract

Read online

Abstract Data scientists need scalable methods to explore and clean big data before applying advanced data analysis and mining algorithms. In this paper, we propose the RSP-Explore method to enable data scientists to iteratively explore big data on small computing clusters. We address three main tasks: statistical estimation, error detection, and data cleaning. The Random Sample Partition (RSP) distributed data model is used to represent the data as a set of ready-to-use random sample data blocks (called RSP blocks) of the entire data. Block-level samples of RSP blocks are selected to understand the data, identify potential types of value errors, and get samples of clean data. We provide a theoretical analysis on using RSP blocks for statistical estimation and demonstrate empirically the advantages of the RSP-Explore method. The experimental results of three real data sets show that the approximate results from RSP-Explore can rapidly converge toward the true values. Furthermore, cleaning a sample of RSP blocks is sufficient to estimate the statistical properties of the unknown clean data.

Published in Journal of Big Data

ISSN: 2196-1115 (Online)
Publisher: SpringerOpen
Country of publisher: United Kingdom
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Electronics: Computer engineering. Computer hardware; Technology: Technology (General): Industrial engineering. Management engineering: Information technology; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://journalofbigdata.springeropen.com

About the journal

Abstract

Keywords