Journal of Big Data (Jan 2022)

Addressing big data variety using an automated approach for data characterization

  • Georgios Vranopoulos,
  • Nathan Clarke,
  • Shirley Atkinson

DOI
https://doi.org/10.1186/s40537-021-00554-3
Journal volume & issue
Vol. 9, no. 1
pp. 1 – 28

Abstract

Read online

Abstract The creation of new knowledge from manipulating and analysing existing knowledge is one of the primary objectives of any cognitive system. Most of the effort on Big Data research has been focussed upon Volume and Velocity, while Variety, “the ugly duckling” of Big Data, is often neglected and difficult to solve. A principal challenge with Variety is being able to understand and comprehend the data. This paper proposes and evaluates an automated approach for metadata identification and enrichment in describing Big Data. The paper focuses on the use of self-learning systems that will enable automatic compliance of data against regulatory requirements along with the capability of generating valuable and readily usable metadata towards data classification. Two experiments towards data confidentiality and data identification were conducted in evaluating the feasibility of the approach. The focus of the experiments was to confirm that repetitive manual tasks can be automated, thus reducing the focus of a Data Scientist on data identification and thereby providing more focus towards the extraction and analysis of the data itself. The origin of the datasets used were Private/Business and Public/Governmental and exhibited diverse characteristics in relation to the number of files and size of the files. The experimental work confirmed that: (a) the use of algorithmic techniques attributed to the substantial decrease in false positives regarding the identification of confidential information; (b) evidence that the use of a fraction of a data set along with statistical analysis and supervised learning is sufficient in identifying the structure of information within it. With this approach, the issues of understanding the nature of data can be mitigated, enabling a greater focus on meaningful interpretation of the heterogeneous data.

Keywords