IEEE Access (Jan 2024)

Manipulating Data Lakes Intelligently With Java Annotations

  • Lap Man Hoi,
  • Wei Ke,
  • Sio Kei Im

DOI
https://doi.org/10.1109/ACCESS.2024.3372618
Journal volume & issue
Vol. 12
pp. 34903 – 34917

Abstract

Read online

Data lakes are typically large data repositories where enterprises store data in a variety of data formats. From the perspective of data storage, data can be categorized into structured, semi-structured, and unstructured data. On the one hand, due to the complexity of data forms and transformation procedures, many enterprises simply pour valuable data into data lakes without organizing and managing them effectively. This can create data silos (or data islands) or even data swamps, with the result that some data will be permanently invisible. Although data are integrated into a data lake, they are simply physically stored in the same environment and cannot be correlated with other data to leverage their precious value. On the other hand, processing data from a data lake into a desired format is always a difficult and tedious task that requires experienced programming skills, such as conversion from structured to semi-structured. In this article, a novel software framework called Java Annotation for Manipulating Data Lakes (JAMDL) that can manage heterogeneous data is proposed. This approach uses Java annotations to express the properties of data in metadata (data about data) so that the data can be converted into different formats and managed efficiently in a data lake. Furthermore, this article suggests using artificial intelligence (AI) translation models to generate Data Manipulation Language (DML) operations for data manipulation and uses AI recommendation models to improve the visibility of data when data precipitation occurs.

Keywords