IEEE Access (Jan 2024)
Data Ingestions as a Service (DIaaS): A Unified Interface for Heterogeneous Data Ingestion, Transformation, and Metadata Management for Data Lake
Abstract
Data ingestion tools are critical component of Data Lake. Existing data ingestion tools face challenges of handling large variety, formats, sources of data. There exists void for unified data ingestion interface to handle the above research problems. This study proposes an innovative and integrated framework for data ingestion in a data lake, addressing the challenges posed by heterogeneous data sources, formats, and metadata management. The framework comprises three novel modules: First Unified Data Integration Connectors (UDIC), which provide seamless connectivity and data retrieval capabilities from diverse sources including databases, data warehouses, file systems, cloud storage, and APIs; Second, Adaptive Data Variety Transformation (ADVT), a module that intelligently handles the transformation and processing of structured, semi-structured, and unstructured data types, ensuring efficient ingestion into the data lake; and third, Intelligent Metadata Management (IMM), a module that captures, stores, and manages metadata associated with the ingested data, offering advanced search, discovery, and enrichment functionalities. Comparative study corroborates features offered by the service with existing data ingestion tools to evaluate the novelty and significance of the study. Performance validation shows varying ingestion latencies across different data types: approximately 148.1 microseconds per record for structured data, 234.2 microseconds per record for semi-structured data, 65.6 microseconds per kilobyte (KB) for video data, and 42.7 microseconds per KB for image data. These results underscore the importance of considering data structure and size in optimizing ingestion processes. Overall, this research aims to revolutionize data ingestion in data lake environments by providing a unified solution for handling diverse data sources, formats, and metadata management.
Keywords