Data Science Journal (Dec 2024)
Semantic Schema Extraction in NoSQL Databases using BERT Embeddings
Abstract
NoSQL databases, valued for flexibility and scalability, pose analytics challenges due to their schema-less nature. Automatic schema extraction is crucial, with existing techniques limited in handling nested structures. Leveraging Natural Language Processing (NLP) advancements, this paper introduces a novel BERT Embeddings-Based approach for extracting schemas from NoSQL databases. The method analyzes semantic relationships within triplets from JSON documents through four stages: triplet extraction, preprocessing, BERT Embedding generation, and similarity analysis. Evaluation on real datasets demonstrates over 83% accuracy in extracting valid nested schema components. The study reveals interdisciplinary intersections, using NLP to unveil structures in scenarios lacking explicit schemas, showcasing significant potential for autonomous schema extraction from raw, unstructured data formats.
Keywords