IEEE Access (Jan 2024)

Knowledge Graph Generation and Application for Unstructured Data Using Data Processing Pipeline

  • Sushmi Thushara Sukumar,
  • Chung-Horng Lung,
  • Marzia Zaman,
  • Ritesh Panday

DOI
https://doi.org/10.1109/ACCESS.2024.3462635
Journal volume & issue
Vol. 12
pp. 136759 – 136770

Abstract

Read online

With the rapid advancement of technology and the vast volume of unstructured data available on the Internet, there is a pressing need to extract information from diverse data formats effectively. This is essential as valuable pieces of information may be lost. To address this issue, researchers are using Machine Learning (ML) and Natural Language Processing (NLP) techniques to extract information from unstructured text, including the utilization of Knowledge Graphs (KGs). This paper demonstrates end-to-end experimental studies of KG construction from unstructured text using open-source techniques and concrete real-world examples in different problem domains. The unstructured data underwent a text processing pipeline consisting of coreference resolution, named entity linking, and relationship extraction. The pipeline is designed to support automatic data storage in a graph database known as Neo4j. This storage includes the extracted entities and their relationships. Experiments were conducted on a real-world unstructured BBC News Dataset to analyze the outcome obtained from the pipeline. The experience can facilitate the adoption of KG creation for practitioners to capture valuable information from a large volume of unstructured text. The results from the relationship extraction step using two techniques were evaluated, including extracted entities, relationship types, accuracies of 61.4% with OpenNRE and 87% with REBEL, and processing time. Further, the data processing pipeline was applied to analyze the unstructured dataset from the Transportation Safety Board’s (TSB) Findings for aviation safety analysis. The results showed that structured relationships identified through the pipeline provided valuable indicators, as they captured critical aviation safety information, such as the flight, aircraft type, event, etc. This pipeline can be fine-tuned with a domain-specific knowledge base to provide higher accuracy and better entity detection.

Keywords