BioMedInformatics (Nov 2023)
Enhancing Semantic Web Technologies Using Lexical Auditing Techniques for Quality Assurance of Biomedical Ontologies
Abstract
Semantic web technologies (SWT) represent data in a format that is easier for machines to understand. Validating the knowledge in data graphs created using SWT is critical to ensure that the axioms accurately represent the so-called “real” world. However, data graph validation is a significant challenge in the semantic web domain. The Shapes Constraint Language (SHACL) is the latest W3C standard developed with the goal of validating data-graphs. SHACL (pronounced as shackle) is a relatively new standard and hitherto has predominantly been employed to validate generic data graphs like WikiData and DBPedia. In generic data graphs, the name of a class does not affect the shape of a class, but this is not the case with biomedical ontology data graphs. The shapes of classes in biomedical ontology data graphs are highly influenced by the names of the classes, and the SHACL shape creation methods developed for generic data graphs fail to consider this characteristic difference. Thus, the existing SHACL shape creation methods do not perform well for domain-specific biomedical ontology data graphs. Maintaining the quality of biomedical ontology data graphs is crucial to ensure accurate analysis in safety-critical applications like Electronic Health Record (EHR) systems referencing such data graphs. Thus, in this work, we present a novel method to create enhanced SHACL shapes that consider the aforementioned characteristic difference to better validate biomedical ontology data graphs. We leverage the knowledge available from lexical auditing techniques for biomedical ontologies and incorporate this knowledge to create smart SHACL shapes. We also create SHACL shapes (baseline SHACL graph) without incorporating the lexical knowledge of the class names, as is performed by existing methods, and compare the performance of our enhanced SHACL shapes with the baseline SHACL shapes. The results demonstrate that the enhanced SHACL shapes augmented with lexical knowledge of the class names identified 176 violations which the baseline SHACL shapes, void of this lexical knowledge, failed to detect. Thus, the enhanced SHACL shapes presented in this work significantly improve the validation performance of biomedical ontology data graphs, thereby reducing the errors present in such data graphs and ensuring safe use in the life-critical applications referencing them.
Keywords