IEEE Access (Jan 2024)

Automated Text Structuring: Natural Language Processing and Regular Expressions in XML Tag Filling

  • Ivan P. Malashin,
  • Vadim S. Tynchenko,
  • Andrei P. Gantimurov,
  • Vladimir A. Nelyub,
  • Aleksei S. Borodulin

DOI
https://doi.org/10.1109/ACCESS.2024.3511674
Journal volume & issue
Vol. 12
pp. 190582 – 190597

Abstract

Read online

The conversion of documents into XML markup requires efficient algorithms and automated solutions. The focus is on tagging documents to meet NISO STS standards, ensuring compatibility across systems. A method combining Natural Language Processing (NLP) and Regular Expressions (regex) for automated XML tag filling is proposed. NLP enhances content understanding, while regex enables precise pattern matching. This approach streamlines the conversion process, reducing manual effort and ensuring standardized tagging. Through experiments, the effectiveness of the method in achieving accurate XML markup aligned with NISO STS guidelines is validated. This research advances automated data structuring, exemplified by the GOST R ontology within NISO STS standards, providing a template for other ontology-based document XML-structuring.

Keywords