IEEE Access (Jan 2023)

Restoration of Data Structures Using Machine Learning Techniques

  • Branislava Cvijetic,
  • Zaharije Radivojevic

DOI
https://doi.org/10.1109/ACCESS.2023.3323846
Journal volume & issue
Vol. 11
pp. 113077 – 113099

Abstract

Read online

Tabular data is the most common format used to represent real-world information. Almost all programs created for storing or processing data, such as relational database systems, spreadsheets, and statistical analysis software can import or export tabular data. These programs are not sufficiently robust to automatically solve the problems of importing messy delimited files or files that contain data from multiple tables. Additional messy datasets contain data delimited by multiple delimiters without the names of the table columns, and parts of the table rows have substituted or deleted columns. This paper proposes the STCExtract algorithm for reconstructing table structures and data in which the input file can be arranged. The STCExtract algorithm is designed to be domain-independent and modular according to machine learning algorithms and other parameters. The algorithm was developed as a two-phase process, in which the original data tables were recognized in the first phase and the columns of the original data tables in the second phase. The STCExtract algorithm was evaluated through expensive experiments using multiple real datasets. Multiple messy datasets were generated for the four experiments. Three experiments were conducted to determine the optimal parameters for the STCExtract algorithm. A fourth experiment was conducted to evaluate the proposed algorithm. The results show that the STCExtract algorithm correctly arranged the structure of the tables with an accuracy of 94.4% to 100%. The accuracy of the STCExtract algorithm in the second phase (when the data were allocated to columns) ranged from 59.7% to 90.2%.

Keywords