علوم و فنون مدیریت اطلاعات (Sep 2023)

Data Quality in Process Mining: A Systematic Review

  • Ahmad Salehi,
  • Mohammad Aghdasi,
  • Toktam Khatibi,
  • Majid SheikhMohammadI

DOI
https://doi.org/10.22091/stim.2022.7800.1737
Journal volume & issue
Vol. 9, no. 3
pp. 160 – 103

Abstract

Read online

Purpose: Process mining connects the disciplines of data mining and machine learning to business process management techniques. A business process is a series of independent and interdependent activities that transform inputs (data, materials, etc.) using one or more resources (such as time, employees, and money). It utilizes the necessary outputs. It is possible to examine the actual behavior of organizations, including the performance of individuals, departments, and resources, using process analysis techniques. The results of the process analysis, which typically includes the organization's business process models, can be compared to the organization's documents and requirements. Thus, processes will be able to be compared, reviewed, monitored, and enhanced. Process mining methods operate based on event logs stored in information systems. Using process mining without high-quality input data will not result in accurate conclusions about an organization's business processes. In recent years, researchers have focused on the evaluation and enhancement of the quality of input data using process mining techniques. The objective of this study is to identify and categorize the most significant data quality issues, as well as recognize the approaches proposed to address this challenge in process mining. Methods: This research employs a systematic review with the intent of analyzing all valid evidence in order to answer the research questions. This study investigates 102 academic studies published between 2007 and 2021, including conference papers, journal articles, and theses. Towards this end, a systematic three-part research methodology was employed. In the first section, which included the research definition, the research field was defined first, followed by the research objectives and queries. In the concluding step of this section, the research's scope is defined. In the second section, the research methodology and entry criteria for the studies discovered during the search for scientific resources are defined. Finally, the identified studies are evaluated in terms of their citations and classified. In the third section, which is devoted to the evaluation of the research, the concluding research of the study is conducted, and then, based on the investigation of the preceding studies, the findings and conclusions are determined. Important data and evidence were extracted from the collated research, allowing for the creation of the necessary tables and graphs. Findings: In recent years, researchers have paid more attention to data quality challenges in the process mining, according to the findings of recent research. In 2019 and 2020, the greatest number of studies will have been published. It was also discovered that the majority of articles were published in three scientific databases, namely Springer, IEEE, and Elsevier. 51% of the studies examined were presented at prestigious conferences. 36% of the studies were published in prestigious scientific journals, while the remaining 13% were represented in dissertations and university reports. The study of the selected articles revealed that 20 data quality issues that can arise in the input data have been investigated in the literature. These challenges have been categorized into five levels: trace, event, case, activity, and timestamps, and four foundational approaches have been identified that have been used to evaluate and resolve data quality challenges in the mining process. 1) data quality frameworks 2) preprocessing 3) anomaly detection 4) repair. Our findings indicate that preprocessing techniques that seek to remove chaotic and infrequent behaviors from the event log have received more attention than other techniques. In addition, these results demonstrate that, in recent years, the discovery of anomalies and the reconstruction of missing events have become popular research topics within the field of process mining. Examining studies related to the field of data quality in the data mining process reveals an abundance of approaches and methods for addressing data quality challenges. Investigations revealed that the use of colorful Petri nets as a mathematical method has been considered in all selected research projects. Conclusions: The data needed for process mining methods can be obtained from various sources. One of the major advantages of process mining is that it is not limited to a specific type of system. Any workflow-based system, such as ticketing, resource management, databases, data warehouses, legacy systems, and even manually collected data, can be analyzed as long as it can be separated using case ID, activity, and timestamp attributes. In real-world scenarios, most data is not collected for process mining purposes or is unsuitable for use in process mining analyses. Especially data that is recorded manually or scattered among various isolated systems can contain errors. Despite the efforts made to improve the quality of input data in the mining process, it is still necessary to develop efficient frameworks and methods to identify, evaluate, and address data quality challenges in real business processes, which are often characterized by high volume and complexity. The results of this research can offer a fresh perspective for researchers, data science specialists, and business analysts.

Keywords