Iranian Journal of Information Processing & Management (Mar 2023)
A Conceptual Framework for Preprocessing and Improving Quality of Event Log in Process Mining
Abstract
In today's challenging world, organizational growth is not possible without the efficient use of data. Process mining uses machine learning methods and business process management concepts to extract hidden knowledge about business processes from data stored in information systems. Process Discovery is the first step in process mining. The main goal of process discovery is to transform the event log into a process model. However, using process discovery methods will not be possible without appropriate data because any analysis based on low-quality data will lead to poor insights and bad decisions that will negatively affect the performance of the organization or business. This paper aims to provide a new conceptual framework for preprocessing data input into process discovery methods to improve the quality of the extracted model. The proposed conceptual framework has been developed using a qualitative research process based on grounded theory. For this purpose, 102 articles related to the domain of data quality in process mining were reviewed, and the most critical challenges of data quality in this field have been identified after filtering and integrating them from the literature, including “noisy/infrequent events”, “outlier events”, “anomalous events”, “missing values”, “incorrect time format”, “ambiguous timestamps”, “synonymous activities”, and “size and complexity”. Then, the basic steps for data preprocessing and cleaning tasks are defined, which include the activities of “repair”, “anomaly detection”, “filtering”, and “dimensional reduction. The final preprocessing framework then builds on data quality issues and identified activities. Four standardized datasets derived from real-world processes were used to assess the proposed framework's performance. Firstly, these data are raw, and secondly, four standard process discovery algorithms are applied after preprocessing by the introduced framework. The results showed that the preprocessing of the input data leads to the improvement of the model quality criteria extracted from the process discovery algorithms. Furthermore, to evaluate the validity of the proposed framework, its performance was compared with three preprocessing methods: “sampling”, “statistical preprocessing”, and “prototype selection”, which the results indicate better efficiency of the proposed approach. The results of this study can be used as guidelines by data and business analysts to identify and resolve data quality problems in process mining projects.
Keywords