IEEE Access (Jan 2024)

The Usage of Template Mining in Log File Classification

  • Peter Marjai,
  • Attila Kiss

DOI
https://doi.org/10.1109/ACCESS.2024.3426959
Journal volume & issue
Vol. 12
pp. 96378 – 96386

Abstract

Read online

The continuing growth of large-scale and complex software systems has led to growing interest in examining the possibilities of using the log files that were created during the runtime of the software. These files can be used for various purposes like error prediction, performance evaluation, learning of usage patterns, improving reliability, and so on. With software systems continuously becoming more and more complicated, the distinction of log files that were generated by different components of the software becomes a new task. The classification of log files is important for several reasons like resource optimization, compliance and auditing, automation and analysis, or understanding the general system health. By classifying log files, organizations can better understand the health and performance of their systems. They can identify patterns, potential security threats, anomalies, errors, and malicious behaviors and storage can also be optimized. In the log files, each line represents a specific event that has occurred. Such events can be identified with the use of template miners that assign a unique ID for each event. In our paper, instead of using the full-sized log files, we change each line to its corresponding event ID and use the resulting smaller file for classification purposes. We use numerous classifying algorithms like Random Forest, K-NN, Ada Boost Classifier, and Decision Tree to assign the files to groups corresponding to their origin types. 75% of the data is used for learning purposes while the remaining 25% is used for testing. We conduct numerous different experiments to verify the effectiveness of our method like evaluating the precision, recall, f-score, and accuracy values and measuring the time it takes to classify the files. Our results yielded that while there is a small fallback in the case of the performance of some of the investigated methods used with the proposed algorithm, it takes significantly less time to classify the log files, which can be profitable, especially in the case of large collections of log files.

Keywords