Information (Mar 2024)
Algorithm-Based Data Generation (ADG) Engine for Dual-Mode User Behavioral Data Analytics
Abstract
The increasing significance of data analytics in modern information analysis is underpinned by vast amounts of user data. However, it is only feasible to amass sufficient data for various tasks in specific data-gathering contexts that either have limited security information or are associated with older applications. There are numerous scenarios where a domain is too new, too specialized, too secure, or data are too sparsely available to adequately support data analytics endeavors. In such cases, synthetic data generation becomes necessary to facilitate further analysis. To address this challenge, we have developed an Algorithm-based Data Generation (ADG) Engine that enables data generation without the need for initial data, relying instead on user behavior patterns, including both normal and abnormal behavior. The ADG Engine uses a structured database system to keep track of users across different types of activity. It then uses all of this information to make the generated data as real as possible. Our efforts are particularly focused on data analytics, achieved by generating abnormalities within the data and allowing users to customize the generation of normal and abnormal data ratios. In situations where obtaining additional data through conventional means would be impractical or impossible, especially in the case of specific characteristics like anomaly percentages, algorithmically generated datasets provide a viable alternative. In this paper, we introduce the ADG Engine, which can create coherent datasets for multiple users engaged in different activities and across various platforms, entirely from scratch. The ADG Engine incorporates normal and abnormal ratios within each data platform through the application of core algorithms for time-based and numeric-based anomaly generation. The resulting abnormal percentage is compared against the expected values and ranges from 0.13 to 0.17 abnormal data instances in each column. Along with the normal/abnormal ratio, the results strongly suggest that the ADG Engine has successfully completed its primary task.
Keywords