IEEE Access (Jan 2019)
Tripartite Active Learning for Interactive Anomaly Discovery
Abstract
Most existing approaches to anomaly detection focus on statistical features of the data. However, in many cases, users are merely interested in a subset of the statistical outliers depending on the specific domain of interest, e.g., network attacks or financial fraud. The instruction from human experts is therefore indispensable in building predictive models in such applications. However, obtaining labels from human experts is time-consuming and expensive. Obtaining labels from nonexpert labelers are relatively easy and cost-effective. However, the labeling accuracy of a nonexpert is usually difficult to assess. Therefore, it remains open to leverage both the machine intelligence and the knowledge from labelers with diverse backgrounds to construct a machine learning model for domain-specific anomaly detection. To this end, this paper proposes a framework of tripartite active learning for interactive anomaly discovery in large datasets based on crowdsourced labels. This tripartite active learning method consists of two stages. In the first stage, an unsupervised learning algorithm is employed to extract statistical outliers from the dataset. This algorithm is of low computational complexity as well as memory requirement and thus well suited for large datasets. We then develop an iterative algorithm consisting of two steps. The algorithm first evaluates and trains labelers based on gold instances provided by the expert labelers. Then, it assigns the most informative samples to its most confident labeler for relabeling and update the detector based on new labels. The capacity constraints are taken into account in the active learning approach to guarantee the fair allocation of labeling instances as well as robustness against erroneous labels. It is seen through experiments that the proposed algorithm provides an effective means for interactive anomaly detection. As far as we are aware of, this is the first work that considers designing a tripartite machine learning system for domain-specific anomaly detection.
Keywords