IEEE Access (Jan 2021)
Diaspore: Diagnosing Performance Interference in Apache Spark
Abstract
Apache Spark is being increasingly used to execute big data applications on cluster computing platforms. To increase system utilization, cluster operators often configure their clusters such that multiple co-located applications can simultaneously share the resources of a cluster node. With resource sharing, applications can compete with each other for shared node resources thereby interfering with each other’s performance. Many Spark applications take a long time to execute. Performance interference from other applications can thus cause a Spark application to fail or take even longer time to execute thereby wasting cluster resources and frustrating users. This motivates the need for an automated technique that can detect interference quickly and also diagnose the root cause of the interference to facilitate mitigation of the problem. Most existing approaches are not designed to offer quick interference detection and diagnosis. For example, they typically require extensive training data for every application of interest under various possible input data sizes and resource allocations. In this paper, we systematically investigate the design of a Machine Learning (ML) based technique that addresses this open problem. We implement a tool called Diaspore that integrates our findings. We evaluate the tool with a diverse set of 13 Spark applications executing on a real cluster. Experimental results show that Diaspore requires only small scale training data, i.e., executions under small input sizes and resource allocations. Furthermore, our results show that the tool can offer accurate predictions for applications not present in the training data. Consequently, Diaspore reduces the training time needed to offer predictions. Finally, the feature engineering underlying Diaspore ensures that the tool can detect and diagnose interference quickly in an online manner by sampling only a small fraction of a long running application’s execution. This can allow cluster operators to mitigate interference in an agile manner.
Keywords