IEEE Access (Jan 2019)
Understanding and Statically Detecting Synchronization Performance Bugs in Distributed Cloud Systems
Abstract
In such an information society, the Internet of Things (IoT) plays an increasingly important role in our daily lives. With such a huge number of deployed IoT devices, Cyber-Physical System (CPS) calls for powerful distributed infrastructures to supply big data computing, intelligence, and storage services. With the increasingly complex distributed software infrastructures, new intricate bugs continue to manifest, causing huge economic loss. Synchronization performance problems, which means that improper synchronizations may degrade the performance and even lead to service exception, heavily influence the entire distributed cluster, imperiling the reliability of the system. As one kind of performance problems, the synchronization performance problems are acknowledged as difficult to diagnosis and fix. We collect 26 performance issues in three real-world distributed systems: HDFS, Hadoop MapReduce, and HBase, and do analysis on their root cause, fix strategy, and algorithm complexity in order to understand these synchronization performance bugs better. Then, we implement a static detection tool including critical section identifier, loop identifier, inner loop identifier, expensive loop identifier, and pruning component. After that, we evaluate our detection tool on these three distributed systems with sampled bugs. In the evaluation, our detection tool accurately finds out all the target bugs. Besides, it points out more new potential performance problems than the previous works. With the strict performance overhead, our detection tool is proved to be greatly efficient.
Keywords