IEEE Access (Jan 2019)
Accelerating Wait-Free Algorithms: Pragmatic Solutions on Cache-Coherent Multicore Architectures
Abstract
Parallelizing performance-critical applications are of critical importance for harnessing the power of multicore processors, which are now ubiquitous. Even though wait-free algorithms offer the appeal of completing each operation of a parallelized application in a finite number of steps, high-performance wait-free algorithms at high levels of concurrency are still rare. In this paper, we demonstrate one primary reason for this inefficiency: existing wait-free algorithms are not optimized for processors' caches and write buffers, two key components in modern hardware to accelerate memory accesses. As an example, a wait-free multi-producer-single-consumer queue algorithm, which faces common performance problems of wait-free algorithms, is studied in this paper. We accelerate the queue algorithm by (1) allowing producers to buffer enqueue requests in their local buffers and to write them into the shared queue in batch, to exhibit the spatial locality of the program, and (2) eliminating expensive atomic operations, by giving up some degree of internal consistency to avoid write buffers being drained frequently. The outcome is a write-buffer and cache-friendly queue algorithm which is wait-free and efficient on off-the-shelf multicore processors. The experiments show that the optimized queue algorithm outperforms prior queue algorithms on three different architectures (x86, Power8, and ARMv8). On x86, it outperforms WFQueue, the state-of-the-art solution, by 2-3x, and outperforms CCQueue, the representative of combining solution, by 4-12x. Applying the techniques presented in this paper to other wait-free algorithms is straightforward; our queue example demonstrates that these techniques can be applied to other wait-free algorithms while maintaining the control flow of the original algorithms without dramatic changes.
Keywords