Parallel Scan for Stream ArchitecturesReport
One of the most useful algorithmic primitives for parallel processing is scan (also known as prefix scan, prefix sum, prefix reduction, etc.). Because the computational granularity of concurrent scan tasks is so small, the memory bandwidth between physical processing elements and globally-visible storage banks is the limiting hardware resource. Current implementations of parallel scan for GPGPU stream architectures do not maximize memory bandwidth: they either make inefficient use of device memory accesses, are computationally-bound due to high dynamic instruction counts, or both. In this work we present three implementations of parallel scan that address these bandwidth inefficiencies. These new implementations are memory-bound, utilize 100% of achievable memory bandwidth, and only require the use of a constant amount of global device memory for the storage of intermediate results. On our target platform, all three provide a 1.7x performance speedup over the scan primitives provided by the CUDA Parallel Primitives (CUDPP) library, exhibiting up to a 64% reduction in dynamic instruction count. Our particular scan implementations are valuable in their own right, but more importantly we have developed a generalized design methodology that should allow us to construct bandwidth-optimal implementations for any stream device having similar machine and programming models.
All rights reserved (no additional license for public reuse)
Merrill, Duane, and Andrew Grimshaw. "Parallel Scan for Stream Architectures." University of Virginia Dept. of Computer Science Tech Report (2009).
University of Virginia, Department of Computer Science