Radix Sort for Stream ArchitecturesReport
This report presents efficient strategies for sorting large sequences of fixed-length keys (and values) using GPGPU stream processors. Our radix sorting methods demonstrate sorting rates of 482 million key-value pairs per second, and 550 million keys per second (32-bit). Compared to the state-of-the-art, our implementations exhibit speedup of at least 2x for all fully-programmable generations of NVIDIA GPUs, and up to 3.7x for current GT200-based models. For this domain of sorting problems, we believe our sorting primitive to be the fastest available for any fully-programmable microarchitecture.
We obtain our sorting performance by using a parallel scan stream primitive that has been generalized in two ways: (1) with local interfaces for producer/consumer operations (visiting logic), and (2) with interfaces for performing multiple related, concurrent prefix scans (multi- scan). These abstractions allow us to improve the overall utilization of memory and computational resources while maintaining the flexibility of a reusable component. We require 38% fewer bytes to be moved through the global memory subsystem and a 64% reduction in the number of thread-cycles needed for computation.
As part of this work, we demonstrate a method for encoding multiple compaction problems into a single, composite parallel scan. This technique provides our local sorting strategies with a 2.5x speedup over bitonic sorting networks for small problem instances, i.e., sequences that can be entirely sorted within the shared memory local to a single GPU core.
All rights reserved (no additional license for public reuse)
Merrill, Duane, and Andrew Grimshaw. "Radix Sort for Stream Architectures." University of Virginia Dept. of Computer Science Tech Report (2010).
University of Virginia, Department of Computer Science