Radix Sort for Stream Architectures

Report
Authors:Merrill, Duane, Department of Computer ScienceUniversity of Virginia Grimshaw, Andrew, Department of Computer ScienceUniversity of Virginia
Abstract:

This report presents efficient strategies for sorting large sequences of fixed-length keys (and values) using GPGPU stream processors. Our radix sorting methods demonstrate sorting rates of 482 million key-value pairs per second, and 550 million keys per second (32-bit). Compared to the state-of-the-art, our implementations exhibit speedup of at least 2x for all fully-programmable generations of NVIDIA GPUs, and up to 3.7x for current GT200-based models. For this domain of sorting problems, we believe our sorting primitive to be the fastest available for any fully-programmable microarchitecture.
We obtain our sorting performance by using a parallel scan stream primitive that has been generalized in two ways: (1) with local interfaces for producer/consumer operations (visiting logic), and (2) with interfaces for performing multiple related, concurrent prefix scans (multi- scan). These abstractions allow us to improve the overall utilization of memory and computational resources while maintaining the flexibility of a reusable component. We require 38% fewer bytes to be moved through the global memory subsystem and a 64% reduction in the number of thread-cycles needed for computation.
As part of this work, we demonstrate a method for encoding multiple compaction problems into a single, composite parallel scan. This technique provides our local sorting strategies with a 2.5x speedup over bitonic sorting networks for small problem instances, i.e., sequences that can be entirely sorted within the shared memory local to a single GPU core.

Rights:
All rights reserved (no additional license for public reuse)
Language:
English
Source Citation:

Merrill, Duane, and Andrew Grimshaw. "Radix Sort for Stream Architectures." University of Virginia Dept. of Computer Science Tech Report (2010).

Publisher:
University of Virginia, Department of Computer Science
Published Date:
2010