Leveraging Memory Level Parallelism Using Dynamic Warp SubdivisionReport
SIMD organizations have shown to allow high throughput for data-parallel applications. They can operate on multiple datapaths under the same instruction sequencer, with its set of operations happening in lockstep sometimes referred to as warps and a single lane referred to as a thread. However, ability of SIMD to gather from disparate addresses instead of aligned vectors means that a single long latency mem- ory access will suspend the entire warp until it completes. This under-utilizes the computation resources and sacrifices memory level parallelism because threads that hit are not ble to proceed and issue more memory requests. Eventually, the pipeline may stall and performance is penalized. There- fore, we propose warp subdividing techniques that dynami- cally construct run-ahead �warp-splits� from threads that hit the cache so that they can run ahead and prefetch cache lines that may be used by others that fall behind. Several opti- mization strategies are investigated and we evaluate the tech- niques over two types of memory systems: a bulk-synchronous cache organization and a coherent cache hierarchy. The for- mer has private caches communicating with the main memory with coherence taken care of by global barriers; the latter has private caches coherently sharing an inclusive, on-chip last level cache (LLC). Experiments with eight data-parallel benchmarks show our technique improves performance on av- erage by 15% on the bulk-synchronous cache organization with a maximum speedup of 1.6X, and 17% on a coherent cache hierarchy with a maximum speedup of 1.9X. This can be achieved with an area overhead of less than 2%.s
All rights reserved (no additional license for public reuse)
Meng, Jiayuan, David Tarjan, and Kevin Skadron. "Leveraging Memory Level Parallelism Using Dynamic Warp Subdivision." University of Virginia Dept. of Computer Science Tech Report (2009).
University of Virginia, Department of Computer Science