Hierarchical Domain Partitioning for Hierarchical ArchitectureReport
The history of parallel computing shows that good performance is heavily dependent on data locality. Prior knowledge of data access patterns allows for optimizations that reduce data movement, achieving lower data access latencies. Compilers and runtime systems, however, have difficulties in speculating on locality issues among threads. Future multicore architec- tures are likely to present a hierarchical model of parallelism, with multiple threads on a core and multiple cores on a chip. With such a system, data affinity and localization becomes even more important to efficiently use per-core resources. We show how an application programming interface (API) with the right abstractions can conveniently indicate data locality and that a system can use this information to place threads in a way that minimizes cache miss rates and interconnect traffic. This information is often well understood and easily expressed by the programmer but is typically lost to the system, forcing runtime environments to rediscover it on the fly; a far more costly approach. Our system is particularly well suited for the trend in manycore architectures towards large numbers of simple cores connected by a decentralized interconnect fabric. We study a set of data-parallel benchmarks and show that our technique yields up to a 25% performance gain with 17% reduction in energy.
All rights reserved (no additional license for public reuse)
Meng, J, S Che, J Huang, J Li, J Sheaffer, and K Skadron. "Hierarchical Domain Partitioning for Hierarchical Architecture." University of Virginia Dept. of Computer Science Tech Report (2008).
University of Virginia, Department of Computer Science