Fault-Tolerance in Coarse Grain Data FlowReport
Wide-area parallel processing systems will soon be available to researchers to solve a range of problems. It is certain that host failures and other faults will be an every day occurrence in these systems. Unfortunately contemporary parallel processing systems were not designed with fault-tolerance as a design objective. The data-flow model, long a mainstay of parallel processing, offers hope. The model's functional nature, which makes it so amenable to parallel processing, also facilitates straight- forward fault-tolerant implementations. It is the combination of ease of parallelization and fault- tolerance that we feel will increase the importance of the model in the future, and lead to the widespread use of functional components. Using Mentat, an object-oriented, data-flow based, parallel processing system, we demonstrate two orthogonal methods of providing application fault-tolerance. The first method provides transparent replication of actors and requires modification to the existing Mentat run-time system. Providing direct support for replicating actors enables the programmer to easily build fault-tolerant applications regardless of the complexity of their data-flow graph representation. The second method - the checkboard method - is applicable to applications that contain independent and restartable computations such as "bag of tasks", Monte Carlo's, and pipelines, and involves some simple restructuring of code. While these methods are presented separately, they could in fact be combined. For both methods, we present experimental data to illustrate the trade-offs between fault-tolerance, performance and resource consumption.
All rights reserved (no additional license for public reuse)
NguyenTuong, Anh, Andrew Grimshaw, and John Karpovich. "Fault-Tolerance in Coarse Grain Data Flow." University of Virginia Dept. of Computer Science Tech Report (1995).
University of Virginia, Department of Computer Science