Error Recovery in Critical Infrastructure Systems

Authors:Knight, John, Department of Computer ScienceUniversity of Virginia Elder, Matthew, Department of Computer ScienceUniversity of Virginia Du, Xing, Department of Computer ScienceUniversity of Virginia

Critical infrastructure applications provide services upon which society depends heavily; such applications require survivability in the face of faults that might cause a loss of service. These applications are themselves dependent on distributed information systems for all aspects of their operation and so survivability of the information systems is an important issue. Fault tolerance is a key mechanism by which survivability can be achieved in these information systems. Much of the literature on fault-tolerant distributed systems focuses on local error recovery by masking the effects of faults. We describe a direction for error recovery in the face of catastrophic faults, where the effects of the faults cannot be masked using available resources. The goal is to provide continued service that is either an alternate or degraded service by reconfiguring the system rather than masking faults. We outline the requirements for a reconfigurable system architecture and present an error recovery system that enables systematic structuring of error recovery specifications and implementations.

Knight, John, Matthew Elder, and Xing Du. "Error Recovery in Critical Infrastructure Systems." University of Virginia Dept. of Computer Science Tech Report (1999).

University of Virginia, Department of Computer Science
