Definition of fault tolerance Fault tolerance is the ability of a system to continue performing its intended function in spite of faults. In a broad sense, fault tolerance is associated with reliability, with successful operation, and with the absence of breakdowns. A fault-tolerant system should be able to handle faults in individual hardware or software components, power failures or other kinds of unexpected disasters and still meet its specification.
Why do we need fault-tolerance?
• It is practically impossible to build a perfect system – suppose a component has the reliability 99.99% – a system consisting of 100 non-redundant components will have the reliability 99.01% – a system consisting of 10.000 components will have the reliability 36.79% • It is hard to forsee all the factors
A system is said to fail if it ceased to perform its intended function. System is used in this book in a generic sense of a group of independent but interrelated elements comprising a unified whole. Therefore, the techniques presented are also applicable to the variety of products, devices and subsystems. Failure can be a total cessation of function, or a performance of some function in a subnormal quality or quantity, like deterioration or instability of operation. The aim of fault-tolerant design is to minimize the probability of failures, whether those failures simply annoy the customers or result in lost fortunes, human injury or environmental disaster
Fault tolerance and redundancy
Redundancy • Redundancy is the provision of functional capabilities that would be unnecessary in a fault-free environment – replicated hardware component – parity check bit attached to digital data – a line of program verifying the correctness of the result
Applications of fault-tolerance
Applications • safety-critical applications – critical to human safety • aircraft flight control – environmental disaster must be avoided • chemical plants, nuclear plants – requirements • 99.99999% probability to be operational at the end of a 3-hour period
mission-critical applications – it is important to complete the mission – repair is impossible or prohibitively expensive • Pioneer 10 was launched 2 March 1970, passed Pluto 13 June 1983 • requirements • 95% probability to be operational at the end of mission (e.g. 10 years) • may be degraded or reconfigured before (operator interaction possible)
• bisness-critical applications – users want to have a high probability of receiving service when it is requested – transaction processing (banking, stock exchange or other time-shared systems) • ATM: < 10 hours/year unavailable • airline reservation: < 1 min/day unavailable
maintenance postponement applications – avoid unscheduled maintenance – should continue to function until next planned repair (economical benefits) – examples: • remotely controlled systems • telephone switching systems (in remote areas)
The main goal of fault tolerance is to increase the dependability of a system
Dependability is the ability of a system to deliver its intended level of service to its users