Survey paper for CMSC 691j, Fall 2001
Architecture of Safety-Critical Real-Time Applications
Gaurav Gupta, Graduate Student, Dept. Of Computer Science And Electrical Engineering, University Of Maryland Baltimore County
Index Terms: Fault-tolerance, Safety-critical, Architecture, Failure rates, Topology.
This paper is a survey on different architectures of Safety-Critical real-time applications. These systems have an acceptable probability of failure in a range of 10-4 to 10-10 failures per hour depending on the type of application. Most of the architectures are explained with reference to the Aircraft control applications but few safety-critical and mission-critical applications are also discussed. The introduction briefs about the architectural principles of real-time applications and some basic concepts of fault tolerant computing. The following sections explain some traditional architectures like FTMP, SIFT, FTP that formed the building block to the more recent architectures MARS, MAFT, GUARDS and DACAPO. The paper highlights on the differences between approaches, design, performance and applications of these architectures.
The application of Computers stretches from daily accounting to nuclear power plants. But these computer systems are subject to failures and the consequences can be disastrous depending on the criticality of the applications. The issue to handle these failures became more important when the designers incorporate computers into safety-critical real-time systems like control systems and navigation of spacecraft. Few earlier approaches used fault-avoidance to achieve reliability in the system but it was very costly to build these components. With the growth of microprocessors fault-tolerance is made possible along with fault-avoidance to reduce the overall cost. In the last 30 years the Safety-critical real-time applications of the computers have expanded to include aircraft, ground transportation, ships, nuclear power plants and medical equipment. The dependable architectures designed in the early 70s for the safety-critical applications included dual-dual and triplex systems. These systems mainly used the redundancy to allow the system to continue operating in spite of the presence of one or more faults.
The basic principle of fault-tolerance is to develop and certify computing system, which can perform, in a satisfactory fashion in the presence of faults. The OAO and Apollo guidance systems used the first two fault-tolerant systems. The OAO computers used an approach in which each transistor can be replaced with four transistors. This approach failed because on an integrated system a fault can damage all four transistors simultaneously. Apollo guidance system used a different technique where there were three identical processors running the same task. There results were voted to identify the error by a single processor. This technique was called Triple Modular Redundancy (TMR)  and formed a basis of many future architectures. ESS (Electronic Switching System) used two systems and compared their outputs, in case of the two results are not same the system was diagnosed to detect the faulty node. In space missions where the system remains un- maintained for a long time JPL_STAR computers were used . The computers were divided into functional units like memory, arithmetic processor, I/O processor and test and repair processor. Each unit was designed to detect its own fault and send a signal about the fault. The test and repair unit replaces the faulty processor with a spare unit if available.
The very first uses of real time system were seen in the Apollo guidance, Navigation and Control Computers. For reliability considerations they used only a three-input NOR gate that can be easily tested and controlled. Inspired by this in 1970 the military aircrafts used this system to develop a fly-by wire flight controls. This Digital fly-by-wire system (DFBW) used bit wise exact consensus of the output for the fault detection but did not meet the requirement of Byzantine Resilience, explained later. A data processing system DPS was designed to meet the fail operation/fail safe (FO/FS) requirement through the use of the software design diversity. It basically forced a backup system loaded with independently developed and coded software that is capable of soft vehicle recovery and continuation or safe return from any mission .
The commercial companies like Boeing also pioneered the use of fault tolerant computers for real-time flight control applications. The acceptable failure rate of these system must be less than 10-9/hr which is far less then military and space missions, but was required for a short duration like landing and take off. In the last 30 years flight control computers have improved tremendously. Today flight like Airbus A-320 are using a full DFBW flight control system with no backup. They used a concept of three quad redundant computers with each of the quads implemented in dissimilar hardware and programmed in dissimilar software. The two major applications of safety critical real-time systems are aircraft engine control and UUV (Unmanned Underwater Vehicle) control.
The requirements for fault-tolerant computer architecture vary with the application requirements. In this paper I have defined reliability requirements for these systems in term of maximum acceptable probability of failure. Safety-critical applications are the most demanding for reliability. To give a view about how demanding these systems are I compare few critical systems in order of their failure rates. A commercial transport fly-by-wire like Airbus A-320 requires a probability of 10-10 failures per hour of the flight, for military aircraft, unmanned launch vehicle and autonomous underwater vehicle require 10-6 to 10-7 failure per hour. The mission critical applications require only 10-4 to 10-6 probabilities of mission failure. Online transaction processors for banks and stock exchanges demand high availability rather then reliability.
Another requirement of such real-time applications is the time-response which have to be met otherwise the system will fail. Also these systems require the capability of validation as discussed earlier. Since lifetime testing of these systems is not possible they are tested extensively in laboratories and simulation under all expected conditions.
The initial approaches for fault-tolerance were very ad-hoc. Byzantine General problem gave a new dimension for the design of architecture for such critical real-time application.
C. Byzantine Resilience
Faults can show random behavior on the part of failed components like sending wrong and contradictory information to corrupt the system. These faults are called Byzantine faults . A system for safety-critical application must be able to detect and handle such random components failure. A traditional system failure modes and effect analysis (FMEA) requires an exhaustive validation to predict the extent and effect of possible failure. It is a tedious, time-consuming and extremely expensive process. Instead if a system is designed to tolerate any fault including one showing intelligent malicious behavior will automatically be Byzantine-resilient. In a Byzantine-resilient system the programmer is unaware of hardware redundancy and management. This facilitates separate validation of entities like application program, Operating system and redundancy management.
Architecture like FTMP, FTP and AIPS include Byzantine resiliency in their architectures. A majority voting with redundancy is used to mask errors and provide spares to restore error masking after a failure. Redundancy management is the key of redundancy system to continue working correctly. Fault containment regions (FCR) are created to restrict error propagation. The redundant elements in a system are partitioned into FCR that works without any effect on outside faults. They are logically as well as physically isolated with each other. FCR restricts faults from propagation but cannot stop propagation of erroneous data. Thus to provide error containment along with fault containment voting plans are used. These voting plans are deployed at all steps of a system like reading data computation and sending data to other FCRs.
Another important aspect of these architectures is a voting mechanism. There are two approaches commonly used, they are exact consensus and approximate consensus. Exact consensus request output of all channels to agree bit-by-bit for no fault conditions used in DFBW and space shuttle DPS. In approximate consensus approach the output of redundant channels need to agree with some predefined threshold. This approach has many drawbacks associated with it like there is no mathematical way to define the threshold so an empirically derived heuristic have to be used. The validation of such empirically derived value is very tedious. Also it can lead to false alarm or hide some attack if the window size is too small or too large. In addition an approximate consensus can be reached only for physical quantity while most of the traffic has no physical semantic. Thus exact consensus is more favorable then approximate consensus. In an exact consensus two components will produce identical result if they have:
Identical Initial States: The redundant copies of the hardware must be initialized to the same state.
Identical Input: Each hardware copy must be provided with identical sequence of input. Input generally includes data from sensors and events.
Identical Operation: Each channel must execute the same sequence of operations on the same input.
An upper bound on the time skew, tskew must be defined so that the time of completion for the given sequence of instruction for the slowest channel is no longer then time for the fastest channel by more than tskew. Congruence of process also signifies the presence or absence of faults. If two congruent processes do not have exact output then there is an error.
Synchronization of redundant channels can be done in the software or hardware.
Software approach uses semaphores or locks but it is inefficient because the application programmer has to take care of it. Most architectures use hardware implemented synchronization. It is done by identical redundant hardware clocks.
In 1970 the NASA sponsored the development of two very advanced machines for commercial aircraft control. These architectures were called fault-tolerant multi processor (FTMP) and software implemented fault tolerant (SIFT). Both these machines follow the TMR approach and execute three copies of a program on different components and then vote the results to mask faults. The next two sections explain the two architectures with their features and design.
Fault tolerant multiprocessor architectures was build for several life-critical aerospace application. It is also used as a fault-tolerant central computer for civil transport aircraft application. FTMP failure rate is claimed to be less than 10-9 failure/hr with maintenance within 10hrs of dispatch. Generally it works fine for an interval of 200-300 hours without maintenance . FTMP architecture is build by connecting an arbitrary number of processors through serial buses. Each processor module has its own local memory or cache.
Following the TMR model three processors are associated in a group, each containing a special circuit to prevent fault propagation. FTMP was a major breakthrough in the development of critical system architectures. The Traditional approaches were not able to detect faults that are indispensable part of critical systems. All the critical transactions were triplicated to provide error masking. Both FTMP and SIFT (discussed later) follows this approach along with graceful degradation. In FTMP voting is implemented in hardware with a fully synchronous system.
A. Design of FTMP
In an architecture where a single memory is accessed by all the processors the problem of resource sharing and conflicts are obvious. FTMP use a processor cache to realize serial access. The performance of the system now depends on the bandwidth of the memory port that decides how fast the requests can be handled by the memory.
FTMP improves this limitation by interpreting the high level statements while they are in the cache making the processing faster. Processors are kept anonymous in this system because they are subject to get reconfigured often. The cache memory is kept large enough to enhance the instructional bandwidth with little loss of the processor anonymity. Most of the jobs executed by the processors are periodic and are scheduled offline. Whenever a processor finishes its job it gets a new job form the ready queue. I/O management is done by reserving the I/O interface exclusively to a process for a limited time according to some priority arrangement.
Fig 1. Bus Guardian Connections
In FMTP any three processors and buses are associated to form a triad. A single triad of bus lines is active at one time for each of the memory and interface accesses. Each element can transmit on only one line in the triad. The three lines are connected to a voter in each module that decides the validity of the data, thus forming a TMR element. A control system called Bus Guardian is associated with each module that governs the status of the associated module. A bus guardian monitors the selection of the memory bus triad and manages the self-test configurations. To avoid fault propagation between these bus guardians they are made independent of each other. They are addressable part of the memory address space and can receive message from any processor triad via the active memory bus triad. Basically Guardians are used as agents to convey the computer’s configuration to all the elements of the computer.
C. Processor and Memory Modules
Fig 2. Physical Diagram of FTMP Architecture
In case of a processor failure its triad attempts to complete the current job step unless a second failure occurs. Once the job step is over one of the remaining two processors try to fix the triad. At this time bus guardian comes to work by removing the faulty processor and replace it with a spare processor. In case there is no spare processors available the remaining two processors of the triad are made spares. Reconfigurations in memory module are more difficult because any spare cannot be used for a faulty one. If the faulty memory module is read-only then only a read-write module can be used to replace it with all the data copied on the spare.
FTMP uses tight synchronization that involves hardware voting. A common time reference or clock that supports hardware voting allows instantaneous validation of internal data and configuration control. FTMP implements timing reference by a fault-tolerant redundant clocking arrangement based on a majority logic algorithm or more recently used voltage-controlled crystal oscillator.
E. Malfunction Management
Safety-Critical system requires a high level of dependability, thus any fault or malfunction has to be managed efficiently in the architecture. The malfunction can occur through various sources like physical penetration, temperature and radiation. These are some physical sources and the system should be made robust enough to handle certain level of these exposures. There can be some circuit failures or deficiencies in the system specifications that can cause these malfunctions. Malfunctions are treated as any other faults and reconfiguration is done to avoid these faults and prevent their propagation. Bus guardian unit and voter unit handles the fault-tolerance in the FTMP architecture. The common clock in the FTMP is specially protected to prevent a single point failure.
SIFT is another ultra reliable fault tolerant fault tolerant system architecture designed for control systems of aircrafts. Its reliability is proved by mathematical models and rigorous testing and specification . SIFT claims to fulfill the reliability requirement of 10-9 failure/hr for 10 hr duration of flight. SIFT does not care about the failure modes i.e. the manner in which data is corrupted. It uses FMEA as interface between units. SIFT detects errors at processor, memory and bus level. It does not use error detection and correction codes in memory. As deviation from FTMP, SIFT implements fault tolerance at software level.
SIFT provide services like error correction, detection, diagnosis, reconfiguration and prevent fault propagation in software. Each processor has a memory associated with it and together they are called a Processing module. Results are stored in individual processor memory SIFT system executes a set of task each having a sequence of operation. The same iteration of task runs simultaneously on three processing
modules. The processor that uses the output matches the 3 values and chooses the value by two out of three vote. The voting is done at the beginning of each iteration to reduce message in the system. The synchronization done in SIFT is loose i.e. there can be slack between the clock of two processors. A task called global executive allocates tasks to different modules.
All units of SIFT system are autonomous in order to prevent fault propagation. This is realized by letting the processing module only write to its own memory and read from other processor modules memory. This technique prevents the faulty node from writing corrupted data into other unit’s memory. Improper control signals are also ignored and timers are used to prevent systems halts and infinite waiting for a signal from a faulty module. Incorrect results are also generated if a non-faulty unit reads corrupted data from a faulty unit. SIFT prevents this scenario by fault masking. It is achieved by reading three different values from memories of different modules over different buses and then voting for the correct data. If a faulty unit is found the system is reconfigured.
SIFT considered four scheduling algorithm out of which priority scheduling, deadline scheduling and simply periodic scheduling worked with some constraint on the processor load. In a simple periodic scheduling algorithm each task iteration rate is an integral multiple of next smaller task rate. SIFT uses a slight variation of this technique by assigning the priority to each task based on the rate of the iteration.
Processors are periodically synchronized in order to prevent increase in Skew of the clock called clock drift. It uses a median clock algorithm using three clocks. To synchronize, each of the three clocks observer other processors clock and adjusts itself to the median of the three clocks. Proofs shows that even if the clock is faulty the median will be acceptable value provided that all the nodes observe the same value of the faulty clock. In case the condition just mentioned is not true the problem cannot be solved with 3 nodes, instead SIFT uses an algorithm with 4 clocks to avoid this scenario. The algorithm uses a vector clock in which a node say ‘p’ first observes clock of another node say ‘q’ then matches the clock value with the value of ‘q’ observed from other nodes. If the majority of values are same then the value is acceptable otherwise NIL is placed for the node in the qth entry of the vector called Interactive Consistency Vector. The median of the vector is computed with a fixed drift value and is taken as synchronized clock time.
D. Operations in SIFT
To access data from location ‘w’ of memory ‘m’ via bus ‘b’ processor ‘p’ initiates the READ operation by putting m and w into register PREQUEST (p, b). Each processor has a separate PREQUEST register to each bus to which it is connected. The BUSREQUEST line is set to request attention and the processor waits for bus and memory to become idle. Processor then seizes the bus when the scanner finds a processor with its BUSREQUEST high. MEMREQUEST line is raised after transferring the value ‘w’ from the processor to a register connected to memory ‘m’. Bus then waits for memory access to complete. When the request is detected the memory is seized and it reads the value at location ‘w’ and loads it into the MEMDATA register. It raises MEMREAD line to inform the bus that data is available. After the MEMREQUEST line by the bus is dropped MEMREAD transfer data from MEMDATA to BUSDATA register and drops MEMREQ, raising DATAREADY. All this time the processor that initiated this request waits for DATAREADY to be raised. Once it senses that the line is up it reads data from the BUSDATA register and drops the BUSREQUEST line.
E. Software System
The software of SIFT has two parts: Application software and Executive Software.
Application software: It performs the actual computation in SIFT system. The fact that each iteration is executed at 3 different modules is not visible to the application software. It asks for the input to the executive software and provides the output, once the execution is over.
Executive Software: This part of the SIFT software does the bulk of the tasks. It provides input for each iteration of the task, checks for errors and faults and reconfigures the system by removing or fixing the failed components. The executive software is divided in to 3 parts.
Global Executive: Global execution task runs on several processors. It job is to perform allocation of tasks, diagnose errors and reconfigure the system.
Local Executive: It runs the assigned task, provides input to the task, receive output and reports errors to the global execution task. Four different routine of local executive task are: Error handler, Scheduler, buffer interface unit, voter.
Local-Global Communication task: The local reconfiguration task and error reporting task communicates with global task using buffers. Some of the error messages are also used by the local executive to detect possible faulty nodes before the global executive diagnose the error. These functions are performed by Local-Global communication task.
A very high redundant computing system was used on space shuttle using five computers per component. During a critical mission four of them execute identical programs and the fifth one is kept as a backup. In case any of the computer fails it was replaced with this backup by copying all the information about the currently running task and starting it again at the backup.
Another distributed approach called LOCUS  is a network OS that exploits hardware redundancy in a LAN to provide a very high availability. The user sees the same address space no matter which machine he is working on. Recently lots of Networks Systems like NFS and AFS have gained tremendous attention.
Fig 4. AIPS Configuration
An extension on FTMP called AIPS  is also a distributed approach. A group of identical processing sites are connected through a switching node to a redundant communication structure, which behaves like a triply redundant bus. Each processing unit can be FTMP, SIFT or FTP. Each site has a local clock but, different sites do not need to synchronize. Voting is done in the same manner, as FTMP in the same processing unit but sites with different local clock requires hardware synchronization. Future architecture deals with much more complex distributed system due to advent of more powerful processors.
The failure probability for commercial transport is about 10-10 failure/hr. We have seen on SIFT and FTMP achieving this through modular redundancy and voting of data from replicated tasks. Performance measurements shows that SIFT and FTMP systems consume up to 80% and 60% of the system throughput . AIPS architecture shows better performance then FTMP but is constrained because it is a single processor system .
Fig 5.MAFT System Architecture
MAFT is a distributed approach to achieve extremely high reliability and high performance in real-time environment. MAFT was designed to support control frequency up to 200 Hz, instruction throughput of 5.5 million instruction/sec and failure probability of 10-10 /hr .
A. MAFT Architecture:
MAFT system is build by connecting several autonomous computers called nodes connected through broadcast bus network. Each node is partitioned into two separate processors called Operations Controller (OC) and Application Controller (AC). Functions like internode communication and synchronization, data voting, error detection, scheduling and reconfiguration is done by OC. AP only executes the application program. To maintain Byzantine agreement and approximate consensus interactive consistency and convergent voting is adopted. Unlike SIFT, FTMP and FTP non-faulty copies need not be equal in MAFT, means they can have different values at one time instance. The copies also do not have to run in exact synchrony.
OC processor receives data and performance functions through several semi-autonomous subsystems. OC communicates with other OCs and its own AP. Message is send to the OC through a dedicated serial links that are checked thoroughly by message checker. Subsystem messages can be data messages, synchronization or error management. Strict Link protocol and an Error Control Check used to protect internode message are not enough for fault tolerance in MAFT. It uses modular redundancy and distributed agreement algorithms to mask the effects of corrupted messages. Communication between OC and AP is done asynchronously through task communicator subsystem. When a task is done the AP asks the OC to send another task. While AP starts executing the new task a new completed CS message is send to all the OCs. OC provides the voted data to AP when required.
MAFT has two major synchronization functions: Steady state operation and startup operation. The steady-state algorithm is loose frame synchronization similar to one used in SIFT. IT involves two steps, first sending a “Presync” System state (SS) message, which is time stamped by all the nodes. Each node then computes the error estimate as the difference between its own Presync and voted value of all the Presync. The second message is ‘Sync” SS message. Startup synchronization is done to recover from catastrophic mid-mission event and graceful admission on nodes.
D. Data Management and Voting
Each data message is tagged with data identification descriptor (DID). A reasonableness check is performed by each OC to check whether the data is within the predefined limit. The voter subsystem only takes the data that passed the reasonableness test. It performs an on-the-fly vote using the new copy of previously received copies. A dynamic deviance check is then done to verify that all the inputs used in the voting process are within a specified window or not.
E. Application Task Scheduling
A MAFT scheduler treats all the tasks as periodic. It supports tasks with varying frequency with a constraint that all frequency are 2-J times the highest frequency.
MAFT uses a fault-tolerance variation of deterministic priority-list scheduling algorithm. To assign a task to the available node the task list is scanned in order of decreasing priority. An interactive consistency algorithm is used to maintain the Byzantine Agreement that rebroadcasts the data received in the previous round. Synchronization of the rebroadcast messages is achieved by defining a “subatomic” period whose boundaries are marked by the simultaneous transmission of task interactive consistency messages by all the nodes.
F. Task Reconfiguration
In the process of redistributing the application workload to account for changes in the system operating set is called task Reconfiguration. It performs graceful degradation and restoration of functions in the system. A Global Task Activation Process defines a set of relevant nodes their initial activation status. The activation status is completed when the number of relevant nodes in the operation set reaches a threshold. Task Reallocation Process takes care of reallocation of tasks to one of the most preferred task in the current operating set. Another process called Task-node status-matching process prohibits a task from being executed on a particular node.
G. Error Handling
The ERR message has 31 error flags which can be set by OC that continuously monitors the behavior of all nodes. A base penalty count (BPC) and Incremental Penalty Count (IPC) is maintained for each node. At the beginning of every atomic period each node broadcast its own ERR having its Id, flags, BPC and IPC. Each node votes on ERR to update voted IPC to voted BPC. The Byzantine agreement on both IPC and BPC count is guaranteed.
Once Byzantine agreement is reached the BPC is updated and compared to the predefined execution threshold. If the value exceeds the threshold each OC recommend exclusion of that node from the operating set. At the beginning of each master period the BPC of each node is decremented by a specific amount. Once the BPC falls below a readmission threshold OC recommends the readmission of the node.
V. The MARS Approach
MARS, Maintainable Real-Time System is a distributed fault tolerant system for process control for industry application like in rolling mills and railway control . It was started in 1980 at Technische University Berlin with the main feature being predictable performance under a specified peak load.
A. Distributed real-time system
Under the peak load conditions all the events occurs with maximum possible frequencies. For a system to satisfy the requirements of a critical fault-tolerant architecture it should be able to handle all the events without missing any hard deadline. MARS uses a transaction model where each transaction is a single execution of a specific set of tasks. In a distributed system a transaction can decompose into a sequence of sub transactions consisting of execution tasks and communication phase between these sub-transaction.
MARS network is divided into cluster, each cluster containing several components interconnected through a synchronous real-time MARS bus. MARS provide a fault-tolerant global base called system-time.
MARS uses state messages for inter processor communication. State messages are single writer multiple readers like approach where the writer writes data to its own mailbox and all the readers keep reading from it without actually consuming the message. The communication protocol used by MARS was an unacknowledged datagram with no wait send i.e. the sender does not wait until the receiver receives the message. To achieve predictability and operatability at peak load situation MARS uses the TDMA approach to avoid unseen media access delay.
MARS covers both permanent and transient faults. Fault tolerance is achieved with self- checking components with active redundancy. Every message is send n times therefore the loss of n-1 message is tolerable. MARS system can be easily maintained and extended due to the clustering of components. The existing component can be expanded into clusters be converting it onto an interface component showing the same behavior.
B. MARS Operating System
MARS Architecture was realized using the MARS Operating System. One copy is run on each component in the system. It has a small kernel with a set of tasks to perform. The kernel executes the hard real-time task according to the schedule and tries to accommodate the soft real-time task in idle time slots. The synchronization of local clocks is done in hardware by clock synchronization unit. Internal synchronization is done be message passing. Each message contains the time stamp of the sender. The receiver tags its own timestamp to the message and records the time difference. The correction term for the local clock is calculated with the fault-tolerant Average algorithm.
The FTA contains a system of clocks, which can contain up to K faulty clocks. The local difference for clock i to j are sorted by value. The K lowest and highest values are discarded. The arithmetic average of the remaining value is the new correction for the local clock j. An upper bound for the internal synchronization is given by:
t n – 2K
------------- = ------------
e + S n – 3K
S maximum tolerable drift E Reading error
The real-time database changes with the arrival of any new message. The tasks operating in different components receive a constant view of data at a particular time. MARS OS provides the self-checking feature in each component. It performs plausibility test and time-redundant execution of task. It also checks the runtime limits, global time limits and timing behavior of the task. Flow is controlled due to periodicity of messages and overwriting messages in case of faults.
DOCAPO is a highly dependable architecture designed mainly for the vehicle control system. It is developed with collaboration with European car manufacturers. It could tolerate high development cost so VLSI solutions were possible to design a compact and lightweight system that can be installed in the vehicle. The safety requirement of these systems is comparable to the fly-by-wire systems.
DACAPO  guarantees constant control delays due to communication delays, computation delays and delays due to synchronization between distributed control and sampling period. It uses cyclic communication for better timeliness that is challenging in the traditional event-driven systems. Traditional architectures like delta-4, MARS are efficient systems but they are not designed to satisfy the special demands of a vehicle borne network and embedded systems. DACAPO nodes are distributed towards the sensors and actuators location. Each node has two parts an Embedded Controllers (EC) and Communication Unit (CMU). In order to make a system FO/FS both EC and CMU has two fail-salient functional units. The communication network has two FO/FO serial buses. FO/FO implies that the buses will function even after two permanent faults.
The system has inherent redundancy i.e. system will perform with reduced performance if a local control function is failed for e.g. even if the control of brake of one wheel fails the vehicle can still stop.
A. System operation
Scheduling is done offline and communication between nodes is cyclic. This scheme enables the system to meet deadline. Also the effect of the transient error is reduced if the schedule is known a priori because it facilitates detection of many errors.
B. Process Scheduling
The compiler generates a block for each node consisting of multiple processes scheduled inside the block. One execution of block is referred to system cycle (SC). SC is divided into m time slots called communication Cycle for scheduling processes. CC is again divided into communication frames CF. During a CF one node is allowed to send data to the global buses according to a fixed schedule. If the execution time of a process exceeds the available time for one CC it is split and executed into later CCs.
As discussed before DACAPO has two logical blocks EC and CMU. Embedded controller (EC) is made up of several units like I/O units with two FS I/O interface, Computation unit (CU) with two Fault tolerant state memories, two fail-salient processing elements (PE) and two FO/FS cross coupling unit (CCU). CMU has two FS global communication unit (GCI) and there are two FS power supply. The two fail salient halves are connected with a FO/FS cross coupling.
Fig 6. DACAPO Node Architecture
Each PE has two microprocessors with local memories and they operate in full synchronization. Only one PE at a time reads the statement and obtains the needed data to execute the process. Once the process is over, it writes the data into the stack memory.
Errors are detected both in he nodes and communication between nodes. Due to the periodic nature of the system the failed process re-execute latest in the next schedule. If the error is repeated then the faulty subsystem is identified and faults are masked. It also implements fault containment.
CU consists of two GCIs running independently on two buses, during communication cycles each GCI broadcasts one message onto its buses. Bus access used in DACAPO is TDMA. Daisy chaining is used for synchronization. The system enters a degraded performance mode of execution if the diagnosis finds that one more failure can lead to catastrophic results.
Due to fast changing technology most of the customized architectures get obsolete either during the long design phase or after few years of their deployment. To deal with this problem some European companies developed a generic upgradable Architecture for real-time Dependable systems called GUARDS . Most of the applications today use the ultra-reliable architectures but it does not work as expected in their infrastructure due to varied requirements and constraint of different application domain. The basic aim of GUARDS project was to significantly lifecycle cost of such embedded systems. They develop instances of a generic architecture that can meet the diverse requirements of these critical applications without much overhead. They adopt a principle of satisfying the basic critical components first and then extend the system to reuse various components and support software components of different criticality.
The key nonfunctional requirements like failure rate of the order less then 10-10/hr, redundancy, error containment and fault-tolerance are the first design principles of any critical real-time architecture. The architecture aims to tolerate permanent and temporary faults and also tries to tolerate certain design faults. GUARDS uses time-triggered, event-triggered or mixed computational computational model. The synchronization is done using the preemptive scheduling model. The architecture facilitates the use of commercial off-the-shelf (COTS) hardware and software components. The fault tolerance is transparent form the application and is implemented in software. The architecture defines three dimensions to fault containment: Integrity levels, Lanes and channels.
B. The Integrity Dimension
Design faults are caused due to improper design or unforeseen requirements. Integral dimension protects these design faults to spread from less critical components to highly critical components. This is done by classifying an object within a particular integrity and prohibits the flow of information based on these integrities. To obtain data form higher integrity nodes integrity inheritance is performed but this decreases the integrity of data. Validation objects are used to apply fault tolerance on information flow. These objects output reliable information by using corrupted input and output reliable information increasing the integrity of the component.
C. Lane and Channel Dimension
Multiple lanes or processors are used to improve fault tolerance by replication of execution. Also it adds parallelism in the system to improve the timing requirement. In channel replication a application task is is replicated over a set of channels. Replicas are provided with same inputs through these channels in the same order in which they are getting it earlier. Voting is used to find out errors in the system.
D.Interchannel Communication Network
Central to the architecture is an interchannel communication network that maintains a global clock for all channels and allows the channels to have a consensus on the nonreplicated data. An ICN-manager is used for each channel and is connected through unidirectional serial links. An ICN-manager can thus broadcast any data on these serial buses and reads data from remote ICNs over other links.
GUARDS use a software-implement algorithm for convergence-averaging algorithm for clock synchronization. In a three-channel configuration Byzantine clocks have to be carefully evaluated. Another issue of exchanging private data between channel and agreeing on a common value in the presence of arbitrary faults is known as the interactive consistency problem. GUARDS uses ZA algorithm to solve this problem . Only keyed checksum scheme is used for message authentication.
ICN uses a table-driven protocol. The schedule is divided in to frames, which are further divided into cycles and slots. In GUARDS error recovery is achieved by error compensation also the due to redundancy errors can be detected and corrected.
Fig 7.TMR execution of a function by sequential threads
The figure 7  shows a three-channel configuration of GUARDS. There are three replicated sensors providing input. A two round interactive consistency exchange is done to consolidate input across different channels. The application task are executed asynchronously on each channel. State Variables are used to store values transferred between iterations. They along with the consolidated inputs are used to computer the output values.
To diagnose the faults in the system GUARDS collect error reports generated during the interactive consistency and consolidation exchanges. These reports are then filtered out and assessed to compute how much possible they could have caused. This filtering is done using a distributed version of a software-implemented approach called alpha-count. The algorithm if correctly implemented diagnoses a channel as faulty if it is accused of being faulty by a majority of other channels in the error report or if it accuse itself as faulty.
F. Process Scheduling
GUARDS uses the standard preemptive priority based scheduling scheme. If an application task is replicated at multiple processors or channels it must be taken care of that all the replicas should get the same input in the same order. GUARDS approach force their tasks to read the same data. To ensure that each replica reads the same value of the input, GUARDS keep more than one copy of the data and us timestamps. The schedule is calculated offline to determine the worse case response time.
A reader replica simply compares the release time with the data timestamp. It accepts the value if the timestamp is earlier but if the timestamp is later then its release time it rejects the input. The output produced by these processors is voted to find out the correct value that can be given to the actuators. It uses tools like HRT-HOOD for design and verification.
VIII Topology of Architectures
Till now we have seen in detail various architectures for computer-based system for safety-critical real-time applications. Design if these systems must take into account the requirements resulting from both technical constraints to be met and resource elements to be reduced. Technical constraints include functionality, dependability and timing where resource elements include the cost and size of the resulting system. An architecture of a system deals with how to configure and co-ordinate different activities. The architecture topology requires the production of the resources used by the system. There are tradeoffs when u chose a particular topology over the other. The designer has to consider various factors like services, components and resources when choosing the topology .
Fig. 8 TMR with Spare Architecture
Let us take an example of TMR, which we have already discussed in detail. The spares used in TMR architecture can be hot (already running), cold (unpowered) or warm (powered but not running). The figure shows a very simple design of the architecture. The disagreement unit determines when a unit becomes faulty. Once the approach is decided a physical implementation has to be decided. It has to be determined what resources are needed to implement the component, whether the system will use software or not, what type of decision mechanism will be used and many more. There can be many alternative solutions to all these questions. Current industrial practice is to set the topology early in the design process and to build the system on the design to conform to the topology. But this approach is not feasible with the complexity and requirements of a safety-critical real-time application. A topology used in these systems should emerge with the progress with the design. A hierarchical model is proposed to fulfill the requirements. In this model the set of fixed and variable elements changes with the design .
Fig 9. Hierarchical model of a Topology
In an application like control system the actions are logically divided in to a set of tasks, input task, Processing task and output task. With the progress in the design these task are decomposed into number of subtasks. Thus a set of precedence-constrained subtasks is constructed and substituted into each task in a way that the task still implements the chosen topology. The designer can revisit the topology at any time in the design to check if the assumption of breaking the task into subtask is correct or not.
This section discusses about a prototype tool to aid the designer resolve the architectural topology problem called Topmeter . It employs a string of bits to represent the topology. The values in the bit string determined how many topologies to be investigated. X-Topmeter has three restrictions on the search space.
At least one resource must be employed per task and network.
At most three resource may be employed per task and network.
At maximum of six units per task and network may be employed.
An evaluation function is used to determine the best solution:
P(x)= Kd*reliability + Kc*Cost(x) + Ks*size(x)
The reliability is measured in terms of predicted mean time failure (MTFF) of the service. The cost of a hardware resource is calculated from its purchase and ownership cost:
Cost(x) = Costhardware + Costsoftware
The size of the topology is simply the total predicted number of hardware units employed:
Size(x) = Nsensors + Nprocessors +Nactuators + Nbuses
Tools like Topmeter are valuable aid to the designer of a topology for a Safety-critical control system.
IX Future work
One of the latest approaches to the system architectures for safety-critical systems is the concept of an integrated Modular Avionics (IMA) System Architecture. This type of of network have significant advantage over existing architectures like reduced deployment cost, reduced cost of spares and the possibility of sharing resources. In an IMA architecture groups of modules form a topology for a task as shown in fig. A topology can be formed in from many groups and more than one group can use the same module. Each module within a group performs one function and is connected to other modules by a single data network. In future multiple data buses can also be used to increase dependability.
Also there is large scope of research in designing new architectures in a distributed environment that will display better performance and reliability. Most of the architectures are application specific that does not perform well with other applications. GUARDS provide an innovative approach by using COTS components. These architectures can be enhanced to provide easy upgradation and scalability. Much work is going on in designing better fault tolerant techniques. The problem of Common mode failure is still a big challenge in front of designers of safety-critical architectures and has vast scope of improvement.
The realm of applicability of safety-critical real-time applications has come a long way in past 40 years. It was a long journey from Apollo AG&C, fly-by-wire systems, swim-by-wire to nuclear power plants. The concept of fault-tolerance has been changed and augmented with the change in technology and requirements. Architectures like FTMP, SIFT and MAFT provide a very strong background for newly development architectures. Both hardware and software techniques are implemented and tested. Many new architectures are coming up with distributed approach and generic architectures, which can be used for a variety of applications. These architectures adapt them to the application and can be upgraded using COTS components. Recently researchers have started paying attention to the topology and bus architectures of these systems. Building these architectures for embedded systems a very challenging area in itself. This paper presents a brief view at some of the very important architectures, their advantages, disadvantages and features. It then briefs about the upcoming architectures and the problems faced by them. The paper concludes by describing some work in the filed of safety-critical real-time system topologies and future work.
 J. H. Lala et al., “Architectural Principles for Safety-Critical Real-Time Applications”, Proc. IEEE, Vol. 82, Jan. 1994.
 D. A. Rennels, “Fault-Tolerant Computing-Concepts and Examples”, IEEE Trans. Computers, Vol. C-33, 1984.
 A. L. Hopkins et al., “FTMP—A Highly Reliable Fault-Tolerant Multiprocessor for Aircraft”, Proc. IEEE, Vol. 66, Oct 1978.
 J. H. Wensley et al., “SIFT: Design and Analysis of a Fault-Tolerant computer for aircraft control”, Proc. IEEE, Vol. 66, Oct 1978.
 C. J. Walter et al. “The MAFT Architecture for Distributed Fault Tolerance”, IEEE Trans. on Computers, Vol. 37, April 1988.
 C. J. Walter et al., “MAFT: A Multicomputer Architecture for Fault-Tolerance in Real-Time Control Systems,” in Proc. IEEE Real-Time System Symp., Dec 1985
 H. Kopetz et al., “Distributed Fault-Tolerant Real-Time Systems: The MARS approach”, IEEE Micro, Vol. 9, No. 1, Feb 1989.
 D. Powell et al., “GUARDS: A Generic Upgradable Architecture for Real-Time Dependable Systems”, IEEE Trans. on parallel and Distributed Systems, Vol. 10, June 1999
 H. Lonn et al., “DACAPO: A Distributed Computer Architecture for Safety-Critical Control Applications”,
Proc. IEEE Int. Symposium on Intelligent Vehicles, Detroit, Michigan, Sept. 1995.
 M. Nicholson et al., “Emergence of an Architectural Topology for Safety-Critical Real-Time Systems”, University of York, England, YO1 5DD, Nov. 1997.
 M. Nicholson et al., “Structuring Architectural Topologies for Real-Time Safety-Critical Systems”, University if York, England, York YCS-97-284, July 1997.
 M. W. Johnson et al., “ AIPS system requirements”, CSDL-C-5738, Charles Stark Draper Lab., Inc., Cambridge, MA, Aug. 1983
 B. Wittenmark et al., “Timing Problems in Real-Time Control Systems”, Proc. SNART 97, The Swedish Conference on Real-Time Systems, 1997.
 J. Popek et al., “ LOCUS: A Network Transparent, High Reliable Distributed System”, Proc. 3rd Int. Conf. Distributed System, Miami, FL, Oct. 1982.
 J. Rushby “Bus Architecture for Safety-Critical Embedded Systems”, EMSOFT 2001: First Workshop on Embedded software, Lake Tahoe, CA. Oct 2001.
 Z. Liu et al., “Specification and Verification of Fault-Tolerance, Timing and Scheduling”, ACM Trans. On Programming Languages and Systems.
 H. Lonn et al., “Clock Synchronization in Distributed Safety-Critical Control Systems”, IEEE First Int. Conf. On Algorithm and Architectures for Parallel Processing, Brisbane, Australia, 1995
 H. Lawson et al. “Guidelines for Basement: A Real-Time Architecture for Automotive Systems”, Mecel, Goteborg, Sweden 1992.