There are a number of situations that can cause an interface to stop updating its heartbeat point. For example, the interface crashes, the interface is unable to write to the data source for any reason, or the interface is hung in a function call that does not return. The following scenario describes the steps taken by the backup copy of the interface to determine if it should assume the role as primary.
Every failover update interval, the backup interface reads the value of the heartbeat point for the primary interface and stores the value internally.
If the primary’s heartbeat has not updated between two reads, the backup stops purging data older than two update intervals from its queue.
If after two update intervals, the primary’s heartbeat has not updated, the backup transitions into a temporary PrimaryStale state. This state allows the primary two additional update intervals of time to recover before the backup assumes control. If there was no recovery time for the primary, thrashing might result due to latency in the system architecture.
Upon entering the PrimaryStale state, the backup stops discarding queued data.
The backup interface remains in the PrimaryStale state for two failover update intervals waiting for the primary interface to come back to life. If the primary interface copy has still not updated its heartbeat point, the backup sends the data in its queue to PI and transitions to the AssumingControl state.
When the backup enters the AssumingControl state, it immediately writes its failover ID to the active ID point. At this point, the interface is now considered the primary copy. The interface remains in the AssumingControl state for two update intervals before transitioning to the PrimaryReady state.
Total failover time for this scenario occurs between three and five update intervals. The exact time it takes to failover depends on when the primary failed and when the backup read the control points. If the backup reads the primary’s heartbeat just after the primary had updated its heartbeat and the primary then halts, the failover time will be closer to three intervals. If the backup reads the primary’s heartbeat just before the primary updates its heartbeat, and then the primary fails just after updating its heartbeat the next interval, the failover time will be closer to five intervals. With the default update interval of 1 second, failover will occur between 3 and 5 seconds. The amount of overlapping data also depends on the timing, but will have a maximum of two intervals. Using the default update interval of 1 second may result in a maximum of 2 seconds worth of overlapping data.
Figure 3, below shows the failover timing chart related to the primary’s failure to update its heartbeat point as discussed in the above scenario.
Time
Action
T+0
Both interfaces are running with IF1 in the primary role.
T+2.1
Event A: IF1 stops updating its heartbeat point.
T+3.5
Event B: IF2 notices that IF1’s heartbeat has not updated. IF2 will discard data older than Time 1.5.
T+4.5
Event C: IF2 notices that IF1 still has not updated its heartbeat. IF2 stops discarding old queued data.
T+6.5
Event D: IF1 still has not updated its heartbeat so IF2 transitions to Primary sending all data received from Time 1.5.
Two things should be apparent in this figure. The time it takes for the failover to occur is slightly less than four and one half seconds and the amount of overlapping data is just over one half second.
Figure 3: Timing chart when the primary interface stops updating its heartbeat tag.