12.2. HA Mechanisms

Home Prev	cOS Core 14.00.15 Administration Guide	Next

This section discusses in more depth the mechanisms cOS Core uses to implement the high availability feature.

Basic Principles

Clavister HA provides a redundant, state-synchronized hardware configuration. The state of the active unit, such as the connection table and other vital information, is continuously copied to the inactive unit via the sync interface. When cluster failover occurs, the inactive unit knows which connections are active, and traffic can continue to flow after the failover with negligible disruption.

The inactive system detects that the active system is no longer operational when it no longer detects sufficient Cluster Heartbeats. Heartbeats are sent over the sync interface as well as all other interfaces.

Heartbeat Frequency

cOS Core sends 5 heartbeats per second from the active system and when three heartbeats are missed (that is to say, after 0.6 seconds) a failover will be initiated. By sending heartbeats over all interfaces, the inactive unit gets an overall view of the active unit's health. Even if sync is deliberately disconnected, failover may not result if the inactive unit receives enough heartbeats from other interfaces via a shared switch, however the sync interface sends twice as many heartbeats as any of the normal interfaces.

Heartbeats are not sent at smaller intervals because such delays may occur during normal operation. An operation, for example opening a file, could result in delays long enough to cause the inactive system to go active, even though the other is still active.

	Important: Disabling sending heartbeats on interfaces
The administrator can manually disable heartbeat sending on any interface if that is desired. This is a property of the Ethernet interface configuration object. This is not recommended since the fewer interfaces that send heartbeats, the higher the risk that not enough heartbeats are received to correctly determine system health. The exception to this recommendation is if an Ethernet interface is not used at all. It is recommended to disable heartbeat sending on unused interfaces. The reason for this is that sending heartbeats on unused interfaces contributes to a false picture of system health since those heartbeats are always lost. A "false" failover could therefore be the result or possibly even both units becoming the active unit.

Important: Disabling sending heartbeats on interfaces

The administrator can manually disable heartbeat sending on any interface if that is desired. This is a property of the Ethernet interface configuration object. This is not recommended since the fewer interfaces that send heartbeats, the higher the risk that not enough heartbeats are received to correctly determine system health.

The exception to this recommendation is if an Ethernet interface is not used at all. It is recommended to disable heartbeat sending on unused interfaces. The reason for this is that sending heartbeats on unused interfaces contributes to a false picture of system health since those heartbeats are always lost. A "false" failover could therefore be the result or possibly even both units becoming the active unit.

Heartbeat Characteristics

Cluster heartbeats have the following characteristics:

The source IP is the interface address of the sending firewall.
The destination IP is the broadcast address on the sending interface.
The IP TTL is always 255. If cOS Core receives a cluster heartbeat with any other TTL, it is assumed that the packet has traversed a router and therefore cannot be trusted.
It is a UDP packet, sent from port 999, to port 999.
The destination MAC address is the multicast version of the shared hardware address and if UniqueSharedMac is enabled (the default) this has the form:
```
			11-00-00-00-mm-mm-nn
```
Where mm is a bit mask made up of the interface bus, slot and port on the master and nn represents the cluster ID. If UniqueSharedMac is not enabled, the form is:
```
			11-00-00-c1-4a-nn
```
Link layer multicasts are used over normal unicast packets for security. Using unicast packets would mean that a local attacker could fool switches to route heartbeats somewhere else so the inactive system never receives them.

Failover Time

The time for failover is typically about one second which means that clients may experience a failover as a slight burst of packet loss. In the case of TCP, the failover time is well within the range of normal retransmit timeouts so TCP will retransmit the lost packets within a very short space of time, and continue communication. UDP does not allow retransmission since it is inherently an unreliable protocol.

Shared IP Addresses and ARP

Both master and slave know about the shared IP address. ARP queries for the shared IP address, or any other IP address published via the ARP configuration section or through Proxy ARP, are answered by the active system.

The hardware address of the shared IP address and other published addresses are not related to the actual hardware addresses of the interfaces. Instead the MAC address is constructed by cOS Core from the Cluster ID. If UniqueSharedMac is enabled (the default), its form is:

			10-00-00-mm-mm-nn

Where mm is derived from the master node's bus/slot/port combined and nn is the configured Cluster ID. The Cluster ID must be unique for each cOS Core cluster in a network. If UniqueSharedMac is not enabled the form is:

			10-00-00-C1-4A-nn

As the shared IP address always has the same hardware address, there will be no latency time in updating ARP caches of units attached to the same LAN as the cluster when failover occurs.

When a cluster member discovers that its peer is not operational, it broadcasts gratuitous ARP queries on all interfaces using the shared hardware address as the sender address. This allows switches to re-learn within milliseconds where to send packets destined for the shared address. The only delay in failover therefore, is detecting that the active unit is down.

ARP queries are also broadcast periodically to ensure that switches do not forget where to send packets destined for the shared hardware address.

Promiscuous Mode

The Ethernet interfaces in an HA cluster must operate in promiscuous mode for HA to function. This mode means that traffic with a destination MAC address that does not match the Ethernet interface's MAC address will be sent to cOS Core and not discarded by the interface. Promiscuous mode is enabled automatically by cOS Core and the administrator does not need to worry about doing this.

If the administrator enters a CLI command ifstat <ifname>, the Receive Mode status line will show the value Promiscuous next to it instead of Normal to indicate the mode has changed. This is discussed further in Section 3.4.2, Ethernet Interfaces.

HA with Anti-Virus and IDP

If an HA cluster has the Anti-Virus or IDP subsystems enabled then updates to the Anti-Virus signature database or IDP pattern database will routinely occur. These updates involve downloads from the external Clavister servers and they require cOS Core reconfiguration to occur for the new database contents to become active.

A database update causes the following sequence of events to occur in an HA cluster:

The active (master) unit downloads the new database files from the Clavister servers. The download is done via the shared IP address of the cluster.
The active (master) node sends the new database files to the inactive peer.
The inactive (slave) unit reconfigures to activate the new database files.
The active (master) unit now reconfigures to activate the new database files causing a failover to the slave unit. The slave is now the active unit.
After reconfiguration of the master is complete, failover occurs again so that the master once again becomes the active unit.

Dealing with Sync Failure

An unusual situation that can occur in an HA cluster is if the sync connection between the master and slave experiences a failure with the result that heartbeats and state updates are no longer received by the inactive unit.

Should such a failure occur then the consequence is that both units will continue to function but they will lose their synchronization with each other. In other words, the inactive unit will no longer have a correct copy of the state of the active unit. A failover will not occur in this situation since the inactive unit will realize that synchronization has been lost.

Failure of the sync interface results in the generation of hasync_connection_failed_timeout log messages by the active unit. However, it should be noted that this log message is also generated whenever the inactive unit appears to be not working, such as during a software upgrade.

Failure of the sync interface can be confirmed by comparing the output from certain CLI commands for each unit. The number of connections could be compared with the stats command. If IPsec tunnels are heavily used, the ipsecglobalstat -verbose command could be used instead and significant differences in the numbers of IPsec SAs, IKE SAs, active users and IP pool statistics would indicate a failure to synchronize.

Once the broken sync interface is fixed, perhaps by replacing the connecting cable, resynchronization of the two units will take place automatically. If the sync interface is now functioning correctly, there may still be some small differences in the statistics from each cluster unit but these will be minor compared with the differences seen in the case of failure.

In unusual circumstances, synchronization between the active and inactive unit will not take place automatically. In this case, it may be necessary to manually restart the unsynchronized inactive unit in order to force resynchronization. This can be achieved using the CLI command:

Device:/> shutdown

A restart of the inactive unit will cause the following to take place:

During startup, the inactive unit sends a message to the active unit to flag that its state has been initialized and it requires the entire state of the active unit to be sent.
The active unit then sends a copy of its entire state to the inactive unit.
The inactive unit then becomes synchronized after which a failover can take place successfully if there is a system failure.

A restart of the inactive unit is the only time when the entire state of the active unit is sent to the inactive unit.

Note that the detection by cOS Core of unexpected packets on the sync interface is discussed in Section 12.4, HA Issues and Troubleshooting.