12.4. HA Issues and Troubleshooting

The following points should be kept in mind when configuring and managing an HA Cluster.

Managing a Cluster Via a Single Public IP Address

Management of a cluster involves three IP addresses: the address of each firewall and the shared IP. If management is required over the Internet, it is better to use just a single public IP address instead of three. This can be achieved by using a Server Load Balancing (SLB) rule although load balancing is not actually the purpose in this case. Setting up cluster management using a single public IP address is described further in an article in the Clavister Knowledge Base at the following link:

https://kb.clavister.com/324736255

HA Master/Slave Separation Latency

To ensure maximum uptime, the administrator might decide to place the two HA units in two separate locations so both will not be affected by the same adverse event, such as a building fire. This is a viable option but it is subject to the one limitation that the latency for synchronization data being sent between the units should not be greater than 20 milliseconds.

If synchronization data between the two units are delayed by more than 20 milliseconds, the cluster may not be able to function correctly.

The issues related to physical separation of the HA cluster nodes are discussed further in an article in the Clavister Knowledge Base at the following link:

https://kb.clavister.com/324735752

ALGs are Not State Synchronized

No aspect of ALGs are state synchronized in an HA cluster. This means that all traffic handled by ALGs will freeze when a cluster fails over to the other peer. However, if the cluster fails back over to the original peer within approximately half a minute, frozen sessions and their associated transfers should begin working again.

Transparent Mode

There is no loop avoidance with HA so this should not be configured with HA. Switch routes (and therefore transparent mode) do not have state synchronization.

VPN Tunnel Synchronization

cOS Core provides complete synchronization for IPsec tunnels in an HA cluster. In the event of a failover occurring, incoming clients should not need to reestablish their tunnels.

However, cOS Core does not provide HA support for the following:

In the event of a failover occurring for these types of tunnel, incoming clients must reestablish their tunnels after the original tunnels are deemed non-functional. The timeout for this varies depending on the client and is typically within the range of a few seconds to a few minutes.

DHCP

Servers for IPv4 DHCP as well as DHCPv6 have full HA synchronization support. However, the clients for both IPv4 DHCP and DHCPv6 are not supported. If either type of client is configured on an interface, this will result in the error message Shared HA IP address not set when trying to commit the configuration.

Real-time Monitoring

The Real-time Monitor will not automatically track the active firewall. If a Real-time Monitor graph shows nothing but the connection count moving, then the cluster has probably failed over to the other unit.

All Cluster Interfaces Need IP Addresses

All interfaces on both HA cluster units should have a valid private IP4 address object assigned to them. The predefined IP object local host could be assigned for this purpose. The need to assign an address is true even if an interface has been disabled.

SNMP

SNMP statistics are not shared between master and slave. SNMP managers have no failover capabilities. Therefore both firewalls in a cluster need to be polled separately.

Logging

Log data will be coming from both master and slave. This means that the log receiver will have to be configured to receive logs from both. It also means that all log queries will likely have to include both master and slave as sources which will give all the log data in one result view. Normally, the inactive unit will not be sending log entries about live traffic so the output should look similar to that from one Clavister firewall.

Using Individual IP Addresses

The unique individual IP addresses of the master and slave cannot safely be used for anything but management. Using them for anything else, such as for source IPs in dynamically NATed connections or publishing services on them, will inevitably cause problems since unique IPs will disappear when the firewall they belong to does.

The Shared IP Must Not Be 0.0.0.0

Assigning the IPv4 address 0.0.0.0 as the shared IP address must be avoided. This is not valid and will cause cOS Core to enter lockdown mode, where only management access will be possible.

Failed Interfaces

Failed interfaces will not be detected unless they fail to the point where cOS Core cannot continue to function. This means that failover will not occur if the active unit can still send "I am alive" heartbeats to the inactive unit through any of its interfaces, even though one or more interfaces may be inoperative.

However, by utilizing the cOS Core link monitoring feature, cOS Core can be configured to trigger immediate HA failover on interface failure. This is discussed further in Section 12.6, Link Monitoring and HA.

Changing the Cluster ID

Changing the cluster ID in a live environment is not recommended for two reasons. First, this will change the hardware address of the shared IPs and will cause problems for all devices attached to the local network, as they will keep the old hardware address in their ARP caches until it times out. Such units would have to have their ARP caches flushed.

Secondly, this breaks the connection between the firewalls in the cluster for as long as they are using different configurations. This will cause both firewalls to go active at the same time.

Invalid Checksums in Heartbeat Packets

Cluster Heartbeats packets are deliberately created with invalid checksums. This is done so that they will not be routed. Some routers may flag this invalid checksum in their log messages.

Making OSPF work

If OSPF is being used to determine routing metrics then a cluster cannot be used as the designated router.

If OSPF is to work then there must be another designated router available in the same OSPF area as the cluster. Ideally, there will also be a second, backup designated router to provide OSPF metrics if the main designated router should fail.

PPPoE Tunnels and DHCP Clients

For reasons connected with the shared IP addresses of an HA cluster, PPPoE tunnels and DHCP clients should not be configured in an HA cluster.

Disabling Heartbeats on Unused Interfaces

It is recommended to disable heartbeats on Ethernet interfaces that are not being used. If this is not done there is a risk that this could cause repeated failovers or even both units going active because the HA mechanism will see the unused interface as a failed interface. The higher the proportion of unused interfaces there are in a cluster, the more pronounced the effect of sending heartbeats on unused interfaces becomes.

Synchronization Problems and Unexplained Failovers

Initial synchronization problems or unexplained failovers could be caused by simple connection problems during the initial cluster setup. However, an HA cluster may also require some fine tuning after successful set up which involves changing some of the HA advanced settings that can determine how the cluster behaves. This can be particularly relevant where a cluster has to handle large amounts of data and connections.

Such post-setup fine tuning is covered by an article in the Clavister Knowledge Base at the following link:

https://kb.clavister.com/329090437

Both HA Nodes Going Either Active or Passive at the Same Time

The following synchronization issues might infrequently occur in an HA cluster but will need to be addressed if they do:

The above list of issues are usually caused by any of the following:

More detail about troubleshooting these issues can be found in an article in the Clavister Knowledge Base at the following link:

https://kb.clavister.com/354848074

Unexpected Packets on the Sync Interface

Packets received on the Sync interface should only be synchronization packets from the cluster peer. If cOS Core sees anything unexpected, it will generate disallowed_on_sync_iface log messages for these events and this may indicate a problem in the network topology.

One known example where this can happen is with a VMWare ESX host with the vswitch "Notify Switches" setting enabled. This is discussed further in an article in the Clavister Knowledge Base at the following link:

https://kb.clavister.com/309993643

Avoiding Interruptions During Configuration Deployments

When a new configuration is deployed to an HA cluster there can be issues with traffic interruptions. Avoiding this is discussed in a Clavister Knowledge Base article at the following link:

https://kb.clavister.com/324735697