Correctly configured Aerospike cluster splits or refuses to form although correctly configured

The Aerospike Knowledge Base has moved to https://support.aerospike.com. Content on https://discuss.aerospike.com is being migrated to either https://support.aerospike.com or https://docs.aerospike.com. Maintenance on articles stored in this repository ceased on December 31st 2022 and this article may be stale. If you have any questions, please do not hesitate to raise a case via https://support.aerospike.com.

Correctly configured Aerospike cluster splits or refuses to form although correctly configured

Problem Description

An Aerospike cluster will not form with more than a given number of nodes. When the nodes are removed, new nodes can be added. The cluster appears to have a maximum size. Checks of the Aerospike configuration show that the nodes are configured correctly.

Explanation

There are three main reasons for this to happen:

  1. Firewall blocking heartbeat/fabric ports between nodes.
  2. Interface or link overload causing packet loss.
  3. The interfaces have incorrect MTU or PTMU discovery is not functioning.

The solutions below show how to diagnose the problem and what steps should be taken to remedy the situation.

Solution

Blocked ports on firewall

Checking for problem 1 is fairly simple; from each node in the cluster check connectivity to both the Heartbeat and Fabric ports to all other nodes. Each node must be able to form HB and Fabric connections to all other nodes.

This can be achieved using netcat with port probing. This is a safer method of checking connectivity than telnet. Adjust the below snippet to the correct port list of HB and Fabric and a list of all node IPs that should be present in the cluster.

$ PORTS="3001 3002"
$ IPS="172.17.0.2 172.17.0.3 172.17.0.4"
$ for port in ${PORTS}; do for ip in ${IPS}; do nc -znv -w 5 ${ip} ${port}; done; done
(UNKNOWN) [172.17.0.2] 3001 (?) open
(UNKNOWN) [172.17.0.3] 3001 (?) open
(UNKNOWN) [172.17.0.4] 3001 (?) open
(UNKNOWN) [172.17.0.2] 3002 (?) open
(UNKNOWN) [172.17.0.3] 3002 (?) : Connection timed out
(UNKNOWN) [172.17.0.4] 3002 (?) : Connection refused

In this example, run from 172.17.0.2, it is evident that the node cannot connect to the other 2 nodes on port 3002, with one connection timing out (packet drops) and one being rejected.

The solution is to change/adjust firewall and/or router settings between the nodes to ensure that the ports are open and can be connected to.

Packet loss

This can be checked on all nodes using the ifconfig command:

# ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.17.0.2  netmask 255.255.0.0  broadcast 172.17.255.255
        ether 02:42:ac:11:00:02  txqueuelen 0  (Ethernet)
        RX packets 124172  bytes 10671324 (10.6 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 126761  bytes 10776856 (10.7 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

If the dropped and/or overrun counters are increasing rapidly when running this command at regular intervals, most likely packet drops are happening on the Linux system itself.

The packets can also be dropped on intermediate routers, switches or firewalls, between the nodes. Some basic tools for checking packet loss from one system to another include long running ping and traceroute as well as tracepath. Ultimately, to be absolutely sure that packets are being dropped, interface statistics must be checked on all routers, switches and firewalls along the path. iperf can also be used to determine if the link is already oversaturated, which may be the cause for packet loss.

A courtesy check of dmesg is also advised to ensure the kernel is not experiencing other issues which may result in packet loss.

A typical cause of link saturation, particularly in a previously stable cluster, is migration traffic. Under default settings Aerospike partition migration can be considered a normal cluster activity. It is possible for these settings to be overtuned causing, among other things, link saturation between cluster nodes.

The solution to this is to slow down migrations by either limiting the migrate-threads, migrate-max-num-incoming, or increasing migrate-sleep.

MTU issues

By design Aerospike Heartbeats must fit in a single packet. The packet size is dictated by the MTU. Normally, this should not be a problem as interfaces will perform PMTU (path MTU) discovery. This means that a TCP connection should automatically be able to work out the maximum allowed size of a packet it can send (called MSS) based on a simple discovery of the max MTU allowed along each hop.

In certain cases, there may exist limiting routers with MTUs smaller than the interface MTU. With PMTU discovery this should still work without issues. Unfortunately, some routers are configured to not respond with ICMP packets, which does not allow for PMTU discovery. The below steps can be taken to identify such a scenario.

One of the first tools to check is the tracepath, which will inform of possible allowed PMTU along the path:

# tracepath far.example.com
 1?: [LOCALHOST]                      pmtu 1500
 1:  172.17.0.1                                            0.061ms 
 1:  172.17.0.1                                            0.050ms 
 2:  limiting.router.example.com                           0.620ms pmtu 1452
 3:  no reply
 4.  far.example.com                                       0.721ms

From the above output, it is evident that the local machine has standard PMTU of 1500 for the local network, and discovered the MTU limitation on a second hop router to be 1452. Therefore, it should be assumed that the max MTU size we can use is 1452.

Third hop unfortunately did not respond to path MTU discovery. So while its MTU may be higher than 1452, this cannot be determined. The machine which failed to provide a response may be limited to less, in which case PMTU discovery will break. To test this, a ping test can be run using the -s flag to specify a packet of a defined size:

# ping -s 1424 far.example.com

Note that 1424 bytes, rather than 1452, was used as ICMP+IP headers are 28 bytes. This results in 1452 byte sized packet. If the above ping test does not return a successful ping, either ICMP is blocked completely on far.example.com (or a router in between) - in which case it will need to be unblocked to perform the MTU test; or the machine 3 (no reply) has a smaller MTU. Testing with something very small, results in a truncated success (may result in untruncated success, which is also fine):

# ping -s 200 far.example.com
PING far.example.com (10.0.0.20) 200(228) bytes of data.
76 bytes from far.example.com (10.0.0.20): icmp_seq=1 ttl=37 (truncated)

This indicates that while smaller-sized packets work, MTU size is indeed limited along the path and is not being correctly reported via Path MTU discovery.

The preferred solution is to fix the router/firewall/switch in question to have same-sized MTU along the whole path.

A secondary step is to always allow ICMP reponses, at least Types 2 and 3, on all routers, switches and firewalls. This should be done regardless of whether the MTU along the path can be adjusted. Without the ICMP types 2 and 3, certain protocols will not function as intended, for example large packet UDP and multicast.

Allowing for correct path MTU discovery will allow for automatic adjustment of MSS (maximum segment size), and allow for correction functioning of the heartbeat protocol.

Alternatively, as a workaround, the heartbeat MTU settings can be adjusted to inform Aerospike manually of the limitation using heartbeat.mtu.

Note that since the Heartbeat message must fit in a single packet, the MTU must be large enough to accommodate for that. Each node in the cluster requires 30 bytes and so as the cluster grows in size so must the heartbeat.mtu.

Notes

Keywords

FIREWALL CLUSTER HEARTBEAT HB MTU PMTU SPLIT WILL NOT FORM MAXIMUM SIZE

Timestamp

July 2020