Aerospike multisite cluster - resiliency

In multisite cluster (3 regions - 3 nodes in each region with replica factor 3). One node went offline and it causes failures in the write workload of the cluster for approx 12 seconds, then aerospike created temp replica and make another replica as master partition and workload processed successfully.

Please advise how to mitigate the number of failures in multisite cluster.

If a node just dies, not much can be done about the latency involved in a) cluster detecting a node just went out for good (heartbeat count related) and then the client rediscovering the partition map - 1 sec tend interval.

Without going into too much detail, if you used default heartbeat count (timeout) and interval settings (10, 150 ms), cluster will detect a node has left in somewhere around 2.3 to 3.3 seconds. (Cluster does not kick a node out just because a few heartbeats were missed). Then, worst case client will detect new partition map after 1 second (tend interval). Having said that, 12 seconds seems very long. I could buy 4.5 seconds worst case. What are your heartbeat settings for interval and timeout?

 heartbeat {
    mode mesh
    port 3002
    address any
    # mesh-seed-address-port 172.xx.yy.zz 3002
    interval 150
    timeout 10

However, if you are bringing a node down for maintenance, you can use the quiesce feature to remove it gracefully without any client side issues.

1 Like

@pgupta currently we have timeout 10 and interval 250. Also aerospike client have 5000 for timeout.

10 / 250 puts you into more like 3+ to 4+ seconds for cluster to detect a node died. So that combined with the timeout perhaps explains the higher number you are seeing but nodes shouldn’t be just dying suddenly.

If you are bringing them down for maintenance or testing, then use quiescing method and clients will not see any transaction timeout. So which is one is your issue - sudden node death or operator bringing the node down?

1 Like

thanks @pgupta for explanation. We updated settings as 100 intervals and 10 timeout, we will test it again and keep you posted.

Please note, making the numbers too tight is also counter-productive as it increases the probability of the cluster falling apart due to network latencies, glitches etc. of the heartbeat signal.

This is a good FAQ on the topic: FAQ What is the Quantum Interval and how does it affect cluster reformation time?