Investigating cluster integrity Faults caused by network stability issues


#1

Summary

This article discusses how to investigate pre-Aerospike 3.7.x Paxos related cluster integrity issues.

Background

There are known issues with the Paxos implementation pre-Aerospike 3.7.x the consequence of which is that when a node leaves a cluster and stays runnning (usually as a result of a network issue) it may not able to re-join the cluster without ‘external’ help. The purpose of this article is to describe a method to investigate that scenario.

Identifying the Issue

There are two ways to observe the issue. Firstly, the result of the ‘info’ command in ‘asadm’ will show that the cluster has split with one node staying outside the cluster.

Secondly, the following string will be found in the aerospike.log of nodes in the cluster.

Dec 07 2015 09:24:41 GMT: INFO (paxos): (paxos.c::2272) CLUSTER INTEGRITY FAULT. [Phase 1 of 2] To fix, issue this command across all nodes:  dun:nodes=bb9e691f4902500,bb982a8f4902500,bb97691f4902500,bb95635f5902500

This is an INFO level message because the cluster is still running despite the problem with the node.

Root Cause Analysis

Around the time of the cluster integrity fault we look for indications of why the nodes lost contact. If the nodes are all still up and running the most likely cause is a temporary network issue. We look for messages indicating that inter-node communication has been impacted. Strings to look for here would be ‘dead’ and ‘fabric’ and particularly ‘hb expires’ as we are looking for loss of contact on the fabric layer nodes use to communicate within the cluster. An example would look as follows:

Dec 07 2015 09:24:40 GMT: INFO (hb): (hb.c::2181) hb expires but fabric says DEAD: node bb9f0a74f902500

The above message lists a node ID, by looking for this message in the aerospike.log for all nodes in the cluster we can ascertain whether one particular node has the issue.

We can then look at latencies to confirm our suspicians. The following log entries from the same node show that transactions start to pile up at around the time the fabric starts to report heartbeats expiring (look for the increase in wr in the second log entry).

Dec 07 2015 09:24:32 GMT: INFO (info): (thr_info.c::4519)    trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (2966, 18920448, 18917482) : hb (5, 330, 325) : fab (103, 3661, 3558)
Dec 07 2015 09:24:42 GMT: INFO (info): (thr_info.c::4519)    trans_in_progress: wr 470 prox 0 wait 145 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (2313, 18921624, 18919311) : hb (1, 330, 329) : fab (91, 3705, 3614)

We can see the increase in latency around the time of the suspected network issue in the writes_master histogram.

               % > (ms)
slice-to (sec)      1      2      4      8     16     32     64    128    256    512  ops/sec
-------------- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ --------
09:24:21    10   0.48   0.18   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00    497.8
09:24:31    10   0.63   0.13   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00    671.1
09:24:41    10   6.79   6.51   6.43   6.40   6.40   6.37   6.35   6.29   6.21   6.12    359.3
09:24:51    10   0.45   0.12   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00    511.3
09:25:01    10  13.57   7.40   2.41   0.26   0.00   0.00   0.00   0.00   0.00   0.00    655.2
09:25:11    10  15.61   7.88   2.40   0.39   0.02   0.00   0.00   0.00   0.00   0.00    516.3
-------------- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ --------
     avg         6.25   3.70   1.87   1.18   1.07   1.06   1.06   1.05   1.03   1.02    535.0
     max        15.61   7.88   6.43   6.40   6.40   6.37   6.35   6.29   6.21   6.12    671.1

This will affect all nodes, not just the node that has left the cluster as proxy writes to the missing node will start piling up.

Reads on the nodes that stayed within the cluster should not be affected as they do not have a prole side.

The reads histogram from the same node would look as follows.

               % > (ms)
slice-to (sec)      1      2      4      8     16     32     64    128    256    512  ops/sec
-------------- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ --------
09:24:21    10   0.95   0.22   0.15   0.06   0.00   0.00   0.00   0.00   0.00   0.00   1814.9
09:24:31    10   0.53   0.03   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   1838.4
09:24:41    10   0.72   0.29   0.17   0.12   0.06   0.00   0.00   0.00   0.00   0.00   1442.4
09:24:51    10   0.48   0.14   0.07   0.03   0.00   0.00   0.00   0.00   0.00   0.00   2346.6
09:25:01    10   2.29   0.39   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   2332.7
09:25:11    10   0.59   0.09   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   2424.9
-------------- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ --------
     avg         0.93   0.19   0.07   0.03   0.01   0.00   0.00   0.00   0.00   0.00   2033.0
     max         2.29   0.39   0.17   0.12   0.06   0.00   0.00   0.00   0.00   0.00   2424.9

Conclusions

From the heartbeat expiry messages, increase in transactions on a specific node and general increase in latency we can identify a temporary network issue in the cluster.

Corrective Actions

Aerospike will normally self heal cluster integrity faults when all nodes stay running. The problematic node will rejoin the cluster and data will re-balance. This will be transparent from an application point of view. In Aerospike releases prior to version 3.7.0 there is a bug in the Paxos implementation that means in certain circumstances this automatic reformation does not happen. The cluster stays running but the node that has experienced an issue stays outside the cluster.

The recommendation is for customers to upgrade to an Aerospike 3.7.x release or higher so that no explicit action has to be taken to re-form the cluster. This is of particular importance when Aerospike nodes are cloud based such as AWS or GCE. This is controlled with the following parameter:

paxos-recovery-policy=auto-reset-master

http://www.aerospike.com/docs/reference/configuration/#paxos-recovery-policy

If an upgrade is not possible the dun and undun commands can be used to bring the missing node back into the cluster. These are documented at the following link:

http://www.aerospike.com/docs/tools/asadm/user_guide/

Notes

– If a cluster experiences frequent temporary network issues, the network should be investigated to understand why these are happening.

– Increasing the node interval and timeout can reduce sensitivity to network glitches however these should not be set so high that a real network issue is masked

http://www.aerospike.com/docs/operations/configure/network/heartbeat/#multicast-heartbeat http://www.aerospike.com/docs/operations/configure/network/heartbeat/#mesh-unicast-heartbeat

– When a node leaves a cluster due to a temporary network issue, it is expected that migrations will happen when it re-joins.


Cluster Visibility False