Case Study - Effect of network instability on partition availabilty in a Strongly Consistent Aerospike namespace

#1

Case Study: Effect of network instability on partition availability in a Strongly Consistent Aerospike cluster

Background

Customer was testing partition availability within a 4 node Aerospike cluster and focusing their attention on a namespace configured for Strong Consistency. The focus of the testing was to determine how node loss within the cluster, both graceful and unexpected, affected partition availability within the namespace. The customer raised a case when they found that with the cluster having a single node out, all partitions were unavailable. The customer was aware that loss of less than the replication factor number of nodes should not cause any loss of availability.

It was clear from the logs that the fourth node in the cluster remained running throughout the testing. Nodes 1-3 were shut down, sometimes by killing the process, sometimes by shutting down cleanly throughout the day. The initial belief of the customer was that a shutdown of the third node of the cluster at around 15:00 caused all partitions to become unavailable, this was not the case. The major question was, why didn’t the partitions become available again when the 3rd node rejoined the cluster.

Detail

The investigation started by looking at the cluster size in the logs for node 4 and from there, it became clear that the issue had, in fact, started at 14:32:

Apr 11 2019 14:32:38 GMT+0530: INFO (info): (ticker.c:160) NODE-ID ABC CLUSTER-SIZE 3
Apr 11 2019 14:32:48 GMT+0530: INFO (info): (ticker.c:160) NODE-ID ABC CLUSTER-SIZE 2
Apr 11 2019 14:32:58 GMT+0530: INFO (info): (ticker.c:160) NODE-ID ABC CLUSTER-SIZE 1
Apr 11 2019 14:33:08 GMT+0530: INFO (info): (ticker.c:160) NODE-ID ABC CLUSTER-SIZE 1
Apr 11 2019 14:33:18 GMT+0530: INFO (info): (ticker.c:160) NODE-ID ABC CLUSTER-SIZE 1

It was known that node 2 went out of the cluster at 14:30 as the asd process was killed for testing purposes.

The root of the problem was elsewhere, however:

Apr 11 2019 14:32:37 GMT+0530: WARNING (socket): (socket.c:865) Error while connecting: 101 (Network is unreachable)
Apr 11 2019 14:32:37 GMT+0530: WARNING (socket): (socket.c:959) Error while connecting socket to X.X.X.X:3002
Apr 11 2019 14:32:37 GMT+0530: WARNING (hb): (hb.c:4882) could not create heartbeat connection to node {X.X.X.X:3002}
Apr 11 2019 14:32:37 GMT+0530: WARNING (socket): (socket.c:959) Error while connecting socket to X.X.X.Y:3002

Log messages for node 4 indicate that it was having trouble connecting to other nodes in the cluster due to network issues. When there is a clean shutdown of node 1 at 14:32 the expectation is that a 2 node cluster will remain, but instead, it drops to a 1 node cluster for some time.

Apr 11 2019 14:32:42 GMT+0530: INFO (as): (signal.c:194) SIGTERM received, starting normal shutdown

Observe cluster size changing from 3 to 2 to 1.

Apr 11 2019 14:30:38 GMT+0530: INFO (info): (ticker.c:160) NODE-ID ABC CLUSTER-SIZE 3
Apr 11 2019 14:30:48 GMT+0530: INFO (info): (ticker.c:160) NODE-ID ABC CLUSTER-SIZE 3
Apr 11 2019 14:30:58 GMT+0530: INFO (info): (ticker.c:160) NODE-ID ABC CLUSTER-SIZE 3
Apr 11 2019 14:31:08 GMT+0530: INFO (info): (ticker.c:160) NODE-ID ABC CLUSTER-SIZE 3
Apr 11 2019 14:31:18 GMT+0530: INFO (info): (ticker.c:160) NODE-ID ABC CLUSTER-SIZE 3
Apr 11 2019 14:31:28 GMT+0530: INFO (info): (ticker.c:160) NODE-ID ABC CLUSTER-SIZE 3
Apr 11 2019 14:31:38 GMT+0530: INFO (info): (ticker.c:160) NODE-ID ABC CLUSTER-SIZE 3
Apr 11 2019 14:31:48 GMT+0530: INFO (info): (ticker.c:160) NODE-ID ABC CLUSTER-SIZE 3
Apr 11 2019 14:31:58 GMT+0530: INFO (info): (ticker.c:160) NODE-ID ABC CLUSTER-SIZE 3
Apr 11 2019 14:32:08 GMT+0530: INFO (info): (ticker.c:160) NODE-ID ABC CLUSTER-SIZE 3
Apr 11 2019 14:32:18 GMT+0530: INFO (info): (ticker.c:160) NODE-ID ABC CLUSTER-SIZE 3
Apr 11 2019 14:32:28 GMT+0530: INFO (info): (ticker.c:160) NODE-ID ABC CLUSTER-SIZE 3
Apr 11 2019 14:32:38 GMT+0530: INFO (info): (ticker.c:160) NODE-ID ABC CLUSTER-SIZE 3
Apr 11 2019 14:32:48 GMT+0530: INFO (info): (ticker.c:160) NODE-ID ABC CLUSTER-SIZE 2
Apr 11 2019 14:32:58 GMT+0530: INFO (info): (ticker.c:160) NODE-ID ABC CLUSTER-SIZE 1
Apr 11 2019 14:33:08 GMT+0530: INFO (info): (ticker.c:160) NODE-ID ABC CLUSTER-SIZE 1
Apr 11 2019 14:33:18 GMT+0530: INFO (info): (ticker.c:160) NODE-ID ABC CLUSTER-SIZE 1
Apr 11 2019 14:33:28 GMT+0530: INFO (info): (ticker.c:160) NODE-ID ABC CLUSTER-SIZE 1
Apr 11 2019 14:33:38 GMT+0530: INFO (info): (ticker.c:160) NODE-ID ABC CLUSTER-SIZE 1

Even before node 2 was killed the cluster did not have 4 nodes, though all the nodes would have been up and running. Looking at unavailable partitions from the node 4 log the following can be seen:

Apr 11 2019 14:32:39 GMT+0530: INFO (partition): (partition_balance_ee.c:1179) {customer_namespace} rebalanced: regime 45 expected-migrations (483,519,483) expected-appeals 0 unavailable-partitions 2071
Apr 11 2019 14:32:49 GMT+0530: INFO (partition): (partition_balance_ee.c:1179) {customer_namespace} rebalanced: regime 47 expected-migrations (0,0,0) expected-appeals 0 unavailable-partitions 4096
Apr 11 2019 14:47:43 GMT+0530: INFO (partition): (partition_balance_ee.c:1179) {customer_namespace} rebalanced: regime 51 expected-migrations (0,0,0) expected-appeals 0 unavailable-partitions 4096
Apr 11 2019 15:03:04 GMT+0530: INFO (partition): (partition_balance_ee.c:1179) {customer_namespace} rebalanced: regime 55 expected-migrations (0,0,0) expected-appeals 0 unavailable-partitions 4096

The namespace starts to lose partition availability as soon as 2 nodes are missing. As the cluster has a replication factor of 2 this is expected. It is not possible to have a majority and so a partition can only be available if the roster master is in the remaining sub-cluster. At this point 2071 partitions remain unavailable. Then the network drops and the cluster becomes a 1 node singleton. Node 4 is alone and, of course, everything is now unavailable. If there are less than replication factor of nodes within a sub-cluster, clients cannot write 2 copies of a record as mandated by the replication factor of 2 and so everything is unavailable.

Apr 11 2019 14:33:38 GMT+0530: INFO (info): (ticker.c:160) NODE-ID ABC CLUSTER-SIZE 1

Leads to:

Apr 11 2019 14:32:49 GMT+0530: INFO (partition): (partition_balance_ee.c:1179) {customer_namespace} rebalanced: regime 47 expected-migrations (0,0,0) expected-appeals 0 unavailable-partitions 4096

The question remains, why didn’t the partitions come back when the nodes started to re-join the cluster?

When node 4 became a singleton, something relatively subtle happened. The node cannot know if it is the only node left after a shutdown or whether is part of a split brain. When another node arrives it effectively ‘rejoins’ the cluster. When a node joins a cluster it always comes back with all its partitions marked as subset since it may have missed some updates, so at this point, both nodes 4 and 3 have subset partitions. The problem now is that the cluster needs to get to a super-majority (fewer than replication factor number of nodes missing). In normal circumstances this would all be resolved when node 1 restarts and rejoins the cluster at 14:36, as then the cluster would have super-majority but the network causes a problem and the cluster cannot get to the required 3 nodes, looking at cluster size again:

Apr 11 2019 14:38:14 GMT+0530: INFO (clustering): (clustering.c:5799) applied cluster size 2

At this point there is a node on its own and a 2 node cluster. The cluster is unable to get its partitions back to a full state, so everything is unavailable.

Now node 2 joins back. The cluster size increases to 3. There should now be a super-majority.

Apr 11 2019 14:47:48 GMT+0530: INFO (info): (ticker.c:160) NODE-ID ABC CLUSTER-SIZE 3

The problem is though, nodes 2 and 3 went down without a clean shutdown, so the storage cannot be trusted. The node comes back with the ‘e’ (evade) flag set. As part of a graceful shutdown the last task a node performs is to set a flag on the storage to indicate that the data there is trustworthy (all pending writes have flushed to disk). When a node starts up it looks for this flag and if it does not find it the node can infer that the shutdown was unclean and that, potentially, writes could have been lost. The node is then marked with the ‘e’ flag as the data it holds may not be fully consistent.

Apr 11 2019 10:49:22 GMT+0530: INFO (drv_ssd): (drv_ssd.c:3295) {customer_namespace} device /home/customer_data/aerospike/data/APP_DATA/customer_namespace.dat prior shutdown not clean
Apr 11 2019 10:49:22 GMT+0530: INFO (drv_ssd): (drv_ssd_ee.c:1378) {customer_namespace} setting partition version 'e' flags

To all intents and purposes this means that the node cannot count towards the supermajority and so all partitions are still fully unavailable.

The logs show revive commands going on to try and resolve the situation.

aerospike_NODE_4.log:Apr 11 2019 15:21:35 GMT+0530: INFO (security): (security.c:5518) permitted | client: 10.32.129.69:46654 | authenticated user: superuser | action: revive | detail: namespace=customer_namespace
aerospike_NODE_4.log:Apr 11 2019 15:21:35 GMT+0530: INFO (info): (thr_info.c:3905) revive: complete - issue 'recluster:' command

However as the commands are running ity is evident that the cluster size is very unstable. At this point the partitions are unavailable rather than dead. A dead partition is a partition that is unavailable when the whole roster is present. Due to the network drops, the cluster never returns to the full roster of 4 nodes and so the partitions are unavailable rather than dead.

Apr 11 2019 15:21:41 GMT+0530: INFO (clustering): (clustering.c:5799) applied cluster size 3
Apr 11 2019 15:28:23 GMT+0530: INFO (clustering): (clustering.c:5799) applied cluster size 2
Apr 11 2019 15:33:57 GMT+0530: INFO (clustering): (clustering.c:5799) applied cluster size 3

When node 3 is finally restarted at 16:39 the cluster is stable at 4 nodes without dead partitions and has full availability. This is because node 3 can now count towards a super majority as it was shut down cleanly.

Apr 11 2019 16:39:38 GMT+0530: INFO (partition): (partition_balance_ee.c:1179) {customer_namespace} rebalanced: regime 65 expected-migrations (3256,2875,2527) expected-appeals 2203 unavailable-partitions 0

Essentially, the issue started when a network fluctuation took the cluster down from the expected cluster size of 2 to a singleton. Throughout the testing period the unplanned cluster events continued to demonstrate what, at first, must have seemed like very odd behaviour.

Key Points of Note

  • Aerospike relies on a stable network connection for the cluster to remain properly formed.
  • Nodes with the ‘e’ flag set on start up cannot count towards the ‘rules’ of availability within Strong Consistency
  • Even with a stable network, if more than replication factor of nodes shutdown, partition availability will be compromised
  • The revive command is provided to counter this, if an operator can be certain that the nodes were shutdown within seconds of each other, they can be considered safe to revive partitions on.
  • Strong Consistency will ensure that writes will not be lost, even if it means making partitions unavailable to achieve this.

Timestamp

4/18/2019