Case Study - Using roster changes to protect against potential unavailability during maintenance windows

The Aerospike Knowledge Base has moved to https://support.aerospike.com. Content on https://discuss.aerospike.com is being migrated to either https://support.aerospike.com or https://docs.aerospike.com. Maintenance on articles stored in this repository ceased on December 31st 2022 and this article may be stale. If you have any questions, please do not hesitate to raise a case via https://support.aerospike.com.

Case Study: Using roster changes to protect against potential unavailability during maintenance windows

Background

This case concerned a rack aware cluster consisting of 2 racks, rack A which contained 3 nodes and rack B which contained 2 nodes. The cluster comprised 3 strong-consistency enabled namespaces. This case study concerns events which took place during a maintenance window whereby rack B was taken down. During this time a node was unexpectedly lost from rack A due to OS issues. The article will examine the timeline of events and how the situation could be properly managed.

Detail

The maintenance window started at 09:35 when the second rack, rack B was shut down for maintenance. The cluster notices this as expected, the succession list changes to reflect cluster membership. No changes are made to the roster.

Jul 15 2019 09:05:18 GMT: INFO (clustering): (clustering.c:5797) applied new succession list a101 a102 a103
Jul 15 2019 09:05:18 GMT: INFO (clustering): (clustering.c:5799) applied cluster size 3

The logs show that the 3 remaining nodes within rack A are migrating. The cluster has a replication factor of 2. As the cluster is rack aware, even though an entire rack has been shut down, it is certain that either a roster master or roster replica node for all partitions exists in the remaining rack. A roster master or roster replica is defined as the node on which a partition would reside when the roster is fully present. The rack aware nature of this cluster means that one rack will always have either the roster master or the roster replica but not both. The migrations that are happening within rack A are creating data replicas to satisfy the replication factor of 2. These data replicas are an alternate copy of the data however they will not count towards the main partition availability rules. These are documented here however, for brevity, the key rules, at a high level (not mentioning special cases) are as follows:

  • If a sub cluster (a.k.a. split-brain) has both the roster-master and roster-replica for a partition, then the partition is active for both reads and writes in that sub cluster.
  • If a sub cluster has a majority of nodes and has either the roster-master or roster-replica for the partition within its component nodes, the partition is active for both reads and writes in that sub cluster.
  • If a sub cluster has exactly half of the nodes in the full cluster (roster) and it has the roster-master within its component nodes, the partition is active for both reads and writes.

Migrations finish in short order as shown in the logs:

Jul 15 2019 09:05:18 GMT: INFO (exchange): (exchange.c:3154) data exchange completed with cluster key ce5163af5af8
Jul 15 2019 09:05:20 GMT: INFO (info): (ticker.c:454) {MAIN_NAMESPACE} migrations: remaining (1315,1439,1315) active (0,0,0) complete-pct 0.04
Jul 15 2019 09:05:20 GMT: INFO (info): (ticker.c:454) {COMPUTE_NAMESPACE} migrations: remaining (943,1439,944) active (0,0,0) complete-pct 13.54
Jul 15 2019 09:05:20 GMT: INFO (info): (ticker.c:454) {RESOLVE_NAMESPACE} migrations: remaining (1316,1439,1316) active (0,0,0) complete-pct 0.00
Jul 15 2019 09:05:30 GMT: INFO (info): (ticker.c:454) {MAIN_NAMESPACE} migrations: remaining (680,1439,680) active (0,0,0) complete-pct 23.09
Jul 15 2019 09:05:30 GMT: INFO (info): (ticker.c:454) {COMPUTE_NAMESPACE} migrations: remaining (260,1223,261) active (0,1,0) complete-pct 46.17
Jul 15 2019 09:05:30 GMT: INFO (info): (ticker.c:454) {RESOLVE_NAMESPACE} migrations: remaining (680,1439,680) active (0,0,0) complete-pct 23.09
Jul 15 2019 09:05:40 GMT: INFO (info): (ticker.c:454) {MAIN_NAMESPACE} migrations: remaining (0,168,0) active (0,0,0) complete-pct 93.90
Jul 15 2019 09:05:40 GMT: INFO (info): (ticker.c:457) {COMPUTE_NAMESPACE} migrations: complete
Jul 15 2019 09:05:40 GMT: INFO (info): (ticker.c:457) {RESOLVE_NAMESPACE} migrations: complete
Jul 15 2019 09:05:50 GMT: INFO (info): (ticker.c:457) {MAIN_NAMESPACE} migrations: complete

The sub-cluster in rack A has a majority and contains either a roster master or roster replica for every partition. All partitions are available. At this point, as migrations are finished, the correct course of action, to ensure operational integrity would be to reset the roster so that it reduced from the original 5 nodes to the 3 nodes in rack A only. This would be done with the roster-set and recluster commands respectively.

NOTE: Setting the roster to the 3 nodes in rack A could have even been done immediately after rack B was taken down for maintenance and prior to migrations finishing. This would have ensured:

  • full availability in the event of a node loss after migrations complete.
  • partial availability if a node in rack A leaves the cluster prior to migrations completing.

The next event was that a node in rack A, node a102, restarted very suddenly due to an OS issue unconnected to the maintenance being carried out on rack B. The restart was so violent that the node did not have time to write any log messages, only the restart is shown. The log messages are shown below. The connection errors are from node a102 attempting to connect to the nodes in rack B.

Jul 15 2019 10:35:11 GMT: WARNING (socket): (socket.c:900) (repeated:134) Error while connecting: 111 (Connection refused)
Jul 15 2019 10:35:11 GMT: WARNING (socket): (socket.c:900) (repeated:3) Error while connecting: 113 (No route to host)
Jul 15 2019 10:35:11 GMT: WARNING (socket): (socket.c:891) (repeated:64) Timeout while connecting
Jul 15 2019 10:38:38 GMT: INFO (as): (as.c:317) <><><><><><><><><><>  Aerospike Enterprise Edition build 4.5.3.2  <><><><><><><><><><>

At this point, all partitions in the rack A sub-cluster became unavailable. This happened because the sub-cluster formed by the nodes remaining in rack A was not able to satisfy the availability rules given above. With a roster of 5 nodes (as the roster had not been reset after rack B was shut down) the 2 nodes in the sub-cluster could only have available partitions if it contained both the roster master and roster replica for those partitions. This was not the case as if a roster master existed in rack A, then the roster replica would be in rack B. If the roster had been reset to include the rack A nodes after migrations had finished all the partitions would be fully available at this point.

Node a102 rejoined the sub-cluster very quickly however the partitions were still unavailable.

Jul 15 2019 10:35:42 GMT: INFO (partition): (partition_balance_ee.c:1244) {MAIN_NAMESPACE} rebalanced: regime 75 expected-migrations (0,0,0) expected-appeals 0 unavailable-partitions 4096
Jul 15 2019 10:35:42 GMT: INFO (partition): (partition_balance_ee.c:1244) {COMPUTE_NAMESPACE} rebalanced: regime 75 expected-migrations (0,0,0) expected-appeals 0 unavailable-partitions 4096
Jul 15 2019 10:35:42 GMT: INFO (partition): (partition_balance_ee.c:1244) {RESOLVE_NAMESPACE} rebalanced: regime 75 expected-migrations (0,0,0) expected-appeals 0 unavailable-partitions 4096

On checking the logs, one reason for this became apparent. When shutting down a strongly consistent namespace the last thing Aerospike does is to write a flag on the storage which indicates all uncommitted writes have been flushed. On startup, if the node cannot read this flag from the storage it must assume that the data on the drives cannot be trusted and marks it with an ‘e’ or ‘evade’ flag. The net effect is that this node cannot be used to count towards availability. The relevant messages are shown below:

Jul 15 2019 10:38:38 GMT: INFO (drv_ssd): (drv_ssd.c:3002) {MAIN_NAMESPACE} device /dev/sda prior shutdown not clean
Jul 15 2019 10:38:38 GMT: INFO (drv_ssd): (drv_ssd.c:3002) {MAIN_NAMESPACE} device /dev/sdb prior shutdown not clean
Jul 15 2019 10:38:38 GMT: INFO (drv_ssd): (drv_ssd_ee.c:1457) {MAIN_NAMESPACE} setting partition version 'e' flags

A point to note here is that none of these partitions are dead. A partition is defined as dead if it is unavailable when the whole roster of nodes is present. The roster in this case consists of 5 nodes as it still includes the nodes on rack B. Therefore the partitions here are unavailable, not dead. Running the revive command would not have restored availability in this scenario.

There is a more subtle point at play here which is that even if node a102 had shut down cleanly, the net effect would have been the same. When node a102 left the sub-cluster, the A rack would be fully unavailable as it would not have been able to satisfy the availability rules. When node a102 returned after a clean shutdown it could not count towards the majority. Why is this? Because the roster was not reset to contain only 3 nodes. Aerospike is a shared nothing architecture, there is no central place where the status of the cluster is kept. This makes it very flexible but it also means that a node does not know if another node is down, or in another sub-cluster taking writes. This is the real root of the issue. Even in the event of a clean shutdown of node a102, there are nodes in the roster (the nodes in rack B ) that a102 could form a sub-cluster with and take writes. The operator knows that rack B is shut down, but the remaining nodes in rack A do not know this, all they know is that it cannot reach the nodes in rack B and it cannot reach node a102. When a102 rejoins, its partitions are marked as being a sub-set (as any node starting up would have), and the 2 other nodes in rack A have also been downgraded to sub-set which causes this 3 node sub-cluster to be fully unavailable.

The only way to ensure all nodes are consistent at this point is to get a super majority (replication factor - 1 nodes missing). There are 2 ways to do this:

  1. Start up rack B - which was the action the customer took.
  2. Reset the roster to include rack A nodes. The operator knew node a102 was down and even though the shut down was not clean, any writes that had not succeeded on 2 nodes (replication factor 2) would have been returned to the client with an error, so the cluster would have been consistent.

The logs show the rack B nodes starting up and joining the cluster:

Jul 15 2019 12:38:06 GMT: INFO (clustering): (clustering.c:5797) applied new succession list b102 a101 a102 a101
Jul 15 2019 12:38:06 GMT: INFO (clustering): (clustering.c:5799) applied cluster size 4
Jul 15 2019 12:38:56 GMT: INFO (clustering): (clustering.c:5795) applied new cluster key a826589053b5
Jul 15 2019 12:38:56 GMT: INFO (clustering): (clustering.c:5797) applied new succession list b101 b102 a101 a102 a103
Jul 15 2019 12:38:56 GMT: INFO (clustering): (clustering.c:5799) applied cluster size 5

As the cluster reforms availability is restored:

Jul 15 2019 12:38:07 GMT: INFO (partition): (partition_balance_ee.c:1244) {MAIN_NAMESPACE} rebalanced: regime 177 expected-migrations (0,0,0) expected-appeals 0 unavailable-partitions 4096
Jul 15 2019 12:38:07 GMT: INFO (partition): (partition_balance_ee.c:1244) {COMPUTE_NAMESPACE} rebalanced: regime 177 expected-migrations (0,0,0) expected-appeals 0 unavailable-partitions 4096
Jul 15 2019 12:38:07 GMT: INFO (partition): (partition_balance_ee.c:1244) {RESOLVE_NAMESPACE} rebalanced: regime 177 expected-migrations (0,0,0) expected-appeals 0 unavailable-partitions 4096
Jul 15 2019 12:38:56 GMT: INFO (partition): (partition_balance_ee.c:1244) {MAIN_NAMESPACE} rebalanced: regime 179 expected-migrations (2755,2113,1974) expected-appeals 0 unavailable-partitions 0
Jul 15 2019 12:38:56 GMT: INFO (partition): (partition_balance_ee.c:1244) {COMPUTE_NAMESPACE} rebalanced: regime 179 expected-migrations (2755,2113,1974) expected-appeals 0 unavailable-partitions 0
Jul 15 2019 12:38:56 GMT: INFO (partition): (partition_balance_ee.c:1244) {RESOLVE_NAMESPACE} rebalanced: regime 179 expected-migrations (2755,2113,1974) expected-appeals 0 unavailable-partitions 0

Key Points of Note

  • Aerospike cannot tell the difference between a node in another sub-cluster and a node that is down.
  • Re-setting the roster after shutting down rack B and allowing migrations to finish would have guarded against any unavailability.
  • Re-setting the roster after shutting down rack B and prior to allowing migrations to finish would have guarded against any unavailability any time all 3 nodes in rack A are present (availability would have been temporarily partially lost if one of the nodes in rack A departs before migrations complete and until it joins back).
  • Resetting the roster upon node __a102 __ returning to the cluster would have restored availability.
  • The roster could be reset safely even though __a102 __ did not shut down cleanly.
  • When a rack is shut down, whether or not the roster is reset migrations will happen to create data replica partitions.
  • Data replica partitions create a copy of a partition to satisfy replication factor within a sub-cluster they do not count towards availability.

Timestamp

October 2019