FAQ: What happens if my cluster suffers a node failure while I have nodes quiesced?
If nodes suffer a failure during the time that other node(s) in the cluster are quiesced, how will the cluster react?
In a small cluster, quiescing a node may appear to be opening yourself to risk should another node in the cluster suffers an issue (i.e. hardware failure) at the same time as you have a different node quiesced. For example, in a 3 node cluster, quiescing one node then losing a node to a hardware failure will result in a single active node.
Using a 3 node AP cluster with Replication Factor 2 (RF=2), this example shows how the cluster will react to this situation in order to attempt to protect your data and stay available as best it can.
To start with, we quiesce a single node (by issuing the relevant quiesce and recluster commands) and wait for migrations to complete. At this point, as expected, our quiesced node holds 0 partitions as primary or secondary (it still has the data, though) and the other 2 nodes hold half (2048) of the primary partitions and half of the secondary partitions each:
Admin> show pmap ~~~~~~~~~~~~~~~~~~~~~~~~~Partition Map Analysis (2020-03-20 12:25:23 UTC)~~~~~~~~~~~~~~~~~~~~~~~~ Cluster Namespace Node Primary Secondary Dead Unavailable Key . . Partitions Partitions Partitions Partitions 8BD2915100A4 test 4302877c872c:3000 0 0 0 0 8BD2915100A4 test 172.17.0.5:3000 2048 2048 0 0 8BD2915100A4 test 172.17.0.4:3000 2048 2048 0 0
If we now lose one of the 2 active nodes, we will be left with 1 active node and one quiesced node. This means we don’t have enough active nodes to meet our Replication Factor of 2. In this instance the quiesced node will move up in the succession list, whereby it remains quiesced, but in order to satisfy RF=2 it is used to store the secondary copy of all partitions:
Admin> show pmap ~~~~~~~~~~~~~~~~~~~~~~~~~Partition Map Analysis (2020-03-20 12:25:55 UTC)~~~~~~~~~~~~~~~~~~~~~~~~ Cluster Namespace Node Primary Secondary Dead Unavailable Key . . Partitions Partitions Partitions Partitions CF62B8E1F434 test 4302877c872c:3000 0 4096 0 0 CF62B8E1F434 test 172.17.0.4:3000 4096 0 0 0
Note: In this state, the quiesced node would potentially serve read transactions, based on the client policy, if reading from non master replicas is allowed.
If we now lose the remaining active node, leaving us with just the quiesced node, that node will have no choice but to promote its secondary partitions to primary and serve client traffic in order to keep the cluster available. It will however remember its quiesced status and once nodes return the the cluster, it will return to being a quiesced node as it was before:
Admin> show pmap ~~~~~~~~~~~~~~~~~~~~~~~~~Partition Map Analysis (2020-03-20 12:28:15 UTC)~~~~~~~~~~~~~~~~~~~~~~~~ Cluster Namespace Node Primary Secondary Dead Unavailable Key . . Partitions Partitions Partitions Partitions D80E118573D6 test 4302877c872c:3000 4096 0 0 0
In summary, quiescing a node simply moves all the partitions owned by the node to the far end of the succession list. Therefore, depending on the effective replication factor and the number of nodes in the cluster, quiesced node may end up having partitions that can be ‘active’ in a cluster.
- Not all of the above will apply in a Strong Consistency cluster as the size of the cluster and SC rules will apply first.
QUIESCE FAILURE SHUTDOWN PARTITIONS