Split Brain detection and impact

How to detect if a cluster did go through a split brain situation

Split brain is a state where a cluster splits into multiple clusters of smaller sizes. This article covers the situation for Available and Strong-Consistency modes in Aerospike.

Detecting a split brain situation

Here are the main symptoms of a cluster that has split:

  1. Check for cluster_size on each node. If it is less than the expected cluster size on some nodes, it would indicate that some nodes have left the cluster or formed a sub-cluster, potentially with other nodes.

Note: A single node departing the cluster is not really considered a split brain. Even if the node is still alive, there are safe guards against the node taking ownership of all the partitions (node will switch to what is called orphan state) and clients should recognize this situation and will not submit transactions to such ‘orphaned’ node.

  1. Grep for the keywords “departed” and/or “applied cluster size” in the logs, this will indicate that a node has departed the cluster and a new cluster size has been applied.

For example:

Sep 26 2018 06:51:20 GMT: INFO (fabric): (fabric.c:2486) fabric: node bb9f2054e2ac362 departed
...
Sep 26 2018 06:51:21 GMT: INFO (clustering): (clustering.c:5808) applied cluster size 2

This indicates which node departed and the effective cluster_size after that.

  1. Check for rebalance and migrations. Any cluster change would trigger a rebalance followed by migrations (redistribution of the partitions across the nodes in the cluster).
{ns_name} rebalanced: expected-migrations (1215,1224) expected-signals 1215 fresh-partitions 397
...
{ns_name} migrations: remaining (654,289,254) active (1,1,0) complete-pct 88.49

Or, for strong-consistency enabled namespaces (notice the unavailable_partitions statistic):

{ns_name} rebalanced: regime 295 expected-migrations (826,826) expected-signals 826 expected-appeals 0 unavailable-partitions 425

Refer to the monitoring migrations doc and FAQ - Monitoring Migrations knowledge base article.

After effects of split brain for AP namespaces (non strongly consistent)

One of the potential issue with a split brain situation in namespaces without strong-consistency enabled is the creation of fresh partitions in sub-clusters. A fresh partition is a partition missing in a sub-cluster (both the master and replica(s) nodes owning such partitions are not present in the sub-cluster), causing a new ‘fresh’ one to be instantiated. This is an issue as it could cause inconsistencies when the cluster fully reforms. This can be checked through this log line:

{ns_name} rebalanced: expected-migrations (1215,1224) expected-signals 1215 fresh-partitions 397

The writes (updates turning into inserts for example) on such fresh partitions would be ‘conflict resolved’ based on the configured conflict-resolution-policy when the cluster reforms.

If the conflict-resolution-policy is set to generation, which is the case by default, the records with the higher generation will win. This of course may not be the most recent version of the record, rather simply the version of the record that has been updated the most times. For keeping the most version of the record, the conflict-resolution-policy must be set to last-update-time.

The use case / application will dictate the best value to use. If historical data (record with multiple bins updated over time) is more important than the most recent update, then generation should be used to keep the version of the record with the most writes. If, on the other hand, records store only current state and no history, the last-update-time policy may be preferable.

After effects of split brain for SC namespaces (strongly consistent)

In SC clusters, based on the SC rules, some partitions may become unavailable in a sub cluster. When the cluster reforms, full availability would be restored and there wouldn’t be any inconsistencies for any of the data written during the cluster split. Having said that, if more than replication-factor number of nodes had their storage replaced at once, the revive commands may be necessary to recover full availability. Consistency guarantees would then depend on how close to each other nodes would have failed for example, and whether the commit-to-device configuration parameter was leveraged.

Impacts to Clients

Aerospike Clients use seed list node IP’s to discover the cluster. It is important to understand that clients simply tend (or ask) each node in a cluster for the partitions it owns, along with the qualification for each partition (master, replica 1, replica 2, etc, based on the replication factor).

Clients tending each node in the cluster (typically at the default 1 second interval) means the clients would have a delay of up to 1 second in terms of the nodes partition ownership. Nodes would proxy transactions to the right node to accommodate this unavoidable delayed view of the partition map. See Client Library Capabilities and Data Replication and Synchronization for more details.

Below are some concrete Q&A examples for a 5-node cluster (A,B,C,D,E) split into a 2-node (A,B) sub-cluster and a 3-node (C,D,E) one, considering the different situations in AP and SC.

Will each sub-cluster now send a new partition map to clients connected to them? Example: nodes A/B send the partition map to some of the clients and nodes C/D/E send the partition map to another set of clients?

AP Mode: Based on the above, each client will independently tend each node it is able to connect to and retrieve the partition ownership of each of those nodes. If the a client was connected to the cluster while it was a full 5 node and can still connect to all 5 nodes, even though the cluster itself has split, the client will continue tending to all 5nodes. This would yield to some sort of partition ownership “flip-flop” as both sub-clusters will have all partitions available. For example, let’s assume that partition #15 was owned by node A as master and node C as replica 1. After the split, node A would still own it as master (and now node B would hold the replica, which would start filling up through migrations) but node C would now also own that partition as master with either node D or E as a replica. As the client successfully and continuously tend all nodes, it will see partition #15 as owned by node A and node C as master successively and based on the time of tending those 2 nodes, read and write transactions would successively be issued against node A and node C. One can then easily understand the potential inconsistencies resulting from such situation. Even if partition #15 had its master and replica copies residing on nodes A and B, a fresh (empty) copy of this partition would be created in the subcluster formed by nodes C, D and E and the same flip-flopping would occur. Of course, in general, when a cluster split happens, it is likely that clients will only be able to connect to one of the sub-clusters only.

SC Mode: In this case, a given partition can only be available on one of the sub-clusters at most. The clients will therefore be able to construct a stable partition map and there wouldn’t be any flip-flopping possible in such case. If the client is able to connect and tend all nodes in the cluster, given that in a 2 way split all partitions will still be available on one of the sub-cluster, the clients will continue to operate just fine (review the SC partition availability rules. There may be a very short time (based on the configured tend interval – defaulting to 1 second) where the client attempt a transaction for a node on which a partition has become unavailable, which would be rectified on the subsequent tend cycle.

If a new client tries to connect to the cluster, would it go to the first node in its seed list (A,B,C,D,E)?

A seed list is used in the order that it is specified and the client will stop after the first node it is able to successfully connect to. It then gets the peer list from this node. Therefore, if a new client starts up while the cluster is split and could theoretically still connect to every node, it will only tend to the sub-cluster the first node it connects to is part of. So if node A is first in its seed list, the client will only connect to sub-cluster A/B, but if the first node in its seed list is one of C, D or E, it will only start tending to the sub-cluster C/D/E.

If a client tries to update or insert a record in SC configuration in the 3-node sub-cluster C/D/E, would the cluster create a copy of partition within the same cluster?

Yes, to retain the configured replication-factor created/updated records would get replicated in the same sub-cluster (assuming that partition was available of course). Migrations would also replicate any available partition in a sub-cluster.

XDR considerations

If a cluster being an XDR destination goes through a split brain situation, the same logic applies. The aftermath of an XDR cluster shipping to a cluster that has gone through a split brain will depend on:

It is recommended to consider configuring namespaces with strong-consistency for any use case sensible to such situation. For namespaces running with the strong-consistency mode, a split brain would never create fresh partitions, and will instead have some partitions potentially unavailable, causing XDR to re-log and try at a later time.

Keywords

SPLITBRAIN SPLIT BRAIN SUBCLUSTER

References

Timestamp

July 2020

© 2015 Copyright Aerospike, Inc. | All rights reserved. Creators of the Aerospike Database.