FAQ - What is the succession list?
When reading Aerospike documentation it is usual to come across references to the ‘succession list’ in the context of cluster behaviour, what does this term actually refer to?
When Aerospike takes a key it hashes that key into a 20 byte digest using the RIPEMD160 algorithm. 12 bits of the resultant digest will form the partition id. This implies the same key will always go to the same partition. There are two main algorithms used to determine which nodes own each partition. The default, uniform balance seeks to distribute partitions across nodes as evenly as possible to balance cluster traffic. The original algorithm sought to minimise data movement during cluster change however, in practice, most customers find uniform balance the optimum choice. As it is algorithmically derived, the node on which a partition resides is equally deterministic.
Note: the rack-aware feature also affects the partition distribution across nodes, but only for the non master copies.
For each partition there is an ordered list of cluster nodes. The first node in that list will hold the master partition copy, the second, the first replica copy, the third, the second replica copy and so on. This is the succession list as it determines the succession of nodes that would be holding the different copies of a given partition. For the original algorithm optimizing data movements during cluster changes (node addition or removal), the order of the succession list for a given partition never changes. For example, if a node holding a partition is removed from that list the partition will migrate to the next node to the right in that list. But for the default (as of version 4.7)
prefer-uniform-balance algorithm, the order of node may change, causing more data movement, but preserving the balance in terms of number of partition owned across nodes.
Consider the following rolling restart scenario. There are five nodes, A-E and replication factor is 2. Looking at a partition, X, the succession list, for ease, is A,B,C,D,E. In this example write load is constant and fill migrations are delayed. Data is persisted on disk. For the sake of the example, let’s also assume the ordering does not change when nodes leave or join the cluster (basically rack-aware and
prefer-uniform-balance are not used).
Nomenclature note: for brevity, we will simply call the first and second replica copies of a partition master and replica respectively. In proper terms, all copies of a partition are called replicas, which is why calling the specific second replica copy of a partition for a replication factor 2 namespace as replica is some sort of a nomenclature shortcut.
1. Steady state
2. Node A is shutdown
When A is shut down, B becomes master as it is next in the succession list (and already held a full copy of that partition). C becomes a replica holding node as well. It will initially not have any records for that partition, but as write transactions are processed, it will be the recipient of replica writes for that partition. When the configured
migrate-fill-delay expires, C will also start receiving fill migrations (records that were not updated or written since node A went down) in order to eventually get a second full copy of that partition.
3. A returns to the cluster
When A returns to the cluster, as nodes B and C could have taken writes, A does not own full copy of partition X and so comes back as the replica for that partition (we mark partition X as a subset partition on node A). C is no longer considered a replica copy for partition X (its records for that partition are considered non-replica) but the data from that partition on node C is not dropped until we have 2 (replication factor) copies of each records (which will happen when migrations for that partition complete and A reclaims the masterhood ownership for that partition).
4. As migrations are ongoing, B is now shut down
When B is shut down before the migration for partition X completes, both A and C are subset partitions. A did not yet receive full delta migrations and so could be missing some updates for partition X that came in while it was out of the cluster. C has the data A is missing however as
fill-migrations are delayed it does not have the original data that resides on A. Here the succession list becomes very important as it defines which node is the master. As A is first in the succession list, it becomes the master.
In other words, neither A or C has a full copy of that partition, but, between them, they do have all the latest updates for all the records for that partition. (If migration for partition X had completed before taking node B down, node A would have been master and node C would have dropped partition X).
Note: when a transaction for a record belonging to partition X has to be processed in this situation (where neither node A nor C has a full copy of the partition), the transaction will leverage duplicate resolution. For namespaces which are not
strong-consistency enabled, there are a few configuration parameters that dictate the behavior:
5. Node B returns to the cluster
When B returns to the cluster and steady state is restored, we end up with the same picture as initially:
Implications of the succession list
- The succession list is based on
node-idchanges then so will the succession list.
- If a node leaves the cluster and returns with a new
node-id, even if it is the same physical machine, extra migrations may occur as the succession could have changed.
node-idis not explicitly set, it is derived from the fabric port cnofiguration and the MAC address of the first Network Interface.
- When a node is
quiescedthe node is artificially moved to the end of the succession list for all partitions. This causes the node to give up master ownership for the partitions it owns and allows for smooth maintenance as transactions would have moved to the nodes that would then be owning the partitions as master.
- If the cluster size is equal to the replication factor then when
quiesced, a node will retain the (n-1)th replica where n is the replication factor as the partitions will have no other node they can reside upon.
- All nodes returning to a cluster their partitions marked as subset until migrations complete, partition by partition, making sure they each have the latest versions of the records they hold.
- For a scenario where an AP cluster is doing a rolling restart as described above it is possible that a record is present on the replica but not the master, this phenomenon is described on this article
- In depth details on Aerospike data distribution can be found on the Data Distribution documentation page.
NODE-ID SUCCESSION PARTITION OWNERSHIP MIGRATE QUIESCE