How consistency is guaranteed for write during migration

When a new node is added to an existing cluster, the migration starts. Suppose a record (k1:v1) exists in this cluster and resides in 2 node (node A and node B, 2 replicas) and a client is constantly writing records (k1:v2) (k1:v3)… during the migration. Which nodes will the records (k1:v2) (k2:v3)… be written to during the migration? Will there be the possibility that some of the records are still written to node A and node B but others are written to other nodes because of migration? Suppose another client is reading the record using key k1, what value will it get during migration?

If your cluster isn’t using strong-consistency, which operates on the CP model of the CAP theorem, then you are using the AP model which cannot generally guarantee consistency. If your system cannot tolerate consistency violations, I recommend that you look into using strong-consistency.

That said, assuming that some consistency violations are OK and that in your network, network partitions and node outages that meet or exceed the configured replication-factor are rare, AP mode can provide consistency in the absence of these events. You cannot guarantee these events will not happen.

So in the scenario you have described, the existing partitions have versions and they may or may not be in sync (meaning there could be ongoing migrations before the node has been added). When the node is added it may or may not come in as a replica.

Let’s assume the worst case. There are migrations, and the node being added comes in as the master replica.

In this scenario, the node comes in as an ‘eventual master’ the node that was master in the previous regime continues as an ‘acting master’ meanwhile the old prole (non-master replica) will move to a non-replica position.

During this time, only the acting master may receive reads or writes. Because there were migrations going on previously, the versions between to acting master and the non-replica will possibly differ, assuming they do, writes must duplicate resolve with the non-replica (assuming write-commit-level provided by config or client indicate ‘all’ (default)). By default reads will not duplicate resolve which can cause inconsistency during these events and can be enabled with read-consistency-level from either the config or the client.

Once the acting master completes migrations to the final-master it hands off master-ship and becomes a ‘prole’, at the time it still cannot accept reads since it will not be synced with the new master since it still does not have data from the non-replica. At this point, the master only needs to duplicate resolve from the non-master (the versions of the master and prole also reflect this relationship (the prole being a subset)).

At some point, the non-replica will complete migrations to the master. At this point, the master no longer needs to do duplicate-resolution but the prole still cannot accept reads since the master needs to migrate information to it which it will begin to do so once it has received all copies of the partition.

When migrations complete to the prole it can begin accepting reads. Also (assuming this is the last prole for the master to migrate to) at this time the master sends a signal to all nodes indicating the completion of all migrations for this migration. Non-replica nodes holding a copy of the partition will now drop their copies on receipt of this signal.

HTH.

Hi, @kporter Thank you for your explanation. But still I have something confusing.

Assume node A is the master and node B is the replica before migration starts and node C is the new selected master for this partition after nodes are added to the cluster.

You said that the writes will still go to the master node of the previous regime, node A and node C will not receive writes. So the new writes (k1, v2) (k1, v3) … goes to node A and replicates to node B during migration continuously. So because the record k1 is being updated all the time, what values of k1 in node A and node C are expected during migration?

I don’t understand how can migration stops if a key is being continuously updating to the master node A during migration, as the status in node A is always newer than that in node C.

New writes go to A and replicate to C (B became a non-replica).

When node A hands off mastership to node C, new writes arriving to A will be proxied to C. Eventually (~2 seconds) client should discover node C is now master.

@kporter - i would love to read an edit of the last three paras of your explanation with nodes A, B and C of the example used in parentheses. I got a bit lost in tracking the nodes between their designations. Acting-Master(B), final-master(C), non-replica(?)…

Also, looks like you are covering the case where node C, the one being added in, had a previous copy of the partition - for example, C was part of the cluster and was temporarily taken down for a rolling upgrade.

Hi @kporter

Thanks for your illustration. So according to your explaination when the migration starts, the writes will go to ‘acting master’ A and also replicated to ‘eventual master’ C too. So C will receive 2 kinds of writes, one is for migration and another is for the new writes replicated from A, when 2 writes of the same key arrive, duplicate resolve will be applied based on the generation of the record. Is this correct?

My another question is about the node B. It becomes ‘non-replica’ since the migration starts, and no new writes(including migration and replication) will go to B during migration. So I’m a bit confused when you say that there will be a later migration from node B to C. Why is this necessary as data from node A to C already completes?

I was assuming the worst case, as I mentioned, that A and B also were not in sync. This is a normal scenario during a rolling upgrade where migrations are not completed after each node joins the cluster. Consider:

# Initial state - node succession order C,B,A:
Cv1m Bv1p A     # 'v1' indicates some partition version.
                # 'm' indicates the master partition.
                # 'p' indicates a non-master replica.
                # Node A doesn't have this partition at this time so it is just A.

# B drops for upgrade:
Cv1m A          # Versions on arrival will be used to determine migrations for this round.
                # During this round they will advance version information for subsequent
                # migration round.
Cv2m Av2sp      # The version advances to a new randomly generated version 'v2'
                # C is still master. A becomes a replica where 's' indicates that it only holds a
                # subset of 'v2'.

# B returns:
Cv2m Bv1s Av2sp
Cv3m Bv1sp Av3s # Version on C advances to random version 'v3' and is still master.
                # B returns with v1 and all restarted partitions with data come back as 'subset'.
                # B also becomes a replica of C pushing A to a 'non-replica' position.
                # A is still a subset of C so C only needs to duplicate resolved with B at this
                # time.

# C leaves the cluster (this is where I started my example from):
Bv1sp Av3s
Bv4sm Av3s      # B advances version to random version 'v4'.
                # Since 'v4s' and and 'v3s' are different, B will need to duplicate resolve with A.

# C returns to the cluster:
Cv3s Bv4sm Av3s
Cv6s Bv5sm Av3s  # State when c returns.
                 # Note: At this point, you and I can see that A is just a subset of C and B
                 #       but this state cannot be represented in this versioning scheme. Also
                 #       notice that the amount of data held by A is likely much less than either
                 #       C or B since it only received replication during the transient periods when
                 #       either C or B were out of the cluster.

# Now we allow migrations to proceed. Assume (though unlikely) B finished migrating to
# C before A. The new state would be:
Cv6sm Bv6sp Av3s # At this time, C takes over as master but still has to duplicate
                 # resolve from A (though we know that technically it has all the state, the
                 # versioning system does not).

# Now A finishes migrations to C:
Cv6m Bv6sp Av6s # A becomes a subset of v6, C no longer needs to duplicate resolve
                 # with 'A'. Also now that C has received its last immigration, it can remove its
                 # subset flag and become a full 'v6'. Also having finally finished the last
                 # immigration, C can now begin emigrations to B.

# C now finished migrations to B:
Cv6m Bv6p Av6s  # Only change here is that B now has a full copy of 'v6'.

# Now that C has finished all of its emigrations it can now signal B to drop and we reach # our final state for this partition:
Cv6m Bv6p A

Note, this example was derived from fallible human memory and is involved enough where there are likely mistakes. That said, this example is a very close representation of how the system actually works. I also didn’t mention version ‘families’ and simply generated new random versions where ‘families’ would have been used. This is a simplification for the purposes of discussion the fundamental algorithm - ‘families’ could have been done by generating a new random version instead, but we chose ‘families’ to make tracing/debugging the algorithm from logs simpler.

  1. Partition versions can be found here: https://github.com/citrusleaf/aerospike-server/blob/master/as/include/fabric/partition.h#L66
  2. How partitions advance during ‘rebalance’ can be found here: https://github.com/citrusleaf/aerospike-server/blob/master/as/src/fabric/partition_balance.c#L1201
  3. How partitions advance during emigration/immigrations can be found here: https://github.com/citrusleaf/aerospike-server/blob/master/as/src/fabric/partition_balance.c#L1530 and https://github.com/citrusleaf/aerospike-server/blob/master/as/src/fabric/partition_balance.c#L1551
  4. How partition information is exchanged between nodes: https://github.com/citrusleaf/aerospike-server/blob/master/as/src/fabric/exchange.c#L1487

Also note that this example assumes ‘AP’ and not ‘Strong Consistency’ mode. These behaviors change in subtle and non-subtle ways for strong consistency.

HTH

Thank you @kporter. Understood.

So what’s the expected behavior for read during migration? Suppose we’re in the ‘Strong consistency’ mode, and there are continuous writes during migration. For example,

1 initial state-> Cv1m Bv1p. In this state, will B serve read?

2 Adding a node A-> Cv2m Bv1 Av2s. So when adding a node A, let’s suppose that the master replica will be migrated to A. So A becomes a subset and B becomes ‘non-replica’, right? In this phase, when reading a record in ‘strong consistency’ mode, how the duplicate resolve is performed? Because there may be old data in Bv1(e.g. k1->v1) and new data(e.g. k1->v2) in Cv2m and Av2s. How can the client be assured to read the value v2 instead of v1?

B can serve a read in this state. By default all reads go the the master, but the client can be configured to read from the replicas. Reading from replicas means your reads will not be linearizable. We call this mode ‘relaxed consistency’. Though it is possible to implement these reads to preserve sequential consistency, presently it doesn’t do that either; we have a plan to make these reads sequentially consistent so expect this to change. If you aren’t familiar with these terms, linearizable basically means that successive reads for a particular record performed by many client processes cannot read an older value then what the previous read form any client has returned. Sequential (I like to think of as “session”) consistency mean that a particular client may not read an older version of a record the it has already read but that read may be older than another client process prior read.

The order you list the nodes is important. For a particular cluster membership/configuration each partition generates a particular node succession ordering. The first element is designated ‘final-master’ the following replication-factor - 1 elements are ‘proles’ (aka non-master-replicas) and any remaining nodes with data are ‘non-replicas’.

So I believe you are describing A coming in as a replica:

# The version sequence will enter re-balance as:
Cv1m A Bv1p
# The versions will advance to:
Cv2m Av2s Bv2s

In this state, duplicate resolution is not necessary since both A and B are subsets of C. During this time any reads that where targeting B will be proxyeed to C.