Strong Consistency and rack awareness

Consistency | Aerospike Documentation has the following:

A. “… In case of a failure to replicate a write transaction across all replicas, the record will be left in the ‘un-replicated’ state, forcing a ‘re-replication’ transaction prior to any subsequent transaction (read or write) on the record.”

B. Slide 30 on describes how replication works for RF = 2.

(A) suggests that record at the master holds the state of replication. (B) tells that the record written to a replica would not know if the copy it holds has been applied at master or not, since there are no additional CONF packets sent. (Even if they were sent, it could be lost.)

Let’s take the following setup: Replication factor 2 for a rack-aware Strongly consistent Aerospike setup having 2 nodes on each DC - N1, N2 on DC1 and M1, M2 on DC2.

  1. Would it be possible to get a dirty read when reading from local DC using allow_replica setting? From (B) above, how does replica know that a write has been committed?

  2. Assume a network split happens between DC1 and DC2. How do I recluster a site-aware geo-synchronous setup when nodes from an entire Data Center are down? All data should be in both DCs since it is strongly consistent and rack aware. I want to use nodes from DC2 after one site (DC1) has gone down.

  3. Is there any case in the above setup where data has not been replicated across both DCs when the link between them is up?

Re: 1 - Yes but I would call it a stale read instead of a dirty read. Stale read: Reading data that is not the most recently committed value. Dirty Read: Reading data that is not yet committed. ALLOW_REPLICA does not allow a dirty read. You can have that situation on replica if using RF=3 and replica has not received ACK from other replica and is in unreplicated state. (Slide 29 of the presentation you cited.)

The moment you accept ALLOW_REPLICA in rack-aware with preferred rack read in SC mode, you are accepting that loss of strict consistency. Hence, it is not the default. By default, reads always go to the master.

Re 2 - You would re-roster the namespace on DC2 - declaring just the nodes on DC2 as part of the roster for the SC mode namespace. Then issue re-cluster command after re-rostering for the roster to take effect.

Re 3 - If the split happens while master was updated but had not sent the record to replica, replica will have the older copy. So your master will be in an unreplicated state and will not ack a successful write to the client. If client has not received a successful ACK, stale data on the other side has to be deemed acceptable and the client/application should successfully complete the write. Since a client does not know if replica got it or not in case it does not get an ack, in general, SC mode application data models should not use APIs like “increment” which are not idempotent.

Thanks @pgupta for the reply!


When could that occur? Doesnt having a Strongly Consistent setup mean that the record is written at the replica before the master sends a response to the client? (It may not be committed but it should have physically reached the replica iiuc).

An aside - Is the reason CONF is not sent to a replica when RF=2 because the master requires at least one ack from the replica before it responds to the client?

(Pasting picture for those following along and for posterity.)

1st - no need to CONF in RF=2 … If I am replica and a got a record from MASTER, its obvious master has it.

In RF=3, I as a replica need to know if other replica also has it - hence all the CONFs in RF=3.

In terms for failures, each of those writes is a TCP-IP connection. Think lost data / packets / transaction timeout in any segment at any instant.

Client writes to master … what if master does not get it and client times out? Master gets it, switches status to replicating, sends write to replica - which gets corrupted. Client may timeout without getting SUCCESS from master. Client does not know if master has it, or not, or replica has it - but its return ACK to master got corrupted. If Client does not get SUCCESS, the write is IN-DOUBT. If the client was unable to even write to the socket to the master, then client knows for sure - that the write to master never went. In that case IN-DOUBT will be false and client can retry. If client times out waiting on SUCCESS, or gets IN-DOUBT true, you have to build an SC mode data model that can deal with those scenarios and keep your application happy.

1 Like

On re-rostering to a single DC, you must ensure upfront in your capacity design that the DC nodes are adequately sized to carry RF=2 amount of data. Re-rostering, followed by re-clustering will cause the single copy of data to be replicated within the DC.

Thanks again, this really helps.

One follow up question - Is it possible to place the masters in such a way that all masters are in one DC and all replicas are in another? I want to avoid 2 RTT for writes for half the keys since we are on an Active-Passive setup.

This has come up before. Currently, you cannot bias master partition assignment in this way.

Thanks for pointing it out, will keep that in mind while sizing our cluster.

Does this mean that when the other side is back up and we re-roster back to a 2-DC setup, the data which was written on a single DC will be automatically be copied over to the remote DC?

It will indeed.

This is an interesting topic, in terms of most efficient deployment of a multi-site SC mode cluster. I would suggest, you start here and read other referenced blogs on that page. There are pros and cons of each approach and associated deployment costs.

Four options - consider a 6 node deployment for argument sake:

a) Simple rack aware SC cluster: DC-1 (3) ↔ DC-2 (3) ==> any DC goes down, other loses availability of half the partitions per SC rules.

b) Bias towards “active” DC … say DC-1: DC-1(4) <==> DC-2(3) … if DC2 goes down, DC-1 still is the majority cluster and hence has all partitions available.

c) Using “stay-quiesced” configuration feature to add a third minimal node as tie-breaker: DC-1(3) ↔ DC-3 (1) ↔ DC-2(3) - either DC-1 or DC-2 goes down, other DC has all partition availability, DC-3 never holds any data.

d) 3 Rack configuration: DC-1(2) ↔ DC-2 (2) ↔ DC-3(2) … Any one DC goes down, full availability , all DCs hold data.

Please review referenced blogs/links. BTW, are you an EE customer?

Thanks @pgupta , I will review the docs you linked.

Yes, we are an EE customer.

Great, in that case, you also have the option to reach out to your Account Executive and see if the Solutions team can assist with multi-site cluster design and sizing.