Aerospike cluster sync issues

I am running aerospike in three nodes, they are running with replication of 3, so each maintain its own copy, two are running in a datacenter and the other is running in another datacenter. I am running the community edition v6. The aerospike node 3 is switched off as it is ideally used as a DR site. So every day the aerospike 3 will turn on, sync with the other two and then will be switched off. Ideal case or scenario that is expected. The issue I face is that the data that is added to the other nodes when aerospike 3 is turned off, few of the data is missed in this process. So, the ideal scenario of maintaining a copy for DR site is not fulfilled and is giving me inconsistent data in my production environment, how do I handle this scenario in the current setup? I have set max-write-cache 1024M in all the nodes, and they are running in a mesh setup. Any solution for this would be helpful.

I am guessing you turn the 3rd node off because you don’t want the clients writing to it? There are different ways to solve this in Enterprise Edition. One easy way is to set stay-quiesced true on this node. Then it will only hold all partitions as replica, in your case, until the other two runtime nodes die/stop.

But in CE, it is hard to turn off one node while ensuring all the writes happening to the DR node got replicated to the other node when you shut down the DR node. (You can if you stop all writes from the clients for a bit.) Those updates could be lost when DR node joins again if those records were updated in the runtime nodes. If you are using apis like increment, that could cause incorrect data. This is one possibility of data anomaly. But that is not your problem, looks like.

Quite likely, the issue you may be facing could be when record generation rolls over in the runtime nodes, those updates will be lost when the DR node joins with old data but “higher” generation. (AP mode, generation rolls over at 65,535, back to 1.) You might want to try using last-update-time as conflict-resolution-policy. Its a dynamic parameter so you can change it without having to stop the cluster. See if that helps. Also, make sure all 3 node server clocks are sync’d to the NTP server so the last-update-time, which is based on local node clock, doesn’t have excessive skew between nodes.