Replica and master objects is inconsistent


#1

The environment is Aerospike CE 3.9.1. the replica factor is set to 2. When the migration completes, I still find the replica and master object number is not equal. During the migration and after the migration, still have some other client write data into Aerospike. I also find new arrived data seems commit to replica object, and the diff number between replica and master is the same. I didn’t find any relevant error message in log file. So, is there any unexpected behavior ? And how can I do next ?


#2

AMC can show inaccurate stats sometimes. Check through asadm. You can also try restarting amc after migrations are done to see if it looks different. I have a cron job that restarts AMC every day


#3

What is your AMC version? - It shows right at the bottom of the browser pane - footer. Also, on your node, what does the query below show?

$grep objects /var/log/aerospike/aerospike.log

You should see something like this:

INFO (info): (ticker.c:348) {ns1} objects: all 300000 master 300000 prole 0

(My output is with a one node cluster - so prole is zero.) Do you see difference with AMC or is it same?


#4

in asadm is the same.


#5

I have 6 nodes cluster, replica is 2.

one node: Jun 08 2017 03:24:06 GMT: INFO (info): (ticker.c:328) {production} objects: all 7693601 master 3919389 prole 3774212

another node: Jun 08 2017 03:41:55 GMT: INFO (info): (ticker.c:328) {production} objects: all 8406338 master 6005579 prole 2399283

in AMC, the number is the same, at least, the diff is around 40M records.


#6

AMC version?


#7

AMC CE 4.0.12, the latest one


#8

i would expect all = master + replica on each node. sum (all) = sum (master) + sum (replica) … sum over all nodes sum(master) = sum (replica) … sum over all nodes


#9

since your numbers from logs and amc match, amc is not the issue then.


#10

This is after migrations are finished? Anything else showing in your log? How about asadm -e “show stats like err”?


#11

Yes, the migration is finished, it is strange, no relative error, and then I rolling restart all nodes, the issue disappeared, I suspect 3.9.1 is not a stable version, because the new heartbeat sub system released and some issues fixed on next version.


#12

I don’t see anything in release notes which addresses anything like that. I’m not really sure whats going on. At this point it may be worth opening a case with Aerospike if you have a support contract


#13

We did discover a way this can happen when recovering from certain split-brain conditions while working on the partition rebalance algorithm used with paxos-protocol v5. Basically it was possible for a partition recovering from a split-brain to determine there were no migrations needed. We were unable to address this issue in the old algorithm and it shouldn’t exist in the new rebalance algorithm. To switch to paxos-protocol v5, see http://www.aerospike.com/docs/operations/upgrade/cluster_to_3_13.