Rack aware cluster failure causes higher number of objects

durable-deletion

#1

I am testing the rack aware cluster option and ran into something odd that I am struggling to understand. I started 4 nodes in 2 groups (they all use mode dynamic and paxos-protocol v4):

10.31.72.49    self-group-id 201
10.233.122.229 self-group-id 201
10.178.197.183 self-group-id 202
10.153.168.60  self-group-id 202

It looked good:

Aug 14 2015 19:00:40 GMT: INFO (partition): (cluster_config.c::371) Rack Aware is enabled.  Mode: dynamic.
Aug 14 2015 19:00:40 GMT: INFO (partition): (cluster_config.c::377) CLUSTER STATE(balanced) SelfNode(bb900ca0ab2c5b7) Group Count(2) Total Node Count(4)
Aug 14 2015 19:00:40 GMT: INFO (partition): (cluster_config.c::387) Group(00ca) GroupNodeCount(2):: Node(bb900ca0ab2c5b7) Node(bb900ca0a99a83c) 
Aug 14 2015 19:00:40 GMT: INFO (partition): (cluster_config.c::387) Group(00c9) GroupNodeCount(2):: Node(bb900c90ae97ae5) Node(bb900c90a1f4831) 

I used the cluster for a bit and at one point I decided to test what happens if i bring one side down. I checked the number of unique objects (2,890,943), then I brought the aerospike service on both nodes in group 202 down and the number of unique objects stayed the same. I then went ahead and started the 2 services back up. They re-joined the cluster successfully and started migrating objects. Once done, however, the total number of objects in the cluster was almost double what it was before my failover test – 5,485,187! Any idea why that could have happened? My cluster did have a higher number of objects at one point but they were deleted and nothing reading/writing to the cluster while I did this. It also appears to be in a good state with no apparent errors in the log.

Here are snapshots from amc:

Before I brought instances down:

Host	Master Objects	Replica Objects	Repl'n Factor	"Disk Used, HWM, Stop Writes"	Disk	"RAM Used, HWM, Stop Writes"	RAM	Expired Objects	Evicted Objects
									
"10.153.168.60:3000    View Details"	663741	877838	2	HWM : 50%SW : 90%Avail: 98%Used: 2%	"376.31 MB    29.63 GB"	HWM : 60%SW : 90%Used: 3%	"94.09 MB    3.91 GB"	0	0
"10.178.197.183:3000    View Details"	489441	859923	2	HWM : 50%SW : 90%Avail: 98%Used: 2%	"329.79 MB    29.68 GB"	HWM : 60%SW : 90%Used: 3%	"82.36 MB    3.92 GB"	0	0
"10.233.122.229:3000    View Details"	963583	602588	2	HWM : 50%SW : 90%Avail: 97%Used: 2%	"382.72 MB    29.63 GB"	HWM : 60%SW : 90%Used: 3%	"95.59 MB    3.91 GB"	0	0
"10.31.72.49:3000    View Details"	774178	550594	2	HWM : 50%SW : 90%Avail: 98%Used: 2%	"323.39 MB    29.68 GB"	HWM : 60%SW : 90%Used: 2%	"80.86 MB    3.92 GB"	0	0

After I brought the instances (10.153.168.60 and 10.178.197.183) back up and they finished migrating:

Host	Master Objects	Replica Objects	Repl'n Factor	"Disk Used, HWM, Stop Writes"	Disk	"RAM Used, HWM, Stop Writes"	RAM	Expired Objects	Evicted Objects
									
"10.153.168.60:3000    View Details"	663741	877838	2	HWM : 50%SW : 90%Avail: 98%Used: 2%	"376.31 MB    29.63 GB"	HWM : 60%SW : 90%Used: 3%	"94.09 MB    3.91 GB"	0	0
"10.178.197.183:3000    View Details"	1852877	2090731	2	HWM : 50%SW : 90%Avail: 96%Used: 4%	"963.08 MB    29.06 GB"	HWM : 60%SW : 90%Used: 6%	"240.70 MB    3.76 GB"	0	0
"10.233.122.229:3000    View Details"	1534143	1286784	2	HWM : 50%SW : 90%Avail: 97%Used: 3%	"689.02 MB    29.33 GB"	HWM : 60%SW : 90%Used: 5%	"172.18 MB    3.83 GB"	0	0
"10.31.72.49:3000    View Details"	1434426	1229834	2	HWM : 50%SW : 90%Avail: 97%Used: 3%	"650.37 MB    29.36 GB"	HWM : 60%SW : 90%Used: 4%	"162.61 MB    3.84 GB"	0	0

This behavior is very bizarre and I want to understand it before using the rack feature in production.

Thanks!


#2

At this point I am fairly sure this is due to deleted records on the restarted servers coming back due to this