I am testing the rack aware cluster option and ran into something odd that I am struggling to understand. I started 4 nodes in 2 groups (they all use mode dynamic and paxos-protocol v4):
10.31.72.49 self-group-id 201
10.233.122.229 self-group-id 201
10.178.197.183 self-group-id 202
10.153.168.60 self-group-id 202
It looked good:
Aug 14 2015 19:00:40 GMT: INFO (partition): (cluster_config.c::371) Rack Aware is enabled. Mode: dynamic.
Aug 14 2015 19:00:40 GMT: INFO (partition): (cluster_config.c::377) CLUSTER STATE(balanced) SelfNode(bb900ca0ab2c5b7) Group Count(2) Total Node Count(4)
Aug 14 2015 19:00:40 GMT: INFO (partition): (cluster_config.c::387) Group(00ca) GroupNodeCount(2):: Node(bb900ca0ab2c5b7) Node(bb900ca0a99a83c)
Aug 14 2015 19:00:40 GMT: INFO (partition): (cluster_config.c::387) Group(00c9) GroupNodeCount(2):: Node(bb900c90ae97ae5) Node(bb900c90a1f4831)
I used the cluster for a bit and at one point I decided to test what happens if i bring one side down. I checked the number of unique objects (2,890,943), then I brought the aerospike service on both nodes in group 202 down and the number of unique objects stayed the same. I then went ahead and started the 2 services back up. They re-joined the cluster successfully and started migrating objects. Once done, however, the total number of objects in the cluster was almost double what it was before my failover test – 5,485,187! Any idea why that could have happened? My cluster did have a higher number of objects at one point but they were deleted and nothing reading/writing to the cluster while I did this. It also appears to be in a good state with no apparent errors in the log.
Here are snapshots from amc:
Before I brought instances down:
Host Master Objects Replica Objects Repl'n Factor "Disk Used, HWM, Stop Writes" Disk "RAM Used, HWM, Stop Writes" RAM Expired Objects Evicted Objects
"10.153.168.60:3000 View Details" 663741 877838 2 HWM : 50%SW : 90%Avail: 98%Used: 2% "376.31 MB 29.63 GB" HWM : 60%SW : 90%Used: 3% "94.09 MB 3.91 GB" 0 0
"10.178.197.183:3000 View Details" 489441 859923 2 HWM : 50%SW : 90%Avail: 98%Used: 2% "329.79 MB 29.68 GB" HWM : 60%SW : 90%Used: 3% "82.36 MB 3.92 GB" 0 0
"10.233.122.229:3000 View Details" 963583 602588 2 HWM : 50%SW : 90%Avail: 97%Used: 2% "382.72 MB 29.63 GB" HWM : 60%SW : 90%Used: 3% "95.59 MB 3.91 GB" 0 0
"10.31.72.49:3000 View Details" 774178 550594 2 HWM : 50%SW : 90%Avail: 98%Used: 2% "323.39 MB 29.68 GB" HWM : 60%SW : 90%Used: 2% "80.86 MB 3.92 GB" 0 0
After I brought the instances (10.153.168.60 and 10.178.197.183) back up and they finished migrating:
Host Master Objects Replica Objects Repl'n Factor "Disk Used, HWM, Stop Writes" Disk "RAM Used, HWM, Stop Writes" RAM Expired Objects Evicted Objects
"10.153.168.60:3000 View Details" 663741 877838 2 HWM : 50%SW : 90%Avail: 98%Used: 2% "376.31 MB 29.63 GB" HWM : 60%SW : 90%Used: 3% "94.09 MB 3.91 GB" 0 0
"10.178.197.183:3000 View Details" 1852877 2090731 2 HWM : 50%SW : 90%Avail: 96%Used: 4% "963.08 MB 29.06 GB" HWM : 60%SW : 90%Used: 6% "240.70 MB 3.76 GB" 0 0
"10.233.122.229:3000 View Details" 1534143 1286784 2 HWM : 50%SW : 90%Avail: 97%Used: 3% "689.02 MB 29.33 GB" HWM : 60%SW : 90%Used: 5% "172.18 MB 3.83 GB" 0 0
"10.31.72.49:3000 View Details" 1434426 1229834 2 HWM : 50%SW : 90%Avail: 97%Used: 3% "650.37 MB 29.36 GB" HWM : 60%SW : 90%Used: 4% "162.61 MB 3.84 GB" 0 0
This behavior is very bizarre and I want to understand it before using the rack feature in production.
Thanks!