Hi all
I have a four-nodes Aerospike cluster, the version is 3.9.0 community. I set the replication factor 2, and accidentally one node crashed and disconnect to the cluster. The remaining 3 nodes are working well and I suppose this cluster can work and provide service properly. AMC screenshot is posted as following:
But when I fetch a key by java client, I found sometimes I can get the record, sometimes the record return null. Then I use aql tool instead of java client. It is also wired that on one node, mostly I cannot get this record and return “AEROSPIKE_ERR_RECORD_NOT_FOUND”, but occasionally, I can get this record.
In aerospike.log I found following warning, is it the root cause ?
Oct 11 2016 05:00:08 GMT: WARNING (migrate): (migrate.c:1002) imbalance: dest refused migrate with ACK_FAIL
Oct 11 2016 05:00:08 GMT: WARNING (partition): (partition.c:1455) {product:537} emigrate done: failed with error and cluster key is current
Yes, those warnings have something to do with it. Make sure your product namespace configurations are the same on all nodes (specifically replication-factor). Also ensure that paxos-single-replica-limit is also configured the same across all nodes.
If you see a line different to “1 ALLOW MIGRATIONS” as here we see “2 ALLOW MIGRATIONS” then this would be the root cause.
This race is expected to be rare since we haven’t yet seen it outside test environment (and only once there).
If you hit this, likely restarting a node will resolve the issue. If you decide to upgrade to 3.10.0, start with principal node, after the principal is upgraded it shouldn’t be possible for this issue to reoccur.
I’ve no idea now, is there any walked around solution ?
for example, can I asbackup all data without no-cluster-change parameter, and delete all original data file, then asrestore from backup data?
Based on the original symptoms in this ticket, we will need to force all partitions to resync, there are 2 methods in 3.9.0:
Perform a rolling restart of the cluster.
I believe 3.9.0 still has dun commands and you can force all partitions to resync by running:
(Note: this will cause the cluster to split into single node clusters and then reform.)
The partition versions have diverged, the rebalance layer doesn’t know how many records each node has, and it cannot really use that information even if it did.