Getting lots of err_sync_copy_null_node Error in Aerospike


#1

Hi

One of my cluster node has suddenly started giving this error in large number. Shot to 4477690.0 from 0 in around 15 minutes. And it’s continuously growing.

I am not able to gather the meaning and significance of this from the metrics page “Number of errors during cluster state exchange because of missing general node information”.

This node is reachable from other nodes in cluster(checked through ping), there are no migrates happening (all are 0,0 in logs), Foreign heartbeat count is same as in another nodes. There is nothing suspicious in logs as well which exhibit any problem.

Can someone please help in understanding this metrics and if I should be worried about it ? As of now, We have an alert over all err* metrics(through aerospike collectd plugin ) and perhaps if this is not serious, we will perhaps remove this alert. Thanks,


#2

Just adding to above, for more information, there are error lines in log as well:

Aug 04 2015 16:24:34 GMT: INFO (info): (thr_info.c::4766)  replica errs :: null 0 non-null 0 ::: sync copy errs :: node 37247268 :: master 0

And they are on all nodes now, not just one as I had stated earlier. This page says they should be zero. Can someone help what is the criticality of this situation and how we can recover from this now ? Thanks.


#3

Further information. Changing to debug mode shows these log lines:

“Aug 04 2015 17:13:31 GMT: DEBUG (partition): (partition.c:find_sync_copy:924) {NAMESPACE_NAME:2144} Returning null node, could not find sync copy of this partition my_index -1, master bb92cad4a12bcf8 replica bb998a74a12bcf8”


#4

Any pointers on this will be really helpful. Am I in trouble because of this error…and what’s the way to recover from it ? Thanks.


#5

Hi Ashish,

Basically, something internally is wrong with the way it is expecting the partitions to be. A work around for this is to restart the principal node so that the partition rejig happens and everything will converge again.

The above is a workaround. This will need a deeper investigation from our side to see why things went wrong in the first place. Lets discuss about this internally.


#6

Hello,

Just observed same errors on our aerospike cluster:

Aerospike 3.5.3 version.

As suggested in this thread - I have advised as workaround to reboot principal server.

Is there a gentler way to solve this?

Marko Vrgotic