Getting lots of err_sync_copy_null_node Error in Aerospike

ashishbhutani · August 4, 2015, 10:25am

Hi

One of my cluster node has suddenly started giving this error in large number. Shot to 4477690.0 from 0 in around 15 minutes. And it’s continuously growing.

I am not able to gather the meaning and significance of this from the metrics page “Number of errors during cluster state exchange because of missing general node information”.

This node is reachable from other nodes in cluster(checked through ping), there are no migrates happening (all are 0,0 in logs), Foreign heartbeat count is same as in another nodes. There is nothing suspicious in logs as well which exhibit any problem.

Can someone please help in understanding this metrics and if I should be worried about it ? As of now, We have an alert over all err* metrics(through aerospike collectd plugin ) and perhaps if this is not serious, we will perhaps remove this alert. Thanks,

ashishbhutani · August 4, 2015, 4:26pm

Just adding to above, for more information, there are error lines in log as well:

Aug 04 2015 16:24:34 GMT: INFO (info): (thr_info.c::4766)  replica errs :: null 0 non-null 0 ::: sync copy errs :: node 37247268 :: master 0

And they are on all nodes now, not just one as I had stated earlier. This page says they should be zero. Can someone help what is the criticality of this situation and how we can recover from this now ? Thanks.

ashishbhutani · August 4, 2015, 5:19pm

Further information. Changing to debug mode shows these log lines:

“Aug 04 2015 17:13:31 GMT: DEBUG (partition): (partition.c:find_sync_copy:924) {NAMESPACE_NAME:2144} Returning null node, could not find sync copy of this partition my_index -1, master bb92cad4a12bcf8 replica bb998a74a12bcf8”

ashishbhutani · August 5, 2015, 4:00pm

Any pointers on this will be really helpful. Am I in trouble because of this error…and what’s the way to recover from it ? Thanks.

sunil · August 5, 2015, 5:03pm

Hi Ashish,

Basically, something internally is wrong with the way it is expecting the partitions to be. A work around for this is to restart the principal node so that the partition rejig happens and everything will converge again.

The above is a workaround. This will need a deeper investigation from our side to see why things went wrong in the first place. Lets discuss about this internally.

mvrgotic · March 30, 2017, 11:10am

Hello,

Just observed same errors on our aerospike cluster:

Aerospike 3.5.3 version.

As suggested in this thread - I have advised as workaround to reboot principal server.

Is there a gentler way to solve this?

Marko Vrgotic

Topic		Replies	Views
Losing records after node fails Configuration	3	1489	May 24, 2015
Odd record count when adding new nodes to cluster Operations	3	1312	July 4, 2016
Replication issue : all nodes down when synchronizing after a node restart Configuration	9	2357	November 22, 2016
Inconsistent result if fetching a key when 1 node crashed on 4 node Aerospike cluster (3.9.0) AQL	31	3970	October 14, 2016
Aerospike Crash	4	1918	January 9, 2016

Getting lots of err_sync_copy_null_node Error in Aerospike

Related topics