One of my cluster node has suddenly started giving this error in large number. Shot to 4477690.0 from 0 in around 15 minutes. And it’s continuously growing.
I am not able to gather the meaning and significance of this from the metrics page “Number of errors during cluster state exchange because of missing general node information”.
This node is reachable from other nodes in cluster(checked through ping), there are no migrates happening (all are 0,0 in logs), Foreign heartbeat count is same as in another nodes. There is nothing suspicious in logs as well which exhibit any problem.
Can someone please help in understanding this metrics and if I should be worried about it ? As of now, We have an alert over all err* metrics(through aerospike collectd plugin ) and perhaps if this is not serious, we will perhaps remove this alert. Thanks,
And they are on all nodes now, not just one as I had stated earlier. This page says they should be zero. Can someone help what is the criticality of this situation and how we can recover from this now ? Thanks.
Basically, something internally is wrong with the way it is expecting the partitions to be. A work around for this is to restart the principal node so that the partition rejig happens and everything will converge again.
The above is a workaround. This will need a deeper investigation from our side to see why things went wrong in the first place. Lets discuss about this internally.