We have a community edition cluster setup on Azure of 4 nodes. And we are facing a constant issue.
Sometimes one of the nodes in the cluster will start to misbehave causing the whole cluster to misbehave.
What we noticed is during this time the ping latencies to the node increases to around 1000ms. On running mtr we see around 50% packets drop. We also pinged the IP of the same node from itself and the ping latencies are reasonable. But from/to anywhere else it is around 1000ms. At this time the since the drop rate is high the heartbeats start to miss and the node is thrown out of the cluster. Causing more issues as the remaining nodes have to take over traffic and this node keeps coming back and going away. The number of connections is not a trigger here because they are in check. QPS also not a factor since it works fine at times during peaks and degrades at times during valley.
If we stop Aerospike on that node the ping latencies drop back to normal right then. This happens almost twice a day sometimes comes back to normal on its own. And sometimes needs a restart.
Any solution to this?