We have a community edition cluster setup on Azure of 4 nodes. And we are facing a constant issue.
Sometimes one of the nodes in the cluster will start to misbehave causing the whole cluster to misbehave.
What we noticed is during this time the ping latencies to the node increases to around 1000ms. On running mtr we see around 50% packets drop. We also pinged the IP of the same node from itself and the ping latencies are reasonable. But from/to anywhere else it is around 1000ms. At this time the since the drop rate is high the heartbeats start to miss and the node is thrown out of the cluster. Causing more issues as the remaining nodes have to take over traffic and this node keeps coming back and going away. The number of connections is not a trigger here because they are in check. QPS also not a factor since it works fine at times during peaks and degrades at times during valley.
If we stop Aerospike on that node the ping latencies drop back to normal right then. This happens almost twice a day sometimes comes back to normal on its own. And sometimes needs a restart.
Are you saturating the NIC? This sounds like you’re pushing the instance harder than it can go, or you have a bad instance. Either way, I think you should probably reach out to Azure support on this
Thanks for the reply. I check N/W bytes on the host it was same as the other boxes.
Also, if the instance is bad won’t it still be the case once I switch off Aerospike? But I see as soon as I switch off Aerospike the ping latencies drop back to normal.
Lots of problems only manifest while a system is under load. Just because packet drops stop occurring when you have 1/1000th~ of the traffic doesn’t mean its not having a problem IMO. I’ve seen it plenty of times… one thing you can use to ‘make your case’ is showing the ops/s between all the nodes, and showing that this 1 node is behaving differently. This is why its so easy to find infra problems on cloud with Aerospike
I am not saying it is Aerospike’s fault per se. Just asking if there is anything that I can check and figure out the exact problem. Or if anyone has faced a similar issue and use their experience.
Check out all the good node_exporter metrics since it looks like you’re using prom. You seem to have already diagnosed it though. If you have packet drops, tell your provider to get you a new instance. In AWS it’s as easy as just stopping the instance and starting, but not sure about Azure.