Packet drops very high on one node

Gunjan_Sharma · July 7, 2019, 2:15pm

Hello All

We have a community edition cluster setup on Azure of 4 nodes. And we are facing a constant issue.

Sometimes one of the nodes in the cluster will start to misbehave causing the whole cluster to misbehave.

What we noticed is during this time the ping latencies to the node increases to around 1000ms. On running mtr we see around 50% packets drop. We also pinged the IP of the same node from itself and the ping latencies are reasonable. But from/to anywhere else it is around 1000ms. At this time the since the drop rate is high the heartbeats start to miss and the node is thrown out of the cluster. Causing more issues as the remaining nodes have to take over traffic and this node keeps coming back and going away. The number of connections is not a trigger here because they are in check. QPS also not a factor since it works fine at times during peaks and degrades at times during valley.

If we stop Aerospike on that node the ping latencies drop back to normal right then. This happens almost twice a day sometimes comes back to normal on its own. And sometimes needs a restart.

Any solution to this?

Albot · July 7, 2019, 6:07pm

Are you saturating the NIC? This sounds like you’re pushing the instance harder than it can go, or you have a bad instance. Either way, I think you should probably reach out to Azure support on this

Gunjan_Sharma · July 8, 2019, 5:12am

Hey @Albot

Thanks for the reply. I check N/W bytes on the host it was same as the other boxes.

Also, if the instance is bad won’t it still be the case once I switch off Aerospike? But I see as soon as I switch off Aerospike the ping latencies drop back to normal.

pgupta · July 8, 2019, 6:02am

With all nodes running, try:

$ asadm 
Admin> show config diff

… do you see something different about this one node, configuration wise?

Gunjan_Sharma · July 8, 2019, 6:05am

Only the usual things (heartbeat.address, node-id, service.access-address) etc

Albot · July 8, 2019, 11:15pm

Lots of problems only manifest while a system is under load. Just because packet drops stop occurring when you have 1/1000th~ of the traffic doesn’t mean its not having a problem IMO. I’ve seen it plenty of times… one thing you can use to ‘make your case’ is showing the ops/s between all the nodes, and showing that this 1 node is behaving differently. This is why its so easy to find infra problems on cloud with Aerospike

Gunjan_Sharma · July 9, 2019, 5:48am

Hey @Albot

Attaching the QPS graph. Orange is the misbehaving node.

I am not saying it is Aerospike’s fault per se. Just asking if there is anything that I can check and figure out the exact problem. Or if anyone has faced a similar issue and use their experience.

Albot · July 9, 2019, 4:38pm

Check out all the good node_exporter metrics since it looks like you’re using prom. You seem to have already diagnosed it though. If you have packet drops, tell your provider to get you a new instance. In AWS it’s as easy as just stopping the instance and starting, but not sure about Azure.

Topic		Replies	Views
One node performing poorly in cluster Tuning	4	2931	July 31, 2015
One node showing inexplicably high read latency/CPU load Tuning aws , migration	10	5744	October 28, 2015
Aerospike Node Entering and Exiting the Cluster Frequently Configuration	9	1938	July 1, 2017
Network throughput issues with asd running ec2 , az , amazon	19	6197	January 25, 2016
Higher Latencies In Few Particular Nodes query , udf , latency , index	4	817	April 19, 2022

Packet drops very high on one node

Related topics