Higher Latencies In Few Particular Nodes

Hi,

We are running a 4 node Aerospike cluster in AWS. We are using i3.2xlarge series of instances which come with 32GB Ram and 900GB of ephemeral volume. We are using a shadow ebs volume to replicate the writes as well. All these instances are in same AWS region and availability zone. We have been running this cluster without any issues for around 1 year. But recently in the past month, we have seen that, there are higher latencies in one particular instance. The read, write and udf latencies of other 3 nodes remains normal, while that one 1 node alone is showing abrupt higher peaks many times that of normal average. We thought this might be a local hardware issue with AWS EC2 instance and we tried replacing the node. But again the newly replace node was also showing the same higher latencies. We tried comparing configs, but they were all same across all 4 nodes. We tried comparing benchmarks, microbenchmarks etc., But in all of them, only the last node was showing anomaly and all other nodes are fine.

These higher latencies haven’t caused any issues as such. But we really wanted to know the reason for the unexplained behaviour in these 1 node alone. Tried various blogs / suggestions / documentations and we have exhausted ourselves with resources. We tried replacing one among the 3 nodes with lower latencies and the newly introduced node is also now showing higher latency.

Hence, expecting some ideas to debug, resolution steps, things to take care etc., to find the reason for increased latencies in particular nodes.

We are using aerospike version 3.13 and we have services using Go, PHP, Java clients in our system.

1 Like

Hey Janardhanan,

The sunset for v3.13 was over 2 years ago. I’d recommend upgrading your version.

And some additional recommendations:

1 Like

Turning on microbenchmarks would indeed potentially help narrow down where the latency is coming from on that node… Other than that, I would recommend a thorough analysis. Best is to graph all metrics and see if any patterns, specifically around workloads (reads, writes, their success/notfound/failure rates), other background tasks that could be weighted differently on the node performing differently, like defragmentation, nsup cycles, etc…). Hard to provide more input other than analyzing full logs…

@Janardhanan_V_S You have described well. I never expect this work is so easy. Thanks for doing this. :slightly_smiling_face:

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.