We are running a 4 node Aerospike cluster in AWS. We are using i3.2xlarge series of instances which come with 32GB Ram and 900GB of ephemeral volume. We are using a shadow ebs volume to replicate the writes as well. All these instances are in same AWS region and availability zone. We have been running this cluster without any issues for around 1 year. But recently in the past month, we have seen that, there are higher latencies in one particular instance. The read, write and udf latencies of other 3 nodes remains normal, while that one 1 node alone is showing abrupt higher peaks many times that of normal average. We thought this might be a local hardware issue with AWS EC2 instance and we tried replacing the node. But again the newly replace node was also showing the same higher latencies. We tried comparing configs, but they were all same across all 4 nodes. We tried comparing benchmarks, microbenchmarks etc., But in all of them, only the last node was showing anomaly and all other nodes are fine.
These higher latencies haven’t caused any issues as such. But we really wanted to know the reason for the unexplained behaviour in these 1 node alone. Tried various blogs / suggestions / documentations and we have exhausted ourselves with resources. We tried replacing one among the 3 nodes with lower latencies and the newly introduced node is also now showing higher latency.
Hence, expecting some ideas to debug, resolution steps, things to take care etc., to find the reason for increased latencies in particular nodes.
We are using aerospike version 3.13 and we have services using Go, PHP, Java clients in our system.