I have a cluster of 4x r3.8xl machines and all worked nicely for a month or two.
Suddenly today I started getting huge latencies performing put transactions - I’m talking jumping from 1ms to 40-60 seconds !
The Aerospike servers are using 4-5% CPU and configured for in-memory only.
Can you check you network utilization. Do you by any chance have cluster configured to have nodes across multiple availability zones (this is not recommended), if yes you may want to check intra node network latency.
Is there any persistence or is it pure data in memory with no persistence.
I would check for the following
- Server Side latency as seen in the histogram
- Enable micro benchmark and see which part of it is taking time. Please check Reading Microbenchmarks on how to to read microbenchmarks.
- If pure data in memory check the network behavior of Amazon boxes.
- If there is persistence with EBS check out how your disk is behaving using iostat.