We currently have a cluster of 9 nodes running build 18.104.22.168. We are observing a sharp increase in aerospike latencies starting 11:30pm daily. The latency remains increased for a few hours even though the throughput is decreasing during this time. Attaching a few screenshots of latency and throughput during the said interval.
Aerospike’s share in the overall web latency is increased during this time as you can see in the screenshot below.
Overall throughput on the website.
Aerospike Reads throughput.
Aerospike Writes throughput.
We aren’t sure what causes this rise in latency and want to deep dive into it to find and fix the root cause. We were looking for something along the lines of what queries are coming to the server and from what ip addresses during this time. So we tried changing the log level of several contexts but still couldn’t find this information.
rw-client was introduced in 3.16, so changing that does nothing. Is there any alternative for this in this version?
Backup starts at 3:30am, so we have ruled that out as a probable cause for this slowness as the latency starts increasing much before that. But we would like to know exactly what is happening during this entire interval.
How should we go about debugging this? What metrics should we look into which can give us more insight? Thanks in advance.