Latency spike daily at same time. Aerospike build 3.13.0.10

Hi,

We currently have a cluster of 9 nodes running build 3.13.0.10. We are observing a sharp increase in aerospike latencies starting 11:30pm daily. The latency remains increased for a few hours even though the throughput is decreasing during this time. Attaching a few screenshots of latency and throughput during the said interval.

Aerospike’s share in the overall web latency is increased during this time as you can see in the screenshot below.

Overall throughput on the website.

Aerospike Reads throughput.

Aerospike Writes throughput.

We aren’t sure what causes this rise in latency and want to deep dive into it to find and fix the root cause. We were looking for something along the lines of what queries are coming to the server and from what ip addresses during this time. So we tried changing the log level of several contexts but still couldn’t find this information. rw-client was introduced in 3.16, so changing that does nothing. Is there any alternative for this in this version?

Backup starts at 3:30am, so we have ruled that out as a probable cause for this slowness as the latency starts increasing much before that. But we would like to know exactly what is happening during this entire interval.

How should we go about debugging this? What metrics should we look into which can give us more insight? Thanks in advance.

Interesting. This is a pretty old version and the newer versions do log some potentially useful information.

But I would personally start with checking the latency details on the server side and maybe turning on some of the benchmarks to see if the latency increase is noticed on the server side, and, if so, which slice(s). (See the Monitoring Latency doc).

© 2015 Copyright Aerospike, Inc. | All rights reserved. Creators of the Aerospike Database.