The problem: we started getting occasional timeouts on more or less random operations with out Aerospike cluster. Timeouts are server-side timeouts (error code 9).
The setup: 4 node cluster, AS version 184.108.40.206, data not in memory. Gigs and gigs of free ram, terabytes of free disk space on nvme drives, CPU under 1.5% all the time (we might have slightly overprovisioned here). Typical load is only around 200-400 reads per second and same range for writes.
I have already checked everything that I can check in the logs - nothing seems to be getting backed up (all queues are empty in every report), this does not seem to be a network problem since the server gets the request and sends back the response with the error code.
Histograms are saying that the majority of reads and writes are sub-ms, and the longest one take 12 ms (rarely), but the timeout counters are going up from time to time and there are no entries in the logs that correspond to that.
The issues seems to only manifest in operations on the sets where records expire. We used to have only one such set - it’s rather large, but everything expires about a year in the future and. But we started storing a lot of short-lived records that are often getting deleted before they even get a chance to expire.
We also started getting timeouts from executing UDFs, but I have not dug deeper into these yet.
How would I proceed ? Especially, how can I get an insight into why the server reports having non-zero client_tsvc_timeout and client_write_timeout?
Thank you, the ideas are much appreciated. Dmytro