Looking for ideas in troubleshooting occasional server timeouts

Dmytro_Zakharov · April 11, 2018, 2:03am

Hi,

The problem: we started getting occasional timeouts on more or less random operations with out Aerospike cluster. Timeouts are server-side timeouts (error code 9).

The setup: 4 node cluster, AS version 3.13.0.7, data not in memory. Gigs and gigs of free ram, terabytes of free disk space on nvme drives, CPU under 1.5% all the time (we might have slightly overprovisioned here). Typical load is only around 200-400 reads per second and same range for writes.

I have already checked everything that I can check in the logs - nothing seems to be getting backed up (all queues are empty in every report), this does not seem to be a network problem since the server gets the request and sends back the response with the error code.

Histograms are saying that the majority of reads and writes are sub-ms, and the longest one take 12 ms (rarely), but the timeout counters are going up from time to time and there are no entries in the logs that correspond to that.

The issues seems to only manifest in operations on the sets where records expire. We used to have only one such set - it’s rather large, but everything expires about a year in the future and. But we started storing a lot of short-lived records that are often getting deleted before they even get a chance to expire.

We also started getting timeouts from executing UDFs, but I have not dug deeper into these yet.

How would I proceed ? Especially, how can I get an insight into why the server reports having non-zero client_tsvc_timeout and client_write_timeout?

Thank you, the ideas are much appreciated. Dmytro

pgupta · April 11, 2018, 2:45am

Version 3.13.4?

Dmytro_Zakharov · April 11, 2018, 4:30pm

I apologize, I missed one .0, it was supposed to be 3.13.0.4. But it also turned out that we are already on 3.13.0.7. Thanks

Topic		Replies	Views
Timeouts ivestigation	8	1575	November 1, 2018
Aerospike write error 9 Tuning	3	1827	April 20, 2017
Hits AEROSPIKE_ERR_TIMEOUT more often than I should Operations	12	3980	May 14, 2015
Timeouts on mostly idle cluster Tuning	5	3084	November 7, 2015
Error Code 9: Timeout after update 3.7.3 to 3.8.1	7	2597	May 20, 2016

Looking for ideas in troubleshooting occasional server timeouts

Related topics