Timeouts ivestigation

Johanan · October 19, 2018, 1:32pm

I know there were questions about timeouts on the forum, like this one:

But I want to know server timeouts reason. How can I understand, what is the bottleneck of my system? Is it high load on the server, that makes it drop some requests or is it a net problem? Does Aerospike contains metrics that would help me to answer this questions?

Albot · October 20, 2018, 2:16am

If the problem is suspected to be on the Aerospike system, this is a good place to start. https://www.aerospike.com/docs/operations/monitor/latency/

Johanan · October 24, 2018, 12:13pm

Thank you for the answer! It has taken some time for me to investigate logs. They contain write and read histogram only by now, the description of those two is

{ns}-read Time taken for read requests from the time they are received at the node to when the response leaves the node. {ns}-write Time taken for writes from end-to-end (includes the time taken for replica write). Does not include deletes.

And I see that ‘total time’ is growing from start up to the end. Am I right that ‘total time’ is a sum of all write-operation, not the value for the distinct one?

kporter · October 24, 2018, 11:09pm

I’m not sure what you mean by ‘toal-time’. Each histogram includes a ‘total’ which is the number of data-points the histogram has collected.

Albot · October 26, 2018, 4:20am

Can you paste which commands you’re running and their output? We’re not sure what you’re using

Johanan · October 26, 2018, 8:28am

Right, now I can provide you with more information. Thank you for helping!

I have about 1.000 - 2.000 writes request per second for 5 nodes
On a node, using

asloglatency -h {user-profiles}-write

i see very low latency

slice-to (sec)      1      8     64    ops/sec 
-------------- ------ ------ ------ ----------
 07:58:36    10   2.58   0.00   0.00      221.2  
 07:58:46    10   3.25   0.00   0.00      215.5  
 07:58:56    10   3.08   0.00   0.00      217.2  
 07:59:06    10   3.28   0.00   0.00      216.5  
 07:59:16    10   3.35   0.00   0.00      223.9

If I get Aerospike logs from

/var/log/aerospike/aerospike.log

and extract ‘histogram’ records from it, I see following:

Oct 04 2018 10:21:13 GMT: INFO (info): (hist.c:139) histogram dump: {user-profiles}-read (82 total) msec
Oct 04 2018 10:21:13 GMT: INFO (info): (hist.c:139) histogram dump: {user-profiles}-write (214 total) msec
Oct 04 2018 10:21:23 GMT: INFO (info): (hist.c:139) histogram dump: {user-profiles}-read (160 total) msec
Oct 04 2018 10:21:23 GMT: INFO (info): (hist.c:139) histogram dump: {user-profiles}-write (414 total) msec
Oct 04 2018 10:21:33 GMT: INFO (info): (hist.c:139) histogram dump: {user-profiles}-read (276 total) msec

I’ve got an impression that msec is standing for milliseconds and the number with ‘total’ is total time. Probably I’m wrong, as you say. Please, explain me then what the metrics meaning is in such a case.

I have about 250-300 sessions per node
I have to set up timeouts for my write operation to 5 seconds, otherwise I would have a few timeouts exceptions. Actually 5s. is not a silver bullet for I still have several timeouts. According to (2) I accuse client in it.

kporter · October 26, 2018, 8:23pm

The total is the number of datapoints the histogram has collected, the ‘msec’ label is indicating the unit that is being collected, the actual histogram follows the ‘histogram dump’ line.

Could you run asloglatency with:

asloglatency -e1 -n8 -h '{user-profiles}-write'

Johanan · October 29, 2018, 11:50am

I’ve calculated aggregated metrics for all 5 of my machines with following

asloglatency -e1 -n16 -h '{user-profiles}-write' -f 'Aug 30 2018 12:00:00' -t 24:00:00

And that’s what I get

 avg         0.86   0.13   0.03   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00      106.0
 max        39.76  13.11   9.47   5.27   4.28   4.22   4.10   3.90   3.67   3.40   2.42   0.56   0.03   0.00   0.00   0.00      612.3
 avg         0.76   0.13   0.02   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00      100.0
 max        49.68  30.32  12.90   3.57   3.57   3.53   3.46   3.42   3.34   2.92   2.15   0.58   0.00   0.00   0.00   0.00      576.7
 avg         0.81   0.13   0.03   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00       99.0
 max        37.87  15.95  12.16  12.10   7.80   7.70   7.54   7.25   6.92   5.95   3.91   0.75   0.02   0.00   0.00   0.00      583.8
 avg         0.88   0.15   0.03   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00       96.0
 max        15.72   5.92   3.19   2.95   2.95   2.88   2.84   2.84   2.69   2.55   1.70   0.54   0.00   0.00   0.00   0.00      554.4
 avg         0.83   0.15   0.02   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00       98.0
 max        40.04  16.98   5.31   2.43   2.34   2.34   2.34   2.30   2.23   2.03   1.56   0.24   0.00   0.00   0.00   0.00      577.9

So, as you see I have several spikes above 4096 ms. What should I do to understand the reason of them? In average the most of requests are under 8 ms and that is great, but I have to know whether those spikes (which lead to timeouts) are going to grow when the cluster have to stand 10 ts. requests instead of current 2 ts.?
Now, having about 1500 requests/sec I have about 1500 timeouts in 7 days and probably we could live with it, but we have to understand reasons.

Albot · November 1, 2018, 11:50am

In the article https://www.aerospike.com/docs/operations/monitor/latency/ it explains that you can enable more detailed benchmarks to further get down to the cause of latency - now that you know there is latency… like this:

asinfo -v 'set-config:context=namespace;id=<namespaceName>;enable-benchmarks-write=true'

It would be good for us to know, too, if the latency lasted minutes or seconds or how does the issue appear? Is it on all nodes or only one? Is there any errors reported from the client side? (Hotkey?)

Topic		Replies	Views
Handling timeout in case of counter bin	5	3302	May 22, 2018
Looking for ideas in troubleshooting occasional server timeouts Tuning	2	1244	April 11, 2018
Java client return timeout once one of the nodes is down Client Libraries java	2	862	May 4, 2022
Observing steady increase timeouts from aerospike server error	0	1735	March 10, 2016
Getting Aerospike timeout when posting 1M records Java Client	3	3450	February 25, 2015

Timeouts ivestigation

Related topics