Timeouts ivestigation


#1

I know there were questions about timeouts on the forum, like this one:

But I want to know server timeouts reason. How can I understand, what is the bottleneck of my system? Is it high load on the server, that makes it drop some requests or is it a net problem? Does Aerospike contains metrics that would help me to answer this questions?


#2

If the problem is suspected to be on the Aerospike system, this is a good place to start. https://www.aerospike.com/docs/operations/monitor/latency/


#3

Thank you for the answer! It has taken some time for me to investigate logs. They contain write and read histogram only by now, the description of those two is

{ns}-read Time taken for read requests from the time they are received at the node to when the response leaves the node. {ns}-write Time taken for writes from end-to-end (includes the time taken for replica write). Does not include deletes.

And I see that ‘total time’ is growing from start up to the end. Am I right that ‘total time’ is a sum of all write-operation, not the value for the distinct one?


#4

I’m not sure what you mean by ‘toal-time’. Each histogram includes a ‘total’ which is the number of data-points the histogram has collected.


#5

Can you paste which commands you’re running and their output? We’re not sure what you’re using


#6

Right, now I can provide you with more information. Thank you for helping!

  1. I have about 1.000 - 2.000 writes request per second for 5 nodes

  2. On a node, using

asloglatency -h {user-profiles}-write

i see very low latency

slice-to (sec)      1      8     64    ops/sec 
-------------- ------ ------ ------ ----------
 07:58:36    10   2.58   0.00   0.00      221.2  
 07:58:46    10   3.25   0.00   0.00      215.5  
 07:58:56    10   3.08   0.00   0.00      217.2  
 07:59:06    10   3.28   0.00   0.00      216.5  
 07:59:16    10   3.35   0.00   0.00      223.9  
  1. If I get Aerospike logs from

/var/log/aerospike/aerospike.log

and extract ‘histogram’ records from it, I see following:

Oct 04 2018 10:21:13 GMT: INFO (info): (hist.c:139) histogram dump: {user-profiles}-read (82 total) msec
Oct 04 2018 10:21:13 GMT: INFO (info): (hist.c:139) histogram dump: {user-profiles}-write (214 total) msec
Oct 04 2018 10:21:23 GMT: INFO (info): (hist.c:139) histogram dump: {user-profiles}-read (160 total) msec
Oct 04 2018 10:21:23 GMT: INFO (info): (hist.c:139) histogram dump: {user-profiles}-write (414 total) msec
Oct 04 2018 10:21:33 GMT: INFO (info): (hist.c:139) histogram dump: {user-profiles}-read (276 total) msec

I’ve got an impression that msec is standing for milliseconds and the number with ‘total’ is total time. Probably I’m wrong, as you say. Please, explain me then what the metrics meaning is in such a case.

  1. I have about 250-300 sessions per node

  2. I have to set up timeouts for my write operation to 5 seconds, otherwise I would have a few timeouts exceptions. Actually 5s. is not a silver bullet for I still have several timeouts. According to (2) I accuse client in it.


#7

The total is the number of datapoints the histogram has collected, the ‘msec’ label is indicating the unit that is being collected, the actual histogram follows the ‘histogram dump’ line.

Could you run asloglatency with:

asloglatency -e1 -n8 -h '{user-profiles}-write'

#8

I’ve calculated aggregated metrics for all 5 of my machines with following

asloglatency -e1 -n16 -h '{user-profiles}-write' -f 'Aug 30 2018 12:00:00' -t 24:00:00

And that’s what I get

 avg         0.86   0.13   0.03   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00      106.0
 max        39.76  13.11   9.47   5.27   4.28   4.22   4.10   3.90   3.67   3.40   2.42   0.56   0.03   0.00   0.00   0.00      612.3
 avg         0.76   0.13   0.02   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00      100.0
 max        49.68  30.32  12.90   3.57   3.57   3.53   3.46   3.42   3.34   2.92   2.15   0.58   0.00   0.00   0.00   0.00      576.7
 avg         0.81   0.13   0.03   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00       99.0
 max        37.87  15.95  12.16  12.10   7.80   7.70   7.54   7.25   6.92   5.95   3.91   0.75   0.02   0.00   0.00   0.00      583.8
 avg         0.88   0.15   0.03   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00       96.0
 max        15.72   5.92   3.19   2.95   2.95   2.88   2.84   2.84   2.69   2.55   1.70   0.54   0.00   0.00   0.00   0.00      554.4
 avg         0.83   0.15   0.02   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00       98.0
 max        40.04  16.98   5.31   2.43   2.34   2.34   2.34   2.30   2.23   2.03   1.56   0.24   0.00   0.00   0.00   0.00      577.9

So, as you see I have several spikes above 4096 ms. What should I do to understand the reason of them? In average the most of requests are under 8 ms and that is great, but I have to know whether those spikes (which lead to timeouts) are going to grow when the cluster have to stand 10 ts. requests instead of current 2 ts.?
Now, having about 1500 requests/sec I have about 1500 timeouts in 7 days and probably we could live with it, but we have to understand reasons.


#9

In the article https://www.aerospike.com/docs/operations/monitor/latency/ it explains that you can enable more detailed benchmarks to further get down to the cause of latency - now that you know there is latency… like this:

asinfo -v 'set-config:context=namespace;id=<namespaceName>;enable-benchmarks-write=true'

It would be good for us to know, too, if the latency lasted minutes or seconds or how does the issue appear? Is it on all nodes or only one? Is there any errors reported from the client side? (Hotkey?)