I know there were questions about timeouts on the forum, like this one:
But I want to know server timeouts reason. How can I understand, what is the bottleneck of my system? Is it high load on the server, that makes it drop some requests or is it a net problem?
Does Aerospike contains metrics that would help me to answer this questions?
Thank you for the answer! It has taken some time for me to investigate logs. They contain write and read histogram only by now, the description of those two is
{ns}-read
Time taken for read requests from the time they are received at the node to when the response leaves the node.
{ns}-write
Time taken for writes from end-to-end (includes the time taken for replica write). Does not include deletes.
And I see that ‘total time’ is growing from start up to the end. Am I right that ‘total time’ is a sum of all write-operation, not the value for the distinct one?
and extract ‘histogram’ records from it, I see following:
Oct 04 2018 10:21:13 GMT: INFO (info): (hist.c:139) histogram dump: {user-profiles}-read (82 total) msec
Oct 04 2018 10:21:13 GMT: INFO (info): (hist.c:139) histogram dump: {user-profiles}-write (214 total) msec
Oct 04 2018 10:21:23 GMT: INFO (info): (hist.c:139) histogram dump: {user-profiles}-read (160 total) msec
Oct 04 2018 10:21:23 GMT: INFO (info): (hist.c:139) histogram dump: {user-profiles}-write (414 total) msec
Oct 04 2018 10:21:33 GMT: INFO (info): (hist.c:139) histogram dump: {user-profiles}-read (276 total) msec
I’ve got an impression that msec is standing for milliseconds and the number with ‘total’ is total time. Probably I’m wrong, as you say. Please, explain me then what the metrics meaning is in such a case.
I have about 250-300 sessions per node
I have to set up timeouts for my write operation to 5 seconds, otherwise I would have a few timeouts exceptions. Actually 5s. is not a silver bullet for I still have several timeouts. According to (2) I accuse client in it.
The total is the number of datapoints the histogram has collected, the ‘msec’ label is indicating the unit that is being collected, the actual histogram follows the ‘histogram dump’ line.
So, as you see I have several spikes above 4096 ms. What should I do to understand the reason of them? In average the most of requests are under 8 ms and that is great, but I have to know whether those spikes (which lead to timeouts) are going to grow when the cluster have to stand 10 ts. requests instead of current 2 ts.?
Now, having about 1500 requests/sec I have about 1500 timeouts in 7 days and probably we could live with it, but we have to understand reasons.
It would be good for us to know, too, if the latency lasted minutes or seconds or how does the issue appear? Is it on all nodes or only one? Is there any errors reported from the client side? (Hotkey?)