I am working on a Java/Spring REST webservice that makes several synchronous and async calls to Aerospike while serving POST HTTP requests. We have Datadog monitoring enabled.
Hardware: AWS EC2 - m5a.4xlarge,
Aerospike version: Aerospike Enterprise Edition build 184.108.40.206,
Aerospike client: Java version 5.1.11.
I am seeing several (about 1%) HTTP requests take a long time to respond (> 1 minute). Per Datadog stack trace viewer, all these requests are stuck due to Aerospike Timeout errors with the following stack trace. These errors increase when the load on the server increases. This large delay affects the server’s overall performance (GC, etc).
Here is an example:
The exception stacktrace in full:
com.aerospike.client.AerospikeException$Timeout: Client timeout: iteration=0 connect=0 socket=30000 total=30 maxRetries=0 node=null inDoubt=false
The Aerospike client has the following Event policy (with timeout 30 msec and max commands in queue at 200).
EventPolicy eventPolicy = new EventPolicy();
eventPolicy.minTimeout = 30;
eventPolicy.maxCommandsInQueue = 200;
eventPolicy.maxCommandsInProcess = 20;
How can these delays be avoided? Can anyone give me some recommendations on how to handle these large response times? Shouldn’t the client throw the timeout errors immediately after the time out limit (30 msec) is reached? why does the async request get delayed?
Thanks for reading.
Something certainly seems wrong. What have you done to troubleshoot so far? Are the aerospike server histograms something similar? What is the underlying call doing, is it a single
Get or a large
If you allow transactions to sit in the queue, they will not time out at the configured 30ms as they may have sat in the queue for longer. So if you want your transactions to time out faster, you would want to configure the queue to have just 1 element max (0 would make it unbounded). This shouldn’t cause transactions to sit for that long, though, but there may be other things that are compounding upstream.
Java client async event loops use a HashedWheel Timer to identify timed out transactions.
You are setting socketTimeout at 30,000 ms, totalTimeout for the transaction at 30 ms. That means, client library will actually set socketTimeout to 30ms before sending the transaction to the server. The server too will then set its transaction timeout to 30ms.
However on the Event Policy side, you are setting minTimeout to 30 ms also. The hashedwheel timer will place this transaction in slot_id = (socketTimeout / minTimeout) % 256 (…default ticksPerWheel) - and count will be (socketTimeout/minTimeout)/256 (256 is… ticksPerWheel).
Now I am not sure if this part of the computation is using socketTimeout of 30000 instead of 30. (Update: Checked, does use 30 ms in this case - so this is not the issue.) That puts you in slot_id 232, with count = 3. So wheel will go around 256x30 ms x 3 + 232x30ms = 30,000 ms before timeout is detected. Will have to look at Java client library code to see if this is actually the case.
But you can easily test this by setting your socketTimeout to 30ms instead of 30,000 … because you are anyway setting total timeout to 30 ms.
@meher If the queue is set at 1 element max, wouldn’t it behave like a synchronous call? Or do you suggest keeping the queue length low and error out faster?
@pgupta the exception having ‘iteration=0’ would typically indicate that the transaction was not put on the wire and timed out while in the delay queue.
@shan.p I guess that it would make it more similar to synchronous… I thought you wanted it to timeout faster for your use case. I am not sure that timing out transactions while in the delay queue can be done efficiently which is probably why it is not done and transactions then timeout when picked up from the delay queue.
As @Albot is suggesting, though, if your main question is not about why the transaction are not timing out faster, but rather why there are timeouts to start with, it is something that would require checking in details what is going on on the server first… as you indicated you are using an Enterprise Edition version, the best thing would be to raise a case with Aerospike Support.
Regarding iteration=0, node=null: Why do I see this timeout exception with "node=null" on Async Java Application? (Also, inDoubt=false which also means this transaction never made it to the server.)