How to troubleshoot
async delay queue full errors
When using the Java client in asynchronous mode the following message might be observed as client load increases:
Async delay queue is full at com.aerospike.client.async.NioCommand.run(NioCommand.java:157
When using one of the Aerospike asynchronous client, the commands are executed immediately using event loops. As load increases, the event loops can become overwhelmed. The effect of this would be that client socket usage increases, potentially leading to performance deterioration.
The basic assumption is that commands going into the event loop would be throttled by the application but this is not always the case. Here are the relevant policy parameters (using the Java API, but the equivalent exist for the other asynchronous clients):
ClientPolicy.asyncMaxConnsPerNode: Maximum number of asynchronous connections allowed per cluster (server) node.
EventPolicy.maxCommandsInProcess: allows a limit on the size of the event loops to throttle at the event loop level.
EventPolicy.maxCommandsInQueue: Maximum number of async commands that can be stored in each event loop’s delay queue for later execution.
And an EventLoop implementation (
asyncMaxConnsPerNode are distributed across event loops. For example, with 300
asyncMaxConnsPerNode and 10 event loops, each event loop will be able to process up to 30 transactions per node concurrently.
maxCommandsInProcess defines the maximum number of commands (transactions) a single event loop can process concurrently. It is a good practice to set the
* (number of event loops). This will prevent running out of connections for a given node in the cluster. The delay queue length is not relevant here as commands in the queue are not associated with any connection.
0 will make it unbounded, making then the
asyncMaxConnsPerNode the de facto boundary.
When the maximum commands within the event loop limit is reached (
maxCommandsInProcess), excess commands are placed within a delay queue until a slot becomes available in the corresponding event loop (a slot in this case would refer to a socket). This delay queue is bounded by the
maxCommandsInQueue policy configuration. The default of
0 will make it unlimited.
maxCommandsInQueue is configured to a non-zero value and the threshold is reached then the error above will appear. This article discusses two basic methods that can be used to troubleshoot the error.
1. Investigate server side latency
The cause of the delay queue filling up is an imbalance between the amount of commands incoming from the application and the speed at which these commands can be executed. A common reason for this is that one or many nodes within the cluster are slower than usual. To investigate this, use the histograms displayed in Aerospike logs as documented here. If latency is observed in any of the default histograms (usually read or write histograms but not exclusively) it may be necessary to turn on specific benchmarks which offer additional histograms breaking the transactions down into stages. This helps identify where in the transaction the latency is being generated.
- It is usually preferable to check multiple logs using asadm in log analyser mode as shown below (the example command below assumes that all logs files are in the current directory):
$ asadm -lf . Seed: [('127.0.0.1', 3000, None)] Config_file: /Users/benbates/.aerospike/astools.conf, /etc/aerospike/astools.conf Aerospike Log Analyzer Shell, version 0.4.2 INFO: Added Log File ./node80.log. INFO: Added Log File ./node81.log. INFO: Added Log File ./node79.log. Log-analyzer>
- The asadm
pager oncommand is useful for viewing mulitple log files concurrently without lines wrapping.
listcommand can be used to view the list of logs available and choose one or several logs specifically for viewing.
histogramcommand allows default histograms and microbenchmarks to be displayed for analysis.
- A good starting point is to use 2 for the
-ebucket size and 4 for the
-bnumber of buckets to display. This gives a nice overview for a quick visual scan. Resolution can be adjusted later to zero in on a specific area of the histogram.
It is of course also possible and fairly common for the latency to be induced by the network layer between the client application and the cluster.
2. Increase client ability to push commands
Though it is not desirable to run out of sockets, it is also not desirable to back up the delay queue by restricting the ability of the client to push commands. If server side analysis has not shown latency and there are no known connectivity problems it may be a reasonable approach to allow the client to parallelise to a greater degree. This is done by increasing the size of the event loops. Increasing the size of the event loop is done at a policy level by increasing
EventPolicy.maxCommandsInProcess. At this point the impact on the network and amount of sockets consumed should be monitored however, in the absence of server latency it will allow the aync client to push more commands.
There is usually no real utility in increasing the size of the delay queue as this will simply back up again if the event loop is not sized to handle the incoming commands.
When increasing the ability of the client to push commands there are considerations in regard to sizing the various components correctly. A client can have multiple event loops specified at the policy level. In addition to the number of loops being configurable, the size of the event loops themselves are defined by the number of commands they can hold, which is controlled by the
maxCommandsInProcess policy item, as described in the beginning of this article.
The maximum number of connection numbers to the each node in the cluster can be controlled using
When sizing event loops it is useful to bear in mind that, potentially, all commands in all loops could go to a single node. If a client had 16 event loops configured, each of which had a value of 16 for
maxCommandsInProcess then it is possible that all of the resultant 256 commands could go to a single node. Therefore it is usually suggested to make sure that
asyncMaxConnsPerNode is set sufficiently high to accomodate this. It may be advisable to use the following rule of thumb:
asyncMaxConnsPerNode = maxCommandsInProcess *
number of eventloops
asyncMaxCommandsInProcess the configured retry policy should also play into the value chosen for
- A simple way to avoid the error above is to keep
EventPolicy.maxCommandsInQueueat the default value of 0 which is unlimited, unless memory usage is a concern.
- The ideal approach to manage the delay queue and event loop is to size these according to the system capacity and use application level throttling to limit the amount of concurrent commands passed into the loop. The two classes used to do this in the Java client are
- The clients have some statistics that could be used when benchmarking and tuning the client for optimal settings:
JAVA ASYNC DELAY QUEUE FULL EVENT LOOP LATENCY