How to troubleshoot `async delay queue full` errors

How to troubleshoot async delay queue full errors

Context

When using the Java client in asynchronous mode the following message might be observed as client load increases:

Async delay queue is full     at com.aerospike.client.async.NioCommand.run(NioCommand.java:157

When using one of the Aerospike asynchronous client, the commands are executed immediately using event loops. As load increases, the event loops can become overwhelmed. The effect of this would be that client socket usage increases, potentially leading to performance deterioration.

The basic assumption is that commands going into the event loop would be throttled by the application but this is not always the case. Here are the relevant policy parameters (using the Java API, but the equivalent exist for the other asynchronous clients):

The asyncMaxConnsPerNode are distributed across event loops. For example, with 300 asyncMaxConnsPerNode and 10 event loops, each event loop will be able to process up to 30 transactions per node concurrently.

The maxCommandsInProcess defines the maximum number of commands (transactions) a single event loop can process concurrently. It is a good practice to set the asyncMaxConnsPerNode to maxCommandsInProcess * (number of event loops). This will prevent running out of connections for a given node in the cluster. The delay queue length is not relevant here as commands in the queue are not associated with any connection.

Setting maxCommandsInProcess to 0 will make it unbounded, making then the asyncMaxConnsPerNode the de facto boundary.

When the maximum commands within the event loop limit is reached (maxCommandsInProcess), excess commands are placed within a delay queue until a slot becomes available in the corresponding event loop (a slot in this case would refer to a socket). This delay queue is bounded by the maxCommandsInQueue policy configuration. The default of 0 will make it unlimited.

If the maxCommandsInQueue is configured to a non-zero value and the threshold is reached then the error above will appear. This article discusses two basic methods that can be used to troubleshoot the error.

Methods

1. Investigate server side latency

The cause of the delay queue filling up is an imbalance between the amount of commands incoming from the application and the speed at which these commands can be executed. A common reason for this is that one or many nodes within the cluster are slower than usual. To investigate this, use the histograms displayed in Aerospike logs as documented here. If latency is observed in any of the default histograms (usually read or write histograms but not exclusively) it may be necessary to turn on specific benchmarks which offer additional histograms breaking the transactions down into stages. This helps identify where in the transaction the latency is being generated.

  • It is usually preferable to check multiple logs using asadm in log analyser mode as shown below (the example command below assumes that all logs files are in the current directory):
$ asadm -lf .
Seed:        [('127.0.0.1', 3000, None)]
Config_file: /Users/benbates/.aerospike/astools.conf, /etc/aerospike/astools.conf
Aerospike Log Analyzer Shell, version 0.4.2

INFO: Added Log File ./node80.log.
INFO: Added Log File ./node81.log.
INFO: Added Log File ./node79.log.
Log-analyzer>
  • The asadm pager on command is useful for viewing mulitple log files concurrently without lines wrapping.
  • The list command can be used to view the list of logs available and choose one or several logs specifically for viewing.
  • The histogram command allows default histograms and microbenchmarks to be displayed for analysis.
  • A good starting point is to use 2 for the -e bucket size and 4 for the -b number of buckets to display. This gives a nice overview for a quick visual scan. Resolution can be adjusted later to zero in on a specific area of the histogram.

It is of course also possible and fairly common for the latency to be induced by the network layer between the client application and the cluster.

2. Increase client ability to push commands

Though it is not desirable to run out of sockets, it is also not desirable to back up the delay queue by restricting the ability of the client to push commands. If server side analysis has not shown latency and there are no known connectivity problems it may be a reasonable approach to allow the client to parallelise to a greater degree. This is done by increasing the size of the event loops. Increasing the size of the event loop is done at a policy level by increasing EventPolicy.maxCommandsInProcess. At this point the impact on the network and amount of sockets consumed should be monitored however, in the absence of server latency it will allow the aync client to push more commands.

There is usually no real utility in increasing the size of the delay queue as this will simply back up again if the event loop is not sized to handle the incoming commands.

When increasing the ability of the client to push commands there are considerations in regard to sizing the various components correctly. A client can have multiple event loops specified at the policy level. In addition to the number of loops being configurable, the size of the event loops themselves are defined by the number of commands they can hold, which is controlled by the maxCommandsInProcess policy item, as described in the beginning of this article.

The maximum number of connection numbers to the each node in the cluster can be controlled using asyncMaxConnsPerNode.

When sizing event loops it is useful to bear in mind that, potentially, all commands in all loops could go to a single node. If a client had 16 event loops configured, each of which had a value of 16 for maxCommandsInProcess then it is possible that all of the resultant 256 commands could go to a single node. Therefore it is usually suggested to make sure that asyncMaxConnsPerNode is set sufficiently high to accomodate this. It may be advisable to use the following rule of thumb:

asyncMaxConnsPerNode = maxCommandsInProcess * number of eventloops

When setting asyncMaxCommandsInProcess the configured retry policy should also play into the value chosen for asyncMaxConnectionsPerNode.

Notes

  • A simple way to avoid the error above is to keep EventPolicy.maxCommandsInQueue at the default value of 0 which is unlimited, unless memory usage is a concern.
  • The ideal approach to manage the delay queue and event loop is to size these according to the system capacity and use application level throttling to limit the amount of concurrent commands passed into the loop. The two classes used to do this in the Java client are throttle and throttles.
  • The clients have some statistics that could be used when benchmarking and tuning the client for optimal settings:

Keywords

JAVA ASYNC DELAY QUEUE FULL EVENT LOOP LATENCY

Timestamp

July 2020

© 2015 Copyright Aerospike, Inc. | All rights reserved. Creators of the Aerospike Database.