Why do I see a warning - "write fail - queue too deep"?


#1

Warning: write fail: queue too deep

Problem Description

When an Aerospike database has one or many namespaces using device as storage-engine the following message may be seen in the aerospike.log.

WARNING (drv_ssd): (drv_ssd.c:4153) {test} write fail: queue too deep: exceeds max 256
WARNING (drv_ssd): (drv_ssd.c:4153) (repeated:1096) {test} write fail: queue too deep: exceeds max 256

And the client may also report errors in the following form:

com.aerospike.client.AerospikeException: Error Code 18: Device overload.

For cloud deployments in AWS, if you are using shadow device, you could be breaching the queue for the shadow device and not the primary device. In that case, warnings would look like this:

WARNING (drv_ssd): (drv_ssd.c:4153) (repeated:1096) {test} write fail: shadow queue too deep: exceeds max 256

Note: In versions earlier than 3.13, you might see these warnings much more verbose. In newer versions, repeated messages are collapsed into a single line with the ‘repeated’ flag instead.

Explanation

This error indicates that although the disks themselves are not necessarily faulty or nearing end of life, they are not keeping up with the load placed upon them. When Aerospike writes it returns success to the client when the record has been written to storage write blocks which are flushed to disk asynchronously. If the storage device is not able to keep up with the write load the storage write blocks are cached in the write cache. When this cache is full the error above is reported. As the cache is not flushing to disk synchronously and get queued up, the errors may appear suddenly. In this scenario latencies may be visible in the reads histogram as reads will involve disk access in most circumstances (excluding data in memory and reading from the post write queue)

Cause and Solution

1- Check whether the write load/client traffic has increased

Can check this using asloglatency tool on the server logs over the time period when the error started occuring. http://www.aerospike.com/docs/tools/asloglatency

Solution: If the increased load is coming primarily from client traffic it may be necessary to set up a sizing conversation with an Aerospike Solutions Architect to check the system configuration matches the current use case. For Enterprise Edition user, this can be initiated by contacting support. Or as a temporary mitigation, reduce the traffic load to mitigate the error.

2- Figure out the device and the namespace that are impacted

The following log line will indicate how many storage-write_blocks are currently in use and the trend for (write-q). If the issue is happening only on one device, this can potentially infer to an existing hot-key situation.

INFO (drv_ssd): (drv_ssd.c:2115) {test} /dev/xvdb: used-bytes 1626239616 free-wblocks 28505 write-q 39 write (8203,23.0) defrag-q 0 defrag-read (7981,21.7) defrag-write (1490,3.0) shadow-write-q 0

http://www.aerospike.com/docs/reference/serverlogmessages/

Solution: If a hot-key situation, confirm if client is also receiving “Error code 14” error. Confirm if the throughput varies between the different nodes in the cluster for the same namespace. To mitigate hot-key issue, see:

3- Check whether migrations are tuned too aggressive

Check if the following vary too much from the defaults - migrate-sleep, migrate-threads, migrate-max-num-incoming.

Note that incoming migration data will go through the write queues as well and can put the w-q at a really high value.

Please note this is limited to tracking write-queue on a per-device basis, we do not have a direct cluster statistic.

4- Check whether defragmentation is tuned too aggressive

Check if the following vary too much from the defaults - defrag-sleep, defrag-lwm-pct.

This will potentially cause extra large block reads and writes that would hurt disk performance (extra defrag reads / defrag writes) as well as increase further the write queue due to defrag writes (defrag-write in the log line from 2).

Solution for 3 and 4: In the long term the load on the cluster should be considered to see if the disks are able to keep up with the throughput. Factors such as level of defragmentation and whether migration is ongoing should be considered as these can contribute to pressure on the disks. Consider keeping configuration at default to lower the load on the disk.

5 Confirm the health of the device

Solution: Even though the error itself does not imply a faulty device, it’s good to confirm if disks are not nearing end of life and pass S.M.A.R.T tests when these are run.

Workaround

In case of short bursts of writes, increasing the write cache will provide more room for the write buffers (w-q) and potentially allow client writes to proceed. If the disks are not capable of handling the write load this will not be a permanent solution. The size of the write cache can be increased using the following command:

asinfo -v 'set-config:context=namespace;id=test;max-write-cache=128M'

The default size for the write cache is 64MB (with a write-block-size of 128K a 64MB cache would comprise 512 storage-write-blocks hence the q 513, max 512 portion of the error). When increasing the cache, it should always be increased to a multiple of the write-block-size.

Note: Ensure that you have sufficient memory capacity before increasing the max-write-cache to a high value to avoid running out of memory. The above setting is for each device configured for the namespace.

More details to see exactly which device could be causing the issue, the logs messages like below could help.

Oct 08 2017 06:46:35 GMT: INFO (drv_ssd): (drv_ssd.c:2115) {ns1} /dev/sda: used-bytes 70507569408 free-wblocks 3269912 write-q 0 write (80395102,140.9) defrag-q 0 defrag-read (79565128,178.9) defrag-write (36171273,82.2) shadow-write-q 0
Oct 08 2017 06:46:55 GMT: INFO (drv_ssd): (drv_ssd.c:2115) {ns1} /dev/sda: used-bytes 70515256192 free-wblocks 3269869 write-q 0 write (80397215,105.7) defrag-q 0 defrag-read (79567198,103.5) defrag-write (36172224,47.5) shadow-write-q 1
Oct 08 2017 06:47:15 GMT: INFO (drv_ssd): (drv_ssd.c:2115) {ns1} /dev/sda: used-bytes 70517832704 free-wblocks 3269777 write-q 35 write (80397876,33.0) defrag-q 29 defrag-read (79567797,30.0) defrag-write (36172491,13.4) shadow-write-q 0
Oct 08 2017 06:47:35 GMT: INFO (drv_ssd): (drv_ssd.c:2115) {ns1} /dev/sda: used-bytes 70516947328 free-wblocks 3269817 write-q 56 write (80398042,8.3) defrag-q 63 defrag-read (79568037,12.0) defrag-write (36172589,4.9) shadow-write-q 0
Oct 08 2017 06:47:55 GMT: INFO (drv_ssd): (drv_ssd.c:2115) {ns1} /dev/sda: used-bytes 70515790720 free-wblocks 3269841 write-q 106 write (80398246,10.2) defrag-q 177 defrag-read (79568379,17.1) defrag-write (36172696,5.3) shadow-write-q 0
Oct 08 2017 06:48:15 GMT: INFO (drv_ssd): (drv_ssd.c:2115) {ns1} /dev/sda: used-bytes 70516167424 free-wblocks 3269857 write-q 161 write (80398456,10.5) defrag-q 240 defrag-read (79568668,14.4) defrag-write (36172801,5.2) shadow-write-q 0
...
Oct 08 2017 06:56:35 GMT: INFO (drv_ssd): (drv_ssd.c:2115) {ns1} /dev/sda: used-bytes 70499140480 free-wblocks 3270541 write-q 1086 write (80403407,9.7) defrag-q 1125 defrag-read (79575188,12.1) defrag-write (36175452,5.2) shadow-write-q 0
Oct 08 2017 06:56:55 GMT: INFO (drv_ssd): (drv_ssd.c:2115) {ns1} /dev/sda: used-bytes 70499059968 free-wblocks 3270559 write-q 1134 write (80403602,9.8) defrag-q 1186 defrag-read (79575462,13.7) defrag-write (36175551,4.9) shadow-write-q 0
Oct 08 2017 06:57:15 GMT: INFO (drv_ssd): (drv_ssd.c:2115) {ns1} /dev/sda: used-bytes 70497662208 free-wblocks 3270572 write-q 1184 write (80403802,10.0) defrag-q 1263 defrag-read (79575752,14.5) defrag-write (36175652,5.1) shadow-write-q 0
Oct 08 2017 06:57:35 GMT: INFO (drv_ssd): (drv_ssd.c:2115) {ns1} /dev/sda: used-bytes 70496528768 free-wblocks 3270604 write-q 1212 write (80404000,9.9) defrag-q 1325 defrag-read (79576044,14.6) defrag-write (36175761,5.4) shadow-write-q 0
Oct 08 2017 06:57:55 GMT: INFO (drv_ssd): (drv_ssd.c:2115) {ns1} /dev/sda: used-bytes 70494422912 free-wblocks 3270633 write-q 1255 write (80404192,9.6) defrag-q 1386 defrag-read (79576326,14.1) defrag-write (36175865,5.2) shadow-write-q 0
Oct 08 2017 06:58:15 GMT: INFO (drv_ssd): (drv_ssd.c:2115) {ns1} /dev/sda: used-bytes 70493322368 free-wblocks 3270675 write-q 1277 write (80404359,8.4) defrag-q 1370 defrag-read (79576519,9.6) defrag-write (36175964,4.9) shadow-write-q 0
Oct 08 2017 06:58:35 GMT: INFO (drv_ssd): (drv_ssd.c:2115) {ns1} /dev/sda: used-bytes 70491328128 free-wblocks 3270709 write-q 1310 write (80404539,9.0) defrag-q 1437 defrag-read (79576800,14.1) defrag-write (36176066,5.1) shadow-write-q 0

Here we can see that initially, we were writing over 100 blocks per second [the second number after write ( ,xxx) ], each block being what you have configured as write-block-size we can refer to the following log reference for details on the log messages: http://www.aerospike.com/docs/reference/serverlogmessages).

During this time the write-q is at 0, which is what we would expect… the write queue should always be at 0, or very very close. Anytime it goes higher than that, it means that we are not able to flush the blocks as fast as they are filled up.

Now as the write-q starts increasing, the write per seconds decreases to around 10 blocks per seconds (10 times less). So, even though the write load reduced by an order of magnitude, the disk-io subsystem is still not able to handle it.

This can lead to high CPU load, which can cause the whole system to slow down, for example, it can also let the nsup cycle being very very long. #ns1 is the namespace name.

Notes

  • As indicated in this article, migration writes will continue to be processed even after the max-write-cache limit has been hit. Prole writes and defragmentation writes will continue as well. This could lead to very high memory usage and its potential consequences.

  • Details on write-block-size and max-write-cache-size can be found at the following links:

http://www.aerospike.com/docs/reference/configuration#max-write-cache http://www.aerospike.com/docs/reference/configuration/#write-block-size

  • The SAR tool may be used to check the loading on the various devices. This will collect and retain data related to network and device usage for a rolling 7 day period. The default SAR interval for collecting data is 10 minutes. It may be useful to reduce this to 5 or 2 minutes dependant on the duration of queue too deep errors when they occur.

  • iostat can also be used to check disk performance but this must be collected when the issue is happening.

  • Configuration reference page

http://www.aerospike.com/docs/reference/configuration

  • There should not be any data loss expected in this case unless 2 nodes in a repication-factor 2 setup go down in parallel with these ongoing warnings. The clients can be potentially configured to retry in such an event.

  • If issue is seen only on the shadow device, issue could be specific to the sizing (if shadow device is sized different than the main device).

Keywords

QUEUE TOO DEEP DEVICE OVERLOAD DEFRAG WRITE READ FAIL CLIENT

Timestamp

10/10/17


Why is one node in our 4-node cluster having so many more 'queue too deep' issues?
FAQ - Shadow Device
Adding new node to cluster
Writes failing to aerospike cluster
Bulk/Batch Updates
Writes failing to aerospike cluster