Why do I see a warning - "write fail - queue too deep" on server and "Error code 18 Device Overload" on the client

Warning: write fail: queue too deep

Problem Description

When an Aerospike database has one or many namespaces using device as storage-engine the following message may be seen in the aerospike.log.

WARNING (drv_ssd): (drv_ssd.c:4153) {test} write fail: queue too deep: exceeds max 256
WARNING (drv_ssd): (drv_ssd.c:4153) (repeated:1096) {test} write fail: queue too deep: exceeds max 256

The client reports those errors in the following form:

com.aerospike.client.AerospikeException: Error Code 18: Device overload.

For cloud deployments in AWS, if you are using shadow device, you could be breaching the queue for the shadow device and not the primary device. In that case, warnings would look like this:

WARNING (drv_ssd): (drv_ssd.c:4153) (repeated:1096) {test} write fail: shadow queue too deep: exceeds max 256

Note: In versions earlier than 3.13, you might see these warnings much more verbose. In newer versions, repeated messages are collapsed into a single line with the ‘repeated’ flag instead.

Explanation

This error indicates that although the disks themselves are not necessarily faulty or nearing end of life, they are not keeping up with the load placed upon them. When Aerospike writes it returns success to the client when the record has been written to storage write blocks which are flushed to disk asynchronously. If the storage device is not able to keep up with the write load the storage write blocks are cached in the write cache. When this cache is full the error above is reported. As the cache is not flushing to disk synchronously and get queued up, the errors may appear suddenly. In this scenario latencies may be visible in the reads histogram, as reads will involve disk access in most circumstances (excluding data in memory and reading from the post write queue).

Note: If the commit-to-device configuration parameter is set to true, the Aerospike server flush each write transaction to disk prior to returning to the client.

Cause and Solution

1- Check whether the write load/client traffic has increased

Can check this using asloglatency tool on the server logs over the time period when the error started occurring.

Solution: If the increased load is coming primarily from client traffic it may be necessary to set up a sizing conversation with an Aerospike Solutions Architect to check the system configuration matches the current use case. For Enterprise Edition user, this can be initiated by contacting support. Or as a temporary mitigation, reduce the traffic load to mitigate the error.

2- Figure out the device and the namespace that are impacted

The following log line will indicate how many storage-write_blocks are currently in use and the trend for (write-q), (shadow-write-q). If the issue is happening only on one device, this can potentially infer to an existing hot-key situation.

{namespace} /dev/sda: used-bytes 296160983424 free-wblocks 885103 write-q 0 write (12659541,43.3) defrag-q 0 defrag-read (11936852,39.1) defrag-write (3586533,10.2) shadow-write-q 0 tomb-raider-read (13758,598.0)

Server Log Reference: https://www.aerospike.com/docs/reference/serverlogmessages/

Solution: If a hot-key situation, confirm if client is also receiving “Error code 14” error. Confirm if the throughput varies between the different nodes in the cluster for the same namespace. To mitigate hot-key issue, see knowledge-base article on Hot Key Error Code 14.

3- Check whether migrations are tuned too aggressively

Check the following configuration parameters and whether they have been increased from their default values:

Note that incoming migration data will go through the write queues as well and can put the write-q at a really high value.

Please note this is limited to tracking write-queue on a per-device basis, we do not have a direct cluster statistic.

4- Check whether defragmentation is tuned too aggressively

Check the following configuration parameters and whether they have been increased from their default values:

This will potentially cause extra large block reads and writes that would hurt disk performance (extra defrag reads / defrag writes) as well as increase further the write queue due to defrag writes (defrag-write in the log line from 2).

Solution for 3 and 4: In the long term the load on the cluster should be considered to see if the disks are able to keep up with the throughput. Factors such as level of defragmentation and whether migration is ongoing should be considered as these can contribute to pressure on the disks. Consider keeping configuration at default to lower the load on the disk. As a temporary option, consider lowering defrag-lwm-pct or increasing defrag-sleep if your cluster has the capacity to handle a drop in available percentage. More details on Knowledge-Base article on Defragmentation.

Decreasing the write-block-size may reduce defragmentation as a delay of defrag-sleep is performed after every large block is read. So for smaller write-block-size, filling up write blocks through defragmentation activity may be a bit smoother. It is nevertheless always recommended to test with relevant workloads. More details on the write-block-size FAQ Article.

5- Confirm the health of the device

Solution: Even though the error itself does not imply a faulty device, it’s good to confirm if disks are not nearing end of life and pass S.M.A.R.T tests when these are run.

Workaround

Option 1 - Tune Max-write-cache

In case of short bursts of writes, increasing the write cache will provide more room for the write buffers (w-q) and potentially allow client writes to proceed. If the disks are not capable of handling the write load this will not be a permanent solution. The size of the write cache can be increased using the following command:

asinfo -v 'set-config:context=namespace;id=test;max-write-cache=128M'

The default size for the write cache is 64MB (with a write-block-size of 128K a 64MB cache would comprise 512 storage-write-blocks hence the q 513, max 512 portion of the error). When increasing the cache, it should always be increased to a multiple of the write-block-size.

Note: Ensure that you have sufficient memory capacity before increasing the max-write-cache to a high value to avoid running out of memory. The above setting is for each device configured for the namespace.

More details to see exactly which device could be causing the issue, the logs messages like below could help.

Oct 08 2019 06:46:35 GMT: INFO (drv_ssd): (drv_ssd.c:2115) {ns1} /dev/sda: used-bytes 70507569408 free-wblocks 3269912 write-q 0 write (80395102,140.9) defrag-q 0 defrag-read (79565128,178.9) defrag-write (36171273,82.2) shadow-write-q 0 tomb-raider-read (13758,598.0)
Oct 08 2019 06:46:55 GMT: INFO (drv_ssd): (drv_ssd.c:2115) {ns1} /dev/sda: used-bytes 70515256192 free-wblocks 3269869 write-q 0 write (80397215,105.7) defrag-q 0 defrag-read (79567198,103.5) defrag-write (36172224,47.5) shadow-write-q 1 tomb-raider-read (13758,598.0)
Oct 08 2019 06:47:15 GMT: INFO (drv_ssd): (drv_ssd.c:2115) {ns1} /dev/sda: used-bytes 70517832704 free-wblocks 3269777 write-q 35 write (80397876,33.0) defrag-q 29 defrag-read (79567797,30.0) defrag-write (36172491,13.4) shadow-write-q 0 tomb-raider-read (13758,598.0)
Oct 08 2019 06:47:35 GMT: INFO (drv_ssd): (drv_ssd.c:2115) {ns1} /dev/sda: used-bytes 70516947328 free-wblocks 3269817 write-q 56 write (80398042,8.3) defrag-q 63 defrag-read (79568037,12.0) defrag-write (36172589,4.9) shadow-write-q 0 tomb-raider-read (13758,598.0)
Oct 08 2019 06:47:55 GMT: INFO (drv_ssd): (drv_ssd.c:2115) {ns1} /dev/sda: used-bytes 70515790720 free-wblocks 3269841 write-q 106 write (80398246,10.2) defrag-q 177 defrag-read (79568379,17.1) defrag-write (36172696,5.3) shadow-write-q 0 tomb-raider-read (13758,598.0)
Oct 08 2019 06:48:15 GMT: INFO (drv_ssd): (drv_ssd.c:2115) {ns1} /dev/sda: used-bytes 70516167424 free-wblocks 3269857 write-q 161 write (80398456,10.5) defrag-q 240 defrag-read (79568668,14.4) defrag-write (36172801,5.2) shadow-write-q 0 tomb-raider-read (13758,598.0)
...
Oct 08 2019 06:56:35 GMT: INFO (drv_ssd): (drv_ssd.c:2115) {ns1} /dev/sda: used-bytes 70499140480 free-wblocks 3270541 write-q 1086 write (80403407,9.7) defrag-q 1125 defrag-read (79575188,12.1) defrag-write (36175452,5.2) shadow-write-q 0 tomb-raider-read (13758,598.0)
Oct 08 2019 06:56:55 GMT: INFO (drv_ssd): (drv_ssd.c:2115) {ns1} /dev/sda: used-bytes 70499059968 free-wblocks 3270559 write-q 1134 write (80403602,9.8) defrag-q 1186 defrag-read (79575462,13.7) defrag-write (36175551,4.9) shadow-write-q 0 tomb-raider-read (13758,598.0)
Oct 08 2019 06:57:15 GMT: INFO (drv_ssd): (drv_ssd.c:2115) {ns1} /dev/sda: used-bytes 70497662208 free-wblocks 3270572 write-q 1184 write (80403802,10.0) defrag-q 1263 defrag-read (79575752,14.5) defrag-write (36175652,5.1) shadow-write-q 0 tomb-raider-read (13758,598.0)
Oct 08 2019 06:57:35 GMT: INFO (drv_ssd): (drv_ssd.c:2115) {ns1} /dev/sda: used-bytes 70496528768 free-wblocks 3270604 write-q 1212 write (80404000,9.9) defrag-q 1325 defrag-read (79576044,14.6) defrag-write (36175761,5.4) shadow-write-q 0 tomb-raider-read (13758,598.0)
Oct 08 2019 06:57:55 GMT: INFO (drv_ssd): (drv_ssd.c:2115) {ns1} /dev/sda: used-bytes 70494422912 free-wblocks 3270633 write-q 1255 write (80404192,9.6) defrag-q 1386 defrag-read (79576326,14.1) defrag-write (36175865,5.2) shadow-write-q 0 tomb-raider-read (13758,598.0)
Oct 08 2019 06:58:15 GMT: INFO (drv_ssd): (drv_ssd.c:2115) {ns1} /dev/sda: used-bytes 70493322368 free-wblocks 3270675 write-q 1277 write (80404359,8.4) defrag-q 1370 defrag-read (79576519,9.6) defrag-write (36175964,4.9) shadow-write-q 0 tomb-raider-read (13758,598.0)
Oct 08 2019 06:58:35 GMT: INFO (drv_ssd): (drv_ssd.c:2115) {ns1} /dev/sda: used-bytes 70491328128 free-wblocks 3270709 write-q 1310 write (80404539,9.0) defrag-q 1437 defrag-read (79576800,14.1) defrag-write (36176066,5.1) shadow-write-q 0 tomb-raider-read (13758,598.0)

Here we can see that initially, we were writing over 100 blocks per second (the second number after write ( ,xxx) ), each block being what you have configured as write-block-size we can refer to the log reference for details on the log messages.

During this time the write-q is at 0, which is what we would expect… the write queue should always be at 0, or very very close. Anytime it goes higher than that, it means that we are not able to flush the blocks as fast as they are filled up.

Now as the write-q starts increasing, the write per seconds decreases to around 10 blocks per seconds (10 times less). So, even though the write load reduced by an order of magnitude, the disk-io subsystem is still not able to handle it.

This can lead to high CPU load, which can cause the whole system to slow down, for example, it can also let the nsup cycle being very very long.

Option 2 - Consider enabling the read page cache to have reads serve from cache rather than disk

This option is only applicable for situations for read heavy use-cases where read performance impact due to write device overload can be tuned.

Check for the log line that shows the reads that are hitting cache vs disk.

{ns_name} device-usage: used-bytes 2054187648 avail-pct 92 cache-read-pct 12.35

More details on the log line can be seen in the Log Reference.

The value ‘cache-read-pct’ can be increased as a temporary workaround to help lighten the load on the devices. To tune that, post-write-queue may be increased. Note that this can impact the memory utilization, thus increase in small steps and with monitoring.

To enable replica reads to go to post-write-queue, [cache-replica-writes] may be enabled.

Aerospike offers another configuration that allows the OS to leverage page cache and can help with latencies for some workload types. More details on this in our knowledge-base article on Buffering and Caching in Aerospike.

Notes

  • As indicated in this article, migration writes will continue to be processed even after the max-write-cache limit has been hit. Prole writes and defragmentation writes will continue as well. This could lead to very high memory usage and its potential consequences.

  • Note the change in which the max-write-cache parameter is interpreted in versions 5.1 and above in the details on the configuration reference manual.

  • The SAR tool may be used to check the load on the various devices. This will collect and retain data related to network and device usage for a rolling one month period (depending on the day of the current month). The default SAR interval for collecting data is 10 minutes. It may be useful to reduce this to 5 or 2 minutes depending on the duration of queue too deep errors when they occur. More details on this in our knowledge-base article on Configuring SAR.

  • iostat can also be used to check disk performance but this must be collected when the issue is happening. More details on this in our knowledge-base article on Interpreting iostat.

  • There should not be any data loss expected in this case unless nodes greater than or equal to the configured replication-factor go down in parallel with these ongoing warnings. The write-q will continue to grow (passing the limit of max-write-cache) but will eventually be flushed to the storage sub-system. The client applications can be coded to retry and/or throttle in such an event.

  • If issue is seen only on the shadow device, issue could be specific to the sizing (if shadow device is sized different than the main device).

Keywords

QUEUE TOO DEEP DEVICE OVERLOAD DEFRAG WRITE READ FAIL CLIENT

Timestamp

July 2020

© 2015 Copyright Aerospike, Inc. | All rights reserved. Creators of the Aerospike Database.