Understanding device write queue impact on Aerospike memory consumption

The Aerospike Knowledge Base has moved to https://support.aerospike.com. Content on https://discuss.aerospike.com is being migrated to either https://support.aerospike.com or https://docs.aerospike.com. Maintenance on articles stored in this repository ceased on December 31st 2022 and this article may be stale. If you have any questions, please do not hesitate to raise a case via https://support.aerospike.com.

Understanding device write queue impact on Aerospike memory consumption

Abstract

As part of normal operation Aerospike uses memory in various ways. Some of this memory usage is tracked and reported in a very specific manner (primary and secondary index usage, data in memory where appropriate). Other memory is reported simply as part of heap_allocated_kbytes which is reported in the following log line:

process: cpu-pct 28 heap-kbytes (71477,72292,118784) heap-efficiency-pct 60.2

On an Aerospike cluster with disk storage for data, each device will have a write queue the size of which is defined by the parameter max-write-cache. As this has a configurable size, it may be reasonable to assume that it cannot grow past that size, however, this is not the case. This article will discuss how the write queue can exceed the value configured by max-write-cache, how this can affect the database and what can be done to mitigate this, both implcitly and explictly.

Factors that can influence write queue size

Nominally the write queue is limited by max-write-cache which defines a maximum size in MiB for the write queue. Writes in Aerospike are placed in a streaming write buffer which is flushed to disk asynchronously either when a buffer is full or every 1 second (configurable via flush-max-ms). The write queue represents a queue of blocks waiting to be flushed to disk. The size is only relevant when the incoming write checks whether or not there is space available on the queue. All client writes check the queue size and if any device has reached the value set for max-write-cache the write is refused and the following message is written to the Aerospike log:

{namespace_name} write fail: queue too deep: exceeds max 512

On the client the corresponding message will be Device Overload. These errors are discussed in more detail in the Queue Too Deep - Error Code 18 article.

Certain types of write do not do this check on the write queue size. These are migrations, replication and defragmentation writes. This means that a node could be in stop-writes and those 3 types of write would still be permitted. Most importantly, as the write queues are per device, it is possible for memory consumption to be significant if there is a heavy influx of writes. In extreme circumstances the size of the write queues may become so large that the OS will kill the Aerospike process due to excessive memory consumption (an OOM kill).

IMPORTANT NOTE: Refer to the bottom part of this article for important changes in versions 5.1.

Explicit methods of controlling influx to write queues

The key to controlling write queue memory usage is an awareness of the state of the system and how different components are interlinked. With this understanding the write queue size can be managed and, by implication the associated memory usage. The first and easiest factor to control are replication writes. Given that all current versions of Aerospike default to use of prefer-uniform-balance to distribute partitons across cluster nodes, if a node is experiencing write queue problems as a result of replica writes, it is most likely that the master writes elsewhere are causing a problem. This implies that the issue at hand is one of sizing, the cluster cannot keep up with peak write load. The long term solution to this is to conduct a sizing exercise with an Aerospike Solutions Architect. The short term solution is simply to add cluster nodes until the peak write load can be sustained without error. This logic does assume a normal load pattern. It is possible that there are factors affecting load distribution such as an hotkey. Hotkeys are discussed in detail in the Hot-Key - Error Code 14 article.

It may be that replica writes become problematic for other reasons. If prefer-uniform-balance is not in place, it is possible that a single node is taking a disproportionate amount of traffic. The solution here would be to implement prefer-uniform-balance upgrading if necessary. Other reasons why replica writes could flood the write queue could be a disk failing on the node in question, a non-homogenous disk layout across the cluster or, alternately, a non uniform fragmentation profile due to the addition or removal of storage devices (which would, over time, flatten).

Large scale migration has a twofold effect on the write queue. Inbound migrations will be added to the write queue. Aerospike default settings allow migration to happen at a well measured pace. Explicit control of inbound can be as simple as retaining default migration settings such as migrate-threads and migrate-max-num-incoming. If the cluster is already close on capacity, the ability to delay fill migrations can allow nodes to be brought into the cluster before fill migrations are started.

In addition to the direct writes associated with migration, there is also the impact on defragmentation. When a partition has been migrated from one node to another within the cluster the entire partition is then dropped. This has the net effect of loading the defragmentation queue which, in turn can flood the write queue. Other operations which include large scale deletions (such as truncate ) will also cause defragmentation to flood the write queue if left unchecked. Thankfully, it is easy to control the rate at which defragmentation takes place using the defrag-sleep parameter which can be set dynamically. In simple terms, defrag-sleep throttles defragmentation thus allowing the load into the write queue to be reduced to a trickle. The corollary to this is that it will then take longer to recover space after deletions and so this should be balanced against write queue space usage.

Ultimately, in addition to understanding how to mitigate a behaviour that may affect the write queue it is important to understand when those behaviours may be triggered. It may be operationally sensible, for example, to issue truncate commands when write load is low thus negating the need to throttle defragmentation at all.

Changes to introduce implicit methods of controlling influx to write queues

From Aerospike 5.1 there are further changes being implemented to increase the granularity with which the the system can manage the write queues. The first, and arguably most important, change is in terms of when the write queue is considered full. Each device has its own write queue, prior to Aerospike 5.1, if any write queue reaches capacity then all write queues will return Queue too deep and client writes will be blocked. In Aerospike 5.1 and higher, the client writes will continue until all possible streaming write buffers are used. All possible streaming write buffers would be max-write-cache multiplied by the number of devices in total. On a system with 10 devices and max-write-cache set at the default value of 64M this implies that 640MB is available across all write queues. Pre-Aerospike 5.1 if a single queue hit 64MB of streaming write buffers then all client writes would receive queue too deep regardless of whether the device in question had a full queue or not. Post 5.1, 640MB could be queuing for a single device or the queue could be split across all 10 devices before queue too deep. The net effect is that the memory consumed queuing incoming client writes is still controlled but the system can be a bit more tolerant towards a single device occasionally slowing down.

Another major change is to throttle the influx of defragmentation writes onto the write queue thus stopping the system being OOM killed in the event of a sudden spike in defragmentation activity. In Aerospike 5.1 and higher, defragmentation writes will continue as normal until the write queues are 100 write blocks greater than the maximum overall queue size (max-write-cache x (number of devices)). Once the write queue is back under that 100 block overspill, defragmentation will be allowed to continue. Client writes will not be allowed to resume until the overall write queue is below the configured maximum. This behaviour will allow the system to give a certain priority to defragmentation, to prevent the system running out of space, while avoiding situations where nodes could be OOM killed due to defragmentation spikes. There is no intended change to replica or migration write behaviour.

Summary

  • There are a number of write types that will proceed even when the write queue is full.
  • If the write queue gets too full the node can be OOM killed.
  • There are a number of controls that can be used to moderate these non-client writes.
  • Above all else, an understanding of the activities that can trigger spikes in non-client writes is the best approach to mitigating any impact they may have.
  • In Aerospike 5.1 changes will be introduced to take a wholistic view of all write queues before issuing ‘Queue too deep’.
  • As part of the same change defragmentation will be allowed to continue until the write queue is 100 blocks past the configured maximum but no further.
  • Behaviour of replica and migration writes will not change.

Keywords

DEFRAGMENTATION NON-CLIENT OOM KILL MIGRATION REPLICA WRITE QUEUE OVERFLOW SWB

Timestamp

June 2020

2 Likes