How to recover contiguous free blocks aka available percent


#1

What is available percent?

Available Percent (device_available_pct) is a key Aerospike metric measuring the minimum contiguous free disk space (in blocks of size write-block-size) across all the devices in a namespace:

avail_pct = min (contig_disk for all disks in a namespace)

It is important to not confuse device_available_pct and device_free_pct. Refer to the article explaining the differences between device_available_pct and device_free_pct for further details.

What happens when a namespace is low on available percent?

From the application’s perspective, the main indication that available percent is low is that writes will start failing when hitting a node with a namespace that does not have enough free contiguous disk space on one of its devices. The error returned to the client in such cases is:

com.aerospike.client.AerospikeException: Error Code 8: Server memory error

The server will log the following WARNING:

WARNING (rw): (write.c:770) {namespaceid}: write_master: drives full

To be more precise, application writes will start failing when the device_available_pct falls below the min-avail-pct configured threshold (default 5%) on any of the namespace devices.

Most common situations and remediations

1. Capacity over use (or the disks are really “full”)

This can happen if there are no blocks eligible for defragmentation because each block’s used percentage is above the capacity is defrag-lwm-pct threshold.

Validate this is the case:

The device log line can be checked for each device:

INFO (drv_ssd): (drv_ssd.c:2115) {namespaceid} /dev/xvdb: used-bytes 1626239616 free-wblocks 28505 write-q 0 write (8203,23.0) defrag-q 0 defrag-read (7981,21.7) defrag-write (1490,3.0) shadow-write-q 0 tomb-raider-read (1615,59.6)

Refer to the log reference manual for the details on each parameters.

  • If the defrag-q is low or at 0, and the defrag-write rate is also low or at 0.0, it is an indication that there are no blocks eligible to be defragmented.
  • If the disk used percentage (device_used_bytes / device_total_bytes x 100) is greater than the configured defrag-lwm-pct, the disk is above the safe operating threshold for healthy defragmentation. (Refer to the article explaining the default high water disk pct).

Here are potential remediations:

  1. Delete records (using the truncate command for example).
  2. Force evictions if possible and acceptable. (Relevant configuration parameters are evict-tenths-pct (increase), high-water-disk-pct (decrease), high-water-memory-pct (decrease), nsup-delete-sleep (decrease), nsup-period (decrease). Refer to the eviction mechanism article for further details).
  3. Add extra capacity (more nodes or more devices per namespace).
  4. Gradually increase the defrag-lwm-pct threshold. Monitor the performance impact given the non linear write amplification increase this will induce.

2. Devices with mismatching sizes

Given the definition of the overall available percent – avail_pct = min (contig_disk for all disks in a namespace), having one devices with a much smaller size than the rest will cause the whole namespace to hit stop_writes even when it seems there is a lot of free space across all devices.

Inspecting the device specific log lines for used-bytes and free-wblocks should quickly determine if that was the case. Of course, verifying the physical size of each device or partition would also yield the information.

Refer to the SSD Setup page for best practices for partitioning disks.

3. Defragmentation is not keeping up

In some cases, there are blocks to be defragmented, but the defragmentation is not able to keep up. Comparing the defragmentation rate with the write rate on the per device log line as well as having the defragmentation queue (defrag-q) continuously increase is a sign of defragmentation not keeping up:

INFO (drv_ssd): (drv_ssd.c:2143) {namespaceid} /dev/nvme0n1: used-bytes 1182271397376 free-wblocks 1212517 write-q 0 write (1304972843,497.1) defrag-q 6743042 defrag-read (1309698931,609.4) defrag-write (639136010,191.8)

Refer to the log reference manual for details on the different statistics provided in the above logline.

In such cases, the idea is to increase the pace of the defragmentation thread. By default, the defragmentation thread sleeps 1000µs between each block read. This can be tuned through the defrag-sleep configuration option. It is recommended to gradually decrease this value and observe the potential impact on the storage subsystem (using iostat for example) and the application performance. The following line will double the default speed of reading blocks to be defragmented.

$ asinfo -v "set-config:context=namespace;id=<namespace name>;defrag-sleep=500"

For further details on this process, refer to the Defragmentation Knowledge Base article.

Having more partitions would also help such situation as it would provide more defragmentation threads. Refer to the SSD Setup page for best practices for partitioning disks.

4. Large post-write-queue

Blocks that are in the post-write-queue (default is 256 blocks per device) are not eligible to be defragmented. For namespaces with small devices and large write-block-size it is possible for the post-write-queue to be a significant portion of the device itself or even larger than it. This will then obviously cause low available percent situations very quickly.

For example, on a device of 4GiB in size with a post-write-queue of 512 and a write-block-size of 8M (8MiB), the total size occupied by this queue would be 512 * 8MiB = 4096 MiB = 4GiB which represents the entire namespace and would prevent any blocks to be defragmented.

Refer to the Avail pct drops without defragmentation starting knowledge base article for further details on this situation.

5. Defragmented blocks are not released soon enough (this is not common)

As of version 3.16.0.1 defragmented blocks are not freed until the data that has been re-written on a new block has been flushed:

  • [AER-5776] - (STORAGE) Don’t free a defragmented write block before all data evacuated from it has been flushed.

In some extreme cases, typically on devices having very few records that are continuously updated, it is possible that a lot of freed up blocks are not released because the new block where the defragmentation thread is writing on is taking a long time to be filled up. This was addressed in version 4.3.1.5:

  • [AER-5950] - (STORAGE) When defrag load is extremely low, periodically flush defrag buffer in order to free source write blocks.

Notes

Recovering from available percent 0

In older versions, it is possible to get all the way to no blocks available at all which would prevent the system to ‘defragment its way out’ even after tuning some parameters. Details for such situation are addressed on the Recovering from Available Percent Zero.

Keywords

DEFRAGMENTATION DEFRAG-LWM-PCT AVAIL-PCT STOP-WRITES

Timestamp

11/08/2018