Understanding when server no longer accepts writes

The Aerospike Knowledge Base has moved to https://support.aerospike.com. Content on https://discuss.aerospike.com is being migrated to either https://support.aerospike.com or https://docs.aerospike.com. Maintenance on articles stored in this repository ceased on December 31st 2022 and this article may be stale. If you have any questions, please do not hesitate to raise a case via https://support.aerospike.com.

Summary

This article explains scenarios which would lead to an Aerospike cluster node rejecting write operations (insert or update).

Resolution

The Aerospike database server has mechanisms to protect against running out of memory or disk space.

For example, the eviction algorithm evicts data (i.e. accelerate expiration of expirable records) once one of the optionally configurable high water-mark is breached. Defragmentation also takes place constantly to be able to reclaim contiguous free storage blocks as records are updated/deleted. As a last resort, though, a namespace turns into read only to prevent running out of capacity.

For more information on configuring the different limits refer to the Namespace Retention configuration documentation.

Note: Aerospike also prevents write transactions to be processed when the underlying storage sub-system is not keeping up. Refer to the max-write-cache configuration and the Device Overload article for further details.

Situation for stop writes on a cluster

In case of a multi-node cluster, for a given namespace, if stop_writes triggers on the node with the master copy of the data/partition, the writes will fail.

In case the node for which a namespace has hit stop_writes is a non master replica node for the partition being written to, the write is allowed for the replica write. Incoming migration writes are also allowed (as well as writes resulting from a duplicate resolution). In other words, only direct client writes against a master partition will not be allowed when stop_writes is true.

For server versions prior to 3.15, the stop_writes flag will wait for the ongoing Namespace Supervisor (nsup) cycle to complete, thus might take longer to set itself to true depending on how long namespace supervisor cycle takes (which in turn depends on number of records, number of namespaces, number of eligible records for expiration/eviction, system performance and other related factors).

Situations for a namespace to get into stop_writes

The server is designed to stop writes on the disk (and the memory) if any of the following are breached:

  • Memory utilization is above a certain threshold (stop-writes-pct).
  • Available Percentage on the disk goes below a certain threshold (min-avail-pct). Situations leading to such low available percent include:
    • Defragmentation not keeping up with the number of objects evicted.
    • Eviction is not able to keep up.
  • For strong-consistency enabled namespaces, clock_skew_stop_writes is triggered off when cluster_clock_skew_ms is above the cluster_clock_skew_stop_writes_sec threshold.
  • As of Aerospike Server 4.5.1, for each Available mode (AP) namespace where nsup is enabled (i.e. nsup-period is not zero) writes will be suspended if the cluster clock skew exceeds 40 seconds.

References

Steps to recover from stop_writes

  • To recover from memory utilization capping, increase the cluster capacity by increasing the memory-size if possible, increase cluster capacity by adding nodes or simply delete data (for example using the truncate method).

  • To recover from minimum available percentage going to 0, refer to the Recovering from low Available Percent article.

  • To understand the defragmentation configuration parameters, refer to the Defragmentation article.

  • To recover from evictions not keeping up, refer to the evict-tenth-pct configuration parameter. This configuration could to be tuned so it’s big enough to allow eviction to keep pace with the rate at which new data is added to the namespace

Keywords

STOP WRITE STOP_WRITES CLOCK SKEW CLOCK_SKEW_STOP_WRITES

Timestamp

October 2020