Some aerospike.conf defaults explained


#1
Synopsis:

Here we look at some of the default values used by Aerospike and explain them further.

high-water-memory-pct – this is set at 60% by default. Is there a reason why this is so low and are we not wasting disk space? Would not 85% or 90% be better?

The recommended setting for high-water-memory-pct is 60%. This is a safe setting for clusters with less than 6 nodes. If a node is down for maintenance or other issues, the cluster will redistribute the data. During this redistribution phase memory usage on a particular node can fluctuate quickly, especially on smaller clusters. In a 3 node cluster if a single node drops the amount of data on the remaining two will increase by 50% after migrations. During migrations the two nodes may in worst case increase to double their original size. A 60% high water mark is meant to protect the system from running out of memory in the near worst case scenario of a 3 node cluster.

For larger clusters you can push this setting higher, however be cautions of going above 80% though. An added benefit of our recommendation is that it provides protection against memory fragmentation. Pushing past 80% doesn’t allow much space for fragmentation. In Aerospike 3 we have changed memory allocation algorithms and introduced strategies to reduce the rate of memory fragmentation.

Lastly if a node has a hardware failure and a replacement cannot be provisioned for an extended duration, you can temporarily increase the high-water-mark until the node can be replaced.

high-water-disk-pct – should this always be the same as the memory-pct or do we need to keep some free space on disk to allow for defrag.

The recommended setting for high-water-disk-pct is 50% which means it is not necessary for this to be the same as high-water-memory-pct. Like memory we recommend 50% due to disk fragmentation and sudden write load spikes potentially caused by a downed node.

This default value is loosely coupled to the defrag-lwm-pct setting and has to do with write amplification, which can impact performance. For further details, refer to the following article: FAQ - Why is high-water-disk-pct set to 50%?.

Should memory-size and file-size be the same if using a file to persist? What should the relationship between these values be?

As a rule of thumb (since 2.7.0 and 3.1.3), file-size should be a factor of 4 larger than memory-size. The additional overhead is due to the size of records rounding up to the nearest record block (rblock) size as well as the additional space required for defrag. Lastly the larger your records the closer the in memory size will be to the data on disk which would reduce the amount of disk storage required. If you average record falls in the range of 512 1024b then disk storage would be 12-24% larger than memory storage size. In which case the file-size should be a minimum of 2.12 to 2.24 times larger than your memory-size.

File persistence is not recommended. What is the expected downside with using a file (we are not planning on doing so for the final production system, but would like to understand the downside here).

As long as data-in-memory is set, the downside would be slower writes. If data-in-memory is not set, reads will also be slower. You can decrease this effect some by mounting the disk with noatime.


RAM Usage in a Linux Server, Aerospike-only computer
Space Utilization
Why expired-objects > (evicted-objects+ object)
Configuration help (evicted objects)
#2

Sorry but am finding hard to understand this contradicting behaviour in theory. Lets say that we have 3 node cluster having 32GB RAM and NS having in memory as 30GB for each node. Now lets consider 60% as default HWM, when the data hits 60% that is 18GB it will start evicting. And how is it going to save when one node goes down that will be at worst case 15GB data is used on each node and 15GB will be distributed on 2 nodes that will be 7.5GB on each. But HWM on those 2 nodes has 60% that is 18GB that will lead to the available memory space is 3GB on each, whereas we have to allocate 7.5GB now on each. What will happen in this situation.


#3

Here’s the math: n=3, k=1, the post-condition = pre-condition * n / (n - k).

When a node goes down, post DRAM = 60 * 3 / 2 = 90%. You hit stop-writes immediately if you were previously at 60%.

You had memory consumption of 18G * 3 = 54G across the cluster. When your node went down that needs to be redistributed across 2 nodes. Now each node is consuming 27G out of 30G, which is 90%, and your namespace hit stop-writes (default 90%).

You should leave at least 5G for the OS on that node, by the way.


#4

Agreed. But am not talking about hitting stop writes am talking somewhere between HWM and STOP-Writes. Lets consider my above question. In case of 60% HWM default and I keep data used in 50% on each nodes which is lesser than HWM and STOPWRITES. In the event of one node down, I will have to run into HWM on 2 nodes left. I feel HWM can be removed or moved to 48% default as minimum we will create 3node cluster as default and at the worst case if 1 node is down it can still hold the writes for a while. HWM 60% is misleading to have as a safe guard in cluster in the event of one node down. Please correct me if am wrong.


#5

The HWM for memory is there to assist you so that you don’t run out of memory. It is not a requirement for you to set it to 60% or any other value - you can set it higher than stop-writes, for example, and then evictions will never happen. You can also give your objects a ‘never-expire’ TTL, and then too, evictions will not happen, even when the HWM is breached. See:

Here are a couple of scenarios HWM helps you avoid:

  • You write objects with a very short TTL at a faster rate than expired objects are being cleaned up. Triggering evictions when the HWM for memory is crossed gives a buffer between this event and stop-writes for more aggressive cleanup via evictions.
  • You need enough space for the cluster to operate in a node-down situation.

I suggest using the equations above, and aim (as you suggested) for a point between 80-85% DRAM utilization with k nodes down. When you have more nodes in the cluster you can set the HWM for memory higher, as each node going down takes out a smaller fraction of the total capacity. So, you’re right to suggest the HWM for memory on a 3-node cluster should be actually be something around 56% (making it 85% after the node goes down).

On a 5-node cluster, n = 5, k = 1 80% * 4 / 5 = 64% and 85% * 4 / 5 = 68% . The high-water-memory-pct being set on the range between those would be safe.

Default settings are there to give you good performance and decent operational margins out of the box, but you should tune things to your particular cluster (hardware and number of nodes).