How to tune the Linux kernel for memory performance

How to tune the Linux kernel for memory performance

Context

The linux kernel attempts to optimize RAM utilization, in that it occupies unused RAM with caches. This is done on the basis that unused RAM is wasted RAM.

Over time the kernel will fill the RAM with cache. As more memory is required by the applications/buffers, the kernel goes through the cache memory pages and finds a block large enough to fit the requested malloc. It then frees that memory and allocates it to the calling application.

Under some circumstances, this can affect the general performance of the system as cache de-allocation is time-consuming in comparison with access to unused RAM. Higher latency could therefore sometimes be observed.

This latency will purely be based on the fact that RAM is being used to it’s full speed potential. As such, no other symptoms may occur apart from general overall and potentially sporadic latency increases. The equivalent would be similar to symptoms that may be observed if the hard disks are not keeping up with reads and writes. The latency may also affect either Aerospike, or operating system components, such as network card/iptables/ebtables/iproute2 mallocs. As such this may show network-based latency instead. The following article discusses this further and provides steps to minimize impact on the system.

Explanation

The kernel memory cache contains the following:

  • dirty cache - Data blocks not yet committed to the file systems which support caching (e.g. ext4). This can be emptied by issuing the sync command athough this may imply a periodic performance penalty. This is not advised for normal usage unless it is extremely important to commit data to hard drive (for example when expecting a failure).
  • clean cache - Data blocks which are on the hard drive but are also retained in memory for fast access. Dropping the clean cache can result in a performance deficit as all data will read from disk, whereas beforehand, the frequently used data would be fetched directly from RAM.
  • inode cache - Cache of the inode location information. This can be dropped as with clean cache but with the attendant performance penalty.
  • slab cache - This type of cache stores objects allocated via malloc by applications so that they may be re-malloc again in the future with object data already populated, resulting in speed gain during memory allocations.

While not much can be done with dirty cache, the other cached objects can be cleared. This has potentially 2 outcomes. Latency in high-malloc applications, such as Aerospike when storing data in memory, will be reduced. On the other hand, disk access may slow down, as all data will have to be read from disk.

Clearing slab cache on a server can potentially introduce a temporary speed penalty (spike). For this reason, it is not advised to clear caches. Instead, it is preferred to inform the system that a certain amount of RAM should never be occupied by cache.

If necessary, clearing the cache can be performed as follows:

# clear page cache (above type 2 and 3)
$ echo 1 > /proc/sys/vm/drop_caches

# clear slab cache (above type 4)
$ echo 2 > /proc/sys/vm/drop_caches

# clear page and slab cache (types 2,3,4)
$ echo 3 > /proc/sys/vm/drop_caches

Most of the space will be occupied by page cache, not slab cache. It is recommended that when clearing cache, to only drop the page cache (echo 1).

For a more permanent fix, a minimum number of free RAM can be set for the kernel. Consider the following example:

Total RAM: 100GB
Used: 10GB
Buffers: 40GB
Minimum free: 10GB
Cache: 40GB

In this example, there is 10GB free memory selected using the minimum free option. In such a case, if 5GB of memory is allocated for buffers, the kernel will allow the allocation to happen instantly. It will then de-allocate some cache to ensure 10GB free memory. Allocations will happen instantly and cache will be dynamically shrunk to ensure that 10GB remains free at all times. The new allocation would look as follows:

Total RAM: 100GB
Used: 10GB
Buffers: 45GB
Minimum free: 10GB
Cache: 35GB

Fine-tuning these parameters is dependant upon the current utilization. For Aerospike, it should be at least 1.1GB free in min_free_kbytes, if the available system memory allows. This means that caches will still operate sufficiently, while leaving a margin for applications to allocate into.

$ cat /proc/sys/vm/min_free_kbytes
67584

Tuning is performed by performing an echo NUMBER > /proc/sys/vm/min_free_kbytes where, NUMBER is the number of kilobytes required to be free in the system. To leave 3% of memory on a 100GB RAM machine unoccupied, the command would be:

echo 3145728 > /proc/sys/vm/min_free_kbytes

Caution should be exercised when setting this parameter, both too low and too high values can have an adverse effect upon system performance. Setting min_free_kbytes too low prevents the system from reclaiming memory. This can result in system hangs and OOM kills of processes.

Setting this parameter to a value that is too high (5-10% of total system memory) will cause the system to run out of memory immediately. Linux is designed to use all available RAM to cache file system data. Setting a high min_free_kbytes value results in the system spending too much time reclaiming memory.

The standard recommendation would be to keep min_free_kbytes at 1-3% of the total memory on the system.

It is advised to either reduce swappiness to 0 or not use swap. For low-latency operations, using swap to any extent will drastically slow down performance.

To set the swappiness to 0 to reduce potential latency:

echo 0 > /proc/sys/vm/swappiness

Notes

IMPORTANT: Any and all changes above are NOT permanent. They only happen during machine runtime. To make the changes permanent, additions must be made to /etc/sysctl.conf.

The following lines make the changes shown above permanent:

vm.min_free_kbytes = 5248000
vm.swappiness = 0

As always, editing such parameters can be destructive if done incorrectly. It is recommended to perform the changes in a lab environment before moving to production. Making changes dynamically before performing permanent change helps in mitigating any potential side effects which could occur.

There is another parameter aimed at a similar output as the above, called zone_reclaim. Unfortunately, this parameter causes aggressive reclaims and scans and should therefore be disabled. This is disabled as standard on all newer kernels and distributions.

The following command can be used to ensure that zone_reclaim is disabled:

$ sysctl -a |grep zone_reclaim_mode
vm.zone_reclaim_mode = 0

Keywords

KERNEL MEMORY CACHE DROP_CACHES MIN_FREE_KBYTES SWAPPINESS

Timestamp

June 2 2019