Tuning Kernel Memory for Performance


Tuning Kernel Memory for Performance

Problem Description

The linux kernel attempts to optimize RAM utilization, in that it occupies unused RAM with caches. This is done on the basis that unused RAM is wasted RAM.

Over time the kernel will fill the RAM with cache. Once more memory is required by the applications/buffers, the kernel goes through the cache memory pages and finds a block large enough to fit the requested malloc. It then frees the memory and allocates it to the calling application.

Unfortunately, this can result, in certain conditions, in latency. Cache de-allocation is time-consuming in comparizon with straight access to unused RAM and therefore latency might show. This latency will purely be based on the fact that RAM is being used to it’s full speed potential. As such, no other symptoms may occur apart from general overall latency increase. The equivalent would be same symptoms you may see if your hard disks are not keeping up with reads and writes. The latency may also affect either aerospike, or operating system components, such as network card/iptables/ebtables/iproute2 mallocs. As such this may show network-based latency instead.


The kernel memory cache contains:

  1. dirty cache <- data blocks not yet commited to the file systems which support caching (e.g. ext4). This can be emptied by issuing the sync command and taking at times a rather large performance penalty. Not advised for normal usage unless it is extremely important to commit data to hard drive (for example when expecting a failure).
  2. clean cache <- data blocks which are on the hard drive but are also retained in memory for fast access. Dropping the clean cache can result in a performance hit as all data will now be fetched from the HDD, whereas beforehand, the most often used data would be fetched directly from RAM.
  3. inode cache <- cache of the inode location information. Same rules apply as point 2 above.
  4. slab cache <- this type of cache stores objects allocated via malloc by applications so that they may be re-malloc again in the future with object data already populated, resulting in speed gain during memory allocations.

While not much can be done with cache type 1, the other cached objects can be cleared. This has potentially 2 outcomes. Latency in high-malloc applications, such as aerospike in-memory database, will be reduced. On the other hand, disk access will become slow, as ALL data will have to be read form disk. Furthermore, clearing slab cache on a server can potentially introduce a temporary speed penalty (spike).

As such, it is not advised to clear caches. Instead, it is advisable to inform the system that we want a certain amount of RAM never occupied by cache.

If required for a quick-temporary-fix, clearing of cache can be performed as follows:

# clear page cache (above type 2 and 3)
$ echo 1 > /proc/sys/vm/drop_caches

# clear slab cache (above type 4)
$ echo 2 > /proc/sys/vm/drop_caches

# clear page and slab cache (types 2,3,4)
$ echo 3 > /proc/sys/vm/drop_caches

Most of the space will be occupied by page cache, not slab cache. As such, if you are going to clear cache, it is advisable to only drop the page cache (echo 1).


For a more permanent fix, a minimum number of free RAM can be set for the kernel. This works as follows in the given example: Total RAM: 100GB Used: 10GB Buffers: 40GB Minimum free: 10GB Cache: 40GB

With the above example, there is 10GB free memory selected using the minimum free option. In such case, if suddenly we allocate 5GB of memory for buffers, the kernel will allow the allocation to happen instantly. It will then de-allocate some cache to again ensure 10GB free memory. As such, allocations will happen instantly and cache will be dynamically shrunk to ensure that 10GB remains free at all times. In this example, the new allocation would look as follows: Total RAM: 100GB Used: 10GB Buffers: 45GB Minimum free: 10GB Cache: 35GB

Fine-tuning the parameter really depends on your current utilization. Due to the way Aerospike operates, you should preferably at least 1.1GB free in min_fre_kbytes, if the available system memory allows. This means that caches will still operate sufficiently, while leaving a margin for applications to allocate into.

$ cat /proc/sys/vm/min_free_kbytes

The tuning is performed by performing an “echo NUMBER > /proc/sys/vm/min_free_kbytes” Where, NUMBER is the number of kilobytes you want to have free in the system. So, to leave 5% of memory on a 100GB RAM machine onoccupied, you would:

echo 5248000 > /proc/sys/vm/min_free_kbytes

At the same time, we advise to either reduce swappiness to 0 or not use swap. For low-latency operations, using swap to any extent will drastically slow down performance to a near-halt. To set the swappiness to 0 to reduce latency:

echo 0 > /proc/sys/vm/swappiness


IMPORTANT: any and all changes above are NOT permanent. they only happen during the machine runtime. To make the changes permanent, edit your /etc/sysctl.conf. The following lines would perform the above example changes permanently:

vm.min_free_kbytes = 5248000
vm.swappiness = 0

As always, editing such parameters can be destructive if done incorrectly. As such, we recommend performing the changes in a lab environment before moving to production. Also, making changes dynamically before performing permanent change helps in mitigating any side-effect which may occur.

There is another parameter aimed at a similar output as the above, called zone_reclaim. Unfortunately, this parameter causes aggressive reclaims and scans and should therefore be disabled. This is disabled as standard on all newer kernels and distros. To ensure that it IS disabled, run the following:

$ sysctl -a |grep zone_reclaim_mode
vm.zone_reclaim_mode = 0

if it says 0, keep it that way!

Related article:





Understanding linux memory usage reporting
Vmxnet3 page allocation failure