How to troubleshoot OOM kills on systems with large amount of cache memory allocated

The Aerospike Knowledge Base has moved to https://support.aerospike.com. Content on https://discuss.aerospike.com is being migrated to either https://support.aerospike.com or https://docs.aerospike.com. Maintenance on articles stored in this repository ceased on December 31st 2022 and this article may be stale. If you have any questions, please do not hesitate to raise a case via https://support.aerospike.com.

How to troubleshoot OOM kills on systems with large amount of cache memory allocated

Context

When investigating out of memory (OOM) kills on the Aerospike daemon (asd) process it is important to determine whether the issue is genuinely due to a lack of RAM or whether the system has adequate RAM but the memory is being held in cache. It is further required to understand the type of cache holding the memory as this informs the approach that should be taken to resolve the issue. This article makes the assumption that the system in question is properly sized in terms of RAM and proposes an approach for the latter situation, where overall RAM is adequate but that RAM is held in cache.

Usually, the first symptom of an OOM kill is an abrupt loss of the Aerospike node with little or no warning in the aerospike.log. On looking at the system in question the free -m command shows a lot of memory in cached.

['free -m']
             total       used       free     shared    buffers     cached
Mem:        122878     114948        929      94433         71     109744
-/+ buffers/cache:      13132     109745
Swap:            0          0          0

The following symptoms may be observed on the system:

  • OOM kills in the output of dmesg
  • network card driver messages indicating that it is not possible to allocate memory. Multiple types of messages may be present, all will mention memory allocation failure and the network interface name.
  • asd or other processes dying without any information in the logs (abrupt logging stop)

An example output of network card OOM trace can be found on this article.

If the system has a lot of cached memory, it may not have run out of RAM. Instead, what may have happened is that the kernel tried to allocate memory, but because that memory is held in cache, it was not able to do so.

In certain situations - like inside a network hard interrupt - the kernel cannot stop the interrupt to free up memory, and this results in OOM issues, even though there may be RAM to spare.

There are two types of cache that could be responsible for the RAM being held in cache:

  1. clean cache - Cache for filesystem writes which have already been commited to a device
  2. dirty cache - Cache for filesystem writes which have not yet been commited to a device

Method

The first step to take in working with a system where there is a large amount of cached memory is to determine what type of cache is holding the memory. If the RAM is being held by clean cache then the Linux drop_caches command will free it up.

$ echo 3 > /proc/sys/vm/drop_caches

After running this command the free -m command should be executed. If the amount of RAM shown as cached has reduced signigicantly and the amount of free memory has increased then the issue was predominatly with clean cache.

To prevent this from happening in the future the suggestions in this article can be used to set the min_free_kbytes kernel parameter, preferably to at least 1153434, which is 1GB for Aerospike large allocations, and 100MB for all other small process allocations.

This is also explained further on this article.

If, on the other hand, running drop_caches has not cleared the cache to any signficant degree (i.e. most RAM is still shown under the cached section), the issue is dirty cache. This poses a more serious problem, as it means that the system is failing to commit data fast enough to the device, causing RAM to fill with dirty caches. This will be caused by one of the following issues with a disk, or disks:

  • disk too slow - This can be checked by reading through multiple runs of the iostat command. If some of the disks are either at 100% utilization, or are showing high values in w_await constantly, then the disk is not keeping up with writes. In general, values of over 1 in w_await show the disk may not be keeping up. Constantly having this fall behind and climb to 100s is another indication that the system is using up the RAM for the dirty caches of the disk in question.
  • disk faulty - It is possible that the disk will have all the signs of slowness, but in theory should be able to handle writes adequately according to the system sizing. In this case, it is possible the disk is faulty. To investigate this, follow the manufacturer’s guidance. Aerospike advises checking the SMART status of the disk (package smartmontools on linux) as a general indication where this is supported by the disk.
  • noisy neighbour - In cloud and virtual environments, a noisy neighour is another system which is using up all the resources (disk or network) of the underlying hardware. In this case, the Aerospike virtual machine may suffer. If this is the case the matter should be raised with the Cloud/Virtualisation provider.

The following example iostat output shows device xvdf falling behind which will then cause dirty cache to be consumed:

['iostat -y -x 5 4']
Linux 4.4.0-92-generic (test-slow-disk)  06/10/2019      _x86_64_        (16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          30.05    0.00   52.62    7.50    1.30    8.53

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvda              0.00    19.00    0.00    9.40     0.00   180.00    38.30     0.00    0.51    0.00    0.51   0.51   0.48
xvdf              0.00     1.60  198.40   13.80   820.80  1068.00    17.80     5.50   27.01    0.67  405.74   3.17  67.36
nvme1n1           0.00     0.00 8502.40  854.40 124980.90 109363.20    50.09     6.72    0.72    0.75    0.42   0.09  81.60
nvme0n1           0.00     0.00 8059.20  683.20 102825.00 87449.60    43.53     5.00    0.57    0.58    0.45   0.09  76.32

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          32.70    0.00   51.14    6.62    1.16    8.37

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvda              0.00    10.20    0.00    4.60     0.00   108.00    46.96     0.00    0.35    0.00    0.35   0.35   0.16
xvdf              0.00     0.60  195.40   19.00   781.60  2043.20    26.35     3.31   15.31    0.56  167.03   2.00  42.96
nvme1n1           0.00     0.00 8236.00  878.40 124959.90 112435.20    52.09     6.14    0.67    0.70    0.41   0.09  81.12
nvme0n1           0.00     0.00 7783.00  628.80 94047.50 80486.40    41.50     4.34    0.51    0.52    0.37   0.09  73.04

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          30.15    0.00   54.83    6.24    1.13    7.64

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvda              0.00    11.00    0.00    8.20     0.00    76.80    18.73     0.00    0.49    0.00    0.49   0.49   0.40
xvdf              0.00     1.20  192.00   19.20   768.00  1440.80    20.92     8.50   39.46    0.65  427.54   4.73 100.00
nvme1n1           0.00     0.00 8184.60  872.00 123810.10 111616.00    51.99     5.98    0.66    0.69    0.35   0.09  80.64
nvme0n1           0.00     0.00 7764.60  644.80 96174.10 82534.40    42.50     4.84    0.57    0.58    0.44   0.09  74.56

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          29.45    0.00   54.67    6.74    1.21    7.92

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvda              0.00    10.40    0.00    4.80     0.00    86.40    36.00     0.00    0.17    0.00    0.17   0.17   0.08
xvdf              0.00     2.40  190.80   19.20   763.20  1474.40    21.31     8.65   40.66    0.51  439.62   4.76 100.00
nvme1n1           0.00     0.00 8051.80  884.80 128865.90 113254.40    54.19     6.72    0.75    0.79    0.37   0.09  82.16
nvme0n1           0.00     0.00 7658.00  646.40 97255.70 82739.20    43.35     4.42    0.53    0.54    0.42   0.08  70.24

Notes

The disk that falls behind does not need to be an Aerospike data disk. It can be an XDR partition disk, or any other disk on the system, including root partition disk. The dirty page cache allocations will drain RAM either way in this case.

Keywords

DROP_CACHES OOM PAGE ALLOCATION FAILURE DIRTY CACHE VMXNET3

Timestamp

24 June 2019