Case Study - OOM Kills on Aerospike cluster with new nodes

Case Study: OOM Kills on Aerospike cluster with new nodes

Background

This case presented as sudden Out Of Memory (OOM) kills of Aerospike nodes in a newly formed cluster. The OOM kills were surprising as the cluster had undergone a vertical expansion into nodes with much more power and greater storage. Data volume had not increased by any significant degree. Nodes had been restarted by the customer and the cluster was available but the question was, why had a previously stable cluster experienced OOM kills and what could be done to make sure this did not happen again?

Detail

When the case first presented, the stability of the systen could not be assured. Nodes had been OOM killed and the reason for this was not clear. In this instance the primary task was to look at the system and make sure that no other nodes were about to OOM kill. This was done by checking the state of the cluster with asadm. The asadm tool will show the memory allocated on all cluster nodes, whether itemised or on the heap.

Checking the state of the cluster in asadm shows that there is no data-in-memory or secondary-indexes and so all RAM usage itemised by Aerospike is due to the Primary index and the total RAM usage is Aerospike itemised RAM and whatever is allocated on the heap.

Admin> summary -l
Cluster
=======
   1.   Server Version     :  E-4.5.3.3
   2.   OS Version         :  Ubuntu 18.04.2 LTS (4.15.0-1043-aws) (4.15.0-1043-aws)
   3.   Cluster Size       :  9
   4.   Devices            :  Total 288, per-node 32
   5.   Memory             :  Total 2.004 TB, 37.78% used (775.200 GB), 62.22% available (1.247 TB)
   6.   Disk               :  Total 58.772 TB, 20.31% used (11.939 TB), 67.89% available contiguous space (39.899 TB)
   7.   Usage (Unique Data):  0.000 B  in-memory, 5.660 TB on-disk
   8.   Active Namespaces  :  1 of 1
   9.   Features           :  KVS, XDR Destination

The itemised RAM usage is detailed below.

Admin> info namespace usage


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Namespace Usage Information (2019-07-04 09:58:08 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
           Namespace                 Node      Total   Expirations,Evictions     Stop        Disk    Disk     HWM   Avail%          Mem     Mem    HWM      Stop      PI          PI      PI     PI   
                   .                    .    Records                       .   Writes        Used   Used%   Disk%        .         Used   Used%   Mem%   Writes%    Type        Used   Used%   HWM%   
MainNamespace          10.200.10.11:3000     1.385 B   (20.866 M, 0.000)       false     1.293 TB   20      50      69        82.578 GB   37      60     90        shmem   82.578 GB   N/E     N/E    
MainNamespace          10.200.10.22:3000     1.474 B   (21.897 M, 0.000)       false     1.363 TB   21      50      67        87.852 GB   39      60     90        shmem   87.852 GB   N/E     N/E    
MainNamespace          10.200.10.33:3000     1.519 B   (21.699 M, 0.000)       false     1.402 TB   22      50      66        90.559 GB   40      60     90        shmem   90.559 GB   N/E     N/E    
MainNamespace          10.200.10.44:3000     1.428 B   (20.833 M, 0.000)       false     1.328 TB   21      50      68        85.118 GB   38      60     90        shmem   85.118 GB   N/E     N/E    
MainNamespace          10.200.10.55:3000     1.402 B   (20.290 M, 0.000)       false     1.308 TB   21      50      68        83.547 GB   37      60     90        shmem   83.547 GB   N/E     N/E    
MainNamespace          10.200.10.66:3000     1.396 B   (20.816 M, 0.000)       false     1.309 TB   21      50      68        83.207 GB   37      60     90        shmem   83.207 GB   N/E     N/E    
MainNamespace          10.200.10.77:3000     1.410 B   (20.091 M, 0.000)       false     1.301 TB   20      50      69        84.049 GB   37      60     90        shmem   84.049 GB   N/E     N/E    
MainNamespace          10.200.10.88:3000     1.472 B   (21.586 M, 0.000)       false     1.367 TB   21      50      67        87.758 GB   39      60     90        shmem   87.758 GB   N/E     N/E    
MainNamespace          10.200.10.99:3000     1.343 B   (20.629 M, 0.000)       false     1.268 TB   20      50      69        80.058 GB   36      60     90        shmem   80.058 GB   N/E     N/E    
MainNamespace                               12.830 B   (188.707 M, 0.000)               11.939 TB                            764.727 GB                                     0.000 B                   
Number of rows: 10


Heap allocation is shown as follows:

Admin> show stat like heap

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Service Statistics (2019-07-04 09:58:08 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE                 :   10.200.10.11:3000    10.200.10.22:3000   10.200.10.33:3000    10.200.10.44:3000   10.200.10.55:3000   10.200.10.66:3000  10.200.10.77:3000  10.200.10.88:3000  10.200.10.99:3000   
heap_active_kbytes   :   9695420              9696480             9699336              9692416             9699064             9696932            9694396            9697216            9695920            
heap_allocated_kbytes:   9609368              9609912             9611698              9606385             9612060             9610520            9609239            9610972            9608406            
heap_efficiency_pct  :   48                   48                  49                   48                  49                  48                 48                 48                 48                 
heap_mapped_kbytes   :   19986432             20006912            19695616             19884032            19742720            19853312           20025344           20039680           20035584           
heap_site_count      :   76                   76                  76                   76                  76                  76                 76                 76                 76             

The whole system free memory can be shown as follows:

Collectinfo-analyzer> show stat like system
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Service Statistics (2019-07-03 11:57:46 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE               :   10.200.10.11:3000    10.200.10.22:3000   10.200.10.33:3000    10.200.10.44:3000   10.200.10.55:3000   10.200.10.66:3000  10.200.10.77:3000  10.200.10.88:3000  10.200.10.99:3000   
system_free_mem_pct:   95                   95                  95                   95                  95                  95                 96                 95                 95         


It is evident from looking at the live cluster that, at this point, there are no nodes about to OOM kill. Total itemised memory usage across the cluster is 765GB which is less than half of the total RAM available. Likewise the heap allocation is not excessive. At this point it is important to check if there are any obvious outliers in the cluster that could be having a specific issue, here it is clear that there are not.

The nodes have not been up very long. This is shown either with the info command in asadm or as follows:

Admin> show stat like uptime
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Service Statistics (2019-07-03 11:57:46 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE  :   10.200.10.11:3000    10.200.10.22:3000   10.200.10.33:3000    10.200.10.44:3000   10.200.10.55:3000   10.200.10.66:3000  10.200.10.77:3000  10.200.10.88:3000  10.200.10.99:3000   
uptime:   57                   58                  58                   58                  58                  58                 58                 58                 58                 

Admin> 

With the information available thus far it is difficult to see why there are OOM kills happening within the cluster and so the next step is to analyse a log from a crashed node. The end of the log is unlikely to show anything of real interest as an OOM kill is often characterised by an abrupt cessation in logging however, by looking into the memory usage log limes directly before the end of the log, more is revealed. The following log line describes what is allocated on the heap and tells a different story.

Jul 03 2019 10:52:56 GMT: INFO (info): (ticker.c:270)    system-memory: free-kbytes 5016532 free-pct 1 heap-kbytes (135113302,135687488,269752320) heap-efficiency-pct 50.1

At this point the system has less than 5GB free.

All nodes are of the same type of AWS instance so it is not due to smaller nodes carrying the same load as larger nodes and the record numbers are showing there should be sufficient space. Given that DMESG was showing that the reason Aerospike nodes were going down was due to OOM kills this created a quandary, memory was being used in excess however the memory that was itemised for was not causing this. How could this be happening?

One possibility is a memory leak, a process allocating memory but not releasing it causing a gradual bloat over time that results in an OOM kill. This can happen with applications from time to time but there were no known memory related issues in this particular version of Aerospike.

Another possibility is that the memory was being used in Aerospike but in areas that are not itemised. Aerospike will itemise for and report on memory that must be actively monitored and managed by the customer. A good example of this is memory used by the primary index. This can grow or shrink according to the number of records held within the cluster at the time. Even if a cluster has been properly sized it is conceivable that a drastic and sudden increase in load could swamp the system. For this reason metrics and levers are provided to manage these types of resource consumption (in the case of memory in a hybrid system such as this the dominant levers would be high-water-memory-pct and stop-writes-pct ).

There are also consumers of memory that do not need to be itemised for as they are fixed by the user. Typically these would be caches and queues. A cache or queue does not need to be itemised in the same way as primary index as there is a fixed limit beyond which it cannot expand. Two such caches are the post-write-queue and the write-cache. The write-cache holds records waiting to be flushed to the underlying device. The post-write-queue holds recently written records so that they can be read without acccessing the device for efficiency reasons. Both of these have a configurable fixed size. More crucially for this case, both of these exist on a per device basis.

Upon checking collectinfo it was immediately obvious that the problem lay with the sizing of the post-write-queue.

Collectinfo-analyzer> show config like post-write
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MainNamespace Namespace Configuration (2019-07-03 11:57:46 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE                           :   10.200.10.11:3000    10.200.10.22:3000   10.200.10.33:3000    10.200.10.44:3000   10.200.10.55:3000   10.200.10.66:3000  10.200.10.77:3000  10.200.10.88:3000  10.200.10.99:3000   
storage-engine.post-write-queue:   4096                 4096                4096                 4096                4096                4096               4096               4096               4096               
Collectinfo-analyzer> 

The post-write-queue was configured to 4096 write blocks.

Collectinfo-analyzer> show config like write-block
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MainNamespace Namespace Configuration (2019-07-03 11:57:46 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE                           :   10.200.10.11:3000    10.200.10.22:3000   10.200.10.33:3000    10.200.10.44:3000   10.200.10.55:3000   10.200.10.66:3000  10.200.10.77:3000  10.200.10.88:3000  10.200.10.99:3000   
storage-engine.write-block-size:   1048576              1048576             1048576              1048576             1048576             1048576            1048576            1048576            1048576            
Collectinfo-analyzer>

The system is set to use the default write-block-size of 1MB and here was the answer. This implied a post-write-queue of 4GB however there is a post-write-queue for every device and as there were 32 devices on the new higher capacity nodes in this cluster, this implied a total allocation to the post write queue of 128GB. With an index allocation of 104GB (on this node) it is evident how easily the system could run out of the 200GB of available memory when the post-write-queue filled up, as it did, very quickly.

The issue was then simple to resolve. The post-write-queue is a dynamic parameter that can be changed without a node restart. Moderating this to a more appropriate value, in this case the 256 block (256MB with 1MB block size) default resolved the issue immediately.

The changes in the post-write-queue had been implemented as the operations team had noticed that the cache_read_pct was very high:

Collectinfo-analyzer> show stat like 'cache_read_pct'
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MainNamespace Namespace Statistics (2019-07-03 11:57:46 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE          :   10.200.10.11:3000    10.200.10.22:3000   10.200.10.33:3000    10.200.10.44:3000   10.200.10.55:3000   10.200.10.66:3000  10.200.10.77:3000  10.200.10.88:3000  10.200.10.99:3000   
cache_read_pct:   100                  100                 100                  100                 100                 100                100                100                100                
Collectinfo-analyzer> 

The tuning change that had been made in terms of post-write-queue was a logical step however the consequence of moving to a system with 32 storage devices per namespace as opposed to 2 had not been considered.

Key Points of Note

  • The assumption that moving to larger nodes will alleviate capacity related issues is sound, however there are a number of configurations contingent on number of devices, the effect of these parameters on memory consumption should be considered when changing the size of the nodes in a given cluster, whether or not load has changed.
  • When parameters have been changed to provide temporary relief for sizing issues, these should be re-evaluated when the sizing issue is resolved.
  • When considering memory that is itemised directly against that which is not, a handy rule of thumb is that memory counters that can be associated directly with a namespace are accounted for, memory counters that are associated with an entire cluster are not itemised.
  • If heap allocation is excessive and this cannot easily be attributed to a wrongly configured cache, the next investigative step would have been to look into jemalloc debugging information

Timestamp

July 2019

© 2015 Copyright Aerospike, Inc. | All rights reserved. Creators of the Aerospike Database.