Case Study: OOM Kills on Aerospike cluster with new nodes
This case presented as sudden Out Of Memory (OOM) kills of Aerospike nodes in a newly formed cluster. The OOM kills were surprising as the cluster had undergone a vertical expansion into nodes with much more power and greater storage. Data volume had not increased by any significant degree. Nodes had been restarted by the customer and the cluster was available but the question was, why had a previously stable cluster experienced OOM kills and what could be done to make sure this did not happen again?
When the case first presented, the stability of the systen could not be assured. Nodes had been OOM killed and the reason for this was not clear. In this instance the primary task was to look at the system and make sure that no other nodes were about to OOM kill. This was done by checking the state of the cluster with
asadm tool will show the memory allocated on all cluster nodes, whether itemised or on the heap.
Checking the state of the cluster in
asadm shows that there is no data-in-memory or secondary-indexes and so all RAM usage itemised by Aerospike is due to the Primary index and the total RAM usage is Aerospike itemised RAM and whatever is allocated on the heap.
Admin> summary -l Cluster ======= 1. Server Version : E-184.108.40.206 2. OS Version : Ubuntu 18.04.2 LTS (4.15.0-1043-aws) (4.15.0-1043-aws) 3. Cluster Size : 9 4. Devices : Total 288, per-node 32 5. Memory : Total 2.004 TB, 37.78% used (775.200 GB), 62.22% available (1.247 TB) 6. Disk : Total 58.772 TB, 20.31% used (11.939 TB), 67.89% available contiguous space (39.899 TB) 7. Usage (Unique Data): 0.000 B in-memory, 5.660 TB on-disk 8. Active Namespaces : 1 of 1 9. Features : KVS, XDR Destination
The itemised RAM usage is detailed below.
Admin> info namespace usage ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Namespace Usage Information (2019-07-04 09:58:08 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Namespace Node Total Expirations,Evictions Stop Disk Disk HWM Avail% Mem Mem HWM Stop PI PI PI PI . . Records . Writes Used Used% Disk% . Used Used% Mem% Writes% Type Used Used% HWM% MainNamespace 10.200.10.11:3000 1.385 B (20.866 M, 0.000) false 1.293 TB 20 50 69 82.578 GB 37 60 90 shmem 82.578 GB N/E N/E MainNamespace 10.200.10.22:3000 1.474 B (21.897 M, 0.000) false 1.363 TB 21 50 67 87.852 GB 39 60 90 shmem 87.852 GB N/E N/E MainNamespace 10.200.10.33:3000 1.519 B (21.699 M, 0.000) false 1.402 TB 22 50 66 90.559 GB 40 60 90 shmem 90.559 GB N/E N/E MainNamespace 10.200.10.44:3000 1.428 B (20.833 M, 0.000) false 1.328 TB 21 50 68 85.118 GB 38 60 90 shmem 85.118 GB N/E N/E MainNamespace 10.200.10.55:3000 1.402 B (20.290 M, 0.000) false 1.308 TB 21 50 68 83.547 GB 37 60 90 shmem 83.547 GB N/E N/E MainNamespace 10.200.10.66:3000 1.396 B (20.816 M, 0.000) false 1.309 TB 21 50 68 83.207 GB 37 60 90 shmem 83.207 GB N/E N/E MainNamespace 10.200.10.77:3000 1.410 B (20.091 M, 0.000) false 1.301 TB 20 50 69 84.049 GB 37 60 90 shmem 84.049 GB N/E N/E MainNamespace 10.200.10.88:3000 1.472 B (21.586 M, 0.000) false 1.367 TB 21 50 67 87.758 GB 39 60 90 shmem 87.758 GB N/E N/E MainNamespace 10.200.10.99:3000 1.343 B (20.629 M, 0.000) false 1.268 TB 20 50 69 80.058 GB 36 60 90 shmem 80.058 GB N/E N/E MainNamespace 12.830 B (188.707 M, 0.000) 11.939 TB 764.727 GB 0.000 B Number of rows: 10
Heap allocation is shown as follows:
Admin> show stat like heap ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Service Statistics (2019-07-04 09:58:08 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ NODE : 10.200.10.11:3000 10.200.10.22:3000 10.200.10.33:3000 10.200.10.44:3000 10.200.10.55:3000 10.200.10.66:3000 10.200.10.77:3000 10.200.10.88:3000 10.200.10.99:3000 heap_active_kbytes : 9695420 9696480 9699336 9692416 9699064 9696932 9694396 9697216 9695920 heap_allocated_kbytes: 9609368 9609912 9611698 9606385 9612060 9610520 9609239 9610972 9608406 heap_efficiency_pct : 48 48 49 48 49 48 48 48 48 heap_mapped_kbytes : 19986432 20006912 19695616 19884032 19742720 19853312 20025344 20039680 20035584 heap_site_count : 76 76 76 76 76 76 76 76 76
The whole system free memory can be shown as follows:
Collectinfo-analyzer> show stat like system ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Service Statistics (2019-07-03 11:57:46 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ NODE : 10.200.10.11:3000 10.200.10.22:3000 10.200.10.33:3000 10.200.10.44:3000 10.200.10.55:3000 10.200.10.66:3000 10.200.10.77:3000 10.200.10.88:3000 10.200.10.99:3000 system_free_mem_pct: 95 95 95 95 95 95 96 95 95
It is evident from looking at the live cluster that, at this point, there are no nodes about to OOM kill. Total itemised memory usage across the cluster is 765GB which is less than half of the total RAM available. Likewise the heap allocation is not excessive. At this point it is important to check if there are any obvious outliers in the cluster that could be having a specific issue, here it is clear that there are not.
The nodes have not been up very long. This is shown either with the
info command in
asadm or as follows:
Admin> show stat like uptime ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Service Statistics (2019-07-03 11:57:46 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ NODE : 10.200.10.11:3000 10.200.10.22:3000 10.200.10.33:3000 10.200.10.44:3000 10.200.10.55:3000 10.200.10.66:3000 10.200.10.77:3000 10.200.10.88:3000 10.200.10.99:3000 uptime: 57 58 58 58 58 58 58 58 58 Admin>
With the information available thus far it is difficult to see why there are OOM kills happening within the cluster and so the next step is to analyse a log from a crashed node. The end of the log is unlikely to show anything of real interest as an OOM kill is often characterised by an abrupt cessation in logging however, by looking into the memory usage log limes directly before the end of the log, more is revealed. The following log line describes what is allocated on the heap and tells a different story.
Jul 03 2019 10:52:56 GMT: INFO (info): (ticker.c:270) system-memory: free-kbytes 5016532 free-pct 1 heap-kbytes (135113302,135687488,269752320) heap-efficiency-pct 50.1
At this point the system has less than 5GB free.
All nodes are of the same type of AWS instance so it is not due to smaller nodes carrying the same load as larger nodes and the record numbers are showing there should be sufficient space. Given that DMESG was showing that the reason Aerospike nodes were going down was due to OOM kills this created a quandary, memory was being used in excess however the memory that was itemised for was not causing this. How could this be happening?
One possibility is a memory leak, a process allocating memory but not releasing it causing a gradual bloat over time that results in an OOM kill. This can happen with applications from time to time but there were no known memory related issues in this particular version of Aerospike.
Another possibility is that the memory was being used in Aerospike but in areas that are not itemised. Aerospike will itemise for and report on memory that must be actively monitored and managed by the customer. A good example of this is memory used by the primary index. This can grow or shrink according to the number of records held within the cluster at the time. Even if a cluster has been properly sized it is conceivable that a drastic and sudden increase in load could swamp the system. For this reason metrics and levers are provided to manage these types of resource consumption (in the case of memory in a hybrid system such as this the dominant levers would be
There are also consumers of memory that do not need to be itemised for as they are fixed by the user. Typically these would be caches and queues. A cache or queue does not need to be itemised in the same way as primary index as there is a fixed limit beyond which it cannot expand. Two such caches are the
post-write-queue and the
write-cache holds records waiting to be flushed to the underlying device. The
post-write-queue holds recently written records so that they can be read without acccessing the device for efficiency reasons. Both of these have a configurable fixed size. More crucially for this case, both of these exist on a per device basis.
collectinfo it was immediately obvious that the problem lay with the sizing of the
Collectinfo-analyzer> show config like post-write ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MainNamespace Namespace Configuration (2019-07-03 11:57:46 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ NODE : 10.200.10.11:3000 10.200.10.22:3000 10.200.10.33:3000 10.200.10.44:3000 10.200.10.55:3000 10.200.10.66:3000 10.200.10.77:3000 10.200.10.88:3000 10.200.10.99:3000 storage-engine.post-write-queue: 4096 4096 4096 4096 4096 4096 4096 4096 4096 Collectinfo-analyzer>
post-write-queue was configured to 4096 write blocks.
Collectinfo-analyzer> show config like write-block ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MainNamespace Namespace Configuration (2019-07-03 11:57:46 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ NODE : 10.200.10.11:3000 10.200.10.22:3000 10.200.10.33:3000 10.200.10.44:3000 10.200.10.55:3000 10.200.10.66:3000 10.200.10.77:3000 10.200.10.88:3000 10.200.10.99:3000 storage-engine.write-block-size: 1048576 1048576 1048576 1048576 1048576 1048576 1048576 1048576 1048576 Collectinfo-analyzer>
The system is set to use the default
write-block-size of 1MB and here was the answer. This implied a
post-write-queue of 4GB however there is a
post-write-queue for every device and as there were 32 devices on the new higher capacity nodes in this cluster, this implied a total allocation to the post write queue of 128GB. With an index allocation of 104GB (on this node) it is evident how easily the system could run out of the 200GB of available memory when the
post-write-queue filled up, as it did, very quickly.
The issue was then simple to resolve. The
post-write-queue is a dynamic parameter that can be changed without a node restart. Moderating this to a more appropriate value, in this case the 256 block (256MB with 1MB block size) default resolved the issue immediately.
Collectinfo-analyzer> show stat like 'cache_read_pct' ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MainNamespace Namespace Statistics (2019-07-03 11:57:46 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ NODE : 10.200.10.11:3000 10.200.10.22:3000 10.200.10.33:3000 10.200.10.44:3000 10.200.10.55:3000 10.200.10.66:3000 10.200.10.77:3000 10.200.10.88:3000 10.200.10.99:3000 cache_read_pct: 100 100 100 100 100 100 100 100 100 Collectinfo-analyzer>
The tuning change that had been made in terms of
post-write-queue was a logical step however the consequence of moving to a system with 32 storage devices per namespace as opposed to 2 had not been considered.
Key Points of Note
- The assumption that moving to larger nodes will alleviate capacity related issues is sound, however there are a number of configurations contingent on number of devices, the effect of these parameters on memory consumption should be considered when changing the size of the nodes in a given cluster, whether or not load has changed.
- When parameters have been changed to provide temporary relief for sizing issues, these should be re-evaluated when the sizing issue is resolved.
- When considering memory that is itemised directly against that which is not, a handy rule of thumb is that memory counters that can be associated directly with a namespace are accounted for, memory counters that are associated with an entire cluster are not itemised.
- If heap allocation is excessive and this cannot easily be attributed to a wrongly configured cache, the next investigative step would have been to look into jemalloc debugging information