Asd invoked oom-killer

Hi, We are using Aerospike 3.16.0.6-1 with the following namespace configuration

namespace inference {
	replication-factor 2
	memory-size 36G
	default-ttl 0
	storage-engine device {
		device /dev/vdb1
		write-block-size 256K
		**data-in-memory true**
	}
	stop-writes-pct 90
	high-water-memory-pct 80
	high-water-disk-pct 80
}

Total memory available in the VM is 40 GBs. The data resides both on the device (SSD) and in memory.

Frequently, we are seeing the asd process getting killed by OOM killer. I could not find anything in the aerospike logs. But this is what I see in the syslog

Sep 23 18:18:30 kernel: [3132374.436096] asd invoked oom-killer: gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=0, order=0, oom_score_adj=0
Sep 23 18:18:30 kernel: [3132374.436097] asd cpuset=/ mems_allowed=0
Sep 23 18:18:30 kernel: [3132374.436101] CPU: 3 PID: 31632 Comm: asd Not tainted 4.9.0-12-amd64 #1 Debian 4.9.210-1
Sep 23 18:18:30 kernel: [3132374.436101] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1 04/01/2014
Sep 23 18:18:30 kernel: [3132374.436102]  0000000000000000 ffffffff9cb36bfe ffffb7efcfdd3c10 ffff92901073a200
Sep 23 18:18:30 kernel: [3132374.436105]  ffffffff9ca09adb 0000000000000000 0000000000000000 0000000c00000001
Sep 23 18:18:30 kernel: [3132374.436106]  ffff928fd9f7c500 ffffffff9c98d587 ffffffffc0464765 ffffffff9caad670
Sep 23 18:18:30 kernel: [3132374.436108] Call Trace:
Sep 23 18:18:30 kernel: [3132374.436113]  [<ffffffff9cb36bfe>] ? dump_stack+0x66/0x88
Sep 23 18:18:30 kernel: [3132374.436116]  [<ffffffff9ca09adb>] ? dump_header+0x78/0x1fd
Sep 23 18:18:30 kernel: [3132374.436119]  [<ffffffff9c98d587>] ? get_page_from_freelist+0x3f7/0xb20
Sep 23 18:18:30 kernel: [3132374.436122]  [<ffffffffc0464765>] ? virtballoon_oom_notify+0x25/0x70 [virtio_balloon]
Sep 23 18:18:30 kernel: [3132374.436125]  [<ffffffff9caad670>] ? cap_inode_killpriv+0x20/0x20
Sep 23 18:18:30 kernel: [3132374.436126]  [<ffffffff9c9894da>] ? oom_kill_process+0x22a/0x3f0
Sep 23 18:18:30 kernel: [3132374.436128]  [<ffffffff9c989971>] ? out_of_memory+0x111/0x470
Sep 23 18:18:30 kernel: [3132374.436129]  [<ffffffff9c98ea6f>] ? __alloc_pages_slowpath+0xa1f/0xb30
Sep 23 18:18:30 kernel: [3132374.436131]  [<ffffffff9c98ed81>] ? __alloc_pages_nodemask+0x201/0x260
Sep 23 18:18:30 kernel: [3132374.436134]  [<ffffffff9c9e0871>] ? alloc_pages_current+0x91/0x140
Sep 23 18:18:30 kernel: [3132374.436135]  [<ffffffff9c987906>] ? filemap_fault+0x326/0x5d0
Sep 23 18:18:30 kernel: [3132374.436149]  [<ffffffffc0327bc1>] ? ext4_filemap_fault+0x31/0x50 [ext4]
Sep 23 18:18:30 kernel: [3132374.436151]  [<ffffffff9c9b8cd9>] ? __do_fault+0xa9/0x1c0
Sep 23 18:18:30 kernel: [3132374.436153]  [<ffffffff9c9bced4>] ? handle_mm_fault+0xda4/0x1350
Sep 23 18:18:30 kernel: [3132374.436156]  [<ffffffff9ca5816a>] ? SyS_epoll_ctl+0x6fa/0xe20
Sep 23 18:18:30 kernel: [3132374.436158]  [<ffffffff9ca57b6f>] ? SyS_epoll_ctl+0xff/0xe20
Sep 23 18:18:30 kernel: [3132374.436160]  [<ffffffff9c863475>] ? __do_page_fault+0x255/0x4f0
Sep 23 18:18:30 kernel: [3132374.436162]  [<ffffffff9ce1f618>] ? async_page_fault+0x28/0x30
Sep 23 18:18:30 kernel: [3132374.436163] Mem-Info:
Sep 23 18:18:30 kernel: [3132374.436166] active_anon:9898799 inactive_anon:4328 isolated_anon:0
Sep 23 18:18:30 kernel: [3132374.436166]  active_file:301 inactive_file:388 isolated_file:0
Sep 23 18:18:30 kernel: [3132374.436166]  unevictable:0 dirty:0 writeback:0 unstable:0
Sep 23 18:18:30 kernel: [3132374.436166]  slab_reclaimable:2622 slab_unreclaimable:5795
Sep 23 18:18:30 kernel: [3132374.436166]  mapped:137 shmem:7009 pagetables:24611 bounce:0
Sep 23 18:18:30 kernel: [3132374.436166]  free:330538 free_pcp:30 free_cma:0
Sep 23 18:18:30 kernel: [3132374.436169] Node 0 active_anon:39595196kB inactive_anon:17312kB active_file:1204kB inactive_file:1552kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:548kB dirty:0kB writeback:0kB shmem:28036kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB pages_scanned:5970 all_unreclaimable? no
: 0kB writeback_tmp:0kB unstable:0kB pages_scanned:5970 all_unreclaimable? no
Sep 23 18:18:30 kernel: [3132374.436169] Node 0 DMA free:15908kB min:444kB low:552kB high:660kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Sep 23 18:18:30 kernel: [3132374.436172] lowmem_reserve[]: 0 1950 40205 40205 40205
Sep 23 18:18:30 kernel: [3132374.436174] Node 0 DMA32 free:209720kB min:56400kB low:70500kB high:84600kB active_anon:1798752kB inactive_anon:8kB active_file:432kB inactive_file:844kB unevictable:0kB writepending:0kB present:2080604kB managed:2015036kB mlocked:0kB slab_reclaimable:60kB slab_unreclaimable:612kB kernel_stack:240kB pagetables:4016kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Sep 23 18:18:30 kernel: [3132374.436177] lowmem_reserve[]: 0 0 38255 38255 38255
Sep 23 18:18:30 kernel: [3132374.436179] Node 0 Normal free:1096524kB min:1096584kB low:1370728kB high:1644872kB active_anon:37796444kB inactive_anon:17304kB active_file:772kB inactive_file:536kB unevictable:0kB writepending:0kB present:39845888kB managed:39177832kB mlocked:0kB slab_reclaimable:10428kB slab_unreclaimable:22568kB kernel_stack:6624kB pagetables:94428kB bounce:0kB free_pcp:120kB local_pcp:0kB free_cma:0kB
Sep 23 18:18:30 kernel: [3132374.436182] lowmem_reserve[]: 0 0 0 0 0
Sep 23 18:18:30 kernel: [3132374.436184] Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15908kB
Sep 23 18:18:30 kernel: [3132374.436191] Node 0 DMA32: 424*4kB (UME) 246*8kB (UME) 92*16kB (UME) 46*32kB (UME) 18*64kB (UME) 7*128kB (UE) 12*256kB (U) 23*512kB (UE) 18*1024kB (UME) 10*2048kB (UME) 36*4096kB (UM) = 209872kB
Sep 23 18:18:30 kernel: [3132374.436199] Node 0 Normal: 379*4kB (UME) 323*8kB (UME) 197*16kB (UME) 42*32kB (ME) 46*64kB (ME) 22*128kB (UME) 30*256kB (UME) 21*512kB (UME) 13*1024kB (UME) 13*2048kB (UME) 250*4096kB (UME) = 1096724kB
Sep 23 18:18:30 kernel: [3132374.436207] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Sep 23 18:18:30 kernel: [3132374.436208] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Sep 23 18:18:30 kernel: [3132374.436209] 7722 total pagecache pages
Sep 23 18:18:30 kernel: [3132374.436209] 0 pages in swap cache
Sep 23 18:18:30 kernel: [3132374.436210] Swap cache stats: add 0, delete 0, find 0/0
Sep 23 18:18:30 kernel: [3132374.436211] Free swap  = 0kB
Sep 23 18:18:30 kernel: [3132374.436211] Total swap = 0kB
Sep 23 18:18:30 kernel: [3132374.436211] 10485621 pages RAM
Sep 23 18:18:30 kernel: [3132374.436212] 0 pages HighMem/MovableOnly
Sep 23 18:18:30 kernel: [3132374.436212] 183427 pages reserved
Sep 23 18:18:30 kernel: [3132374.436212] 0 pages hwpoisoned

Just before the aerospike process was killed, this was the ticker output in aerospike log.

Sep 23 2020 12:27:31 GMT: INFO (info): (ticker.c:166) NODE-ID bb9e9c2330a0102 CLUSTER-SIZE 19
Sep 23 2020 12:27:31 GMT: INFO (info): (ticker.c:245)    system-memory: free-kbytes 1892632 free-pct 4 heap-kbytes (28735028,38935172,41752576) heap-efficiency-pct 68.8
Sep 23 2020 12:27:31 GMT: INFO (info): (ticker.c:259)    in-progress: tsvc-q 0 info-q 0 nsup-delete-q 0 rw-hash 0 proxy-hash 0 tree-gc-q 0
Sep 23 2020 12:27:31 GMT: INFO (info): (ticker.c:281)    fds: proto (200,21882,21682) heartbeat (19,84712,84693) fabric (456,496863,496407)
Sep 23 2020 12:27:31 GMT: INFO (info): (ticker.c:290)    heartbeat-received: self 2 foreign 2691157
Sep 23 2020 12:27:31 GMT: INFO (info): (ticker.c:320)    fabric-bytes-per-second: bulk (782897,137018) ctrl (43,70) meta (43,45) rw (4887,9159)
Sep 23 2020 12:27:31 GMT: INFO (info): (ticker.c:344)    early-fail: demarshal 0 tsvc-client 107 tsvc-batch-sub 292 tsvc-udf-sub 0
Sep 23 2020 12:27:31 GMT: INFO (info): (ticker.c:361)    batch-index: batches (10023620,295,599)
Sep 23 2020 12:27:31 GMT: INFO (info): (hist.c:139) histogram dump: batch-index (10023915 total) msec
Sep 23 2020 12:27:31 GMT: INFO (info): (hist.c:156)  (00: 0009998625) (01: 0000013817) (02: 0000007516) (03: 0000002549)
Sep 23 2020 12:27:31 GMT: INFO (info): (hist.c:156)  (04: 0000000737) (05: 0000000488) (06: 0000000102) (07: 0000000039)
Sep 23 2020 12:27:31 GMT: INFO (info): (hist.c:156)  (08: 0000000025) (09: 0000000005) (11: 0000000010) (12: 0000000002)
Sep 23 2020 12:27:31 GMT: INFO (info): (ticker.c:375) {inference} objects: all 147129300 master 15012518 prole 132116731 non-replica 51
Sep 23 2020 12:27:31 GMT: INFO (info): (ticker.c:417) {inference} migrations: remaining (458,523,470) active (3,0,0) complete-pct 0.51
Sep 23 2020 12:27:31 GMT: INFO (info): (ticker.c:439) {inference} memory-usage: total-bytes 21389802639 index-bytes 9416275200 sindex-bytes 0 data-bytes 11973527439 used-pct 55.34
Sep 23 2020 12:27:31 GMT: INFO (info): (ticker.c:469) {inference} device-usage: used-bytes 35198849664 avail-pct 67
Sep 23 2020 12:27:31 GMT: INFO (info): (ticker.c:538) {inference} client: tsvc (0,0) proxy (167,0,2) read (0,0,0,0) write (37706909,0,54) delete (0,0,0,0) udf (0,0,0) lang (0,0,0,0)
Sep 23 2020 12:27:31 GMT: INFO (info): (ticker.c:590) {inference} batch-sub: tsvc (0,148) proxy (370,0,476) read (10528669,0,0,2169059)
Sep 23 2020 12:27:31 GMT: INFO (info): (ticker.c:709) {inference} retransmits: migration 6825 client-read 0 client-write (306,96) client-delete (0,0) client-udf (0,0) batch-sub 0 udf-sub (0,0)
Sep 23 2020 12:27:31 GMT: INFO (info): (hist.c:139) histogram dump: {inference}-write (37706909 total) msec
Sep 23 2020 12:27:31 GMT: INFO (info): (hist.c:156)  (00: 0034994825) (01: 0000622370) (02: 0000699394) (03: 0000821947)
Sep 23 2020 12:27:31 GMT: INFO (info): (hist.c:156)  (04: 0000476140) (05: 0000084414) (06: 0000002897) (07: 0000000738)
Sep 23 2020 12:27:31 GMT: INFO (info): (hist.c:156)  (08: 0000004053) (09: 0000000057) (10: 0000000035) (11: 0000000039)

Few observations,

  1. At the time of failure, there is some free space left in the system-memory
  2. At the time of failure, Memory used percentage = ~ 55 %
  3. heap_active_kbytes slowly increases and when it is near 36 GB (memory size configured in the namespace), the aerospike process is killed. heap_allocated_kbytes remains more or less constant.

There is no specific pattern for the failure. It happens during the time when there is little load in the cluster. I tried searching for other articles on the forum but they were not of much help. Reallly appreciate if I get some leads here.

So far, it looks like increase in heap_active_kbytes is causing this issue. What could be the root cause and fix for this?

Hi, Any leads on this would really be grateful. Thanks, Akshay

© 2015 Copyright Aerospike, Inc. | All rights reserved. Creators of the Aerospike Database.