When using the auto-pin numa parameter, OOM Killer comes in and kills the asd process. The process starts anew and after reading data from the OOM Killer disk, it kills the process again …
About hardware: Dell 440, 2 x Intel Xeon Silver 4114, 96GM RAM, 2 x 800 GB SSD Intel SSD DC S3610.
lspci | egrep -i --color 'network|ethernet'
...
3b:00.0 Ethernet controller: Intel Corporation Ethernet Controller 10G X550T (rev 01)
3b:00.1 Ethernet controller: Intel Corporation Ethernet Controller 10G X550T (rev 01)
af:00.0 Ethernet controller: Intel Corporation Ethernet Controller 10G X550T (rev 01)
af:00.1 Ethernet controller: Intel Corporation Ethernet Controller 10G X550T (rev 01)
Ubuntu 18.04.4 LTS.
$ cat /etc/default/irqbalance | grep IRQBALANCE_ARGS
IRQBALANCE_ARGS="--policyscript=/etc/aerospike/irqbalance-ban.sh"
AS version 4.8.0.6.
service {
paxos-single-replica-limit 1
proto-fd-max 30000
transaction-max-ms 10000
# default 1
migrate-threads 10
auto-pin numa
}
logging {
file /var/log/aerospike/aerospike.log {
context any warning
}
}
network {
service {
address aggi
access-address aggi
port 3000
}
heartbeat {
address aggi
port 9918
interval 150
timeout 10
mode mesh
mesh-seed-address-port host1 9918
mesh-seed-address-port host2 9918
mesh-seed-address-port host3 9918
mesh-seed-address-port host4 9918
}
fabric {
address aggi
port 3001
}
info {
address aggi
port 3003
}
}
namespace testNS {
replication-factor 2
memory-size 89G
single-bin false
high-water-disk-pct 90
high-water-memory-pct 90
stop-writes-pct 100
evict-tenths-pct 1000
default-ttl 45d
migrate-sleep 0
transaction-pending-limit 50
nsup-period 120
nsup-threads 1
storage-engine device {
device /dev/sdb
device /dev/sdc
scheduler-mode noop
write-block-size 128K
post-write-queue 4096
read-page-cache true
defrag-sleep 0
}
}
I upload data to the cluster. As soon as the host where the auto-pin
numa parameter is set, the memory usage reaches 45-49GB, the process is killed.
Mar 11 09:35:09 testHost kernel: [1809036.283244] asd invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=0, order=0, oom_score_adj=0
Mar 11 09:35:09 testHost kernel: [1809036.283245] asd cpuset=/ mems_allowed=0-1
Mar 11 09:35:09 testHost kernel: [1809036.283249] CPU: 22 PID: 28532 Comm: asd Not tainted 4.15.0-88-generic #88-Ubuntu
Mar 11 09:35:09 testHost kernel: [1809036.283249] Hardware name: Dell Inc. PowerEdge R440/08CYF7, BIOS 2.2.11 06/14/2019
Mar 11 09:35:09 testHost kernel: [1809036.283250] Call Trace:
Mar 11 09:35:09 testHost kernel: [1809036.283256] dump_stack+0x6d/0x8e
Mar 11 09:35:09 testHost kernel: [1809036.283260] dump_header+0x71/0x285
Mar 11 09:35:09 testHost kernel: [1809036.283264] ? security_capable_noaudit+0x4b/0x70
Mar 11 09:35:09 testHost kernel: [1809036.283266] oom_kill_process+0x21f/0x420
Mar 11 09:35:09 testHost kernel: [1809036.283268] out_of_memory+0x116/0x4e0
Mar 11 09:35:09 testHost kernel: [1809036.283270] __alloc_pages_slowpath+0xa53/0xe00
Mar 11 09:35:09 testHost kernel: [1809036.283273] ? alloc_pages_current+0x6a/0xe0
Mar 11 09:35:09 testHost kernel: [1809036.283275] __alloc_pages_nodemask+0x29a/0x2c0
Mar 11 09:35:09 testHost kernel: [1809036.283277] alloc_pages_current+0x6a/0xe0
Mar 11 09:35:09 testHost kernel: [1809036.283280] __page_cache_alloc+0x81/0xa0
Mar 11 09:35:09 testHost kernel: [1809036.283282] filemap_fault+0x3ea/0x6f0
Mar 11 09:35:09 testHost kernel: [1809036.283284] ? page_add_file_rmap+0x134/0x180
Mar 11 09:35:09 testHost kernel: [1809036.283285] ? filemap_map_pages+0x181/0x390
Mar 11 09:35:09 testHost kernel: [1809036.283287] ext4_filemap_fault+0x31/0x44
Mar 11 09:35:09 testHost kernel: [1809036.283290] __do_fault+0x5b/0x115
Mar 11 09:35:09 testHost kernel: [1809036.283292] __handle_mm_fault+0xdef/0x1290
Mar 11 09:35:09 testHost kernel: [1809036.283294] handle_mm_fault+0xb1/0x210
Mar 11 09:35:09 testHost kernel: [1809036.283298] __do_page_fault+0x281/0x4b0
Mar 11 09:35:09 testHost kernel: [1809036.283300] ? SyS_futex+0x13b/0x180
Mar 11 09:35:09 testHost kernel: [1809036.283302] do_page_fault+0x2e/0xe0
Mar 11 09:35:09 testHost kernel: [1809036.283305] ? page_fault+0x2f/0x50
Mar 11 09:35:09 testHost kernel: [1809036.283307] page_fault+0x45/0x50
Mar 11 09:35:09 testHost kernel: [1809036.283309] RIP: 0033:0x55c229d03b69
Mar 11 09:35:09 testHost kernel: [1809036.283309] RSP: 002b:00007f546c7c0e80 EFLAGS: 00010202
Mar 11 09:35:09 testHost kernel: [1809036.283311] RAX: 0000000000000001 RBX: 000000003050708b RCX: 0000000000000002
Mar 11 09:35:09 testHost kernel: [1809036.283311] RDX: 00000000000000eb RSI: 00007f5ece5c22c4 RDI: 00007f584cea3844
Mar 11 09:35:09 testHost kernel: [1809036.283312] RBP: 00007f546c7c0f50 R08: 000000000000004e R09: 00007f5400000000
Mar 11 09:35:09 testHost kernel: [1809036.283313] R10: 0000000000020000 R11: 00000015b21e2490 R12: 00007f546c7c1b40
Mar 11 09:35:09 testHost kernel: [1809036.283313] R13: 00007f5ece5c22c0 R14: 00007f546c7c0f38 R15: 0000000000000001
Mar 11 09:35:09 testHost kernel: [1809036.283315] Mem-Info:
Mar 11 09:35:09 testHost kernel: [1809036.283319] active_anon:11240291 inactive_anon:561035 isolated_anon:0
Mar 11 09:35:09 testHost kernel: [1809036.283319] active_file:282989 inactive_file:72380 isolated_file:0
Mar 11 09:35:09 testHost kernel: [1809036.283319] unevictable:2882 dirty:0 writeback:0 unstable:0
Mar 11 09:35:09 testHost kernel: [1809036.283319] slab_reclaimable:28252 slab_unreclaimable:28946
Mar 11 09:35:09 testHost kernel: [1809036.283319] mapped:30424 shmem:2304 pagetables:24694 bounce:0
Mar 11 09:35:09 testHost kernel: [1809036.283319] free:11954177 free_pcp:93 free_cma:0
Mar 11 09:35:09 testHost kernel: [1809036.283321] Node 0 active_anon:44889448kB inactive_anon:2239724kB active_file:0kB inactive_file:0kB unevictable:4348kB isolated(anon):0kB isolated(file):0kB mapped:4000kB dirty:0kB writeback:0kB shmem:4148kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
Mar 11 09:35:09 testHost kernel: [1809036.283322] Node 0 DMA free:15896kB min:12kB low:24kB high:36kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15980kB managed:15896kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Mar 11 09:35:09 testHost kernel: [1809036.283325] lowmem_reserve[]: 0 1381 46684 46684 46684
Mar 11 09:35:09 testHost kernel: [1809036.283327] Node 0 DMA32 free:182464kB min:1308kB low:2720kB high:4132kB active_anon:1251972kB inactive_anon:40kB active_file:8kB inactive_file:64kB unevictable:0kB writepending:0kB present:1521664kB managed:1456096kB mlocked:0kB kernel_stack:3936kB pagetables:3824kB bounce:0kB free_pcp:372kB local_pcp:16kB free_cma:0kB
Mar 11 09:35:09 testHost kernel: [1809036.283329] lowmem_reserve[]: 0 0 45302 45302 45302
Mar 11 09:35:09 testHost kernel: [1809036.283331] Node 0 Normal free:42912kB min:42932kB low:89320kB high:135708kB active_anon:43637796kB inactive_anon:2240928kB active_file:0kB inactive_file:0kB unevictable:4348kB writepending:0kB present:47185920kB managed:46396296kB mlocked:4348kB kernel_stack:5480kB pagetables:92120kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Mar 11 09:35:09 testHost kernel: [1809036.283333] lowmem_reserve[]: 0 0 0 0 0
Mar 11 09:35:09 testHost kernel: [1809036.283335] Node 0 DMA: 0*4kB 1*8kB (U) 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15896kB
Mar 11 09:35:09 testHost kernel: [1809036.283340] Node 0 DMA32: 293*4kB (UM) 215*8kB (U) 101*16kB (UM) 55*32kB (UME) 25*64kB (UME) 9*128kB (UE) 1*256kB (E) 2*512kB (UE) 1*1024kB (E) 2*2048kB (ME) 41*4096kB (M) = 183356kB
Mar 11 09:35:09 testHost kernel: [1809036.283345] Node 0 Normal: 269*4kB (UME) 221*8kB (UME) 973*16kB (UME) 637*32kB (UME) 77*64kB (UME) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 43724kB
Mar 11 09:35:09 testHost kernel: [1809036.283351] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Mar 11 09:35:09 testHost kernel: [1809036.283352] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Mar 11 09:35:09 testHost kernel: [1809036.283352] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Mar 11 09:35:09 testHost kernel: [1809036.283353] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Mar 11 09:35:09 testHost kernel: [1809036.283353] 355820 total pagecache pages
Mar 11 09:35:09 testHost kernel: [1809036.283355] 0 pages in swap cache
Mar 11 09:35:09 testHost kernel: [1809036.283355] Swap cache stats: add 0, delete 0, find 0/0
Mar 11 09:35:09 testHost kernel: [1809036.283356] Free swap = 1998844kB
Mar 11 09:35:09 testHost kernel: [1809036.283356] Total swap = 1998844kB
Mar 11 09:35:09 testHost kernel: [1809036.283357] 24763803 pages RAM
Mar 11 09:35:09 testHost kernel: [1809036.283357] 0 pages HighMem/MovableOnly
Mar 11 09:35:09 testHost kernel: [1809036.283358] 411614 pages reserved
Mar 11 09:35:09 testHost kernel: [1809036.283358] 0 pages cma reserved
Mar 11 09:35:09 testHost kernel: [1809036.283358] 0 pages hwpoisoned
Mar 11 09:35:09 testHost kernel: [1809036.283359] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
Mar 11 09:35:09 testHost kernel: [1809036.283390] [ 707] 0 707 57293 27820 458752 0 0 systemd-journal
Mar 11 09:35:09 testHost kernel: [1809036.283392] [ 742] 0 742 26476 306 98304 0 0 lvmetad
Mar 11 09:35:09 testHost kernel: [1809036.283393] [ 765] 0 765 12072 1417 122880 0 -1000 systemd-udevd
Mar 11 09:35:09 testHost kernel: [1809036.283396] [ 1030] 62583 1030 35484 798 180224 0 0 systemd-timesyn
Mar 11 09:35:09 testHost kernel: [1809036.283397] [ 1102] 0 1102 42599 3021 229376 0 0 networkd-dispat
Mar 11 09:35:09 testHost kernel: [1809036.283399] [ 1103] 0 1103 17664 1100 180224 0 0 systemd-logind
Mar 11 09:35:09 testHost kernel: [1809036.283400] [ 1108] 0 1108 7830 582 106496 0 0 cron
Mar 11 09:35:09 testHost kernel: [1809036.283401] [ 1112] 102 1112 65766 1008 163840 0 0 rsyslogd
Mar 11 09:35:09 testHost kernel: [1809036.283402] [ 1115] 0 1115 71906 1372 188416 0 0 accounts-daemon
Mar 11 09:35:09 testHost kernel: [1809036.283403] [ 1116] 103 1116 12529 907 143360 0 -900 dbus-daemon
Mar 11 09:35:09 testHost kernel: [1809036.283405] [ 1119] 0 1119 1130 384 57344 0 0 atopacctd
Mar 11 09:35:09 testHost kernel: [1809036.283406] [ 1126] 100 1126 17996 1082 172032 0 0 systemd-network
Mar 11 09:35:09 testHost kernel: [1809036.283408] [ 1177] 101 1177 17689 778 176128 0 0 systemd-resolve
Mar 11 09:35:09 testHost kernel: [1809036.283409] [ 1210] 0 1210 46809 3416 262144 0 0 unattended-upgr
Mar 11 09:35:09 testHost kernel: [1809036.283410] [ 1216] 0 1216 4045 372 73728 0 0 agetty
Mar 11 09:35:09 testHost kernel: [1809036.283411] [ 1469] 0 1469 18075 1393 180224 0 -1000 sshd
Mar 11 09:35:09 testHost kernel: [1809036.283413] [ 1738] 998 1738 30758 4176 131072 0 0 node_exporter
Mar 11 09:35:09 testHost kernel: [1809036.283416] [15026] 996 15026 657594 6887 425984 0 0 asprom
Mar 11 09:35:09 testHost kernel: [1809036.283417] [11753] 0 11753 27692 859 131072 0 0 irqbalance
Mar 11 09:35:09 testHost kernel: [1809036.283419] [21042] 0 21042 7119 2884 110592 0 0 atop
Mar 11 09:35:09 testHost kernel: [1809036.283421] [25494] 0 25494 26997 1797 249856 0 0 sshd
Mar 11 09:35:09 testHost kernel: [1809036.283422] [25496] 1100 25496 19192 1900 196608 0 0 systemd
Mar 11 09:35:09 testHost kernel: [1809036.283423] [25497] 1100 25497 64854 659 258048 0 0 (sd-pam)
Mar 11 09:35:09 testHost kernel: [1809036.283425] [25619] 1100 25619 26997 858 241664 0 0 sshd
Mar 11 09:35:09 testHost kernel: [1809036.283426] [25625] 1100 25625 5688 1330 81920 0 0 bash
Mar 11 09:35:09 testHost kernel: [1809036.283427] [26445] 0 26445 15878 1066 172032 0 0 sudo
Mar 11 09:35:09 testHost kernel: [1809036.283428] [26446] 1100 26446 3640 645 73728 0 0 grep
Mar 11 09:35:09 testHost kernel: [1809036.283429] [26447] 0 26447 1876 194 57344 0 0 tail
Mar 11 09:35:09 testHost kernel: [1809036.283431] [26494] 0 26494 26997 1853 249856 0 0 sshd
Mar 11 09:35:09 testHost kernel: [1809036.283432] [26600] 1100 26600 26997 877 241664 0 0 sshd
Mar 11 09:35:09 testHost kernel: [1809036.283433] [26601] 1100 26601 5685 1334 90112 0 0 bash
Mar 11 09:35:09 testHost kernel: [1809036.283434] [27351] 997 27351 16223 4713 159744 0 0 python3
Mar 11 09:35:09 testHost kernel: [1809036.283435] [27352] 0 27352 12548093 11771869 96215040 0 0 asd
Mar 11 09:35:09 testHost kernel: [1809036.283437] Out of memory: Kill process 27352 (asd) score 857 or sacrifice child
Mar 11 09:35:09 testHost kernel: [1809036.284199] Killed process 27352 (asd) total-vm:50192372kB, anon-rss:47087476kB, file-rss:0kB, shmem-rss:0kB
Mar 11 09:35:11 testHost kernel: [1809037.862931] oom_reaper: reaped process 27352 (asd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Mar 11 09:35:11 testHost systemd[1]: aerospike.service: Main process exited, code=killed, status=9/KILL
What else needs to be configured?