SIGSEGV received, aborting Aerospike Community Edition build 3.5.15

Hi,

We have a cluster which contains 4 servers with 2 sets. OS is Centos 6.7 x64. First set contains about 2 billions records, peak load is 2к/20к read/write. No indexes, no scan. Second set contains about 5 millions records, peak load is 60к write, periodically read only by secondary indexes.

Once or twice a week some of aerospike server(s) is down, logs:

Sep 15 2015 04:10:23 GMT: WARNING (as): (signal.c::160) SIGSEGV received, aborting Aerospike Community Edition build 3.5.15
Sep 15 2015 04:10:23 GMT: WARNING (as): (signal.c::162) stacktrace: found 7 frames
Sep 15 2015 04:10:23 GMT: WARNING (as): (signal.c::162) stacktrace: frame 0: /usr/bin/asd(as_sig_handle_segv+0x59) [0x46f718]
Sep 15 2015 04:10:23 GMT: WARNING (as): (signal.c::162) stacktrace: frame 1: /lib64/libc.so.6(+0x326a0) [0x7f7231aed6a0]
Sep 15 2015 04:10:23 GMT: WARNING (as): (signal.c::162) stacktrace: frame 2: /usr/bin/asd(as_partition_reinit+0x2c8) [0x4e0bbd]
Sep 15 2015 04:10:23 GMT: WARNING (as): (signal.c::162) stacktrace: frame 3: /usr/bin/asd(as_partition_balance+0x1e9a) [0x4e68ff]
Sep 15 2015 04:10:23 GMT: WARNING (as): (signal.c::162) stacktrace: frame 4: /usr/bin/asd(as_paxos_thr+0xf12) [0x4f035c]
Sep 15 2015 04:10:23 GMT: WARNING (as): (signal.c::162) stacktrace: frame 5: /lib64/libpthread.so.0(+0x79d1) [0x7f72329129d1]
Sep 15 2015 04:10:23 GMT: WARNING (as): (signal.c::162) stacktrace: frame 6: /lib64/libc.so.6(clone+0x6d) [0x7f7231ba38fd]

/var/log/messages no logs or:

Sep 15 07:54:44 cache104 kernel: asd invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0
Sep 15 07:54:44 cache104 kernel: asd cpuset=/ mems_allowed=0
Sep 15 07:54:44 cache104 kernel: Pid: 18206, comm: asd Not tainted 2.6.32-504.8.1.el6.x86_64 #1
Sep 15 07:54:44 cache104 kernel: Call Trace:
Sep 15 07:54:44 cache104 kernel: [<ffffffff810d40c1>] ? cpuset_print_task_mems_allowed+0x91/0xb0
Sep 15 07:54:44 cache104 kernel: [<ffffffff81127300>] ? dump_header+0x90/0x1b0
Sep 15 07:54:44 cache104 kernel: [<ffffffff8122eb0c>] ? security_real_capable_noaudit+0x3c/0x70
Sep 15 07:54:44 cache104 kernel: [<ffffffff81127782>] ? oom_kill_process+0x82/0x2a0
Sep 15 07:54:44 cache104 kernel: [<ffffffff811276c1>] ? select_bad_process+0xe1/0x120
Sep 15 07:54:44 cache104 kernel: [<ffffffff81127bc0>] ? out_of_memory+0x220/0x3c0
Sep 15 07:54:44 cache104 kernel: [<ffffffff811344df>] ? __alloc_pages_nodemask+0x89f/0x8d0
Sep 15 07:54:44 cache104 kernel: [<ffffffff8116c79a>] ? alloc_pages_vma+0x9a/0x150
Sep 15 07:54:44 cache104 kernel: [<ffffffff8114f6fd>] ? handle_pte_fault+0x73d/0xb00
Sep 15 07:54:44 cache104 kernel: [<ffffffff811585f4>] ? page_remove_rmap+0x54/0xa0
Sep 15 07:54:44 cache104 kernel: [<ffffffff8113ae0c>] ? release_pages+0x21c/0x250
Sep 15 07:54:44 cache104 kernel: [<ffffffff8114fcea>] ? handle_mm_fault+0x22a/0x300
Sep 15 07:54:44 cache104 kernel: [<ffffffff8104d0d8>] ? __do_page_fault+0x138/0x480
Sep 15 07:54:44 cache104 kernel: [<ffffffff811497b0>] ? sys_madvise+0x350/0x790
Sep 15 07:54:44 cache104 kernel: [<ffffffff8152ffde>] ? do_page_fault+0x3e/0xa0
Sep 15 07:54:44 cache104 kernel: [<ffffffff8152d395>] ? page_fault+0x25/0x30

We can not associate this problem with any of actions (read, write, scan, etc…).

Would it be possible to install gdb and provide the output for each stacktrace by doing the following:

sudo gdb asd

At the gdb prompt type

info line *0x46f718

and hit [Enter]

Continue with other lines of stacktrace as so:

info line *0x4e0bbd info line *0x4e68ff info line *0x4f035c

From your /var/log/messages, seems like you may have run out of memory when the OOM killer got triggered.

Could you provide a sar memory trend output for that day:

sar -r 

Please also look at the aerospike capacity planing for memory:

http://www.aerospike.com/docs/operations/plan/capacity/#memory-required