SIGSEGV received, aborting Aerospike Community Edition build 3.5.15


#1

Hi,

We have a cluster which contains 4 servers with 2 sets. OS is Centos 6.7 x64. First set contains about 2 billions records, peak load is 2к/20к read/write. No indexes, no scan. Second set contains about 5 millions records, peak load is 60к write, periodically read only by secondary indexes.

Once or twice a week some of aerospike server(s) is down, logs:

Sep 15 2015 04:10:23 GMT: WARNING (as): (signal.c::160) SIGSEGV received, aborting Aerospike Community Edition build 3.5.15
Sep 15 2015 04:10:23 GMT: WARNING (as): (signal.c::162) stacktrace: found 7 frames
Sep 15 2015 04:10:23 GMT: WARNING (as): (signal.c::162) stacktrace: frame 0: /usr/bin/asd(as_sig_handle_segv+0x59) [0x46f718]
Sep 15 2015 04:10:23 GMT: WARNING (as): (signal.c::162) stacktrace: frame 1: /lib64/libc.so.6(+0x326a0) [0x7f7231aed6a0]
Sep 15 2015 04:10:23 GMT: WARNING (as): (signal.c::162) stacktrace: frame 2: /usr/bin/asd(as_partition_reinit+0x2c8) [0x4e0bbd]
Sep 15 2015 04:10:23 GMT: WARNING (as): (signal.c::162) stacktrace: frame 3: /usr/bin/asd(as_partition_balance+0x1e9a) [0x4e68ff]
Sep 15 2015 04:10:23 GMT: WARNING (as): (signal.c::162) stacktrace: frame 4: /usr/bin/asd(as_paxos_thr+0xf12) [0x4f035c]
Sep 15 2015 04:10:23 GMT: WARNING (as): (signal.c::162) stacktrace: frame 5: /lib64/libpthread.so.0(+0x79d1) [0x7f72329129d1]
Sep 15 2015 04:10:23 GMT: WARNING (as): (signal.c::162) stacktrace: frame 6: /lib64/libc.so.6(clone+0x6d) [0x7f7231ba38fd]

/var/log/messages no logs or:

Sep 15 07:54:44 cache104 kernel: asd invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0
Sep 15 07:54:44 cache104 kernel: asd cpuset=/ mems_allowed=0
Sep 15 07:54:44 cache104 kernel: Pid: 18206, comm: asd Not tainted 2.6.32-504.8.1.el6.x86_64 #1
Sep 15 07:54:44 cache104 kernel: Call Trace:
Sep 15 07:54:44 cache104 kernel: [<ffffffff810d40c1>] ? cpuset_print_task_mems_allowed+0x91/0xb0
Sep 15 07:54:44 cache104 kernel: [<ffffffff81127300>] ? dump_header+0x90/0x1b0
Sep 15 07:54:44 cache104 kernel: [<ffffffff8122eb0c>] ? security_real_capable_noaudit+0x3c/0x70
Sep 15 07:54:44 cache104 kernel: [<ffffffff81127782>] ? oom_kill_process+0x82/0x2a0
Sep 15 07:54:44 cache104 kernel: [<ffffffff811276c1>] ? select_bad_process+0xe1/0x120
Sep 15 07:54:44 cache104 kernel: [<ffffffff81127bc0>] ? out_of_memory+0x220/0x3c0
Sep 15 07:54:44 cache104 kernel: [<ffffffff811344df>] ? __alloc_pages_nodemask+0x89f/0x8d0
Sep 15 07:54:44 cache104 kernel: [<ffffffff8116c79a>] ? alloc_pages_vma+0x9a/0x150
Sep 15 07:54:44 cache104 kernel: [<ffffffff8114f6fd>] ? handle_pte_fault+0x73d/0xb00
Sep 15 07:54:44 cache104 kernel: [<ffffffff811585f4>] ? page_remove_rmap+0x54/0xa0
Sep 15 07:54:44 cache104 kernel: [<ffffffff8113ae0c>] ? release_pages+0x21c/0x250
Sep 15 07:54:44 cache104 kernel: [<ffffffff8114fcea>] ? handle_mm_fault+0x22a/0x300
Sep 15 07:54:44 cache104 kernel: [<ffffffff8104d0d8>] ? __do_page_fault+0x138/0x480
Sep 15 07:54:44 cache104 kernel: [<ffffffff811497b0>] ? sys_madvise+0x350/0x790
Sep 15 07:54:44 cache104 kernel: [<ffffffff8152ffde>] ? do_page_fault+0x3e/0xa0
Sep 15 07:54:44 cache104 kernel: [<ffffffff8152d395>] ? page_fault+0x25/0x30

We can not associate this problem with any of actions (read, write, scan, etc…).


#2

Would it be possible to install gdb and provide the output for each stacktrace by doing the following:

sudo gdb asd

At the gdb prompt type

info line *0x46f718

and hit [Enter]

Continue with other lines of stacktrace as so:

info line *0x4e0bbd info line *0x4e68ff info line *0x4f035c

From your /var/log/messages, seems like you may have run out of memory when the OOM killer got triggered.

Could you provide a sar memory trend output for that day:

sar -r 

Please also look at the aerospike capacity planing for memory:

http://www.aerospike.com/docs/operations/plan/capacity/#memory-required