Aerospike crashed and report connection time out


#1

I use Aerospike CE 3.8.3. The three nodes cluster Aerospike server works properly at the most time, but today I find it crashed and report “Connection time out” in all three nodes. But I didn’t find any connection issue when I was aware of this crash. So I don’t know why it crashed. The error log is in following. Thanks.

Jul 14 2017 04:08:04 GMT: INFO (info): (thr_info.c:4837)    tree_counts: nsup 0 scan 0 dup 0 wprocess 0 migrx 0 migtx 0 ssdr 0 ssdw 0 rw 0
Jul 14 2017 04:08:04 GMT: INFO (info): (thr_info.c:4854) {test} disk bytes used 5534753920 : avail pct 61
Jul 14 2017 04:08:04 GMT: INFO (info): (thr_info.c:4856) {test} memory bytes used 2293166823 (index 488249792 : sindex 9082931 : data 1795834100) : used pct 26.70
Jul 14 2017 04:08:04 GMT: INFO (info): (thr_info.c:4904) {test} migrations - complete
Jul 14 2017 04:08:04 GMT: INFO (info): (thr_info.c:4911)    partitions: actual 1413 sync 1297 desync 0 zombie 0 absent 1386
Jul 14 2017 04:08:04 GMT: INFO (info): (hist.c:137) histogram dump: reads (515 total) msec
Jul 14 2017 04:08:04 GMT: INFO (info): (hist.c:163)  (00: 0000000514) (01: 0000000001)
Jul 14 2017 04:08:04 GMT: INFO (info): (hist.c:137) histogram dump: writes_master (6914 total) msec
Jul 14 2017 04:08:04 GMT: INFO (info): (hist.c:154)  (00: 0000006467) (01: 0000000160) (02: 0000000252) (03: 0000000035)
Jul 14 2017 04:08:04 GMT: INFO (info): (hist.c:137) histogram dump: proxy (0 total) msec
Jul 14 2017 04:08:04 GMT: INFO (info): (hist.c:137) histogram dump: udf (0 total) msec
Jul 14 2017 04:08:04 GMT: INFO (info): (hist.c:137) histogram dump: query (1620 total) msec
Jul 14 2017 04:08:04 GMT: INFO (info): (hist.c:154)  (00: 0000000685) (01: 0000000084) (02: 0000000192) (03: 0000000066)
Jul 14 2017 04:08:04 GMT: INFO (info): (hist.c:154)  (04: 0000000148) (05: 0000000052) (06: 0000000136) (07: 0000000158)
Jul 14 2017 04:08:04 GMT: INFO (info): (hist.c:154)  (08: 0000000013) (09: 0000000001) (12: 0000000002) (13: 0000000022)
Jul 14 2017 04:08:04 GMT: INFO (info): (hist.c:154)  (14: 0000000027) (15: 0000000028) (16: 0000000004) (17: 0000000001)
Jul 14 2017 04:08:04 GMT: INFO (info): (hist.c:163)  (18: 0000000001)
Jul 14 2017 04:08:04 GMT: INFO (info): (hist.c:137) histogram dump: query_rec_count (1412 total) count
Jul 14 2017 04:08:04 GMT: INFO (info): (hist.c:154)  (01: 0000000169) (02: 0000000169) (03: 0000000161) (04: 0000000063)
Jul 14 2017 04:08:04 GMT: INFO (info): (hist.c:154)  (05: 0000000220) (06: 0000000044) (07: 0000000018) (08: 0000000134)
Jul 14 2017 04:08:04 GMT: INFO (info): (hist.c:163)  (11: 0000000294) (13: 0000000140)
Jul 14 2017 04:08:09 GMT: CRITICAL (cf:socket): (socket.c:117) recv() failed: 110 Connection timed out
Jul 14 2017 04:08:09 GMT: WARNING (as): (signal.c:96) SIGABRT received, aborting Aerospike Community Edition build 3.8.3 os el7
Jul 14 2017 04:08:09 GMT: WARNING (as): (signal.c:100) stacktrace: found 9 frames
Jul 14 2017 04:08:09 GMT: WARNING (as): (signal.c:100) stacktrace: frame 0: /usr/bin/asd(as_sig_handle_abort+0x35) [0x4a218d]
Jul 14 2017 04:08:09 GMT: WARNING (as): (signal.c:100) stacktrace: frame 1: /lib64/libc.so.6(+0x35670) [0x7ffffd235670]
Jul 14 2017 04:08:09 GMT: WARNING (as): (signal.c:100) stacktrace: frame 2: /lib64/libc.so.6(gsignal+0x37) [0x7ffffd2355f7]
Jul 14 2017 04:08:09 GMT: WARNING (as): (signal.c:100) stacktrace: frame 3: /lib64/libc.so.6(abort+0x148) [0x7ffffd236ce8]
Jul 14 2017 04:08:09 GMT: WARNING (as): (signal.c:100) stacktrace: frame 4: /usr/bin/asd(cf_fault_sink_hold+0) [0x5383fc]
Jul 14 2017 04:08:09 GMT: WARNING (as): (signal.c:100) stacktrace: frame 5: /usr/bin/asd(cf_socket_recv+0xaf) [0x53e812]
Jul 14 2017 04:08:09 GMT: WARNING (as): (signal.c:100) stacktrace: frame 6: /usr/bin/asd(thr_demarshal+0xe8c) [0x4bd2e6]
Jul 14 2017 04:08:09 GMT: WARNING (as): (signal.c:100) stacktrace: frame 7: /lib64/libpthread.so.0(+0x7dc5) [0x7ffffee07dc5]
Jul 14 2017 04:08:09 GMT: WARNING (as): (signal.c:100) stacktrace: frame 8: /lib64/libc.so.6(clone+0x6d) [0x7ffffd2f6ced]

#2

I do not see any release notes that directly deal with this crash and I cannot find any internal reference either. I do know that demarshal and sockets received quite a bit of attention around 3.11. I would suggest upgrading, we would be very interested if the issue happens on a newer build.


#3

What does your ifconfig look like, and heartbeat conf? This reminds me of a crash I experienced when my idrac interface was online, and i had “address any” specified.