3.6.1 crash


#1

Hey,

We upgraded today to 3.6.1 (enterprise) and it seems its pretty unstable.

Our cluster crashed with:

Sep 26 2015 21:48:35 GMT: CRITICAL (hb): (hb.c:as_hb_start_receiving:1338) unable to add socket 72 to epoll fd list: File exists


Sep 26 2015 21:48:36 GMT: WARNING (as): (signal.c::94) SIGABRT received, aborting Aerospike Enterprise Edition build 3.6.1 os debian7
Sep 26 2015 21:48:38 GMT: WARNING (as): (signal.c::96) stacktrace: found 8 frames
Sep 26 2015 21:48:39 GMT: WARNING (as): (signal.c::96) stacktrace: frame 0: /usr/bin/asd(as_sig_handle_abort+0x5d) [0x48ee59]
Sep 26 2015 21:48:40 GMT: WARNING (as): (signal.c::96) stacktrace: frame 1: /lib/x86_64-linux-gnu/libc.so.6(+0x321e0) [0x7f7af99e31e0]
Sep 26 2015 21:48:40 GMT: WARNING (as): (signal.c::96) stacktrace: frame 2: /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35) [0x7f7af99e3165]
Sep 26 2015 21:48:40 GMT: WARNING (as): (signal.c::96) stacktrace: frame 3: /lib/x86_64-linux-gnu/libc.so.6(abort+0x180) [0x7f7af99e63e0]
Sep 26 2015 21:48:40 GMT: WARNING (as): (signal.c::96) stacktrace: frame 4: /usr/bin/asd(cf_fault_event+0x22a) [0x51c2d3]
Sep 26 2015 21:48:40 GMT: WARNING (as): (signal.c::96) stacktrace: frame 5: /usr/bin/asd(as_hb_thr+0xec8) [0x4e9708]
Sep 26 2015 21:48:40 GMT: WARNING (as): (signal.c::96) stacktrace: frame 6: /lib/x86_64-linux-gnu/libpthread.so.0(+0x6b50) [0x7f7afa7fdb50]
Sep 26 2015 21:48:40 GMT: WARNING (as): (signal.c::96) stacktrace: frame 7: /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f7af9a8c95d]

Any idears how to solve it? We downgraded to 3.6.0


#2

Hello. Thanks for the report. Could you please send us the full log so we can see the configuration as well as the logs that happened prior to the crash?

Also, did you receive this same type of stack on multiple nodes?


#3

We have tried to reproduce the problem on 2- and 3-node Debian 7 and CentOS 6 mesh and multicast clusters, but the reported crash has not occurred. If you are still seeing this issue, could you please give us more info. so we can help resolve it. Otherwise, please close this issue. Thanks.


#4

Hey,

I send you a private message with a more detailed log file, thats all we got. We didnt tried to upgrade to 3.6.1 anymore. Yeah we noticed this crash on 3 cluster nodes.

We had 2 servers running. Heres a short schema:

server1 
ID1 ip:3000 v 3.6.0 
ID2 ip:4000 v 3.6.1

server2 
ID3 ip:3000 v 3.6.0 
ID4 ip:4000 v 3.6.1

During this crash ID1,2 and 4 crashed. If theres a need for this information, all instances are dockert.

Greetings Sascha


#5

Hello. Thanks for the info. We did not receive the private message containing the log file. Exactly who / which address did you send it to?

The fact that you are using Docker is an important clue. Are you using host networking? Does everything always work when using 3.6.0 for all 4 cluster nodes?

While I haven’t reproduced the crash using Docker yet, we can probably make progress on this issue if you can keep giving us more info. Thanks for your help!!


#6

Hello,

I send you it again (private message here in this forum)


#7

Got it this time ~~ Thanks! Looking into the cause. Will let you know what I find.


#8

Hello. Are you using Amazon AWS? Whether or not, could you please give the kernel version and Linux distro. you are using? Specifically, could you please give the output of “uname -a”? Thanks!


#9

Hello psi,

Thanks for your further investigation. Were running on own custom dedicated servers hosted at OVH.

Were running on debian 7 stable (including latest updates).

About the kernel stuff:

Were running currently a custom kernel (4.0.0).

The config is copy & paste of debian 7 stable kernel, if you like / need i can upload it for you.

PS: Anyways we updated now to 3.6.3 and it seems stable so far ~

Greetings Sascha