We have an 8-node cluster using version 4.5.0.9
with mesh configuration. We are trying to replace existing nodes with newer servers one by one but whenever we replace an old node with a new node (stop old and add new), the asd
process restarts abruptly after a couple of hours on the new node. sometimes after restarting, it works well for a few days also but again it restarts with the same error msg. we don’t see any such issue on old nodes:
May 10 2024 09:01:18 GMT: WARNING (fabric): (fabric.c:2019) fabric_connection_process_readable(0x7f111c20d848) invalid msg_size 1702259049 remote 0xbb9b6f1e0efec3c
May 10 2024 09:01:23 GMT: WARNING (fabric): (fabric.c:1991) fabric_connection_process_readable() recv_sz -1 msg_sz 0 errno 14 Bad address
May 10 2024 09:01:23 GMT: WARNING (fabric): (fabric.c:1914) msg_parse_fields failed for fc 0x7f111c20d848
May 10 2024 09:01:23 GMT: FAILED ASSERTION (socket): (socket.c:1424) shutdown() failed on FD -1: 9 (Bad file descriptor)
May 10 2024 09:01:23 GMT: WARNING (as): (signal.c:209) SIGUSR1 received, aborting Aerospike Community Edition build 4.5.0.9 os el7
May 10 2024 09:01:23 GMT: WARNING (as): (signal.c:211) stacktrace: registers: rax 0000000000000000 rbx 00000000021629b8 rcx 00007f767920f4fb rdx 000000000000000a rsi 00000000000012b3 rdi 0000000000000b00 rbp 00007f312cbfbc40 rsp 00007f312cbfb728 r8 0000000000000000 r9 0000000000000078 r10 00007f312cbfab60 r11 0000000000000206 r12 0000000000000001 r13 0000000000000000 r14 0000000000000006 r15 0000000000000079 rip 00007f767920f4fb
May 10 2024 09:01:23 GMT: WARNING (as): (signal.c:211) stacktrace: found 9 frames: 0x490497 0x7f767920f630 0x7f767920f4fb 0x52d098 0x53d9d5 0x4e5596 0x4e7540 0x7f7679207ea5 0x7f76776feb0d offset 0x400000
May 10 2024 09:01:23 GMT: WARNING (as): (signal.c:211) stacktrace: frame 0: /usr/bin/asd(as_sig_handle_usr1+0x10e) [0x490497]
May 10 2024 09:01:23 GMT: WARNING (as): (signal.c:211) stacktrace: frame 1: /lib64/libpthread.so.0(+0xf630) [0x7f767920f630]
May 10 2024 09:01:23 GMT: WARNING (as): (signal.c:211) stacktrace: frame 2: /lib64/libpthread.so.0(raise+0x2b) [0x7f767920f4fb]
May 10 2024 09:01:23 GMT: WARNING (as): (signal.c:211) stacktrace: frame 3: /usr/bin/asd(cf_fault_event+0x1f0) [0x52d098]
May 10 2024 09:01:23 GMT: WARNING (as): (signal.c:211) stacktrace: frame 4: /usr/bin/asd() [0x53d9d5]
May 10 2024 09:01:23 GMT: WARNING (as): (signal.c:211) stacktrace: frame 5: /usr/bin/asd() [0x4e5596]
May 10 2024 09:01:23 GMT: WARNING (as): (signal.c:211) stacktrace: frame 6: /usr/bin/asd() [0x4e7540]
May 10 2024 09:01:23 GMT: WARNING (as): (signal.c:211) stacktrace: frame 7: /lib64/libpthread.so.0(+0x7ea5) [0x7f7679207ea5]
May 10 2024 09:01:23 GMT: WARNING (as): (signal.c:211) stacktrace: frame 8: /lib64/libc.so.6(clone+0x6d) [0x7f76776feb0d]
After this the node becomes unresponsive and we have to reboot it. There is nothing in kernel logs.
We are not able to identify the root cause. The migration keeps on going for days due to it. we are using the same aerospike configuration on both new and old nodes (the only diff is auto-pin
is OFF in new nodes due to the bonded interface and has value cpu
on old nodes).
The other diff in servers is that new nodes have more RAM, CPU, and network capacity and newer kernel versions. All are physical servers.