Frequent node restart due to invalid msg_size

We have an 8-node cluster using version 4.5.0.9 with mesh configuration. We are trying to replace existing nodes with newer servers one by one but whenever we replace an old node with a new node (stop old and add new), the asd process restarts abruptly after a couple of hours on the new node. sometimes after restarting, it works well for a few days also but again it restarts with the same error msg. we don’t see any such issue on old nodes:

May 10 2024 09:01:18 GMT: WARNING (fabric): (fabric.c:2019) fabric_connection_process_readable(0x7f111c20d848) invalid msg_size 1702259049 remote 0xbb9b6f1e0efec3c
May 10 2024 09:01:23 GMT: WARNING (fabric): (fabric.c:1991) fabric_connection_process_readable() recv_sz -1 msg_sz 0 errno 14 Bad address
May 10 2024 09:01:23 GMT: WARNING (fabric): (fabric.c:1914) msg_parse_fields failed for fc 0x7f111c20d848
May 10 2024 09:01:23 GMT: FAILED ASSERTION (socket): (socket.c:1424) shutdown() failed on FD -1: 9 (Bad file descriptor)
May 10 2024 09:01:23 GMT: WARNING (as): (signal.c:209) SIGUSR1 received, aborting Aerospike Community Edition build 4.5.0.9 os el7
May 10 2024 09:01:23 GMT: WARNING (as): (signal.c:211) stacktrace: registers: rax 0000000000000000 rbx 00000000021629b8 rcx 00007f767920f4fb rdx 000000000000000a rsi 00000000000012b3 rdi 0000000000000b00 rbp 00007f312cbfbc40 rsp 00007f312cbfb728 r8 0000000000000000 r9 0000000000000078 r10 00007f312cbfab60 r11 0000000000000206 r12 0000000000000001 r13 0000000000000000 r14 0000000000000006 r15 0000000000000079 rip 00007f767920f4fb
May 10 2024 09:01:23 GMT: WARNING (as): (signal.c:211) stacktrace: found 9 frames: 0x490497 0x7f767920f630 0x7f767920f4fb 0x52d098 0x53d9d5 0x4e5596 0x4e7540 0x7f7679207ea5 0x7f76776feb0d offset 0x400000
May 10 2024 09:01:23 GMT: WARNING (as): (signal.c:211) stacktrace: frame 0: /usr/bin/asd(as_sig_handle_usr1+0x10e) [0x490497]
May 10 2024 09:01:23 GMT: WARNING (as): (signal.c:211) stacktrace: frame 1: /lib64/libpthread.so.0(+0xf630) [0x7f767920f630]
May 10 2024 09:01:23 GMT: WARNING (as): (signal.c:211) stacktrace: frame 2: /lib64/libpthread.so.0(raise+0x2b) [0x7f767920f4fb]
May 10 2024 09:01:23 GMT: WARNING (as): (signal.c:211) stacktrace: frame 3: /usr/bin/asd(cf_fault_event+0x1f0) [0x52d098]
May 10 2024 09:01:23 GMT: WARNING (as): (signal.c:211) stacktrace: frame 4: /usr/bin/asd() [0x53d9d5]
May 10 2024 09:01:23 GMT: WARNING (as): (signal.c:211) stacktrace: frame 5: /usr/bin/asd() [0x4e5596]
May 10 2024 09:01:23 GMT: WARNING (as): (signal.c:211) stacktrace: frame 6: /usr/bin/asd() [0x4e7540]
May 10 2024 09:01:23 GMT: WARNING (as): (signal.c:211) stacktrace: frame 7: /lib64/libpthread.so.0(+0x7ea5) [0x7f7679207ea5]
May 10 2024 09:01:23 GMT: WARNING (as): (signal.c:211) stacktrace: frame 8: /lib64/libc.so.6(clone+0x6d) [0x7f76776feb0d]

After this the node becomes unresponsive and we have to reboot it. There is nothing in kernel logs.

We are not able to identify the root cause. The migration keeps on going for days due to it. we are using the same aerospike configuration on both new and old nodes (the only diff is auto-pin is OFF in new nodes due to the bonded interface and has value cpu on old nodes).

The other diff in servers is that new nodes have more RAM, CPU, and network capacity and newer kernel versions. All are physical servers.

4.5.0.9 was released March, 2019. It also isn’t the latest hotfix for that lineage. I see one fabric crash fix documented later for that lineage. I’d suggest upgrading to at least the latest 4.9 since it was a “jump version” so it received hotfixes for an extended duration. Be sure to read through the special upgrades document linked from the release notes.

  • [AER-6197] - (FABRIC) Incorrect handling of unsupported message types can cause a crash.

are you referring to this fix in 4.5.0.24? if yes, I can give it a try since I assume there will be no breaking change because it’s a sub-version release. can you pls confirm?

we were planning to upgrade to the latest 4.9 version but I am a little skeptical that it can break something in production so holding it for a while.

Another thing that we noticed was that old nodes have MTU 1500 and new nodes have 9000 MTU. do you think it can be the problem?

Yes, though your problem doesn’t really match this issue. It is very rare that there are regressions in hotfix releases, we use the last digit to indicate such a release. In the few occasions a regression occurred in a hotfix, it was quickly caught and a subsequent release made to address it. I suggest that you at least go to the latest for this lineage. If the problem still occurs, your best option would be the latest version of 4.9 - I assume production is broken already, or is this a test cluster?

I’m not aware of issues with larger mtus, running with jumbo frames is somewhat common with on prem deployments. The mixture of jumbo and standard frame sizes within the same cluster is likely an untested environment configuration.

Regarding migration time, this is much faster with Enterprise versions due to delta migrations.

Thanks. I will first try 4.5.0.24 which is the latest in 4.5.0 series and then 4.9.0.37 latest.

I assume production is broken already, or is this a test cluster?

it becomes unstable frequently (sometimes multiple times a day and sometimes after a few days). because we are getting this issue only in new nodes (1-2) and not in old nodes (remaining 5-6) and we have replicas so it is still in workable condition (thanks to aerospike :slight_smile: ). Although a lot of operational overhead.

The mixture of jumbo and standard frame sizes within the same cluster is likely an untested environment configuration.

Interesting! I will change the MTU and will observe and will update if it fixes the issue.

It seems the issue was different MTU across different nodes. We updated MTU on all nodes and didn’t see any issue after that.