Communication problem bewteen nodes

version:6.0.0.1

Using asadm, the information displayed upon each login is unstable, and occasionally it indicates that [10.101.105.5] is in an offline state. and 3002 port timed out. However, the firewall between the two hosts is active, and ping works fine, and all the nodes in the cluster just works fine. What could be the possible reasons for this phenomenon?

host:10.101.105.4

(base) [root@ali-sin101-aerospike-105-4 ~]# asadm
Seed: [(‘127.0.0.1’, 3000, None)]
Config_file: /root/.aerospike/astools.conf, /etc/aerospike/astools.conf
Aerospike Interactive Shell, version 2.7.3

Found 6 nodes
Online: 10.101.105.4:3000, 10.101.105.13:3000, 10.101.105.1:3000, 10.101.105.14:3000, 10.101.105.3:3000
Offline: 10.101.105.5:3000

Admin> exit
(base) [root@ali-sin101-aerospike-105-4 ~]# asadm
Seed: [(‘127.0.0.1’, 3000, None)]
Config_file: /root/.aerospike/astools.conf, /etc/aerospike/astools.conf
Aerospike Interactive Shell, version 2.7.3

Found 6 nodes
Online: 10.101.105.4:3000, 10.101.105.5:3000, 10.101.105.13:3000, 10.101.105.1:3000, 10.101.105.3:3000, 10.101.105.14:3000
Extra nodes in alumni list: 10.101.105.1:3000, 10.101.105.3:3000, 10.101.105.13:3000, 10.101.105.14:3000, 10.101.105.4:3000
Admin> exit You have new mail in /var/spool/mail/root
(base) [root@ali-sin101-aerospike-105-4 ~]# asadm
Seed: [(‘127.0.0.1’, 3000, None)]
Config_file: /root/.aerospike/astools.conf, /etc/aerospike/astools.conf
Aerospike Interactive Shell, version 2.7.3

Found 6 nodes
Online: 10.101.105.4:3000, 10.101.105.13:3000, 10.101.105.5:3000, 10.101.105.3:3000, 10.101.105.14:3000, 10.101.105.1:3000

Admin> exit

host:10.101.105.5

(py38) [root@ali-sin101-ymaerospike-105-5 ~]# asadm
Seed: [(‘127.0.0.1’, 3000, None)]
Config_file: /root/.aerospike/astools.conf, /etc/aerospike/astools.conf
Aerospike Interactive Shell, version 2.7.3

Found 6 nodes
Online: 10.101.105.5:3000, 10.101.105.3:3000, 10.101.105.14:3000, 10.101.105.13:3000, 10.101.105.4:3000
Offline: 10.101.105.1:3000
Extra nodes in alumni list: 10.101.105.3:3000, 10.101.105.5:3000, 10.101.105.14:3000, 10.101.105.1:3000, 10.101.105.13:3000

Admin> exit
(py38) [root@ali-sin101-ymaerospike-105-5 ~]# asadm
Seed: [(‘127.0.0.1’, 3000, None)]
Config_file: /root/.aerospike/astools.conf, /etc/aerospike/astools.conf
Aerospike Interactive Shell, version 2.7.3

Found 6 nodes
Online: 10.101.105.5:3000, 10.101.105.14:3000, 10.101.105.13:3000, 10.101.105.1:3000, 10.101.105.4:3000, 10.101.105.3:3000

Admin> exit
(py38) [root@ali-sin101-ymaerospike-105-5 ~]#
(py38) [root@ali-sin101-ymaerospike-105-5 ~]#
(py38) [root@ali-sin101-ymaerospike-105-5 ~]#

[root@ali-sin101-aerospike-105-4 ~]# traceroute 10.101.105.5
traceroute to 10.101.105.5 (10.101.105.5), 30 hops max, 60 byte packets
1 * * *
2 * * *
3 * * *
4 * * *
5 * * *
6 * 10.101.105.5 (10.101.105.5) 0.203 ms 0.264 ms
[root@ali-sin101-aerospike-105-4 ~]# traceroute 10.101.105.5
traceroute to 10.101.105.5 (10.101.105.5), 30 hops max, 60 byte packets
1 10.101.105.5 (10.101.105.5) 0.201 ms 0.248 ms 0.240 ms
[root@ali-sin101-aerospike-105-4 ~]# traceroute 10.101.105.5
traceroute to 10.101.105.5 (10.101.105.5), 30 hops max, 60 byte packets^[[A
1 * * *
2 * * *
3 * * *
4 * * *
5 * * *
6 * 10.101.105.5 (10.101.105.5) 0.203 ms 0.214 ms
[root@ali-sin101-aerospike-105-4 ~]# traceroute 10.101.105.5
traceroute to 10.101.105.5 (10.101.105.5), 30 hops max, 60 byte packets
1 10.101.105.5 (10.101.105.5) 0.189 ms 0.202 ms *
[root@ali-sin101-aerospike-105-4 ~]# traceroute 10.101.105.5
traceroute to 10.101.105.5 (10.101.105.5), 30 hops max, 60 byte packets
1 10.101.105.5 (10.101.105.5) 0.211 ms * 0.227 ms
[root@ali-sin101-aerospike-105-4 ~]# traceroute 10.101.105.5
traceroute to 10.101.105.5 (10.101.105.5), 30 hops max, 60 byte packets
1 10.101.105.5 (10.101.105.5) 0.204 ms * 0.200 ms
[root@ali-sin101-aerospike-105-4 ~]# traceroute 10.101.105.5
traceroute to 10.101.105.5 (10.101.105.5), 30 hops max, 60 byte packets
1 10.101.105.5 (10.101.105.5) 0.187 ms * 0.229 ms
[root@ali-sin101-aerospike-105-4 ~]#

it looks like a network problem

I’ll need to verify verified, but I believe that “Offline” in asadm means that asadm couldn’t reach the node within the specified timeout (default 5s).

I agree, I’d suspect a network issue.

after analyze dmesg log ,we confirm that was a kernal bug cause that problem 。

[Sat May 25 07:18:11 2024] SLUB: Unable to allocate memory on node -1 (gfp=0x20)
[Sat May 25 07:18:11 2024]   cache: kmalloc-256, object size: 256, buffer size: 256, default order: 1, min order: 0
[Sat May 25 07:18:11 2024]   node 0: slabs: 743, objs: 21632, free: 29
[Sat May 25 07:18:11 2024] SLUB: Unable to allocate memory on node -1 (gfp=0x20)
[Sat May 25 07:18:11 2024]   cache: kmalloc-256, object size: 256, buffer size: 256, default order: 1, min order: 0
[Sat May 25 07:18:11 2024]   node 0: slabs: 743, objs: 21632, free: 29
[Sat May 25 07:18:11 2024] SLUB: Unable to allocate memory on node -1 (gfp=0x20)
[Sat May 25 07:18:11 2024]   cache: kmalloc-256, object size: 256, buffer size: 256, default order: 1, min order: 0
[Sat May 25 07:18:11 2024]   node 0: slabs: 743, objs: 21632, free: 29
[Sat May 25 07:18:11 2024] SLUB: Unable to allocate memory on node -1 (gfp=0x20)
[Sat May 25 07:18:11 2024]   cache: kmalloc-256, object size: 256, buffer size: 256, default order: 1, min order: 0
[Sat May 25 07:18:11 2024]   node 0: slabs: 744, objs: 21648, free: 29
[Sat May 25 13:16:31 2024] warn_alloc_failed: 2258 callbacks suppressed
[Sat May 25 13:16:31 2024] swapper/13: page allocation failure: order:0, mode:0x20
[Sat May 25 13:16:31 2024] swapper/7: page allocation failure: order:0, mode:0x20
[Sat May 25 13:16:31 2024] CPU: 7 PID: 0 Comm: swapper/7 Tainted: G           OE  ------------   3.10.0-957.5.1.el7.x86_64 #1
[Sat May 25 13:16:31 2024] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8f19b21 04/01/2014
[Sat May 25 13:16:31 2024] Call Trace:
[Sat May 25 13:16:31 2024]  <IRQ>  [<ffffffff8cf61e41>] dump_stack+0x19/0x1b
[Sat May 25 13:16:31 2024] swapper/15: page allocation failure: order:0, mode:0x20
[Sat May 25 13:16:31 2024]  [<ffffffff8c9bcae0>] warn_alloc_failed+0x110/0x180
[Sat May 25 13:16:31 2024] CPU: 15 PID: 0 Comm: swapper/15 Tainted: G           OE  ------------   3.10.0-957.5.1.el7.x86_64 #1
[Sat May 25 13:16:31 2024]  [<ffffffff8ce235aa>] ? kfree_skb+0x3a/0xa0
[Sat May 25 13:16:31 2024] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8f19b21 04/01/2014
[Sat May 25 13:16:31 2024]  [<ffffffff8cf5d44e>] __alloc_pages_slowpath+0x6b6/0x724
[Sat May 25 13:16:31 2024] Call Trace:
[Sat May 25 13:16:31 2024]  [<ffffffff8c9c1145>] __alloc_pages_nodemask+0x405/0x420
[Sat May 25 13:16:31 2024]  [<ffffffff8c9c14b8>] page_frag_alloc+0x158/0x170
[Sat May 25 13:16:31 2024]  <IRQ>  [<ffffffff8cf61e41>] dump_stack+0x19/0x1b
[Sat May 25 13:16:31 2024]  [<ffffffff8ce27d26>] __netdev_alloc_skb+0xa6/0x110
[Sat May 25 13:16:31 2024]  [<ffffffff8c9bcae0>] warn_alloc_failed+0x110/0x180
[Sat May 25 13:16:31 2024]  [<ffffffff8ce235aa>] ? kfree_skb+0x3a/0xa0
[Sat May 25 13:16:31 2024]  [<ffffffff8cf5d44e>] __alloc_pages_slowpath+0x6b6/0x724
[Sat May 25 13:16:31 2024]  [<ffffffffc03dc24e>] page_to_skb+0x4e/0x1f0 [virtio_net]
[Sat May 25 13:16:31 2024]  [<ffffffff8cea9e00>] ? tcp_v4_rcv+0x770/0x9c0
[Sat May 25 13:16:31 2024]  [<ffffffffc03de259>] virtnet_poll+0x2c9/0x750 [virtio_net]
[Sat May 25 13:16:31 2024]  [<ffffffff8c9c1145>] __alloc_pages_nodemask+0x405/0x420
[Sat May 25 13:16:31 2024]  [<ffffffff8ce39f1f>] net_rx_action+0x26f/0x390
[Sat May 25 13:16:31 2024]  [<ffffffff8c9c14b8>] page_frag_alloc+0x158/0x170
[Sat May 25 13:16:31 2024]  [<ffffffff8c8a0f45>] __do_softirq+0xf5/0x280
[Sat May 25 13:16:31 2024]  [<ffffffff8ce27d26>] __netdev_alloc_skb+0xa6/0x110
[Sat May 25 13:16:31 2024]  [<ffffffff8cf7832c>] call_softirq+0x1c/0x30
[Sat May 25 13:16:31 2024]  [<ffffffff8c82e675>] do_softirq+0x65/0xa0
[Sat May 25 13:16:31 2024]  [<ffffffff8c8a12c5>] irq_exit+0x105/0x110
[Sat May 25 13:16:31 2024]  [<ffffffff8cf795e6>] do_IRQ+0x56/0xf0
[Sat May 25 13:16:31 2024]  [<ffffffffc03dc24e>] page_to_skb+0x4e/0x1f0 [virtio_net]
[Sat May 25 13:16:31 2024]  [<ffffffff8cf6b362>] common_interrupt+0x162/0x162
[Sat May 25 13:16:31 2024]  [<ffffffffc03de259>] virtnet_poll+0x2c9/0x750 [virtio_net]
[Sat May 25 13:16:31 2024]  [<ffffffff8ce39f1f>] net_rx_action+0x26f/0x390
[Sat May 25 13:16:31 2024]  <EOI>  [<ffffffff8cf69aa0>] ? __cpuidle_text_start+0x8/0x8
[Sat May 25 13:16:31 2024]  [<ffffffff8c8a0f45>] __do_softirq+0xf5/0x280
[Sat May 25 13:16:31 2024]  [<ffffffff8cf69ca6>] ? native_safe_halt+0x6/0x10
[Sat May 25 13:16:31 2024]  [<ffffffff8cf7832c>] call_softirq+0x1c/0x30
[Sat May 25 13:16:31 2024]  [<ffffffff8cf69abe>] default_idle+0x1e/0xc0
[Sat May 25 13:16:31 2024]  [<ffffffff8c82e675>] do_softirq+0x65/0xa0
[Sat May 25 13:16:31 2024]  [<ffffffff8c8366f0>] arch_cpu_idle+0x20/0xc0
[Sat May 25 13:16:31 2024]  [<ffffffff8c8a12c5>] irq_exit+0x105/0x110
[Sat May 25 13:16:31 2024]  [<ffffffff8c8fc4da>] cpu_startup_entry+0x14a/0x1e0
[Sat May 25 13:16:31 2024]  [<ffffffff8cf795e6>] do_IRQ+0x56/0xf0
[Sat May 25 13:16:31 2024]  [<ffffffff8c857db7>] start_secondary+0x1f7/0x270
[Sat May 25 13:16:31 2024]  [<ffffffff8cf6b362>] common_interrupt+0x162/0x162
[Sat May 25 13:16:31 2024]  [<ffffffff8c8000d5>] start_cpu+0x5/0x14
[Sat May 25 13:16:31 2024]  <EOI>  [<ffffffff8cf69aa0>] ? __cpuidle_text_start+0x8/0x8
[Sat May 25 13:16:31 2024]  [<ffffffff8cf69ca6>] ? native_safe_halt+0x6/0x10
[Sat May 25 13:16:31 2024]  [<ffffffff8cf69abe>] default_idle+0x1e/0xc0
[Sat May 25 13:16:31 2024]  [<ffffffff8c8366f0>] arch_cpu_idle+0x20/0xc0
[Sat May 25 13:16:31 2024]  [<ffffffff8c8fc4da>] cpu_startup_entry+0x14a/0x1e0
[Sat May 25 13:16:31 2024]  [<ffffffff8c857db7>] start_secondary+0x1f7/0x270
[Sat May 25 13:16:31 2024]  [<ffffffff8c8000d5>] start_cpu+0x5/0x14
[Sat May 25 13:16:31 2024] swapper/15: page allocation failure: order:0, mode:0x20
[Sat May 25 13:16:31 2024] CPU: 15 PID: 0 Comm: swapper/15 Tainted: G           OE  ------------   3.10.0-957.5.1.el7.x86_64 #1
[Sat May 25 13:16:31 2024] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8f19b21 04/01/2014
[Sat May 25 13:16:31 2024] Call Trace:
[Sat May 25 13:16:31 2024]  <IRQ>  [<ffffffff8cf61e41>] dump_stack+0x19/0x1b
[Sat May 25 13:16:31 2024]  [<ffffffff8c9bcae0>] warn_alloc_failed+0x110/0x180
[Sat May 25 13:16:31 2024]  [<ffffffff8ce9e30c>] ? tcp_rcv_state_process+0x1bc/0xf50
[Sat May 25 13:16:31 2024]  [<ffffffff8ce8c666>] ? __inet_lookup_established+0x46/0x130
[Sat May 25 13:16:31 2024]  [<ffffffff8cf5d44e>] __alloc_pages_slowpath+0x6b6/0x724
[Sat May 25 13:16:31 2024]  [<ffffffff8c9c1145>] __alloc_pages_nodemask+0x405/0x420
[Sat May 25 13:16:31 2024]  [<ffffffff8c9c14b8>] page_frag_alloc+0x158/0x170
[Sat May 25 13:16:31 2024]  [<ffffffff8ce27d26>] __netdev_alloc_skb+0xa6/0x110
[Sat May 25 13:16:31 2024]  [<ffffffffc03dc24e>] page_to_skb+0x4e/0x1f0 [virtio_net]
[Sat May 25 13:16:31 2024]  [<ffffffffc03de259>] virtnet_poll+0x2c9/0x750 [virtio_net]
[Sat May 25 13:16:31 2024]  [<ffffffff8ce39f1f>] net_rx_action+0x26f/0x390
[Sat May 25 13:16:31 2024]  [<ffffffff8c8a0f45>] __do_softirq+0xf5/0x280
[Sat May 25 13:16:31 2024]  [<ffffffff8cf7832c>] call_softirq+0x1c/0x30
[Sat May 25 13:16:31 2024]  [<ffffffff8c82e675>] do_softirq+0x65/0xa0
[Sat May 25 13:16:31 2024]  [<ffffffff8c8a12c5>] irq_exit+0x105/0x110
[Sat May 25 13:16:31 2024]  [<ffffffff8cf795e6>] do_IRQ+0x56/0xf0
[Sat May 25 13:16:31 2024]  [<ffffffff8cf6b362>] common_interrupt+0x162/0x162
[Sat May 25 13:16:31 2024]  <EOI>  [<ffffffff8cf69aa0>] ? __cpuidle_text_start+0x8/0x8
[Sat May 25 13:16:31 2024]  [<ffffffff8cf69ca6>] ? native_safe_halt+0x6/0x10
[Sat May 25 13:16:31 2024]  [<ffffffff8cf69abe>] default_idle+0x1e/0xc0
[Sat May 25 13:16:31 2024]  [<ffffffff8c8366f0>] arch_cpu_idle+0x20/0xc0
[Sat May 25 13:16:31 2024]  [<ffffffff8c8fc4da>] cpu_startup_entry+0x14a/0x1e0
[Sat May 25 13:16:31 2024]  [<ffffffff8c857db7>] start_secondary+0x1f7/0x270
[Sat May 25 13:16:31 2024]  [<ffffffff8c8000d5>] start_cpu+0x5/0x14
[Sat May 25 13:16:31 2024] swapper/15: page allocation failure: order:0, mode:0x20
[Sat May 25 13:16:31 2024] CPU: 15 PID: 0 Comm: swapper/15 Tainted: G           OE  ------------   3.10.0-957.5.1.el7.x86_64 #1
[Sat May 25 13:16:31 2024] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8f19b21 04/01/2014

https://lore.kernel.org/lkml/20210927170235.589030577@linuxfoundation.org/