Heartbeat number of connections per node


#1

Hy, I’ve a 4 nodes cluster 3.5.8 CE version running only one namespace.

Analyzing heartbeat info logs, I extract the “trans_in_progress” logs to see hb connections.

Server 1:

trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (27, 790021, 789994) : hb (5, 464, 459) : fab (58, 128, 70)

Server 2:

trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (22, 754969, 754947) : hb (4, 96, 92) : fab (58, 116, 58)

Server 3:

trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (20, 758292, 758272) : hb (6, 552, 546) : fab (58, 160, 102)

Server 4:

trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (16, 664968, 664952) : hb (5, 16, 11) : fab (58, 114, 56)

Is it normal that the number of heartbeat connexions is not the same everywhere? Should it not be 6 (2 directions for each other node)?

I’m surprised to have different values in each logs. Does it mean I have some network problems ?

Thanks a lot.

Emmanuel


#2

When looking into logs with detail level activated, I see that several connexions are opened on the same nodes:

May 13 2015 13:43:37 GMT: DETAIL (hb): (hb.c:as_hb_rx_process:1716) Got heartbeat pulse from node identifying itself as 10.240.12.31:3002
May 13 2015 13:43:37 GMT: DETAIL (hb): (hb.c:as_hb_rx_process:1716) Got heartbeat pulse from node identifying itself as 10.240.12.31:3002
May 13 2015 13:43:37 GMT: DETAIL (hb): (hb.c:as_hb_rx_process:1716) Got heartbeat pulse from node identifying itself as 10.240.118.17:3002
May 13 2015 13:43:37 GMT: DETAIL (hb): (hb.c:as_hb_rx_process:1716) Got heartbeat pulse from node identifying itself as 10.240.226.153:3002
May 13 2015 13:43:37 GMT: DETAIL (hb): (hb.c:as_hb_rx_process:1716) Got heartbeat pulse from node identifying itself as 10.240.226.153:3002
May 13 2015 13:43:37 GMT: DETAIL (hb): (hb.c:as_hb_thr:2106) sending tcp heartbeat to index 97 : msg size 339
May 13 2015 13:43:37 GMT: DETAIL (hb): (hb.c:as_hb_thr:2106) sending tcp heartbeat to index 104 : msg size 339
May 13 2015 13:43:37 GMT: DETAIL (hb): (hb.c:as_hb_thr:2106) sending tcp heartbeat to index 116 : msg size 339
May 13 2015 13:43:37 GMT: DETAIL (hb): (hb.c:as_hb_thr:2106) sending tcp heartbeat to index 131 : msg size 339
May 13 2015 13:43:37 GMT: DETAIL (hb): (hb.c:as_hb_thr:2106) sending tcp heartbeat to index 150 : msg size 339 

Is it normal ?

I detect an other problem with the debug level:

May 13 2015 13:43:59 GMT: DEBUG (hb): (hb.c:as_hb_try_connecting_remote:1028) could not create heartbeat connection to node 10.240.112.35:3002 

Heartbeat tried to connect a dead node from more than 1 week. The cluster have been fully restarted after this node death because it has been replaced with a new instance on GCE with local-SSD and it’s ip address has changed.

Thanks


#3

The ideal number of sockets on a given node is the cluster size -1. But there is a benign issue where 2 sockets may exist to some nodes, which you are seeing.

As for the connection to the dead node, is or was the old IP still in the aerospike.conf at the time the server started? The servers periodically check if the servers defined there have returned. To clear that issue you should only need to restart each node after the node has been removed from the configs.


#4

Ok, thanks a lot for your reply. Emmanuel