What does "found redundant connections to same node" mean?

FAQ - What does “found redundant connections to same node” mean?

Detail

Some occurrences, or a continuous stream of the following message may be observed in the logs:

found redundant connections to same node, fds 210 209 - choosing at random

Answer

By itself, this message indicates that the Aerospike heartbeat protocol (HB) has found two connections open for the same destination node. Normally there should be only one connection open to every node. In order to deal with this, Aerospike prints the message above and chooses an fd or connection handle randomly from a list of available descriptors.

In order to see why this might happen, a wider range of messages must be investigated, typically associated with the HB protocol.

The below example shows messages that may be associated with having redundant connections:

Jun 04 2019 21:15:42 GMT: WARNING (socket): (socket.c:959) Error while connecting socket to 172.17.0.7:3002
Jun 04 2019 21:15:42 GMT: WARNING (hb): (hb.c:4882) could not create heartbeat connection to node {172.17.0.7:3002}
Jun 04 2019 21:15:42 GMT: WARNING (socket): (socket.c:891) Timeout while connecting

These messages indicate a communication issue between the nodes. This issue is strong enough to affect heartbeats between them; connectivity must therefore be checked and investigated. Note that this does not mean the nodes are unable to communicate at all, but may rather indicate that the nodes intermittently lose their connections and/or TCP packets.

The redundant connection message in and of itself indicates that some connectivity issues have recovered while Aerospike was creating new connections which resulted in Aerospike holding more than one working connection to a destination node. This is most often caused by intermittent connectivity issues, such as delayed packets.

Some things to be initially investigated include:

  • Packet loss, dropped packets and overrun packet statistics on linux, most commonly using ifconfig or ip link ls.
  • Packet loss and sustained connectivity tests using most commonly iperf.
  • Any evidence of packet loss or other issues on routers and switches between the nodes.
  • Kernel-level messages which may relate to the issue, in dmesg.
  • Available bandwidth between the nodes.

Note that a connectivity issue may not necessarily indicate network problems. Whilst this is the most common root cause, the issue could be within just one node.

In a scenario where only a single node has connectivity problems, that node could have a range of problems resulting in missed heartbeats. These include, but are not limited to:

  • hardware or driver faults - most commonly found by checking dmesg
  • overloaded hard disks resulting in CPU interrupts hanging on requests and affecting the network, most commonly checked using iostat
  • misconfigured kernel parameters or Aerospike node

Notes

Keywords

FOUND REDUNDANT CONNECTIONS TO SAME NODE HB HEARTBEAT

Timestamp

March 2020

© 2015 Copyright Aerospike, Inc. | All rights reserved. Creators of the Aerospike Database.