Cluster Visibility False


#1

Summary:

Asmonitor, asadm, and/or AMC are reporting “cluster visibility” on one or more nodes to be false. This occurs when the list of nodes seen by the tools contains addresses not in the list of addresses returned by asinfo -v services on that particular node or if the services list across the cluster does not match for the nodes. This is not necessarily an indication of a cluster not fully operational, it is nevertheless recommended to look at the cause and address it. Starting with the version of tools shipped with Aerospike version 3.7.5, asadm (0.0.17) will only show “cluster integrity” instead of “cluster visibility”. However, if there are cluster visibility issues (mismatch of services lists across the nodes in the cluster), asadm will print a warning when launched.

Details:

If a minor cache issue in monitoring, try exiting out of the tool and re-attempt to verify if it’s consistently false.

The following 2 commands should help identify the cause of the cluster visibility false:

asadm -e "asinfo -v service"

This command will return the broadcasted service addresse(s) for each of the node in the cluster.

asadm -e "asinfo -v services"

This command will return the list of neigbors addresses each node is reporting.

Let’s now look at the different causes for a cluster visibility false.

Some nodes may be reporting more than one access-address.

If any of the nodes are reporting multiple service addresses then cluster visibility will be false because the cluster visibility indicator in the tools does not support this configuration. This should also cause all nodes represented by the tool to report false. Many clients also do not support this configuration and so it may indicate that you have a misconfigured server, in which case you will need to configure access-address in the network.service context of /etc/aerospike/aerospike.conf.

Sub-set or extra IP’s in the services list

One or more nodes in the cluster may be advertising a subset of the peer node’s access-addresses or the nodes are advertising the access-address of a node that has departed from the cluster. Currently there isn’t an easy way to verify that this is the case and which nodes are missing/present that shouldn’t be. But, if nodes are leaving the cluster or if a node has recently joined, it could potentially cause such issues.

You could run the following command and match the list of IP’s on all the nodes and confirm if they have a mismatch in the count of values:

asadm -e "asinfo -v services"

Resolution

For 3.7 or later release, should set auto-reset-master for the paxos-recovery-policy if the value is “manual”.

With the list returned there may be a few lists that are shorter (or longer) than the rest, you could then identify the node causing the problem. In general if most nodes report all of their peers and the one reporting false are only missing a small fraction of the peers, no action is required, the clients are able to work around this issue and this can be treated as a false negative. If there are a large number of peers missing from the services list then you can try either dun/undun the missing node on the nodes reporting false (option 1) or dun/undun the node reporting incorrect values across the cluster (option 2).

Note : Doing a dun-undun on the cluster will trigger migrations as the cluster rebalances.

# Option 1:
asadm -e "cluster dun [Missing IP] with [IP returning false visibility]"
   
# Option 2:
asadm -e "cluster dun [IP returning false visibility]"

# Then run the following command after waiting for a few seconds: 
asadm -e "cluster undun [Missing IP / IP returning false visibility]"

Other related links


#2

3.9.1+ deprecates ‘dun’. See FAQ - How can a node be removed from a cluster in Aerospike 3.9.1 and higher?