Cluster visibility issues when one node goes down


#1

We are experiencing issue where entire cluster suffers serious cluster visibility issues when one of our nodes went down. asinfo service reports that all nodes experiences cluster integrity issues and half of the nodes cluster visibility issues (AS: 3.5.14, +10 nodes in the cluster).

We had similar situation few weeks ago but lucky we managed to get missing host up and running quite fast and (as far as I remember) putting it back has resolved the issue.

Is there any way how the cluster can be bring into shape in such situation? Rolling restart is probably one option, but quite painful. Maybe there is other way?


#2

Cluster visibility issues are normally not serious. Basically it means that some nodes are not advertising all of their neighbors. Event affecting cluster membership can have transient effects on the cluster visibility reported by our tools. I wouldn’t be too alarmed by this unless it continues to persist minutes after the cluster disruption (even then it is likely still benign).

If it does persist, you can run:

asadm -e "asinfo -v 'services'"

Starting from any node the client needs to be able to discover all of the nodes in the cluster so. So the client asks for a node’s services and then continues to ask those nodes for their services until the set of services discovered ceases to grow.