Cluster Visibility Error using loopback address


#1

We have Aerospike running on 14 nodes and they have been running for almost a year now with no issues. We have one node now that is having a very strange problem and I can’t seem to find the solution. When you try to use the asadm tool you get this

Found 14 nodes
Online:  10.100.0.138:3000,**127.0.0.1:3000**, 10.100.0.149:3000, 10.100.0.154:3000, 10.100.0.143:3000, 10.100.0.153:3000, 10.100.0.150:3000, 10.100.0.152:3000, 10.100.0.151:3000, 10.100.0.155:3000, 10.100.0.140:3000, 10.100.0.137:3000, 10.100.0.139:3000, 10.100.0.142:3000
Cluster Visibility error (Please check services list): 10.100.0.138:3000, 10.100.0.149:3000, 10.100.0.154:3000, 10.100.0.143:3000, 10.100.0.153:3000, 10.100.0.150:3000, 10.100.0.152:3000, 10.100.0.151:3000, 10.100.0.155:3000, 10.100.0.140:3000, 10.100.0.137:3000, 10.100.0.139:3000, 10.100.0.142:3000

Notice the 127.0.0.1 that is bold? I have no idea why that is the case! If you run it again it might get the correct 10.100.0.x address or it might use the 127.0.0.1.

The other odd thing is that there is a 127.0.0.1 to node mapping that is incorrect as well.

Also, this is what I see on the server that is having an issue. There should only be 14 nodes, not sure why the local ip is bound to the node.

< ~IP to NODE-ID Mapping~
               IP           NODE-ID
10.100.0.136:3000   BB9ACD16E7AC40C
10.100.0.137:3000   BB918D66E7AC40C
10.100.0.138:3000   BB908D76E7AC40C
10.100.0.139:3000   BB9DED56E7AC40C
10.100.0.140:3000   BB9E2D56E7AC40C
10.100.0.142:3000   BB9E2D26E7AC40C
10.100.0.143:3000   BB980D76E7AC40C
10.100.0.149:3000   BB912BFDE7AC40C
10.100.0.150:3000   BB906BFDE7AC40C
10.100.0.151:3000   BB9DA1BDF7AC40C
10.100.0.152:3000   BB9F01CDF7AC40C
10.100.0.153:3000   BB91C1BDF7AC40C
10.100.0.154:3000   BB96E95987AC40C
10.100.0.155:3000   BB9EE0E9C7AC40C
127.0.0.1:3000      BB9ACD16E7AC40C
Number of rows: 15

Any help with this? I have added the access-address to the aeropsike configuration file but nothing has worked.

All help is appreciated as this is causing some issues in our production environment.

Thanks!


#2

What kind of issues are you seeing? Can you post your network configure section of aerospike.conf ?


#3

Here is the network config on the server that is having issues. It’s been suggested to try to remove the node from the cluster and re-add. Going to try that first and then see if that solves the issue.

    network {
        service {
            address 10.100.0.136
            access-address 10.100.0.136
            port 3000
    }
    
    heartbeat {
        mode multicast
        multicast-group 239.1.99.222
        port 9918

        # To use unicast-mesh heartbeats, remove the 3 lines above, and see
        # aerospike_mesh.conf for alternative.

        interval 150
        timeout 10
    }

    fabric {
        port 3001
    }

    info {
        port 3003
    }
}

#4

Either this is a bug in asadm or one of the nodes is advertising 127.0.0.1 as an access address - it could be 10.100.0.136 or a different node and you happen to be running asadm from 10.100.0.136.

What version of asadm are you running?

asadm --version

What version are the servers?

asadm -h 10.100.0.136 -v "asinfo -v build"

What are the access-addresses being advertised by the servers?

asadm -h 10.100.0.136 -v "asinfo -v "services"
asadm -h 10.100.0.136 -v "asinfo -v "peers-clear-std"

#5

asadm version

asadm --version
0.1.5

build info

asinfo -v build
3.10.0-.3

access address info

asinfo -v "services" 10.100.0.155:3000;10.100.0.152:3000;10.100.0.140:3000;10.100.0.154:3000;10.100.0.139:3000;10.100.0.142:3000;10.100.0.137:3000;10.100.0.151:3000;10.100.0.138:3000;10.100.0.153:3000;10.100.0.143:3000;10.100.0.149:3000;10.100.0.150:3000
asinfo -v "peers-clear-std"\
27,3000,[[BB9EE0E9C7AC40C,,[10.100.0.155]],[BB9F01CDF7AC40C,,[10.100.0.152]],[BB9E2D56E7AC40C,,[10.100.0.140]],[BB96E95987AC40C,,[10.100.0.154]],[BB9DED56E7AC40C,,[10.100.0.139]],[BB9E2D26E7AC40C,,[10.100.0.142]],[BB918D66E7AC40C,,[10.100.0.137]],[BB9DA1BDF7AC40C,,[10.100.0.151]],[BB908D76E7AC40C,,[10.100.0.138]],[BB91C1BDF7AC40C,,[10.100.0.153]],[BB980D76E7AC40C,,[10.100.0.143]],[BB912BFDE7AC40C,,[10.100.0.149]],[BB906BFDE7AC40C,,[10.100.0.150]]]

#6

The weird thing is that on all the other nodes they work fine. There is no reference of a 127.0.0.1 address anywhere. The even stranger thing is that they were all installed at roughly the same time and using the same base config.


#7

None of the nodes are advertising 127.0.0.1 (local loopback address) so this must be a bug in asadm.

Basically the host asadm uses defaults to 127.0.0.1 if node is provided and in this case it remains in the output as one of the hosts.

If you were to specify the host’s access address instead of using the default, this issue should go away:

asadm -h 10.100.0.136

#8

Well here is a weird one… If I try that I get this, but not every time, sometimes it connects as I would expect. I’m definitely puzzled on this one.

asadm -h 10.100.0.136\
Aerospike Interactive Shell, version 0.1.5\
Found 1 nodes
Offline: 10.100.0.136:3000

Not able to connect any cluster.

Config files location: /root/.aerospike/

The config files location is incorrect also. I’ve looked at the running process and it’s running the correct config.

Aerospi+ ...  Ssl  19:53  96:28 /usr/bin/asd --config-file /etc/aerospike/aerospike.conf

#9

It is telling you where the config files for asadm are located.

The connection attempt is probably timing out, or you may have exceeded ulimit -n or proto-fd-max connections to this server.


#10

Default value for seed IP is 127.0.0.1. asadm uses input seed node IP to get access-address for all Aerospike nodes, then it uses these access-addresses to connect actual Aerospike nodes. For seed node it uses access-address + input seed node address, so first it tries with access-address if that failed then seed address. In this case, for seed node (10.100.0.136) asadm have two IPs [10.100.0.136, 127.0.0.1] , when first failed to connect then only it tries second one. So it seems sometimes asadm is not able to connect 10.100.0.136, which you experienced when you provide this IP as a parameter to asadm.

As Kavin suggested this might be issue of proto-fd-max or ulimit -n exceeded. client_connections stat can give number of open connection.

asadm -e “show statistics service like client_connection” .

For proto-fd-max limit exceed, Aerospike log shows warning for dropping incoming connection.


#11

First of all, thanks to Hemant_Patre and kporter for all the help!! :beers: After looking info this further I found that for some reason we were running out of ports on the server which was causing the 127.0.0.1 issue as it could no longer connect to the 10.x.x.x address. I have increased the ports and the issue has subsided.

Thanks again for all the help, I was tearing my hair out on this one.


#12

This topic was automatically closed 6 days after the last reply. New replies are no longer allowed.