We are currently running a 3 node Aerospike cluster (v3.8.4). We hit the cluster with a fairly high volume of traffic from around 30 Node (v5.5) instances which are running version 1.0 of the Node Aerospike client. I recently upgraded the client to run v3.5 on Node 6.9.2. When I deploy the Node instances all seem initially happy, but then start throwing the error below continuously and become unresponsive.
WARN (1) [as_peers.c:186] [as_peers_validate_node] - Failed to connect to peer 192.168.1.1 3000. AEROSPIKE_ERR_CONNECTION Socket write error: 111, 192.168.1.1:3000
This is strange as none of the Aerospike nodes in the cluster have an ip address of 192.168.1.1
Given that the system runs high availability 24/7 with the v1 client, I’m struggling to understand why we’re having these problems with the 3.5 client. Just to confirm (as I’ve seen a previous post), I connect to the cluster on start up, then close the connection when Node shuts down.
I’ve tried to reproduce this issue in a more controlled development environment, but I don’t think I’ve managed to emulate the high volume of traffic hitting the cluster.
Did you make any changes to the server side as well? (E.g. upgrade to a newer version, config changes, etc.) Or you just upgraded the client side?
If you have the asinfo tool installed, can you run asinfo -h <ip> -p <port> -v services when this issue occurs? Does the 192.168.1.1 address appear in the results? (*)
Are the cluster nodes in the 192.168.1.0/24 subnet or using an entirely different IP range?
Cheers,
Jan
(*) If you do not have asinfo installed, you can also use the client to fetch the same info: AEROSPIKE_HOSTS=<ip:port> node -e 'require("aerospike").connect().then(client => client.infoAll("services").then(console.info).then(() => client.close()))'
@rdaero, you mentioned, that all the client nodes are “initially happy” and only throw the connection error after some time. If you run the same info query when the clients are not yet showing this issue, do you get the same response?
The 192.168.1.1 address is advertised by your cluster nodes as one of their service addresses. Do your servers have a 2nd network interface that is set to this IP address? If that is the case, you should configure the access-address of your servers to specify only the 10.3.4.x address: See General Network Configuration docs for details.
What’s strange, though, is that the 1.0 client doesn’t seem to be affected. It’s possible that it just ignores this error. Let’s see if setting access-address fixes the issue for the 3.5 client. Otherwise, we have to dig deeper and see if there is some other, possibly unrelated issue.
Yes indeed, after a bit of investigation (the nodes are docker instances), I can see that the nodes do have a second ip address in the 192.168.1.x subnet. I’ll make the config change over the next few days and let you know if that fixes it.