Aql client overwhelmed by "WARN AEROSPIKE_ERR_TIMEOUT"

billbargens · January 23, 2017, 6:20am

I have a aerospike cluster of two nodes marked as nodeA and nodeB. The cluster have been running normally for a long time.

But Recently when I connected to the server node using aql, the client was overwhelmed by the warning messages:“WARN AEROSPIKE_ERR_TIMEOUT”.

I run some commands to check the issue, and the result is as follows:

asinfo -v service -h nodeA -p 6000 nodeA-ip1:6000;nodeA-ip2:6000

asinfo -v service -h nodeB -p 6000 nodeB-ip1:6000;nodeB-ip2:6000;172.17.42.1:6000

The “172.17.42.1:6000” is a unknow ip that does not belongs to nodeB.

asadm -e “asinfo -v services” -p 6000

nodeA returned: nodeB-ip1;nodeB-ip2;172.17.42.1:6000

nodeB returned: nodeA-ip1;nodeA-ip2

172 (172.17.42.1) returned: Invalid command or Could not connect to node 172.17.42.1

So how can I get rid of the warning messages?

billbargens · January 25, 2017, 8:23am

Any help will be appreciated!

billbargens · February 6, 2017, 3:38am

@rbotzer Can you give me some advice?

billbargens · February 6, 2017, 3:43am

@system Can anyone give me some advice?

kporter · February 6, 2017, 3:57am

If that isn’t a recognized address could it be a rogue node?

pgupta · February 6, 2017, 5:46am

Can you share the exact config files of both nodes? you can mask the ip address if you want to with “nodeA” , “nodeB” etc.

Or you may try to run on nodeA:

asinfo -v “services-alumni-reset”

and see if that clears it up.

billbargens · February 6, 2017, 6:38am

My server version is 3.5.8

The command failed with a exception:

asinfo -v “services-alumni-reset” -h nodeA request to nodeA returned error

billbargens · February 6, 2017, 6:47am

The config files is as fellows:

 service {
 user root
 group root
  paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
  pidfile /disk1/jijw/aerospike/item/var/run/aerospike.pid
  transaction-queues 32
  transaction-threads-per-queue 32
  service-threads 32
  proto-fd-max 15000
  work-directory /disk1/jijw/aerospike/item/var
}

logging {
  # Log file must be an absolute path.
  file /disk1/jijw/aerospike/item/logs/aerospike.log {
    context any info
  }
}

mod-lua {
  system-path /disk1/jijw/aerospike/item/share/udf/lua
  user-path /disk1/jijw/aerospike/item/var/udf/lua
}

network {
  service {
    address any
    port 6000
    reuse-address
  }

  heartbeat {
    mode multicast
    address ******(masked)
    port 9921
    interval 150
    timeout 10
  }

  fabric {
    port 6001
  }

  info {
    port 6003
  }
}



namespace item {
  single-bin false
  replication-factor 2
  memory-size 20G
  default-ttl 0 # 30 days, use 0 to never expire/evict.
  high-water-memory-pct 85
  high-water-disk-pct 85
  stop-writes-pct 90
  write-commit-level-override all
  storage-engine device {
    device /dev/sdc1    # raw device.# device /dev/<device>  # (optional) another raw device.
    write-block-size 1M
    data-in-memory false
    cold-start-empty true
  }
}

pgupta · February 6, 2017, 7:26am

Since your server is listening on port 6000 instead of default 3000, add -p 6000 to your asinfo command.

Unrelated, cold-start-empty true leaves you vulnerable to losing all your data should the entire cluster restart after a cluster wide fault. Hope you understand the implications of having that in your config file. What you are saying that always ignore the data in the persistent storage medium when booting this node up. This is generally not recommended.

billbargens · February 6, 2017, 7:44am

Still same:

asinfo -v “services-alumni-reset” -h nodeA -p 6000 request to nodeA : 6000 returned error

My server version is 3.5.8. Does it support this command?

Also thanks for your advice about the cold-start-empty configuration!

pgupta · February 6, 2017, 7:55am

At ver - 3.9.1, dun was deprecated and services-alumni-reset introduced. so yes, 3.5.8, will not work

asinfo -v ‘dun:nodes=BB936F106CA0568’ where BB… is the nodeid that you want to remove.

what does asadm>info show? do you see the rogue node id?

billbargens · February 6, 2017, 8:49am

The “asadm>info” result is as fellows:

kporter · February 6, 2017, 7:21pm

In your configuration, under network.service, set access-address to the appropriate client reachable address. This configuration is static so you will need to restart each node after configuring.

billbargens · February 7, 2017, 3:22am

But the request using java client is normal!

billbargens · February 7, 2017, 3:29am

The aerospike cluster seems to be running normally except for the endless aql exception!

kporter · February 7, 2017, 3:47am

Run:

asadm -e "asinfo -v service" -p 6000

This will show the services each node is advertising.

pgupta · February 7, 2017, 3:50am

I think your cluster keeps looking for this non existent node at 172:17:42:1:6000. Try: asinfo -v ‘tip-clear:host-port-list=172.17.42.1:6000’ -h nodeA -p 6000

Then see if asadm>info shows only the two good nodes. Also, good idea to do an asbackup of your data before trying anything exotic!

Once you have backup, you can try: asinfo -v ‘dun:nodes=0’ -h nodeA -p 6000 because the nodeid seems to be 0 for this non-existent node.

billbargens · February 7, 2017, 5:46am

I run some commands to check the issue, and the result is as follows:

asinfo -v service -h nodeA -p 6000 nodeA-ip1:6000;nodeA-ip2:6000

asinfo -v service -h nodeB -p 6000 nodeB-ip1:6000;nodeB-ip2:6000;172.17.42.1:6000

The “172.17.42.1:6000” is a unknow ip that does not belongs to nodeB.

asadm -e “asinfo -v services” -p 6000

nodeA returned: nodeB-ip1;nodeB-ip2;172.17.42.1:6000

nodeB returned: nodeA-ip1;nodeA-ip2

172 (172.17.42.1) returned: Invalid command or Could not connect to node 172.17.42.1

Does this mean the address that nodeB advertised to the cluster was nodeB-ip1, nodeB-ip2 and 172.17.42.1:6000? But The “172.17.42.1:6000” is a unknow ip that does not belongs to nodeB.

billbargens · February 7, 2017, 6:09am

I try your methods, but does not work.

As from the official doc: Info Command Reference | Aerospike Documentation, the command “asinfo -v service -h nodeB -p 6000” will return a list of IP that nodeB advitesd to other cluster nodes.

asinfo -v service -h nodeB -p 6000 nodeB-ip1:6000;nodeB-ip2:6000;172.17.42.1:6000

It seems that nodeB advertised a ip 172.17.42.1 that does not belong to it to the cluster. So what we need to do is to get rid of that ip. Is it right?

pgupta · February 7, 2017, 6:12am

Some other non aerospike process also listening at this port 6000 on nodeB? On node B, can you try using netstat and see what processes are using port 6000?

Topic		Replies	Views
Aerospike Cluster Automatically Errors Node.js Client	3	3726	January 18, 2016
AEROSPIKE_ERR_CLIENT Node BB9A0AEAE005452 127.0.0.1:3000 is not yet fully initialized Installation	5	3739	July 3, 2018
Aerospike Node Entering and Exiting the Cluster Frequently Configuration	9	1950	July 1, 2017
Adding 2 nodes to the cluster and how to check whether 2 nodes are connected or not?	36	8202	March 10, 2017
Aerospike tries to connect to dead node Configuration error	5	1572	March 10, 2022

Aql client overwhelmed by "WARN AEROSPIKE_ERR_TIMEOUT"

Related topics