Aql client overwhelmed by "WARN AEROSPIKE_ERR_TIMEOUT"


#1

I have a aerospike cluster of two nodes marked as nodeA and nodeB. The cluster have been running normally for a long time.

But Recently when I connected to the server node using aql, the client was overwhelmed by the warning messages:“WARN AEROSPIKE_ERR_TIMEOUT”.

I run some commands to check the issue, and the result is as follows:

asinfo -v service -h nodeA -p 6000 nodeA-ip1:6000;nodeA-ip2:6000

asinfo -v service -h nodeB -p 6000 nodeB-ip1:6000;nodeB-ip2:6000;172.17.42.1:6000

The “172.17.42.1:6000” is a unknow ip that does not belongs to nodeB.

asadm -e “asinfo -v services” -p 6000

nodeA returned: nodeB-ip1;nodeB-ip2;172.17.42.1:6000

nodeB returned: nodeA-ip1;nodeA-ip2

172 (172.17.42.1) returned: Invalid command or Could not connect to node 172.17.42.1

So how can I get rid of the warning messages?


#2

Any help will be appreciated!


#3

@rbotzer Can you give me some advice?


#4

@system Can anyone give me some advice?


#5

If that isn’t a recognized address could it be a rogue node?


#6

Can you share the exact config files of both nodes? you can mask the ip address if you want to with “nodeA” , “nodeB” etc.

Or you may try to run on nodeA:

asinfo -v “services-alumni-reset”

and see if that clears it up.


#7

My server version is 3.5.8

The command failed with a exception:

asinfo -v “services-alumni-reset” -h nodeA request to nodeA returned error


#9

The config files is as fellows:

 service {
 user root
 group root
  paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
  pidfile /disk1/jijw/aerospike/item/var/run/aerospike.pid
  transaction-queues 32
  transaction-threads-per-queue 32
  service-threads 32
  proto-fd-max 15000
  work-directory /disk1/jijw/aerospike/item/var
}

logging {
  # Log file must be an absolute path.
  file /disk1/jijw/aerospike/item/logs/aerospike.log {
    context any info
  }
}

mod-lua {
  system-path /disk1/jijw/aerospike/item/share/udf/lua
  user-path /disk1/jijw/aerospike/item/var/udf/lua
}

network {
  service {
    address any
    port 6000
    reuse-address
  }

  heartbeat {
    mode multicast
    address ******(masked)
    port 9921
    interval 150
    timeout 10
  }

  fabric {
    port 6001
  }

  info {
    port 6003
  }
}



namespace item {
  single-bin false
  replication-factor 2
  memory-size 20G
  default-ttl 0 # 30 days, use 0 to never expire/evict.
  high-water-memory-pct 85
  high-water-disk-pct 85
  stop-writes-pct 90
  write-commit-level-override all
  storage-engine device {
    device /dev/sdc1    # raw device.# device /dev/<device>  # (optional) another raw device.
    write-block-size 1M
    data-in-memory false
    cold-start-empty true
  }
}

#10

Since your server is listening on port 6000 instead of default 3000, add -p 6000 to your asinfo command.

Unrelated, cold-start-empty true leaves you vulnerable to losing all your data should the entire cluster restart after a cluster wide fault. Hope you understand the implications of having that in your config file. What you are saying that always ignore the data in the persistent storage medium when booting this node up. This is generally not recommended.


#11

Still same:

asinfo -v “services-alumni-reset” -h nodeA -p 6000 request to nodeA : 6000 returned error

My server version is 3.5.8. Does it support this command?

Also thanks for your advice about the cold-start-empty configuration!


#12

At ver - 3.9.1, dun was deprecated and services-alumni-reset introduced. so yes, 3.5.8, will not work

asinfo -v ‘dun:nodes=BB936F106CA0568’ where BB… is the nodeid that you want to remove.

what does asadm>info show? do you see the rogue node id?


#13

The “asadm>info” result is as fellows:


#14

In your configuration, under network.service, set access-address to the appropriate client reachable address. This configuration is static so you will need to restart each node after configuring.


#15

But the request using java client is normal!


#16

The aerospike cluster seems to be running normally except for the endless aql exception!


#17

Run:

asadm -e "asinfo -v service" -p 6000

This will show the services each node is advertising.


#18

I think your cluster keeps looking for this non existent node at 172:17:42:1:6000. Try: asinfo -v ‘tip-clear:host-port-list=172.17.42.1:6000’ -h nodeA -p 6000

Then see if asadm>info shows only the two good nodes. Also, good idea to do an asbackup of your data before trying anything exotic!

Once you have backup, you can try: asinfo -v ‘dun:nodes=0’ -h nodeA -p 6000 because the nodeid seems to be 0 for this non-existent node.


#19

I run some commands to check the issue, and the result is as follows:

asinfo -v service -h nodeA -p 6000 nodeA-ip1:6000;nodeA-ip2:6000

asinfo -v service -h nodeB -p 6000 nodeB-ip1:6000;nodeB-ip2:6000;172.17.42.1:6000

The “172.17.42.1:6000” is a unknow ip that does not belongs to nodeB.

asadm -e “asinfo -v services” -p 6000

nodeA returned: nodeB-ip1;nodeB-ip2;172.17.42.1:6000

nodeB returned: nodeA-ip1;nodeA-ip2

172 (172.17.42.1) returned: Invalid command or Could not connect to node 172.17.42.1

Does this mean the address that nodeB advertised to the cluster was nodeB-ip1, nodeB-ip2 and 172.17.42.1:6000? But The “172.17.42.1:6000” is a unknow ip that does not belongs to nodeB.


#20

I try your methods, but does not work.

As from the official doc: http://www.aerospike.com/docs/reference/info#service, the command “asinfo -v service -h nodeB -p 6000” will return a list of IP that nodeB advitesd to other cluster nodes.

asinfo -v service -h nodeB -p 6000 nodeB-ip1:6000;nodeB-ip2:6000;172.17.42.1:6000

It seems that nodeB advertised a ip 172.17.42.1 that does not belong to it to the cluster. So what we need to do is to get rid of that ip. Is it right?


#21

Some other non aerospike process also listening at this port 6000 on nodeB? On node B, can you try using netstat and see what processes are using port 6000?