Removing a node without causing client failure


#1

We’re in the middle of transitioning to some new hardware, and as part of the process I have to remove a few old nodes from our cluster. I’ve been careful to remove nodes at the right time (e.g., removing one node at a time, waiting for migrates to finish, etc.), but whenever I remove a node it seems to cause all of the Aerospike clients that are reading from the cluster to temporarily throw exceptions like this:

com.aerospike.client.AerospikeException$Timeout: Client timeout: timeout=0 iterations=2 failedNodes=0 failedConns=2
    at com.aerospike.client.command.SyncCommand.execute(SyncCommand.java:131)
    at com.aerospike.client.command.BatchExecutor.execute(BatchExecutor.java:53)
    at com.aerospike.client.AerospikeClient.get(AerospikeClient.java:606)

What is the preferred way to remove a node from the cluster in such a way that it is transparent to currently active clients?


#2

I will check this out and see if I can locate a best practise. If I can, I will share on this thread.


#3

I checked this out and there is no best practise for removing cluster nodes either gracefully or not. The problem is that the client has a partition map which locates records on the node that has been removed. There is no way to notify the client that the partition map has changed prior to the node going away.

What you can do is to set a very low tend interval which controls when the client requests the partition map from the cluster. This would mean that the impact is minimised, that being said it would increase network traffic so should be done on a temporary basis.

Tend interval is documented for the java API here:

http://www.aerospike.com/apidocs/java/ (Class ClientPolicy)

and the C API here:

http://www.aerospike.com/apidocs/c/d0/d52/structas__cluster.html#a47002904a9035f9d36143f4d05ad8a65

It is probably easier to treat the messages as informational only if you know you are bringing a node down.


#4

This is a major issue for us in adoption of Aerspike and causes us downtime each time we add or remove or restart node. Not to say we have never suffered from anything like this in any of the following technologies Cassandra, Scylladb, Couchbase, Mongodb It’s not clear why then a node is being removed or added gracefully the cluster does not notify the client with uptodate partition map once node shut down is initiated?


#5

I can suggest an operational work around.

On the node being removed - block all heartbeat realated and aerospike fabric related incoming and outgoing traffic (block the port used for heartbeat and fabric/aerospike internal communication for both incoming and outgoing) using iptables.

This would make the rest of cluster think that the node is dead and they would form a new cluster and a new partition map.

But since the node is still up and running and serving the clients on 3000, it can happily serve the reads.

Caveat - any writes which still go to this “ghost” node (my own term) will be lost from the actual cluster.

Work around for the caveat -

For the node being removed - first reduce the stop writes configuration to lesser than actual usage dynamically - therefore triggering stop writes on the node. Now block the heart beat traffic.

Once the new cluster without the ghost node is happily up and running for few seconds (you can verify the reads have stopped or not on the ghost node using asloglatency), then do sudo service aerospike stop on the ghost node. @Aerospike_Staff what do you guys say?

PS: I am not working with Aerospike, so take the above with a pinch of salt and as a pure operational suggestion with possible lots of other unknown gotchas!


#6

This could maybe a feature request. So when I do a sudo service aerospike stop , mimic something like above - i.e, enable a read only mode somehow (using something like suggested above or something else) and then after few seconds do the actual shutdown, i.e, give the clients enough time to get the new partition table from the cluster.

@Alexander_Piavka Btw, this should impact only at node removal, not at node addition I think. The server does not know or tracks the clients, so it cannot push notifications to them. The clients pull the partition map. So the errors you would see happens within the tend time as mentioned by @Ben_Bates above, i.e, the time when the client pulled the last partition map, the server updated the partition map and by the time the client pulls the next revision of partition map.


#7

I would also suggest creating a feature request.

Anshu’s proposed solution, though clever, essentially trades one problem with a new set of problems.