Getting "timeouts" after one node shutdown

manana · August 19, 2015, 1:54pm

I have 4 nodes cluster with replication factor 2. A client is multiprocess application, that connect to cluster in parent process and reads random data. When I shutdown one node, an application experiences few (~1 per worker) errors like: -1 - Bad file descriptor, -1 - Socket read error: 104

And then many “timeouts” like: 9 - Client timeout: timeout=3000 iterations=4 failedNodes=0 failedConns=0, but aerospike_key_get take only few ms. This bad state takes 2-3s (but so many requests) and then everything goes back to normal.

Can I force library to simply ask the replica of data instead of dead node? It has more than 2990ms spare time to do it

Server: CE 3.5.15, client: 3.1.20 I’m trying to configure client to be very defensive:

cfg.policies.timeout = 3000;
cfg.policies.retry = 3;
cfg.use_shm = true;
cfg.shm_takeover_threshold_sec = 1;

Whole example client with log: fget.c · GitHub

wchu · August 23, 2015, 10:30pm

It is possible to set a as_policy_replica with AS_POLICY_REPLICA_ANY, which will read randomly from any replica copy.

Please note that in conditions where a write for the same key is in transit (from a different client instance), reading from a non-master replica will not yet have the updated data which master would have.

manana · August 25, 2015, 6:34am

Thanks wchu, it solve fake timouts. I still get few connection reset by peer errors, while node is going down, but I can live with it

Topic		Replies	Views
Handling node failure on client	4	3822	September 23, 2024
Whole cluster goes down if one node fails to serve the request Java Client	6	1984	September 21, 2017
Java client return timeout once one of the nodes is down Client Libraries java	2	862	May 4, 2022
Client failures when a node is removed Operations	3	1236	September 9, 2017
Java Client - Automatic failover to other nodes Java Client	3	910	May 3, 2021

Getting "timeouts" after one node shutdown

Related topics