Getting "timeouts" after one node shutdown


#1

I have 4 nodes cluster with replication factor 2. A client is multiprocess application, that connect to cluster in parent process and reads random data. When I shutdown one node, an application experiences few (~1 per worker) errors like: -1 - Bad file descriptor, -1 - Socket read error: 104

And then many “timeouts” like: 9 - Client timeout: timeout=3000 iterations=4 failedNodes=0 failedConns=0, but aerospike_key_get take only few ms. This bad state takes 2-3s (but so many requests) and then everything goes back to normal.

Can I force library to simply ask the replica of data instead of dead node? It has more than 2990ms spare time to do it :wink:

Server: CE 3.5.15, client: 3.1.20 I’m trying to configure client to be very defensive:

cfg.policies.timeout = 3000;
cfg.policies.retry = 3;
cfg.use_shm = true;
cfg.shm_takeover_threshold_sec = 1;

Whole example client with log: https://gist.github.com/jarda-manana/237e57bf84111fc1a8c1


#2

It is possible to set a as_policy_replica with AS_POLICY_REPLICA_ANY, which will read randomly from any replica copy.

Please note that in conditions where a write for the same key is in transit (from a different client instance), reading from a non-master replica will not yet have the updated data which master would have.


#3

Thanks wchu, it solve fake timouts. I still get few connection reset by peer errors, while node is going down, but I can live with it :wink: