Write should continue if one of the node fails


#1

Hi All,

I am using latest client & server (community edition). If one of the node from cluster (having replication factor as 2) fails, I want write op to continue & eventually complete.

I’ve tried setting below policies.

cfg.policies.read.replica = AS_POLICY_REPLICA_ANY; cfg.policies.write.replica = AS_POLICY_REPLICA_ANY; cfg.policies.write.commit_level = AS_POLICY_COMMIT_LEVEL_MASTER;

cfg.policies.write.base.total_timeout = 5000;// 5secs cfg.policies.write.base.max_retries = 1; cfg.policies.write.base.sleep_between_retries = 300; // 300ms

But I don’t see write operation getting completed. Its getting stuck forever. Can anyone please help with some pointers?


#2

All modes should achieve this result, how are you determining that the write isn’t completing?

What is meant by ‘stuck forever’?


#4

Hi Kevin,
I’ve modified async_get example to write 30000 records.
It works fine as I see 60000 records into DB (considering replication factor of 2).

Admin> show statistics sets 
~~~~~~~~~~~~~~~~~~~~~~~~DIRTY eg-set Set Statistics (2018-12-13 11:51:33 UTC)~~~~~~~~~~~~~~~~~~~~~~~~
NODE             :   192.168.4.10:3000   192.168.4.164:3000   192.168.6.192:3000   192.168.7.186:3000   
disable-eviction :   false               false                false                false                
memory_data_bytes:   0                   0                    0                    0                    
ns               :   DIRTY               DIRTY                DIRTY                DIRTY                
objects          :   14720               15161                14689                15430                
set              :   eg-set              eg-set               eg-set               eg-set               
set-enable-xdr   :   use-default         use-default          use-default          use-default          
stop-writes-count:   0                   0                    0                    0                    
tombstones       :   0                   0                    0                    0                    
truncate_lut     :   0                   0                    0                    0                    

Admin> 

Now when I bring down 192.168.4.10 (while writes still happening) I see less records than 60000.

Admin> show statistics sets 
~~~~~~~~~~~~~~DIRTY eg-set Set Statistics (2018-12-13 12:27:52 UTC)~~~~~~~~~~~~~~
NODE             :   192.168.4.164:3000   192.168.6.192:3000   192.168.7.186:3000   
disable-eviction :   false                false                false                
memory_data_bytes:   0                    0                    0                    
ns               :   DIRTY                DIRTY                DIRTY                
objects          :   19821                20093                20006                
set              :   eg-set               eg-set               eg-set               
set-enable-xdr   :   use-default          use-default          use-default          
stop-writes-count:   0                    0                    0                    
tombstones       :   0                    0                    0                    
truncate_lut     :   0                    0                    0                    

Admin> 

I get below logs from aerospike c client which detects node removal from cluster.

write succeeded 5000
[src/main/aerospike/as_cluster.c:547][as_cluster_tend] 2 - Node BB91ABB97565000 refresh failed: AEROSPIKE_ERR_CONNECTION Bad file descriptor
[src/main/aerospike/as_cluster.c:547][as_cluster_tend] 2 - Node BB91ABB97565000 refresh failed: AEROSPIKE_ERR_CONNECTION Socket write error: 111, 192.168.4.10:3000, 35954
[src/main/aerospike/as_cluster.c:547][as_cluster_tend] 2 - Node BB91ABB97565000 refresh failed: AEROSPIKE_ERR_CONNECTION Socket write error: 111, 192.168.4.10:3000, 35956
[src/main/aerospike/as_cluster.c:362][as_cluster_remove_nodes_copy] 2 - Remove node BB91ABB97565000 192.168.4.10:3000
count:5001, ev-loop:0, pending-cmds:0

write succeeded 15000
[src/main/aerospike/as_node.c:158][as_node_destroy] 2 - as_node_destroy start for node 192.168.4.10:3000
[src/main/aerospike/as_node.c:205][as_node_destroy] 2 - as_node_destroy end for node 192.168.4.10:3000
count:15001, ev-loop:0, pending-cmds:0

End goal is to get 60000 records into DB even if one of the node fails while writes happening.
Is there any configuration policy for aerospike c client which can retry write op to different node?


#5

You could quiesce the node before taking it down, that way it would be graceful. Otherwise, check to see if the c client supports retries and/or build your own retry logic.