Here is our network config for node 1 (we have 2 servers x.x.x.1 and x.x.x.2 in mesh mode on same datacenter):
network {
service {
address any
port 3000
}
heartbeat {
mode mesh
port 3002
address x.x.x.1
mesh-seed-address-port x.x.x.1 3002
mesh-seed-address-port x.x.x.2 3002
interval 150
timeout 10
}
fabric {
port 3001
}
info {
port 3003
}
}
We’ve also found another issue, after the first error occurs (Error: (7) AEROSPIKE_ERR_CLUSTER_CHANGE), if we restart 1 node, and then restart the other one, all servers become unavailable (and it seems to be forever) :
AEROSPIKE_ERR_CLIENT Socket write error: 111
Error -1: Failed to connect
We also have this in aerospike logs :
Jul 21 2016 14:37:23 GMT: WARNING (as): (signal.c:193) SIGTERM received, shutting down
Jul 21 2016 14:37:23 GMT: WARNING (as): (signal.c:196) startup was not complete, exiting immediately
We’ve just tried the same sequence with 4 nodes instead of 2, even if we only stop one node before or during insertion of data, we have AEROSPIKE_ERR_CLUSTER_CHANGE error, and worse, if we restart 2 nodes sequentially during this ‘cluster change’ state, it kills all the cluster.
New tests done we just have to follow this steps to have all the cluster down (done with 2 nodes)
stop node 1
insert some data
start node 1
→ Error: (7) AEROSPIKE_ERR_CLUSTER_CHANGE
(not so bad as we can still query - potentially inaccurate- data with the java client and failOnClusterChange = false)
restart node 2
→ AEROSPIKE_ERR_CLIENT Socket write error: 111
So it seems that if a node is sending his data to another node for replication, if this node is stopped, all the cluster dies.
The very bad part about this is that it seems that the only way to repair it is to delete all the data partition.
@Loic_D Thanks for the link. You allowing reads to any replica nodes would reduce this issue a bit. For python you would change the client’s replica policy to POLICY_REPLICA_ANY.