Replication issue : all nodes down when synchronizing after a node restart


Hi all,

We are encountering an issue with a configuration based on 2 Aerospike nodes on 2 different machines (same namespace, replication factor = 2).

If a node is stopped then restarted, each node becomes unavailable after the restart :


We have the following configuration on each machine :

namespace calypso {
        replication-factor 2
        memory-size 128G
        default-ttl 0

        storage-engine device {
                device /dev/sda
                device /dev/sdb
                write-block-size 256K

Is it because we are using only 2 nodes for replication, do we need more nodes to handle replication + failover?



Aerospike Server version?

Can we see the network config context?



We’re using Aerospike 3.9.0.

Here is our network config for node 1 (we have 2 servers x.x.x.1 and x.x.x.2 in mesh mode on same datacenter):

network {
        service {
                address any
                port 3000

        heartbeat {
                mode mesh
                port 3002
                address x.x.x.1
                mesh-seed-address-port x.x.x.1 3002 
                mesh-seed-address-port x.x.x.2 3002

                interval 150
                timeout 10

        fabric {
                port 3001

        info {
                port 3003

We’ve also found another issue, after the first error occurs (Error: (7) AEROSPIKE_ERR_CLUSTER_CHANGE), if we restart 1 node, and then restart the other one, all servers become unavailable (and it seems to be forever) :

AEROSPIKE_ERR_CLIENT Socket write error: 111

Error -1: Failed to connect

We also have this in aerospike logs :

Jul 21 2016 14:37:23 GMT: WARNING (as): (signal.c:193) SIGTERM received, shutting down
Jul 21 2016 14:37:23 GMT: WARNING (as): (signal.c:196) startup was not complete, exiting immediately


We’ve just tried the same sequence with 4 nodes instead of 2, even if we only stop one node before or during insertion of data, we have AEROSPIKE_ERR_CLUSTER_CHANGE error, and worse, if we restart 2 nodes sequentially during this ‘cluster change’ state, it kills all the cluster.


New tests done we just have to follow this steps to have all the cluster down (done with 2 nodes)

  • stop node 1
  • insert some data
  • start node 1

–> Error: (7) AEROSPIKE_ERR_CLUSTER_CHANGE (not so bad as we can still query - potentially inaccurate- data with the java client and failOnClusterChange = false)

  • restart node 2

–> AEROSPIKE_ERR_CLIENT Socket write error: 111

So it seems that if a node is sending his data to another node for replication, if this node is stopped, all the cluster dies.

The very bad part about this is that it seems that the only way to repair it is to delete all the data partition.



We are also facing exactly same issue. Could you please share, how did you resolve this?



We didn’t solve it… With more nodes it seems to work better but we still have some errors.


Could either of you provide server logs during this event?

Also @chintan.gandhi, what version of Aerospike Server and which client are you using?


@kporter The full problem is described in this issue, with more logs in the comment


@Loic_D Thanks for the link. You allowing reads to any replica nodes would reduce this issue a bit. For python you would change the client’s replica policy to POLICY_REPLICA_ANY.