Replication issue : all nodes down when synchronizing after a node restart


#1

Hi all,

We are encountering an issue with a configuration based on 2 Aerospike nodes on 2 different machines (same namespace, replication factor = 2).

If a node is stopped then restarted, each node becomes unavailable after the restart :

Error: (7) AEROSPIKE_ERR_CLUSTER_CHANGE

We have the following configuration on each machine :

namespace calypso {
        replication-factor 2
        memory-size 128G
        default-ttl 0

        storage-engine device {
                device /dev/sda
                device /dev/sdb
                write-block-size 256K
        }
}

Is it because we are using only 2 nodes for replication, do we need more nodes to handle replication + failover?

Thanks


#2

Aerospike Server version?

Can we see the network config context?


#3

Hi,

We’re using Aerospike 3.9.0.

Here is our network config for node 1 (we have 2 servers x.x.x.1 and x.x.x.2 in mesh mode on same datacenter):

network {
        service {
                address any
                port 3000
        }

        heartbeat {
                mode mesh
                port 3002
                address x.x.x.1
                mesh-seed-address-port x.x.x.1 3002 
                mesh-seed-address-port x.x.x.2 3002

                interval 150
                timeout 10
        }

        fabric {
                port 3001
        }

        info {
                port 3003
        }
}

We’ve also found another issue, after the first error occurs (Error: (7) AEROSPIKE_ERR_CLUSTER_CHANGE), if we restart 1 node, and then restart the other one, all servers become unavailable (and it seems to be forever) :

AEROSPIKE_ERR_CLIENT Socket write error: 111

Error -1: Failed to connect

We also have this in aerospike logs :

Jul 21 2016 14:37:23 GMT: WARNING (as): (signal.c:193) SIGTERM received, shutting down
Jul 21 2016 14:37:23 GMT: WARNING (as): (signal.c:196) startup was not complete, exiting immediately

#4

We’ve just tried the same sequence with 4 nodes instead of 2, even if we only stop one node before or during insertion of data, we have AEROSPIKE_ERR_CLUSTER_CHANGE error, and worse, if we restart 2 nodes sequentially during this ‘cluster change’ state, it kills all the cluster.


#5

New tests done we just have to follow this steps to have all the cluster down (done with 2 nodes)

  • stop node 1
  • insert some data
  • start node 1

–> Error: (7) AEROSPIKE_ERR_CLUSTER_CHANGE (not so bad as we can still query - potentially inaccurate- data with the java client and failOnClusterChange = false)

  • restart node 2

–> AEROSPIKE_ERR_CLIENT Socket write error: 111

So it seems that if a node is sending his data to another node for replication, if this node is stopped, all the cluster dies.

The very bad part about this is that it seems that the only way to repair it is to delete all the data partition.


#6

Hello

We are also facing exactly same issue. Could you please share, how did you resolve this?

–Chintan


#7

We didn’t solve it… With more nodes it seems to work better but we still have some errors.


#8

Could either of you provide server logs during this event?

Also @chintan.gandhi, what version of Aerospike Server and which client are you using?


#9

@kporter The full problem is described in this issue, with more logs in the comment https://github.com/aerospike/aerospike-server/issues/142


#10

@Loic_D Thanks for the link. You allowing reads to any replica nodes would reduce this issue a bit. For python you would change the client’s replica policy to POLICY_REPLICA_ANY.