Replication issue : all nodes down when synchronizing after a node restart

Loic_D · July 21, 2016, 8:08am

Hi all,

We are encountering an issue with a configuration based on 2 Aerospike nodes on 2 different machines (same namespace, replication factor = 2).

If a node is stopped then restarted, each node becomes unavailable after the restart :

Error: (7) AEROSPIKE_ERR_CLUSTER_CHANGE

We have the following configuration on each machine :

namespace calypso {
        replication-factor 2
        memory-size 128G
        default-ttl 0

        storage-engine device {
                device /dev/sda
                device /dev/sdb
                write-block-size 256K
        }
}

Is it because we are using only 2 nodes for replication, do we need more nodes to handle replication + failover?

Thanks

kporter · July 21, 2016, 2:38pm

Aerospike Server version?

Can we see the network config context?

Loic_D · July 21, 2016, 2:50pm

Hi,

We’re using Aerospike 3.9.0.

Here is our network config for node 1 (we have 2 servers x.x.x.1 and x.x.x.2 in mesh mode on same datacenter):

network {
        service {
                address any
                port 3000
        }

        heartbeat {
                mode mesh
                port 3002
                address x.x.x.1
                mesh-seed-address-port x.x.x.1 3002 
                mesh-seed-address-port x.x.x.2 3002

                interval 150
                timeout 10
        }

        fabric {
                port 3001
        }

        info {
                port 3003
        }
}

We’ve also found another issue, after the first error occurs (Error: (7) AEROSPIKE_ERR_CLUSTER_CHANGE), if we restart 1 node, and then restart the other one, all servers become unavailable (and it seems to be forever) :

AEROSPIKE_ERR_CLIENT Socket write error: 111

Error -1: Failed to connect

We also have this in aerospike logs :

Jul 21 2016 14:37:23 GMT: WARNING (as): (signal.c:193) SIGTERM received, shutting down
Jul 21 2016 14:37:23 GMT: WARNING (as): (signal.c:196) startup was not complete, exiting immediately

Loic_D · July 22, 2016, 10:13am

We’ve just tried the same sequence with 4 nodes instead of 2, even if we only stop one node before or during insertion of data, we have AEROSPIKE_ERR_CLUSTER_CHANGE error, and worse, if we restart 2 nodes sequentially during this ‘cluster change’ state, it kills all the cluster.

Loic_D · July 22, 2016, 12:33pm

New tests done we just have to follow this steps to have all the cluster down (done with 2 nodes)

stop node 1
insert some data
start node 1

→ Error: (7) AEROSPIKE_ERR_CLUSTER_CHANGE (not so bad as we can still query - potentially inaccurate- data with the java client and failOnClusterChange = false)

restart node 2

→ AEROSPIKE_ERR_CLIENT Socket write error: 111

So it seems that if a node is sending his data to another node for replication, if this node is stopped, all the cluster dies.

The very bad part about this is that it seems that the only way to repair it is to delete all the data partition.

chintan.gandhi · November 11, 2016, 6:59pm

Hello

We are also facing exactly same issue. Could you please share, how did you resolve this?

–Chintan

Loic_D · November 12, 2016, 7:24am

We didn’t solve it… With more nodes it seems to work better but we still have some errors.

kporter · November 14, 2016, 6:18pm

Could either of you provide server logs during this event?

Also @chintan.gandhi, what version of Aerospike Server and which client are you using?

Loic_D · November 16, 2016, 1:21pm

@kporter The full problem is described in this issue, with more logs in the comment https://github.com/aerospike/aerospike-server/issues/142

kporter · November 22, 2016, 1:16am

@Loic_D Thanks for the link. You allowing reads to any replica nodes would reduce this issue a bit. For python you would change the client’s replica policy to POLICY_REPLICA_ANY.

Topic		Replies	Views
Aerospike Cluster Automatically Errors Node.js Client	3	3726	January 18, 2016
Aerospikes strange behaviour when link between nodes goes down How Aerospike Works query , scan , index	8	3887	January 17, 2017
Why my aerospike cluster can't do replication even I set replication-factor 2?	2	73	June 20, 2024
The cluster stops working when adding a new node Configuration	1	1109	May 14, 2018
Aerospike Exception Operations	4	1200	August 10, 2017

Replication issue : all nodes down when synchronizing after a node restart

AEROSPIKE_ERR_CLIENT Socket write error: 111

Error -1: Failed to connect

Related topics