"Initial partition balance unresolved" after restart 1 node in Cluster (AER-3863)

nizsheanez · May 22, 2015, 2:57am

Hello, i just restarted 1 node in cluster(3 nodes, replication factor 2) and see this in logs of restarted node:

May 22 2015 02:55:32 GMT: WARNING (tsvc): (thr_tsvc.c::424) rejecting client transaction - initial partition balance unresolved

and this on other nodes:

May 22 2015 02:57:02 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb920103bcb2b78 and self bb9f0afb752aed4
May 22 2015 02:57:02 GMT: INFO (paxos): (paxos.c::2412) CLUSTER INTEGRITY FAULT. [Phase 1 of 2] To fix, issue this command across all nodes:  dun:nodes=bb920103bcb2b78

what is it about? how can i deal with it?

nizsheanez · May 22, 2015, 3:27am

after stop waiting and start in gone, is it related with restart? AS version 3.5.8

kporter · May 22, 2015, 8:53pm

There is a very small window where a node joins a cluster and other nodes begin to advertise the node but the node hasn’t finished creating its partition table and a client picked up that advertised service and made a request.

I wouldn’t expect this window to have lasted very long at all. Actually this is the first time I am seeing this message actually being logged–so congratulations. How long did this message last for?

nizsheanez · May 23, 2015, 5:44am

i see this warning non-stop, until i don’t stop node. if i do restart it appear again.

kporter · May 24, 2015, 2:41am

Could you share information about the environment you are running?

What OS? Kernel?

Is this running in a virtualized environment?

Have you been able to reproduce the issue? If so can you provide your method?

Are the Aerospike nodes on different machines?

Also could you share your /etc/aerospike/aerospike.conf?

nizsheanez · May 24, 2015, 5:40am

It’s not Virtual machines, cluster from 3 metal servers, my configs here: How to increase threads used by UDFs? - #10 by raj

i was able to reproduce it just by /etc/init.d/aerospike restart i will check in monday is it still reproducable or not.

kporter · May 25, 2015, 5:49am

The configuration there has the replication-factor configured to 0! And the namespace is not persisted.

If this is still the case then I would expect data loss when a node is dropped. Also replication-factor 0 should be an illegal setting–I am not sure what behavior you will see with that.

The minimum replication-factor should be 1 which is to say that there is only a single copy of the data in the cluster. This means that if a single node drops a portion of your data will not be in the cluster.

If you want 2 copies in the cluster then replication-factor needs to be configured to 2.

nizsheanez · May 25, 2015, 9:24am

sorry, i sent you wrong config. This config for single server installation. I will send right a bit later.

Mnemaudsyne · May 27, 2015, 12:52am

@nizsheanez,

A JIRA ticket has been filed to make replication-factor 0 an illegal setting. It’s AER-3863, just for reference.

We look forward to seeing your config!

Cheers,

Maud

Topic		Replies	Views
Node does not reconnect to cluster after restart Operations	5	2499	November 30, 2015
Cluster not syncing back: try rolling restart or fast restart (AER-4500)	10	2855	November 21, 2015
Data inconsistency after failed node back Tuning	7	4167	November 14, 2014
Cluster integrity fault Operations	1	2134	January 24, 2016
Replication issue : all nodes down when synchronizing after a node restart Configuration	9	2357	November 22, 2016

"Initial partition balance unresolved" after restart 1 node in Cluster (AER-3863)

Related topics