"Initial partition balance unresolved" after restart 1 node in Cluster (AER-3863)


#1

Hello, i just restarted 1 node in cluster(3 nodes, replication factor 2) and see this in logs of restarted node:

May 22 2015 02:55:32 GMT: WARNING (tsvc): (thr_tsvc.c::424) rejecting client transaction - initial partition balance unresolved

and this on other nodes:

May 22 2015 02:57:02 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb920103bcb2b78 and self bb9f0afb752aed4
May 22 2015 02:57:02 GMT: INFO (paxos): (paxos.c::2412) CLUSTER INTEGRITY FAULT. [Phase 1 of 2] To fix, issue this command across all nodes:  dun:nodes=bb920103bcb2b78

what is it about? how can i deal with it?


#2

after stop waiting and start in gone, is it related with restart? AS version 3.5.8


#3

There is a very small window where a node joins a cluster and other nodes begin to advertise the node but the node hasn’t finished creating its partition table and a client picked up that advertised service and made a request.

I wouldn’t expect this window to have lasted very long at all. Actually this is the first time I am seeing this message actually being logged–so congratulations. How long did this message last for?


#4

i see this warning non-stop, until i don’t stop node. if i do restart it appear again.


#5

Could you share information about the environment you are running?

What OS? Kernel?

Is this running in a virtualized environment?

Have you been able to reproduce the issue? If so can you provide your method?

Are the Aerospike nodes on different machines?

Also could you share your /etc/aerospike/aerospike.conf?


#6

It’s not Virtual machines, cluster from 3 metal servers, my configs here: How to increase threads used by UDFs?

i was able to reproduce it just by /etc/init.d/aerospike restart i will check in monday is it still reproducable or not.


#7

The configuration there has the replication-factor configured to 0! And the namespace is not persisted.

If this is still the case then I would expect data loss when a node is dropped. Also replication-factor 0 should be an illegal setting–I am not sure what behavior you will see with that.

The minimum replication-factor should be 1 which is to say that there is only a single copy of the data in the cluster. This means that if a single node drops a portion of your data will not be in the cluster.

If you want 2 copies in the cluster then replication-factor needs to be configured to 2.


#8

sorry, i sent you wrong config. This config for single server installation. I will send right a bit later.


#9

@nizsheanez,

A JIRA ticket has been filed to make replication-factor 0 an illegal setting. It’s AER-3863, just for reference.

We look forward to seeing your config!

Cheers,

Maud