Aerospike cluster split in VMware ESXi

hiroki_kana · September 8, 2015, 4:46am

Hi.

When failed heartbeat then made single cluster and happened cluster integrity failed on my cluster. And cluster was split.

aerospike.log-20150829.gz:Aug 28 2015 08:59:35 GMT: WARNING (paxos): (paxos.c::1890) quorum visibility lost! Continuing anyway …

My cluster is

node : 4
replication factor : 2
heartbeat mode : mesh
heartbeat timing : 150
heartbeat timeout : 10

This node on VMWare ESXi(share in not Aerospike node).

My question is

I want to auto merge to cluster. way to config setting is exists?
Many time heartbeat failed is caused by this is virtual machine node?
I think this phenomenon is called “split brain”, exists solution?

lucien · September 29, 2015, 5:27pm

The cluster split may be due to instability of your vmware env. You may be able to increase

heartbeat timing : 150 heartbeat timeout : 10

Please see:

http://www.aerospike.com/docs/reference/configuration/#timeout

and

http://www.aerospike.com/docs/reference/configuration/#interval

Split brain will occur with network partitions. Database should recover once network is restored. You may be able to for a cluster reset by running an asdm dun/undun command

asadm -e "cluster dun all; shell sleep 5; cluster undun all"

lucien · September 30, 2015, 5:03pm

Can you explain what you mean by “auto merge” cluster?

After a split brain when the network recovers , data will use its conflict resolution to solve conflict between records.

Please see info on :

https://discuss.aerospike.com/t/conflict-resolution-policy-setting-usage/818

hiroki_kana · October 8, 2015, 10:32am

Thank you very much for your reply.

“auto merge” I have said, is a automatic reconstruction cluster when cluster split.

I understand that due to instability of vmware env. However, it also occur unstable network on a physical machine. For example, think not able to communicate a few seconds when network device(has a redundant configuration) broken. Want to do automatic dun / undun, because I think happen split brain at that time and not always ready to troubleshooting.

I knows exists auto-dun / auto-undun in config. Is it better to use auto-dun / auto-undun setting in order to fullfill automatic dun / undun?

lucien · October 13, 2015, 11:05pm

You may be able to use auto-dun on a small cluster of less than 6 nodes.

asinfo -v 'config-set:context=service;paxos-recovery-policy=auto-dun-all'

In case of a flaky network you could also increase both the interval and timeout values

http://www.aerospike.com/docs/operations/configure/network/heartbeat/

Please see example below:

  heartbeat {
    ...
    interval 250                    # Number of milliseconds between heartbeats
    timeout 30                      # Number of heartbeat intervals to wait
   ...
}

hiroki_kana · October 19, 2015, 12:05pm

Thank you, for reply.

I understand automatic dun when setting paxos-recovery-policy=auto-dun-all.

In this case, dunned node is comes back to cluster?

lucien · November 3, 2015, 2:56am

That is correct. auto-dun-all in a small cluster could help.

Topic		Replies	Views
Aerospike multisite cluster - resiliency	14	196	May 25, 2024
Problems with cluster setup on VM Configuration	2	1750	September 20, 2016
Problem cluster integrity false on aerospike enterprise 3.9 Configuration	0	1378	September 3, 2016
Minimal Heartbeat Configuration	2	2276	February 21, 2016
Cluster integrity fault: Unable to create two node cluster Configuration	4	3177	July 8, 2015

Aerospike cluster split in VMware ESXi

Related topics