Aerospike cluster split in VMware ESXi


#1

Hi.

When failed heartbeat then made single cluster and happened cluster integrity failed on my cluster. And cluster was split.

aerospike.log-20150829.gz:Aug 28 2015 08:59:35 GMT: WARNING (paxos): (paxos.c::1890) quorum visibility lost! Continuing anyway …

My cluster is

  • node : 4
  • replication factor : 2
  • heartbeat mode : mesh
  • heartbeat timing : 150
  • heartbeat timeout : 10

This node on VMWare ESXi(share in not Aerospike node).

My question is

  1. I want to auto merge to cluster. way to config setting is exists?
  2. Many time heartbeat failed is caused by this is virtual machine node?
  3. I think this phenomenon is called “split brain”, exists solution?

#2

The cluster split may be due to instability of your vmware env. You may be able to increase

heartbeat timing : 150 heartbeat timeout : 10

Please see:

http://www.aerospike.com/docs/reference/configuration/#timeout

and

http://www.aerospike.com/docs/reference/configuration/#interval

Split brain will occur with network partitions. Database should recover once network is restored. You may be able to for a cluster reset by running an asdm dun/undun command

asadm -e "cluster dun all; shell sleep 5; cluster undun all"

#3

Can you explain what you mean by “auto merge” cluster?

After a split brain when the network recovers , data will use its conflict resolution to solve conflict between records.

Please see info on :

https://discuss.aerospike.com/t/conflict-resolution-policy-setting-usage/818


#4

Thank you very much for your reply.

“auto merge” I have said, is a automatic reconstruction cluster when cluster split.

I understand that due to instability of vmware env. However, it also occur unstable network on a physical machine. For example, think not able to communicate a few seconds when network device(has a redundant configuration) broken. Want to do automatic dun / undun, because I think happen split brain at that time and not always ready to troubleshooting.

I knows exists auto-dun / auto-undun in config. Is it better to use auto-dun / auto-undun setting in order to fullfill automatic dun / undun?


#5

You may be able to use auto-dun on a small cluster of less than 6 nodes.

asinfo -v 'config-set:context=service;paxos-recovery-policy=auto-dun-all'

In case of a flaky network you could also increase both the interval and timeout values

http://www.aerospike.com/docs/operations/configure/network/heartbeat/

Please see example below:

  heartbeat {
    ...
    interval 250                    # Number of milliseconds between heartbeats
    timeout 30                      # Number of heartbeat intervals to wait
   ...
}

#6

Thank you, for reply.

I understand automatic dun when setting paxos-recovery-policy=auto-dun-all.

In this case, dunned node is comes back to cluster?


#7

That is correct. auto-dun-all in a small cluster could help.