Single network partitioned node and XDR recoveries

Single network partitioned node and XDR recoveries

Detail

A node who lost visibility to all the other nodes will become a single node cluster and can own all partitions.

Recoveries on a node that became a single node cluster due to network connectivity issues will experience a jump in XDR lag.

The node who became a single node cluster will suddenly start owning partitions which it was neither master nor replica. Aerospike does not maintain the per-partition last ship time (LST) for those. So, this node will have a very stale LST for those partitions. The observed lag would approximately be set to the start time of the Aerospike service on the node. So, when XDR recoveries start for those partitions, even though we do not really ship anything in this case (as there is no data for those partitions), the lag will spike to match the startup time.

A good practice is to use the min-cluster-size configuration parameter to avoid small sub-clusters to form.

Resolution

A fix has been implemented and released as part of versions 5.1.0.10 and 5.0.0.13:

[AER-6288] - (XDR) Keep the last ship time of non-owned partitions relatively current.

Non-owned partitions’ safe LST will be kept current using the partition’s last persisted LST. This will potentially reduce the number of records shipped in rare cases such as nodes becoming single-node clusters.

Keywords

XDR NETWORK PARTITIONING ORPHAN SINGLE NODE

Timestamp

September 2020

© 2015 Copyright Aerospike, Inc. | All rights reserved. Creators of the Aerospike Database.