Single network partitioned node and XDR recoveries

The Aerospike Knowledge Base has moved to https://support.aerospike.com. Content on https://discuss.aerospike.com is being migrated to either https://support.aerospike.com or https://docs.aerospike.com. Maintenance on articles stored in this repository ceased on December 31st 2022 and this article may be stale. If you have any questions, please do not hesitate to raise a case via https://support.aerospike.com.

Single network partitioned node and XDR recoveries

Detail

A node who lost visibility to all the other nodes will become a single node cluster and can own all partitions.

Recoveries on a node that became a single node cluster due to network connectivity issues will experience a jump in XDR lag.

The node who became a single node cluster will suddenly start owning partitions which it was neither master nor replica. Aerospike does not maintain the per-partition last ship time (LST) for those. So, this node will have a very stale LST for those partitions. The observed lag would approximately be set to the start time of the Aerospike service on the node. So, when XDR recoveries start for those partitions, even though we do not really ship anything in this case (as there is no data for those partitions), the lag will spike to match the startup time.

A good practice is to use the min-cluster-size configuration parameter to avoid small sub-clusters to form.

Resolution

A fix has been implemented and released as part of versions 5.1.0.10 and 5.0.0.13:

[AER-6288] - (XDR) Keep the last ship time of non-owned partitions relatively current.

Non-owned partitions’ safe LST will be kept current using the partition’s last persisted LST. This will potentially reduce the number of records shipped in rare cases such as nodes becoming single-node clusters.

Keywords

XDR NETWORK PARTITIONING ORPHAN SINGLE NODE

Timestamp

September 2020