How to recover from long term Data Center outages in XDR

The Aerospike Knowledge Base has moved to https://support.aerospike.com. Content on https://discuss.aerospike.com is being migrated to either https://support.aerospike.com or https://docs.aerospike.com. Maintenance on articles stored in this repository ceased on December 31st 2022 and this article may be stale. If you have any questions, please do not hesitate to raise a case via https://support.aerospike.com.

How to recover from long term Data Center outages in XDR

Context

This article applies to Aerospike versions 3.13.0.6+ and 3.14.1.2+. In a scenario where two Data Centers are using XDR to synchronise one or many namespaces there may be an outage on a Data Center.
For short term outages, such as a network issue, XDR will handle this automatically via link down processing.
When there has been a longer term outage during which the digest log has overflowed on the source it may be necessary to take manual steps to recover from the outage.

When the digest log (which is a ring buffer) has overflowed, the following string can be found in the log:

Digest log file overflow. Losing XDR records.

Earlier versions would print the following string:

Digest log got wrapped. Continuing from the current start marker

The consequence of an overflow in the digest log is that the oldest entries will be overwritten and those changes may not be shipped (if no further updates to those records were made subsequently). Whenever either xdr_queue_overflow_error or dlog_overwritten_error is not zero, the destination Data Center may be missing some updates.

At this point the topology of XDR becomes significant. Consider following two scenarios,

  • Case 1: active-passive topology, where Data Center 1 (DC1) ships changes to Data Center 2 (DC2).
  • Case 2: active-active topology, meaning that both Data Centers are shipping changes to each other.

In the following example, let’s assume DC2 has become unavailable and the digest log on DC1 has overflowed.

Make sure the cluster is stable and there are no xdr_active_failed_node_sessions.

Case-1

  1. If DC2 is still down get DC2 back up to force linked down processing to kick in from DC1. Else, skip this step and proceed further. This is required to get the cluster state from CLUSTER_DOWN to CLUSTER_WINDOWSHIP to drain the digestlog completely.

  2. Disassociate DC2 from the NS1 namespace dynamically:

asadm -e "asinfo -v 'set-config:context=namespace;id=NS1;xdr-remote-datacenter=DC2;action=remove'"

Since there is no destination to ship to for this namespace, the link down processing should finish quickly. If multiple namespaces are shipping to DC2 then it is required to disassociate DC2 from all those namespaces.

  1. Wait for link down processing to complete. Check for xdr_active_link_down_sessions and xdr_ship_outstanding_objects to get to zero.

  2. Disable XDR shipping (will disable for all DCs):

asadm -e "asinfo -v 'set-config:context=xdr;xdr-shipping-enabled=false'"

At this point XDR will not ship any incoming writes or deletes but it will still log them in the digest log.

  1. Re-associate DC2 with the NS1 namespace:
asadm -e "asinfo -v 'set-config:context=namespace;id=NS1;xdr-remote-datacenter=DC2;action=add'"
  1. On DC1, take a backup of any namespaces being shipped to DC2 using asbackup

  2. Clean up the NS1 namespace on DC2 (using truncate command) just to make sure to truncate all those records whose deletes were missed by XDR when the digest log overflowed on DC1.

  3. Using asrestore restore the backup taken in step 6 on DC2.

  4. Re-enable XDR shipping on DC1, any write or delete transaction that has been received would have been held in the digest log and should now be shipped to DC2.

asadm -e "asinfo -v 'set-config:context=xdr;xdr-shipping-enabled=true'"

  1. Applications shouldn’t immediately read from the cluster that has just been restored. It is better to wait until XDR catches up by making sure the xdr_ship_outstanding_objects gets to zero on DC1.

Case-2

XDR setup is active-active DC1 ↔ DC2 and DC2 is down and the xdr-digest-log overflowed on DC1. For the ease of understanding let’s assume DC2 is down. Note that in an active-active setup it can be either DC2 or DC1. The defnition of a DC down here is from the XDR’s perspective and not from the application’s perspective.

  1. If DC2 is still down, get DC2 back up to force linked down processing to kick in from DC1. Else, skip this step and proceed further. This is required to get the cluster state from CLUSTER_DOWN to CLUSTER_WINDOWSHIP to drain the digestlog completely.

  2. Disassociate DC2 from its namespace NS1 dynamically:

asadm -e "asinfo -v 'set-config:context=namespace;id=NS1;xdr-remote-datacenter=DC2;action=remove'"

Since there is no destination to ship to for this namespace, the link down processing should finish quickly. If multiple namespaces are shipping to DC2, it is required to disassociate from all those namespaces.

  1. Wait for link down processing to be complete. Check for xdr_active_link_down_sessions and xdr_ship_outstanding_objects to get to zero.

  2. Disable XDR shipping (will disable for all DCs):

asadm -e "asinfo -v 'set-config:context=xdr;xdr-shipping-enabled=false'"

At this point XDR will not ship any incoming writes or deletes but it will still log them in the digest log.

  1. Re-associate DC2 with the NS1 namespace:
asadm -e "asinfo -v 'set-config:context=namespace;id=NS1;xdr-remote-datacenter=DC2;action=add'"
  1. On DC1, take a backup of any namespaces being shipped to DC2 using asbackup

  2. Stop application traffic on DC2. DC2 will be temporarily unavailable.

  3. Wait for the xdr_ship_outstanding_objects to be zero on DC2.

  4. Clean up the NS1 namespace on DC2 (using truncate command) to make sure it truncates all those records whose deletes were missed by XDR when the digest log overflowed on DC1.

  5. Set enable-xdr = false on DC2 for that namespace (NS1) so that the asrestore writes don’t get logged into the digestlog and hence it is not shipped back to DC1 or any other destination DCs.

asadm -e "asinfo -v 'set-config:context=namespace;id=NS1;enable-xdr=false'"
  1. Using asrestore restore the backup taken in step 6 on DC2.

  2. Once the restoration on DC2 is complete, set enable-xdr = true for that namespace (NS1).

asadm -e "asinfo -v 'set-config:context=namespace;id=NS1;enable-xdr=true'"
  1. Re-enable XDR shipping on DC1, any writes and deletes that has been received would have been held in the digest log and should now be shipped to DC2.
asinfo -v 'set-config:context=xdr;xdr-shipping-enabled=true'
  1. Applications shouldn’t read from DC2 cluster right away, it is better to wait until XDR catches up by making sure the xdr_ship_outstanding_objects is zero.

  2. Start application traffic on DC2.

Applies To

Server prior to v. 5.0