How to recover from long term Data Center outages in XDR


#1

How to recover from long term Data Center outages in XDR

Context

This article applies to Aerospike versions 3.13.0.6+ and 3.14.1.2+. In a scenario where two Data Centers are using XDR to synchronise one or many namespaces there may be an outage on a Data Center.
For short term outages, such as a network issue, XDR will handle this automatically via link down processing.
When there has been a longer term outage during which the digest log has overflowed on the source it may be necessary to take manual steps to recover from the outage. When the digest log (which is a ring buffer) has overflowed, the following string can be found in the log.

Digest log got wrapped. Continuing from the current start marker

The consequence of an overflow in the digest log is that the oldest entries will be overwritten and those changes may not be shipped (if no further updates to those records which would have put them ).

At this point the topology of XDR becomes significant. We will consider 2 cases.

  • Case 1: active/passive topology where Data Center 1 (DC1) ships changes to Data Center 2 (DC2).
  • Case 2: active / active topology meaning that both Data Centers are shipping changes to each other.

In the following examples we will assume DC2 has become unavailable and the digest log in DC1 has overflowed.

Make sure the cluster is stable and there are no xdr_active_failed_node_sessions.

Case-1

  1. If DC2 is still down get DC2 back up to force linked down processing to kick in from DC1. Else, skip this step and proceed further. This is required since you have to get the cluster state from cluster_down to cluster_windowship which is essential to drain the digestlog completely.

  2. Disassociate DC2 from the NS1 namespace dynamically:

asadm -e "asinfo -v 'set-config:context=namespace;id=NS1;xdr-remote-datacenter=DC2;action=remove'"

When we do this, since there is no destination to ship to for this namespace, the link down processing should finish quickly. If multiple namespaces are shipping to DC2 then you will have to disassociate DC2 from all those namespaces.

  1. Wait for link down processing to complete. Check for xdr_active_link_down_sessions and xdr_ship_outstanding_objects to get to zero.

  2. Disable XDR shipping (will disable for all DCs):

asadm -e "asinfo -v 'set-config:context=xdr;xdr-shipping-enabled=false'"

At this point XDR will not ship any incoming writes or deletes but will still log them in the digest log.

  1. Re-associate DC2 with the NS1 namespace:
asadm -e "asinfo -v 'set-config:context=namespace;id=NS1;xdr-remote-datacenter=DC2;action=add'"
  1. On DC1, take a backup of any namespaces being shipped to DC2 using asbackup

  2. Clean up the NS1 namespace on DC2 (using truncate command) to make sure we truncate all those records whose deletes were missed by XDR when the digest log overflowed on DC1.

  3. Using asrestore restore the backup taken in step 6 on DC2.

  4. Re-enable XDR shipping from DC1, any writes and deletes that had been received would have been held in the digest log and should now be shipped to DC2.

asadm -e "asinfo -v 'set-config:context=xdr;xdr-shipping-enabled=true'"

  1. You shouldn’t read from the cluster that has been restored right away. Wait until XDR catches up. Make sure the xdr_ship_outstanding_objects gets to zero on DC1.

Case-2

XDR setup is active-active DC1 <–> DC2 and DC2 is down and the xdr-digest-log overflowed on DC1. For the ease of understanding we assume DC2 is down. In an active-active setup it can be either DC2 or DC1. The defnition of a DC down here is from the XDR’s perspective and not from the application’s perspective.

  1. If DC2 is still down get DC2 back up to force linked down processing to kick in from DC1. Else, skip this step and proceed further. This is required since you have to get the cluster state from cluster_down to cluster_windowship which is essential to drain the digestlog completely.

  2. Disassociate DC2 from it’s namespace NS1 dynamically:

asadm -e "asinfo -v 'set-config:context=namespace;id=NS1;xdr-remote-datacenter=DC2;action=remove'"

When we do this, since there is no destination to ship to for this namespace, the link down processing should finish quickly. If multiple namespaces are shipping to DC2, you will have to disassociate from all those namespaces.

  1. Wait for link down processing to be complete. Check for xdr_active_link_down_sessions and xdr_ship_outstanding_objects to get to zero.

  2. Disable XDR shipping (will disable for all DCs):

asadm -e "asinfo -v 'set-config:context=xdr;xdr-shipping-enabled=false'"

At this point XDR will not ship any incoming writes or deletes but will still log them in the digest log.

  1. Re-associate DC2 with the NS1 namespace:
asadm -e "asinfo -v 'set-config:context=namespace;id=NS1;xdr-remote-datacenter=DC2;action=add'"
  1. On DC1, take a backup of any namespaces being shipped to DC2 using asbackup

  2. Stop application traffic on DC2. DC2 will be temporarily unavailable.

  3. Wait for the xdr_ship_outstanding_objects to be zero on DC2.

  4. Clean up the NS1 namespace on DC2 (using truncate command) to make sure we truncate all those records whose deletes were missed by XDR when the digest log overflowed on DC1.

  5. Using asrestore restore the backup taken in step 6 on DC2.

  6. Enable shipping on DC1. We did hold entries in dlog by having disabled shipping earlier so we shouldn’t miss any deletes:

asinfo -v 'set-config:context=xdr;xdr-shipping-enabled=true'
  1. You shouldn’t read from DC2 cluster right away. Wait until XDR catches up by making sure the xdr_ship_outstanding_objects is zero.

  2. Start application traffic on DC2.