XDR Recovery Vs tomb-raider-eligible-age

Dear Team, Following is our configuration: tomb-raider-eligible-age=6Hrs

We have the following scenario: Case -1: On a Site (having multiple Aerospike DB in acluster) which is In-Service and some db from the cluster are down for more than 6 hours – Its clear that we need to follow → Take down a node, wipe out the disks on this node and restart the node.

Can you please re-confirm ?

Case-2: Two Geo-Redundant Site, having their database in sync to each other with XDR replication. Out of the 2 site , 1 Site completely down , for example for a programmed MW, for more than 6 hours. Now what we need to do before the restoration considering that XDR :

  1. XDR still accumulating digest on the remote site and still not overflowed ? Wrt to tomb-raider-eligible-age=6Hrs – how should we recover this local site keeping the tombstones config of 6 hr mind ?
  2. XDR overflowed on the remote site ?

Please suggest !!

Thanks Asif

Team, Can you please respond to the above qurey /

Thanks in advance !

Regards Asif

On case 1, and independent of XDR, it is the recommended to wipe out the node that was out for a while pass the tomb-raider-eligible-age. https://www.aerospike.com/docs/reference/configuration/index.html#tomb-raider-eligible-age

On case 2 which version of XDR? On latest XDR there is no more digestlog and version 5 and above support rewinding shipping to a DC to prior to the maintenance window. How to rewind XDR for a namespace

On older versions and for these cases:

  1. XDR still accumulating digest on the remote site and still not overflowed ? Wrt to tomb-raider-eligible-age=6Hrs – how should we recover this local site keeping the tombstones config of 6 hr mind ?

To recover the local site and since all nodes were down past 6 hours, you could remove the digestlog on all the nodes prior to bringing that cluster up and re-sync using one of the methods from the guide below.

  1. XDR overflowed on the remote site ?

With xdr overflowed, (no longer an issue on XDR 5 and no more digestlog) the cluster’s will probably be out of sync. The following guide should help with steps to recover both sides:

It may be best to open a support case since you are an Enterprise user for faster responses.

Thanks a lot for the feedback !! Not sure if i get the correct version of the XDR, i used the following command to get the following:

Admin> summary -l Cluster

  1. Server Version : E-4.5.2.6

–To recover the local site and since all nodes were down past 6 hours, you could remove the digestlog on all the nodes prior to bringing that cluster up and re-sync using one of the methods from the guide below.–

  1. So, you mean to say , remove the digest log from the local nodes which we are trying to recover ? Do we have any command to do it or , it is a manual one ?
  2. You mentioned about the guide – Can you share which one and the section that we need to follow ?

For Support: Do you mean to reach to " [helpdesk@aerospike.com]" ?

4.5 is an older build with Digestlog. I’d recommend trying out the new version 5.x without digestlog.

Assuming there was no lag when you shutdown the entire cluster you can remove the digestlog by simply deleting it with rm after having stop that node for the more than 6 hour maintenance. This would prevent any re-shipping from digestlog to occur when node is started:

sudo rm /opt/aerospike/digestlog

If digeslog overflowed or any cluster becomes out of sync, section 3 to 7 of this guide here should help.

There is also this article for general XDR full cluster maintenance.

Yes, if an EE customer you can get support from that address : [helpdesk@aerospike.com]

Hi, looking at this post can see in case the duration of thr DB down exceeds the tomb-raider-eligible-age it’s mandatory to coldboot/clean the affected node(s). I understand: https://www.aerospike.com/docs/reference/configuration/#tomb-raider-eligible-age proposes this approach, but at the same time the following link: https://aerospike.com/docs/guide/durable_deletes.html#tombstone-management clearly states that: A special background mechanism (“Tomb-Raider”) is used to remove no-longer needed tombstones. The conditions for a tombstone to be removed are as follows:

  • There are no previous copies of the record on disk.
    • This condition assures that a cold start will not bring back any older copy.
  • The tombstone’s last-update-time is before the current time minus the configured tomb-raider-eligible-age .
    • This condition prevents a node that’s been apart from the cluster for tomb-raider-eligible-age seconds, to rejoin and re-introduce an older copy.
  • The node is not waiting for any incoming migration.

If all conditions are satisfied, the tombstone will be reclaimed.

So my question is why do we need to wipe out the data, if tombstones are working as expected, they should be deleted only if it’s safe to do that, still it looks if we bring back a node after tomb-raider-eligible-age has passed we have issues, which issues exactly should we face for that?

Thanks Mike

Consider the case that a record was durably deleted while a node was out of the cluster and the node remains out of the cluster for at least tomb-raider-eligible-age. In the period after tomb-raider-eligible-age and before the node returning, the replicas holding the tombstone scan their disks and find no remaining copies of the durably deleted record and delete their local tombstone. Now the node that was down rejoins the cluster and it has a copy of the record from before the durable delete. The cluster, having already removed the tombstones, no longer has knowledge of this record being deleted and allows it to be replicated within the cluster.

1 Like

Thanks a lot for the clear explanation, really appreciated! Mike

Since you have been so kind would like to get more about the Initial Case 2.1 where we have one XDR site going down cleanly, no lag, so basically no information needed on local digestlogs.

We keep this one XDR site down for more than configured tombstone eligible age and in the meantime the digestlogs on the helthy site are accumulating WITHOUT overflowing. After this interval we bring the XDR node back up.

At this point locally we don’t have issues about the tombstones deletion since all nodes are aligned and no conflict should be in place, while the remote node should start applying the XDR digestlogs entirely, including more recent tombstones and durable deletes. So in this case where is the out of sync could possibly come from?

Thanks again! Mike

I would expect that tombstones created and subsequently cleared while the remote site was down to not be replicated. The digestlog cannot ship a tombstone that has already been deleted.

Thanks a lot, if digestlogs as well are clearing the tombstones then it makes perfect sense, appreciated Mike