XDR ships older version of records when node restarts

The Aerospike Knowledge Base has moved to https://support.aerospike.com. Content on https://discuss.aerospike.com is being migrated to either https://support.aerospike.com or https://docs.aerospike.com. Maintenance on articles stored in this repository ceased on December 31st 2022 and this article may be stale. If you have any questions, please do not hesitate to raise a case via https://support.aerospike.com.

XDR can ship older version of some records when a node restarts

Problem Description

In some older versions of Aerospike, it is possible that some records on an XDR destination cluster may have their value reverted to an older version after nodes at the source cluster are restarted.

Explanation

When a node with XDR enabled is restarted, it will always resume and re-process the last 5 minutes in its digest log. Log messages similar to the following will be observed:

Aug 14 2019 13:17:28 GMT: INFO (xdr): (xdr.c:837) Starting XDR with resume ... to ship 12 outstanding log records
...
...
...
Aug 14 2019 13:17:29 GMT: INFO (as): (as.c:445) service ready: soon there will be cake!
Aug 14 2019 13:17:30 GMT: INFO (xdr): (xdr_serverside.c:153) XDR last ship time of this node for DC 0 went back to 1565788311404 from 1565788649324
Aug 14 2019 13:17:30 GMT: INFO (xdr): (xdr_handlers.c:190) replication service ready: and now you have icing!

It is therefore possible for records that were updated while the restarted node was not in the cluster to have a previous version shipped if the restarted node re-processes digests (in the digestlog) of such records prior to migrations completing.

The number of affected records could be much higher if there is lag when the node is restarted.

Solution

There are 2 potential approaches to workaround this behavior:

  1. Stop the Aerospike process on the node, wait for the failed node processing to finish on the other nodes in the cluster, delete the digest log, and, finally restart the Aerospike process.

  2. Set xdr-shipping-enabled to false in the config file on the node which is being restarted, and then dynamically set it to true once migrations have completed.

Notes

This issue is fixed in the following versions of Aerospike:

  • 4.7.0.2 onwards
  • 4.6.0.4
  • 4.5.3.6
  • 4.5.2.6
  • 4.5.1.11
  • 4.5.0.15

Keywords

XDR FAILED NODE SHIPPING OLDER VERSION MIGRATE MIGRATION

Timestamp

August 2019

Can you use node quiescing and after the quiesced node stops taking any more writes and finished shipping its XDR (dlog_outstanding zero), shutdown and delete its digestlog before starting? (i.e. don’t have to worry about failed-node-handler and migrations and this can be scripted.)