XDR ships older version of records when node restarts

XDR can ship older version of some records when a node restarts

Problem Description

Some records on an XDR destination cluster have their value reverted to an older version after nodes at the source cluster are restarted.

Explanation

When a node with XDR enabled is restarted, it will always resume and re-process the last 5 minutes in its digest log. Log messages similar to the following will be observed:

Aug 14 2019 13:17:28 GMT: INFO (xdr): (xdr.c:837) Starting XDR with resume ... to ship 12 outstanding log records
...
...
...
Aug 14 2019 13:17:29 GMT: INFO (as): (as.c:445) service ready: soon there will be cake!
Aug 14 2019 13:17:30 GMT: INFO (xdr): (xdr_serverside.c:153) XDR last ship time of this node for DC 0 went back to 1565788311404 from 1565788649324
Aug 14 2019 13:17:30 GMT: INFO (xdr): (xdr_handlers.c:190) replication service ready: and now you have icing!

It is therefore possible for records that were updated while the restarted node was not in the cluster to have a previous version shipped if the restarted node re-processes digests (in the digestlog) of such records prior to migrations completing.

The number of affected records could be much higher if there is lag when the node is restarted.

Solution

There are 2 potential approaches to workaround this behavior:

  1. Stop the Aerospike process on the node, wait for the failed node processing to finish on the other nodes in the cluster, delete the digest log, and, finally restart the Aerospike process.

  2. Set xdr-shipping-enabled to false in the config file on the node which is being restarted, and then dynamically set it to true once migrations have completed.

Notes

A fix for this behavior is internally tracked under Jira AER-6098.

Keywords

XDR FAILED NODE SHIPPING OLDER VERSION MIGRATE MIGRATION

Timestamp

August 2019

Can you use node quiescing and after the quiesced node stops taking any more writes and finished shipping its XDR (dlog_outstanding zero), shutdown and delete its digestlog before starting? (i.e. don’t have to worry about failed-node-handler and migrations and this can be scripted.)

© 2015 Copyright Aerospike, Inc. | All rights reserved. Creators of the Aerospike Database.