FAQ - Some common questions on XDR


#1

Summary

XDR (Cross-Datacenter Replication) is a feature of the Enterprise edition of Aerospike which allows one cluster to synchronize with another cluster over a longer delay link. This covers some FAQ on XDR management. For more information on XDR, please see XDR Architecture.

1) How can we detect XDR conflicts or failure in case of Active-Active setup where both source and destination are receiving writes?

Unfortunately, there are too many variables in this scenario and there is no way to detect that a conflict happened with complete reliability. A datamodel level workaround is suggested for customers who are worried by the idea of keeping the records in two different sets/namespaces, where each DC writes to its own set/ns. This would avoid real conflict. However, the user would need to read from both the records and merge them using their business logic as and when necessary.

2) If an equivalent generation count is observed in both the source cluster and destination cluster, which wins: XDR or local cluster?

XDR does not check the generation of records when writing at a destination. Older versions of XDR (prior to 3.8) had a configuration which would force XDR writes with generation check (xdr-forward-with-gencheck when set to true).

There were complications with using the generation check in some scenarios including cases with deletes and re-creation of records, therefore this configuration option was deprecated from Aerospike 3.8.1 onwards.

Whenever XDR writes to it’s destination, it will overwrite the destination copy of the record (updating the meta-data of the record such as TTL, generation number, etc).

XDR version 3.8.3 and above supports a configuration to only ship the bins that have been updated. The option is xdr-ship-bins.

3) Does XDR preserve the generation of records that are shipped?

The generation of records are independent on each cluster. Here are a couple of examples to illustrate situations where the generation will not match between a source and destination cluster:

  1. XDR has a hotkey de-duplication mechanism (refer to xdr-hotkey-time-ms for details). Therefore, multiple updates at a source cluster (multiple generation increments) could result in much fewer updates at the destination (fewer generation increments).

  2. XDR may not have been setup initially on a source cluster. In such situation, the first time a record is processed and shipped from the source cluster (for example with a generation greater than 1), it will be written up with a generation of 1 on the destination cluster (assuming the record didn’t exist there previously).

  3. In some edge cases, if XDR times-out while processing a record, the timeout could occur ‘on the way back’ or after the transaction has been successfully applied at destination side. In such cases, the transaction will be relogged and retried, resulting in an extra generation increment at the destination compared to the source.

4) What does XDR do at the point of failure of a node?

No impact is seen on the source or the destination cluster in the event of a failure of one of the nodes. In the event of a node failure on the destination, XDR automatically identifies the remaining nodes and continues to ship data. The data is rebalanced once the node returns to the cluster. See here for further details.

5) What is the impact of source or target migration on XDR?

There is no known impact on XDR when the source or the destination is undergoing migration. If there is a lag and data has migrated to a different node by the time the digest log is processed, the digestlog entry will be relogged to the node currently owning the record. For XDR versions prior to 3.8, for one-off situations when there is a huge lag built up and the cluster goes into migration, you have the option of switching to single-read XDR mode to be able to proxy to get the data shipped as expected. This is because, by default, XDR uses batch reads which do not proxy during migrations.

6) How should the digest log be sized?

A digest log entry is 80 bytes, the digest log will hold both master and replica records (the replica records will be shipped in case of a node failure at the XDR source cluster). The digest log size will have to be determined based on the expected throughput across the cluster. When the digest log fills up, the oldest entries will be overwritten and may not be shipped if they do not occur subsequently in the digest log.

7) The digest log is sitting on the root filesystem on an EBS drive that is 80GB in size. Will there be any performance issues regarding IOPS or anything else that could harm the cluster/XDR? Will it be better to use another drive for digest log?

Reads and writes to the digestlog are done in batches of 100, so not every transaction would result in an io operation. At high throughput there will still be potentially high io activity to the disk where the digestlog is persisted, despite the advantages of the file system cache. It is good practice, for high throughput workloads to monitor the key system metrics, and, if the digest log io is a potential concern to monitor this with iostat.

8) How can monthly XDR traffic be estimated?

The xdr_ship_bytes metric can be used. The total shipped in 30 days can be calculated by taking two snapshots of the metric from all nodes at the beginning and end of the month (or on a daily basis and extrapolate). This can also be done on a per DC basis uisng dc_ship_bytes.

9) Is it possible to find the number of writes on a cluster that are initiated by an XDR client, aside from application writes?

The TPS obseved on the cluster represents all client writes, this includes any writes originating from an XDR client. If a cluster is an XDR destination and clients are also writing to it then the write TPS observed includes both XDR and client writes.

To differentiate between XDR and client writes, refer to the following 2 statistics:

10) Is it possible to specify which exact records XDR ships?

At the time of writing, it is not possible to specify shipping at the record level. Sets can be used to ship only specific records. The following article gives more details.

11) Do the source cluster nodes and destination cluster nodes connect to each other n-n? What happens if one destination node cannot be reached for connectivity or firewall reasons?

XDR functions as a normal Aerospike client. If one of the nodes at the destination goes down, it will trigger a reclustering and using the new partition map, the xdr client will ship the records to the correct cluster nodes. There is no one-to-one correspondence between nodes on the local cluster and nodes on the remote cluster. Every master node of the local cluster can write to any remote cluster node. Like any other client, XDR writes a record to the remote master node which then takes care of any replica writes.

If a destination node goes down, the following errors and warnings will be shown on the source cluster.

Mar 15 2018 09:21:24 GMT: WARNING (xdr): (xdr_ship.c:825) Marking cluster DC2 as suspicious because of a connection error.
Mar 15 2018 09:21:26 GMT: WARNING (xdr): (as_cluster.c:560) Node FA1BCEAAC270008 peers refresh failed: AEROSPIKE_ERR_CONNECTION Bad file descriptor
Mar 15 2018 09:21:26 GMT: INFO (xdr): (as_cluster.c:354) Remove node FA1BCEAAC270008 10.0.0.1:3000

XDR will continue to ship to the newly reformed cluster, like a normal client. If a destination node is unreachable due to connectivity or firewall issues, the writes/ships for that node (master for that record) will timeout and will be relogged in the digest log. In this case, the following warnings/errors on the source cluster can be observed and XDR will throttle for a 30 second period:

Mar 15 2018 10:19:48 GMT: WARNING (xdr): (xdr_ship.c:844) Marking cluster DC2 as suspicious because of a timeout.
Mar 15 2018 10:19:49 GMT: INFO (xdr): (as_cluster.c:536) Node 13898BE4CA005452 refresh failed: AEROSPIKE_ERR_TIMEOUT 

12) Which records will be shipped by XDR?

At XDR startup, the local cluster discovers namespaces that must be replicated to the remote cluster. XDR ships every write that happens after it is set-up and enabled. Any writes that happened prior to setting up xdr will not be shipped as they are not present in the digest log.

Once XDR has been enabled and configured, it will keep track of any writes taken during a destination cluster outage via the link down handler / window shipper thread, but it will not be able to replicate records that are not touched/written/updated before it has been setup.

13) What’s the recommended way to populate a new cluster with existing data (prior to xdr setup)?

To populate a new cluster with records that were written prior to setting XDR up, refer to the following article:

14) Can XDR be used to ship from one namespace to another namespace of the remote DC?

No, it is not possible at this time. You can pre-populate data using asrestore which will allow use of a different namespace.

15) Can XDR be used to ship from one cluster to another cluster with different version or different hardware?

XDR is just a client running within the server code and writing to remote destination clusters. It is possible to have different hardware (cloud or baremetal) or versions. For cloud to non-cloud, you may have to consider the transfer cost.

16) Are there any metrics to show how many records were written by XDR on a specific DC for a specific namespace?

There are no such direct metrics from a source XDR node, but there is a statistic on a node receiving XDR traffic. The xdr_write_success statistic is the number of records which are written by XDR on a namespace on a node.

17) What is the effect on XDR of node shutdowns in a high lag scenario?

Caution should be exercised when shutting down nodes when there is a high xdr_timelag. The lag should be allowed to drop to 0 before a node is removed from the cluster permanently. The reason for this is that migration writes between nodes are not entered in to the digest logs of those nodes and so the nodes owning the master or replica partitions must ship them. If a node leaves the cluster failed node processing will ensure that replica records for the partitions for which the removed node was master will be shipped to the destination, other nodes (which are neither master nor replica) cannot ship these records as they would be migrations and, as stated above, not present in digest log.

Keywords

XDR TTL GENERATION SOURCE DESTINATION

Timestamp

01/08/2018


Merging Bins during XDR?