The different delays a transaction is subject to in XDR 5.0
XDR was completely redesigned in version 5.0. One of the biggest change, if not the biggest one, is the removal of the ‘on file’ digestlog. This removal enabled:
- streamlined configuration
- higher efficiency and performance by decoupling shipments of records across different destination clusters
- new features to manage and keep namespaces ‘in sync’
This article focuses on the different delays a transaction would be subject to in XDR 5.0.
Note: XDR replicates records asynchronously by design. For synchronous replication, Aerospike supports multi-site clustering. The Active-Active Capabilities in Aerospike Database 5 blog post is a good read on this topic.
As a quick reminder, here are the main points of delay in the previous XDR:
- buffering writes and reads of digestlog entries (
- hotkey de-duplication delays through a cache (
- lockstep shipping across destination data centers (which could cause unexpected delays against otherwise healthy destinations)
The 5.0 XDR has the following parameters influencing how fast records get asynchronously shipped (some are not configurable in version 5.0 but would be in the upcoming version 5.1 (at the time of writing this article) .
This article will go over those different parameters in details and explain the relationship between them.
period-ms – DC thread’s period
period-ms delay is the run period of the XDR DC Manager thread. Its default is set to 100ms. The XDR DC thread is responsible for sequentially processing all the partitions at the source node for a specific destination cluster (referred to as DC or data center). Processing a partition typically consists of processing all pending entries in the XDR transaction queues (one per partition) but could also process entries from retry queues (if any pending in those) or instructing the DC recovery thread to proceed with the recovery process for relevant partitions (if recoveries are ongoing – when XDR is not able to keep up for a potential variety of reasons).
This would be the typical source of any delay in shipping a write transaction. As the DC thread goes sequentially through each partition, based on when in a cycle a write transaction occurs, the DC thread could process the transaction quasi instantly (based on the other parameters discussed in this article) but could also process it only on the next cycle which should be up to
Ignoring other parameters at this point, the time for a write transaction to be processed by the DC thread would be: between 0 and
period-ms milliseconds (100ms by default)
This is only valid if the DC thread is able to process all partitions faster than the configured
period-ms. If the DC thread
lap_us(which measures, in microseconds, how long a full lap takes) is higher than the
period-ms, it would cause extra delays. Being in
recoveries_pendingmode or being throttled due to entries pending in the retry queue would also cause extra delays.
period-msconfigured too low may cause too many passes without much actual processing being done and could adversely impact the overall performance of a node due to spending extra unnecessary CPU cycles. A minimum of 25ms would typically be recommended for use cases requiring less than 100ms.
When measuring the overall time it takes for a record to reach a destination after it has been written at the source (based on the last update time (LUT), there may be additional latency also noticed in cases of low throughput. XDR may not have yet established connections for all the service threads which would add to the delay in shipping. This latency may be more prevalent in a test environment with low throughput than production with ongoing high throughput traffic. A workaround, for testing, would be have a warm up workload to eliminate this connection initialization latency.
In version 5.0, the
period-msconfiguration parameter can only be set dynamically at runtime. It will be statically configurable in the upcoming version 5.1 (at the time of this writing). Example asinfo command to set period-ms to 25ms:
asinfo -v 'set-config:context=xdr;dc=<DC_NAME>;period-ms=25'
delay-ms – XDR ‘artificial’ delay
delay-ms, also called ‘artificial delay’ can be set between 0 and 5000ms. It defaults to 0 and must also be lower than the
delay-ms, when configured, applies to all entries to be processed by a DC thread and forces those to not be processed until they have been waiting for the configured
delay-ms. This could be useful to prevent the same record to be shipped too frequently by forcing entries to stay in the XDR transaction queue longer. This gives a chance to potential subsequent write transactions to not have to be separately processed if a predecessor is still in the XDR transaction queue.
Time for a write transaction to be processed by the DC thread:
delay-ms (0ms by default) + anywhere between 0 and
period-ms milliseconds (100ms by default)
For example, with a
delay-ms of 50 and the default
period-ms of 100, records would take anywhere between 50 and 150ms to be processed by the DC thread. For the time it would take a record to reach its destination, this time would be added to the time it takes to read the record locally (typically negligible if reading from the
post-write-queue as well as the link latency to the remote cluster.
sc-replication-wait-ms – XDR SC delay
sc-replication-wait-ms only applies to
strong-consistency enabled (SC) namespaces. This is hard coded to 100ms and not configurable at all in version 5.0. It will be configurable in version 5.1 (at the time of this writing). This SC specific delay is to prevent records from being shipped before fully replicated. In order to prevent unnecessary attempts to process a record that has not replicated yet, this delay is always added. Its mechanism is identical to the
delay-ms configuration parameter. The main difference being that
delay-ms is set by default to 0 whereas for an SC namespace it would be necessary and always beneficial to have the
sc-replication-wait-ms configured to a few milliseconds at least in order to maximize the chances of replication to have completed before processing a record.
Time for a write transaction in an SC namespace to be processed by the DC thread: max(
sc-replication-wait-ms) + anywhere between 0 and
period-ms milliseconds (100ms by default)
Therefore, by default, in an SC namespace, records would take anywhere between 100ms and 200ms to be processed (under normal circumstances, i.e, no recoveries or flooding retries, etc.).
Recap on SC transactions: write transactions in SC must be acknowledged by all replicas before a node acks back to the client. In an SC namespace, records have 2 bits of the primary index reserved for the replication flag with the following possible states: OK, Replicating, Re-replicating and Unreplicated. In the case of a normal SC write (RF=2) and no timeouts or re-replications:
- starting state on master is OK.
- state on master goes to Replicating when in process of writing to the replica(s).
- replica(s) send an ack to master when the record is successfully written which would switch the state on the master copy to OK.
- master sends an OK state to the replica(s).
- in case of failure to replicate, the state would switch to Unreplicated.
If a record is still in Replicating state, it will be moved the retry queue and XDR would keep trying it indefinitely from there (every DC thread’s lap).
If a record is in Unreplicated state, XDR will trigger re-replication. The re-replication will be like a new write (with a new LUT) which would trigger this whole process again, as a new write.
It is important to avoid having XDR attempt to process records which are in the Replicating state unless there was an unexpected slow down or disruption causing records to stay in the Replicating state for longer than expected. Encountering a record in a ‘natural’ Replicating state, meaning it is still replicating under normal circumstances, there is a very good chance that the next record in the queue will also be in the Replicating state (assuming there is a continuous flow of writes). This would end up wasting CPU on retrying a lot of records. It is therefore necessary to judiciously configure the
sc-replication-wait-msif the default is not adequate.
hot-key-ms – XDR Hotkey separation time
hot-key-ms parameter is actually very different from the previous three as it does not involves a delay for shipping but controls the frequency at which hot-keys are processed by XDR. It is set to 100ms by default and dictates how deep to check for an existing entry in the XDR transaction queue for an incoming write transaction. This configuration parameter would help situations where the XDR transaction queue is potentially large and avoid having to check across the whole queue for a potential previous entry corresponding to the current incoming transaction. Such check could be slow and the
hot-key-ms provides a limit for how far in the past to check in the current XDR transaction queue. This parameter can also be summarized as ‘the maximum frequency at which a hot-key would be shipped’.
Always test prior to changing default values that may impact a production use case.
XDR delays replication SC period-ms delay shipping hotkey