FAQ - What can cause XDR throttling?


FAQ - What are the causes of XDR throttling


In Aerospike 3.8.x and higher XDR will report values for throughput into the aerospike.log file. At times XDR will intentionally decrease the throughput and this will be visible. The following log line shows details around XDR throttling.

[DC_NAME]: dc-state CLUSTER_UP timelag-sec 2 lst 1468006386894 mlst 1468006389647 (2016-07-08 19:33:09.647 GMT) fnlst 0 (-) wslst 0 (-) shlat-ms 0 rsas-ms 0.020 rsas-pct 10.0

In the line above rsas-ms shows the average sleep time for each write to the DC. When this starts to increase XDR is sleeping more. Putting read threads to sleep is the mechanism by which XDR throttles throughput. This is logged in the dc_remote_ship_avg_sleep statistic. Throttling is also shown in the rsas-pct which describes the percentage of throttled writes to the DC. This is logged in the dc_remote_ship_avg_sleep_pct statistic.

What are the key reasons why XDR will throttle throughput to a remote DC?


Excessive latency

XDR will begin to throttle when the latency gets too high. The threshold used to determine when the lag is too high is when it is 25% of the xdr- write-timeout. The default value for the xdr-write-timeout is 10000 milliseconds and so when latency exceeds 2500ms XDR will start to throttle throughput. This stops XDR flooding a datacenter that cannot cope with high throughput.

Maximum configured throughput hit

To avoid flooding the network, XDR can be configured with a maximum allowed throughput (number of records being written to the destination per second). This is controlled by the xdr-max-ship-throughput parameter. XDR actually turns this into a maximum number of objects that can be inflight, based on the link latency for a given DC. For example, if a link between 2 DCs has a round trip latency of 10ms, putting 1 record at a time on the link (1 record in flight) would allow for 100 records to be written every second (throughput of 100). In default configuration (no xdr-max-ship-throughput set) the derived value for the maximum number of objects that can be in flight at one time is 50000. If the records in flight exceed this value, XDR will start to throttle.

The client has run out of connections

XDR uses the Aerospike C client to ship records, and will use 64 connections per node at the destination cluster (objects are pipelined on those connections, meaning more than one record at a time can be in flight on a single connection). During startup of XDR (and startup of the underlying C client) the client can for a short period of time run out of connections are those are still being established. This can also happen immediately following a link failure when connection get re-established.

When the remote DC reported an error

When a DC reports an error XDR will throttle down to avoid potential 100% CPU usage scenarios. If errors are continuously happening for 30 seconds, the DC will then be considered down and a window shipper thread will be started. Note that in case of a remote DC error (xdr_ship_destination_error), a 30 seconds window is given to recover before putting the cluster into cluster down mode (which will recover through a window shipper when it is again reachable).

Network Error or Timeout

In the event that there is a network error when attempting to ship to the destination or if there is a timeout on XDR write, the throughput drops to 1 record shipped per second per XDR read thread (xdr-read-threads) which is 4 by default for each remote DC. These errors are considered transient and, in this situation, after a 30 seconds interval through which XDR ships at 1 transaction per second per XDR read thread, the throughput (actually the number of records on the wire) is then doubled every 2 seconds until the maximum of either 50k or a configured to a maximum throughput (xdr-max-ship-throughput) is reached.

Note: As of version 4.4, XDR gradually slows down when encountering network errors or timeouts:

  • reducing the throughput by 50% for 1 second on the first error/timeout.
  • if errors/timeouts continue, reduce the throughput down another 50% for 1 second.
  • this will continue down to 1 transaction per thread per second.
  • upon not encountering any error or timeout, XDR will double the throughput every 1 second (as detailed previously).

Kernel misconfiguration

XDR will also throttle if one or any of the nodes on the destination doesn’t have the Aerospike default values for kernel.shmmax = 68719476736 and kernel.shmall = 4294967296. On the nodes where this is not set right, you will start seeing warnings like this:

Jan 16 2018 00:00:23 GMT: WARNING (arenax): (arenax_ee.c:170) could not allocate 1073741824-byte arena stage 18: block size error
Jan 16 2018 00:00:23 GMT: WARNING (index): (index.c:710) arenax alloc failed
Jan 16 2018 00:00:23 GMT: WARNING (rw): (write.c:533) {ns_seg} write_master: fail as_record_get_create() <Digest>:0x786478ecb1e535271805ce5e742c769dc2b8230f

And the source nodes will start relogging and this inturn will start throttling.

Errors returned by the destination cluster

Here are errors for which XDR will throttle:

  • Error during connection establishment (includes issues with Authentication and TLS).
  • Timeouts errors returned by the server.
  • AEROSPIKE_ERR_SERVER: Generic server error.
  • AEROSPIKE_ERR_FAIL_FORBIDDEN: Temporary forbidden errors like stop writes due to clock skew.

Here are errors for which XDR will not throttle:

  • AEROSPIKE_ERR_RECORD_NOT_FOUND: Can only happen for delete. This is the only non-permanent error under this category.
  • AEROSPIKE_ERR_ALWAYS_FORBIDDEN: Happens when allow-xdr-writes is set to false on the remote DC. Or in general anything forbidden due to config on remote.


  • Detail on xdr-write-timeout


  • Detail on xdr-max-ship-throughput


  • Relationship between XDR latency and throttling in detail
  • Log messages explained in detail






How to identify a bad DC that cause XDR throttling
What are the options for reducing XDR's network utilization?
How do I handle a planned network maintenance between XDR source and destination?
What is the unit for XDR max ship throughput?
XDR lag starts to increase to remote data centers