XDR lag starts to increase to remote data centers

Aerospike_Knowledge · August 19, 2016, 8:31pm

The Aerospike Knowledge Base has moved to https://support.aerospike.com. Content on https://discuss.aerospike.com is being migrated to either https://support.aerospike.com or https://docs.aerospike.com. Maintenance on articles stored in this repository ceased on December 31st 2022 and this article may be stale. If you have any questions, please do not hesitate to raise a case via https://support.aerospike.com.

Troubleshooting XDR lag increase to remote data centers

Let’s first revisit the definition of the XDR time lag:

The xdr_timelag is the delta between the timestamp of the last record which was attempted to ship and the current time.

Another way of wording this is:

The xdr_timelag is the time a record is waiting at the source before being attempted to be shipped.

There are a variety of reasons that would lead into the lag increasing. Let’s pick one example to review some of the things to check.

Aug 13 2016 06:25:41 GMT: INFO (xdr): (xdr.c:2027) sh 31465282087 : ul 208 : lg 20736463319 : rlg 18682259 : lproc 20715167324 : rproc 1173447 : lkdproc 10519322 : errcl 3174416 : errsrv 15529572 : hkskip 1906225345 1107652918 : flat 0
...
Aug 13 2016 06:36:33 GMT: INFO (xdr): (xdr.c:2027) sh 31465291084 : ul 98 : lg 20736883719 : rlg 18682834 : lproc 20715173224 : rproc 1173447 : lkdproc 10519322 : errcl 3174416 : errsrv 15530147 : hkskip 1906225404 1107652977 : flat 0
...
Aug 13 2016 06:38:13 GMT: INFO (xdr): (xdr.c:2027) sh 31465292421 : ul 23 : lg 20736913519 : rlg 18682925 : lproc 20715173924 : rproc 1173447 : lkdproc 10519322 : errcl 3174416 : errsrv 15530238 : hkskip 1906225410 1107652982 : flat 0

In those log lines, it can be seen that rlg and errsrv are increasing. These metrics record the number of records being relogged into the digest log and the number of errors generated when the remote cluster rejects the write from XDR. An error logged under errsrv indicates the error has been reported after the C client used by XDR has executed the put or delete operation on the remote cluster. It is also apparent that the delta between rlg values (575 and 91) is the same as the delta between errsrv. The following conclusions can be drawn:

XDR write operations are being rejected by the remote server
There is a 1:1 correlation between these rejected operations and relogged records

The question then becomes, why are XDR writes being rejected? An examination of the logs on the remote cluster is necessary. The first stage of any log check should be a simple check for warnings.

$grep -i warn aerospike.log
aerospike.log:Aug 13 2016 06:20:12 GMT: WARNING (namespace): (namespace.c:440) {xdr_target} hwm_breached false, stop_writes true (memory), memory sz:108724019840 (108724019840 + 0) hwm:130459631616 sw:108716359680, disk sz:2073798377600 hwm:3006477107200

The answer to the question of why XDR lag is increasing is immediately evident. There are nodes in the remote cluster that have hit stop_writes and as a result when XDR tries to write to that cluster, the write is rejected. XDR then increments errsrv (xdr_ship_destination_error stat) and relogs the record in the digest log, incrementing rlg (dlog_relogged stat). The XDR lag increases as the timestamp of the last successfully shipped record remains constant while the current time moves forward, and lag represents the difference between these two times.

Solution

In this case the solution is to correct the issue with the remote cluster, by looking into why stop_writes is true. Although, in certain cases, monitoring lag can be a useful metric, in this instance the issue would have been more apparent if there was tracking of xdr_ship_destination_error and dlog_relogged metrics, these would have pointed to the real root issue much more efficiently possibly. XDR lag is - at best - a top level metric and in some cases is actively disingenuous.

Notes

Whenever errsrv (or xdr_ship_destination_error ) is increasing the first thing to check is the state of the remote cluster.
A scenario where XDR lag would be high due to completely normal behaviour is if a window shipper is shipping records to a cluster that has recovered from an outage. The records shipped by the window shipper may well be signicantly older than the current time.
Another scenario is due to misconfiguration of xdr-hotkey-time-ms. For example, setting it to 10 minutes will have XDR lags close to 10 minutes. You will observe a zip-zap pattern. This is because a hotkey record can be shipped delayed by up to 10 minutes.
Here is an article that covers different causes for throttling which of course will then also impact the XDR lag: FAQ - What can cause XDR throttling?

Applies To

Server prior to v. 5.0

Keywords

XDR LAG REMOTE CLUSTER DC RELOG HOTKEY SKIP

Timestamp

10/12/2018

Topic		Replies	Views
Question about max-throughput and lag XDR (Cross Data Center Replication)	4	828	August 10, 2022
Latency spike daily at same time. Aerospike build 3.13.0.10	2	593	November 3, 2021
Is there any way for Aerospike to max out the MSS? XDR (Cross Data Center Replication)	2	1604	July 30, 2015
Replacement for xdr_ship_bytes metric post Aerospike 5.0? Monitoring	3	529	July 8, 2022
Very strange Aerospike behaviour - completely unknown cause PHP Client Library	22	3404	September 22, 2017