XDR lag starts to increase to remote data centers


#1

XDR lag starts to increase to remote data centers

Problem Description

XDR is shipping data in an active/passive configuration to a number of remote DCs. Lag starts to increase to two of these DCs but not all.

Aug 13 2016 06:25:41 GMT: INFO (xdr): (xdr.c:2027) sh 31465282087 : ul 208 : lg 20736463319 : rlg 18682259 : lproc 20715167324 : rproc 1173447 : lkdproc 10519322 : errcl 3174416 : errsrv 15529572 : hkskip 1906225345 1107652918 : flat 0
...
Aug 13 2016 06:36:33 GMT: INFO (xdr): (xdr.c:2027) sh 31465291084 : ul 98 : lg 20736883719 : rlg 18682834 : lproc 20715173224 : rproc 1173447 : lkdproc 10519322 : errcl 3174416 : errsrv 15530147 : hkskip 1906225404 1107652977 : flat 0
...
Aug 13 2016 06:38:13 GMT: INFO (xdr): (xdr.c:2027) sh 31465292421 : ul 23 : lg 20736913519 : rlg 18682925 : lproc 20715173924 : rproc 1173447 : lkdproc 10519322 : errcl 3174416 : errsrv 15530238 : hkskip 1906225410 1107652982 : flat 0

The log excerpts above show us that the XDR lag is indeed increasing (it is shown by the lg metric in the log lines above)

Explanation

When looking at an XDR lag issue an important thing to consider is what XDR lag actually means and how it should be used. XDR lag is defined as follows:

XDR lag is the delta between the timestamp of the last record which was attempted to ship and the current time.

From that definition we can infer that, at best, XDR lag is an indication of other issues within a cluster and indeed that there are legitimate reasons why lag could be high. In the scenario described above other metrics within those same log lines should be examined.

At the same time as the XDR lag increases it can be seen that rlg and errsrv are also increasing. These metrics record the number of records being relogged into the digest log and the number of errors generated when the remote cluster rejects the write from XDR. An error logged under errsrv indicates the error has been reported after the C client used by XDR has executed the put or delete operation on the remote cluster. Taking the log lines above it is apparent that the delta between rlg values (575 and 91) is the same as the delta between errsrv. The following conclusions can be drawn:

  • XDR put and delete operations are being rejected by the remote server
  • There is a 1:1 correlation between these rejected operations and relogged records

The question then becomes, why are XDR writes being rejected? An examination of the logs from the remote cluster is necessary. The first stage of any log check should be a simple check for warnings.

$grep -i warn aerospike.log
aerospike.log:Aug 13 2016 06:20:12 GMT: WARNING (namespace): (namespace.c:440) {xdr_target} hwm_breached false, stop_writes true (memory), memory sz:108724019840 (108724019840 + 0) hwm:130459631616 sw:108716359680, disk sz:2073798377600 hwm:3006477107200

The answer to the question of why XDR lag is increasing is immediately evident. There are nodes in the remote cluster that have hit stop_writes and as a result when XDR tries to write to that cluster, the write is rejected. XDR then increments errsrv and relogs the record in the digest log, incrementing rlg. The XDR lag increases as the timestamp of the last successfully shipped record remains constant while the current time moves forward, and lag represents the difference between these two times.

Solution

In this case the solution is to correct the issue with the remote cluster, by looking into why stop_writes is true. Although, in certain cases, monitoring lag can be a useful metric, in this instance the issue would have been more apparent if there was tracking of the errsrv and rlg metrics, these would have pointed to the real root issue much more quickly. XDR lag is - at best - a top level metric and in some cases is actively disingenuous.

Notes

  • Whenever errsrv is increasing the first thing to check is the state of the remote cluster.
  • A scenario where XDR lag would be high due to completely normal behaviour is if a window shipper is shipping records to a cluster that has recovered from an outage. The records shipped by the window shipper may well be signicantly older than the current time.

Keywords

XDR LAG REMOTE CLUSTER DC RELOG

Timestamp

16th August 2016