FAQ Can there be XDR lag without a client or server error?
xdr_timelag is observed on a source XDR cluster it may be due to a variety of reasons, usually some error in the shipping of the records to the remote DC. Is there a scenario where XDR lag can build even in the absence of any errors?
Yes, it is not inconceivable to experience
xdr_timelag when the system is not relogging records or indeed throttling the records that are being shipped. The most likely cause of this would be the impact of a hotkey on XDR operation. To understand how this can happen it is necessary to understand how XDR works internally.
When records are inserted, updated or deleted in an XDR enabled system, the digest for that record is inserted into the digest log, a ring buffer. The digest log is read sequentially by a thread which reads digests and places those digests on read thread queues. There will be a read thread queue for each
xdr-read-thread. The first role of the
xdr-read-threads is to take the digests and pass them to the Aerospike transaction or service threads (dependant on version) which then read the corresponding records to be shipped.
The read thread queues for the
xdr-read-threads is deterministic which means that the same digest will always go to the same queue. Each
xdr-read-thread has a cache to help it manage hotkeys, which are keys that are written with a very high frequency. The way in which the cache does this is simple. When a digest is read from the queue it is placed in the cache and held there for a pre-defined time period. The time period is configured using
xdr-hotkey-time-ms. If the same digest enters the cache within
xdr-hotkey-time-ms the digest is marked as dirty and not shipped. At the end of the
xdr-hotkey-time-ms any dirty digests are shipped and the dirty flag is removed. Any digests without the dirty flag are removed from the cache. The net effect is that a given digest will not be shipped any more frequently than once per
When there is a hot key, as that key will always go to the same read thread queue, and as the queue size has a limit of 1000 records, it is possible that the read thread queue for a given
xdr-read-thread can fill up. An example of this is shown below:
Collectinfo-analyzer> show stat like xdr_read_reqq_used -flip ~~~~~~~~~Service Statistics (2020-03-11 13:57:07 UTC)~~~~~~~~~ NODE xdr_read_reqq_used xdr_read_reqq_used_pct 10.0.0.10:3000 85 2.12 10.0.0.11:3000 1000 25.00 10.0.0.12:3000 1790 44.75 10.0.0.1:3000 30 0.75 10.0.0.2:3000 1995 49.88 10.0.0.3:3000 1061 26.52 10.0.0.4:3000 1000 25.00 10.0.0.6:3000 19 0.47 10.0.0.7:3000 1000 25.00 10.0.0.8:3000 1000 25.00 10.0.0.9:3000 0 0.00 Number of rows: 11
The output below shows that there is a queue on nodes
.8 that is totally full. This is indicated as
xdr_read_req_used is 1000 exactly. This metric represents the amount of digests sitting in the read thread queues. By default there are 4
xdr-read-threads and 4 queues. So one queue sitting at 1000 digests would imply 25% capacity used (which is shown in
It is also evident that the queues for nodes
.3 are also backing up as they are above 1000.
The question is, why are some queues sitting at 1000 and more digests? It is because the digest log is read sequentially and digests go to queues deterministically. If the next digest in the log waiting to be processed is one that must go to a queue that has hit the limit of 1000, it will block and wait until the queue it would end up in makes some room. The net effect will be that other queues drain as they are serviced and the queue being hit by the hotkey stays full (or multiple queues get high usage if multiple hotkeys).
In terms of how to mitigate this phenomenon, the answer is counter-intuitive. Logic would suggest that
xdr-hotkey-time-ms should be decreased from the default value of 100ms to increase the rate at which the hotkeys are shipped but this is not correct. The caches used to manage XDR hotkeys are capable of holding 25,000 digests and so they are extremely unlikely to be blocked. In any event, a digest will only appear in this cache once, whether it is marked as clean or dirty.
xdr_timelag represents the time (in seconds) it took the latest shipped record from the moment it was first written at the source until it was attempted to be shipped to the destination cluster, it can be considered to be a measure of how long a given digest waited in the digest log before being processed. When a read thread queue is blocked and the digests from the digest log are not being processed, the
xdr_timelag will increase even in the absence of shipping errors.
The solution is actually to increase
xdr-hotkey-time-ms. This allows the system to ship repeated (very hot or not) digests less frequently reducing the overall load on the system and allowing the blocked read thread queue to drain out. Once the read thread queue drains it can be refilled and the digest log can be processed once again.
To explain further. Each time the
xdr-read-thread takes a digest from the read thread queue and reads this from the server the thread will be occupied solely with this task to the exclusion of others. If the thread is blocked it cannot remove digests from the read thread queue. As described above, when the read thread queue is full, the digests cannot be read from the digest log and lag builds. Increasing the
xdr-hotkey-time-ms means pressure on the
xdr-read-thread is reduced.
If a given digest is updated once every 100ms then the
xdr-read-thread has to fetch the digest every 100ms (either from the
post-write-queue but, more likely when there is lag in the system, from the storage subsystem). This is not the case, however, if the digest is found in the de-duplication cache. If the digest is found in the de-duplication cache then it is discarded as soon as it is picked off the read thread queue, this is extremely quick and so the thread is blocked for a very small amount of time.
xdr-hotkey-time-ms to, for example, 500ms, means that the a digest will stay in the de-duplication cache for 5 times longer meaning that for every 5 updates (using the example above), 4 will be simply discarded straight away. The net effect is that, with a change in
xdr-hotkey-time-ms from 100ms to 500ms, the amount of digests discarded is increased and the
xdr-read-thread is able to drain the queue far more efficiently.
By reducing the time spent fetching repeated digests and instead simply discarding those digests as they come off the read thread queue, the thread is free to do more work on unique digests and can drain the read thread queue faster. It is worth considering that a record being updated every 100ms would not necessarily be considered a hotkey.
- The caveat to increasing
xdr-hotkey-time-msis that records that are updated repeatedly will be shipped less frequently.
- Discussion on XDR hotkeys.
- Throttling of shipping is a process XDR executes automatically to compensate for certain types of issue. It corresponds to the metric
dc_ship_idle_avgand any non zero value indicates throttling.
- The following log line shows whether XDR is throttling. The entry to check is rsas-ms which reports
[DC_NAME]: dc-state CLUSTER_UP timelag-sec 2 lst 1468006386894 mlst 1468006389647 (2016-07-08 19:33:09.647 GMT) fnlst 0 (-) wslst 0 (-) shlat-ms 0 rsas-ms 0.004 rsas-pct 0.0 con 384 errcl 0 errsrv 0 sz 6
- The log line above also shows
xdr_timelagwhich is listed as
- The facility to trace individual transactions being processed by XDR is discussed in the XDR Tracing article and can help isolate potential issues with XDR shipping.
- Common XDR throttling reasons.
XDR HOTKEY LAG NO ERROR DIGESTS