FAQ - What metrics can be used to determine a correct value for
When shipping data to a remote data centre via XDR the architecture is straightforward. The
digest log stores the digests of records incoming to the node. Digest log reader threads read the digests from the digest log and put the digest on to in-memory read request queues. Threads known as
xdr-read-threads pick the digests from these in-memory request queues, process them through the de-duplication cache, schedule the read for the associated record via the service threads (or transaction queues/threads for versions prior to 4.7), potentially applies compression and, finally, pass them to an embedded Aerospike C client which ships the record to the destination cluster(s).
Most common reasons for a build up in
xdr_timelag are described in the FAQ - What Can Cause XDR Throttling article but in some cases a build up in
xdr_timelag may be observed due to slow performance of the tasks assigned to the
xdr-read-threads. In that instance, increasing the number of
xdr-read-threads may be an appropriate solution. What Aerospike metrics exist to determine whether an increase in
xdr-read-threads could alleviate the
xdr_read_active_avg_pct. This describes the amount of time the xdr read threads spend working as opposed to waiting for digests to appear on the queues they service. High percentages for this metric along with a higher CPU usage may indicate a need to increase the number of
xdr-read-threads. When the CPU is at lower utlisation the expectation is that the default number of
xdr-read-threadsshould be sufficient to handle the XDR load.
xdr_read_reqq_used_pct. This gives a value in terms of percent for how full the read request queues are. This metric should be used with care. A slow disk will cause this metric to be high and so it is not a good indicator of a need to increase
There is a maximum of 10,000 transactions that can be in flight in the internal XDR transaction queue. This is a hard limit and as such, if there are 10000 transactions or near in flight, increasing the number of
xdr-read-threadswill not solve a source side lag issue. This is measured in raw numbers by
xdr_read_txnq_usedor as a percentage
- Dynamically decreasing
xdr-read-threadshas been known to cause node crashes in some rare situations, it is therefore advisable to decrease this configuration parameter statically (restart required).
- The following log line shows
[DC_NAME]: dc-state CLUSTER_UP timelag-sec 2 lst 1468006386894 mlst 1468006389647 (2016-07-08 19:33:09.647 GMT) fnlst 0 (-) wslst 0 (-) shlat-ms 0 rsas-ms 0.004 rsas-pct 0.0 con 384 errcl 0 errsrv 0 sz 6
- The most common reason for the
xdr_timelagto build up is actually throttling. Refer to the article titled “What Can Cause XDR Throttling” for details.
XDR-READ-THREADS XDR-TIMELAG LAG XDR