FAQ - What metrics can be used to determine a correct value for xdr-read-threads

Aerospike_Knowledge · December 8, 2019, 9:43pm

The Aerospike Knowledge Base has moved to https://support.aerospike.com. Content on https://discuss.aerospike.com is being migrated to either https://support.aerospike.com or https://docs.aerospike.com. Maintenance on articles stored in this repository ceased on December 31st 2022 and this article may be stale. If you have any questions, please do not hesitate to raise a case via https://support.aerospike.com.

FAQ - What metrics can be used to determine a correct value for `xdr-read-threads`?

Detail

When shipping data to a remote data centre via XDR the architecture is straightforward. The digest log stores the digests of records incoming to the node. Digest log reader threads read the digests from the digest log and put the digest on to in-memory read request queues. Threads known as xdr-read-threads pick the digests from these in-memory request queues, process them through the de-duplication cache, schedule the read for the associated record via the service threads (or transaction queues/threads for versions prior to 4.7), potentially applies compression and, finally, pass them to an embedded Aerospike C client which ships the record to the destination cluster(s).

Most common reasons for a build up in xdr_timelag are described in the FAQ - What Can Cause XDR Throttling article but in some cases a build up in xdr_timelag may be observed due to slow performance of the tasks assigned to the xdr-read-threads. In that instance, increasing the number of xdr-read-threads may be an appropriate solution. What Aerospike metrics exist to determine whether an increase in xdr-read-threads could alleviate the xdr_timelag?

Answer

The following metrics are used to track the behaviour of xdr-read-threads. These are fully documented in the Aerospike Metrics Reference.

xdr_read_active_avg_pct. This describes the amount of time the xdr read threads spend working as opposed to waiting for digests to appear on the queues they service. High percentages for this metric along with a higher CPU usage may indicate a need to increase the number of xdr-read-threads. When the CPU is at lower utlisation the expectation is that the default number of xdr-read-threads should be sufficient to handle the XDR load.
xdr_read_reqq_used_pct. This gives a value in terms of percent for how full the read request queues are. This metric should be used with care. A slow disk will cause this metric to be high and so it is not a good indicator of a need to increase xdr-read-threads.
There is a maximum of 10,000 transactions that can be in flight in the internal XDR transaction queue. This is a hard limit and as such, if there are 10000 transactions or near in flight, increasing the number of xdr-read-threads will not solve a source side lag issue. This is measured in raw numbers by xdr_read_txnq_used or as a percentage xdr_read_txnq_used_pct.

Notes

Dynamically decreasing xdr-read-threads has been known to cause node crashes in some rare situations, it is therefore advisable to decrease this configuration parameter statically (restart required).
The following log line shows xdr_timelag:

[DC_NAME]: dc-state CLUSTER_UP timelag-sec 2 lst 1468006386894 mlst 1468006389647 (2016-07-08 19:33:09.647 GMT) fnlst 0 (-) wslst 0 (-) shlat-ms 0 rsas-ms 0.004 rsas-pct 0.0 con 384 errcl 0 errsrv 0 sz 6

The most common reason for the xdr_timelag to build up is actually throttling. Refer to the article titled “What Can Cause XDR Throttling” for details.

Applies To

Server prior to v. 5.0

Keywords

XDR-READ-THREADS XDR-TIMELAG LAG XDR

Timestamp

November 2019

Topic		Replies	Views
Replacement for xdr_ship_bytes metric post Aerospike 5.0? Monitoring	3	524	July 8, 2022
Question about max-throughput and lag XDR (Cross Data Center Replication)	4	818	August 10, 2022
Is there any way for Aerospike to max out the MSS? XDR (Cross Data Center Replication)	2	1598	July 30, 2015
Aerospike Performance issue and Understanding asloglatency tool Tuning query , index	1	4159	October 8, 2015
Read/write performance spikes	1	3272	December 23, 2015