FAQ - What metrics can be used to determine a correct value for xdr-read-threads

FAQ - What metrics can be used to determine a correct value for xdr-read-threads?

Detail

When shipping data to a remote data centre via XDR the architecture is straightforward. The digest log stores the digests of records incoming to the node. Digest log reader threads read the digests from the digest log and put the digest on to in-memory read request queues. Threads known as xdr-read-threads pick the digests from these in-memory request queues, process them through the de-duplication cache, schedule the read for the associated record via the service threads (or transaction queues/threads for versions prior to 4.7), potentially applies compression and, finally, pass them to an embedded Aerospike C client which ships the record to the destination cluster(s).

Most common reasons for a build up in xdr_timelag are described in the FAQ - What Can Cause XDR Throttling article but in some cases a build up in xdr_timelag may be observed due to slow performance of the tasks assigned to the xdr-read-threads. In that instance, increasing the number of xdr-read-threads may be an appropriate solution. What Aerospike metrics exist to determine whether an increase in xdr-read-threads could alleviate the xdr_timelag?

Answer

The following metrics are used to track the behaviour of xdr-read-threads. These are fully documented in the Aerospike Metrics Reference.

  • xdr_read_active_avg_pct. This describes the amount of time the xdr read threads spend working as opposed to waiting for digests to appear on the queues they service. High percentages for this metric along with a higher CPU usage may indicate a need to increase the number of xdr-read-threads. When the CPU is at lower utlisation the expectation is that the default number of xdr-read-threads should be sufficient to handle the XDR load.

  • xdr_read_reqq_used_pct. This gives a value in terms of percent for how full the read request queues are. This metric should be used with care. A slow disk will cause this metric to be high and so it is not a good indicator of a need to increase xdr-read-threads.

  • There is a maximum of 10,000 transactions that can be in flight in the internal XDR transaction queue. This is a hard limit and as such, if there are 10000 transactions or near in flight, increasing the number of xdr-read-threads will not solve a source side lag issue. This is measured in raw numbers by xdr_read_txnq_used or as a percentage xdr_read_txnq_used_pct.

Notes

  • Dynamically decreasing xdr-read-threads has been known to cause node crashes in some rare situations, it is therefore advisable to decrease this configuration parameter statically (restart required).
  • The following log line shows xdr_timelag:
[DC_NAME]: dc-state CLUSTER_UP timelag-sec 2 lst 1468006386894 mlst 1468006389647 (2016-07-08 19:33:09.647 GMT) fnlst 0 (-) wslst 0 (-) shlat-ms 0 rsas-ms 0.004 rsas-pct 0.0 con 384 errcl 0 errsrv 0 sz 6

Keywords

XDR-READ-THREADS XDR-TIMELAG LAG XDR

Timestamp

November 2019

© 2015 Copyright Aerospike, Inc. | All rights reserved. Creators of the Aerospike Database.