FAQ - What might cause XDR to re-log a record?

Aerospike_Knowledge · August 26, 2016, 9:14pm

The Aerospike Knowledge Base has moved to https://support.aerospike.com. Content on https://discuss.aerospike.com is being migrated to either https://support.aerospike.com or https://docs.aerospike.com. Maintenance on articles stored in this repository ceased on December 31st 2022 and this article may be stale. If you have any questions, please do not hesitate to raise a case via https://support.aerospike.com.

FAQ - What would cause XDR to relog a record?

Detail

There are certain situations where XDR cannot ship a record and so the digest log entry is re-logged in the digest log so that the shipping can be attempted again later. What are these situations?

For the purpose of this article, the terms record re-logging and digest entry re-logging can be considered interchangeable, though in truth, only digest log entries are re-logged.

Answer

When XDR ships a remote DC it acts as a client to that remote DC. The Aerospike C client is used by XDR internally to either put or delete records in the remote cluster. If something goes wrong during either of these operations the C client reports an error. There are two categories of error reported by the C client.

Client side errors - These are errors where something has gone wrong before the put or delete operation has been sent to the remote cluster. These are reported as xdr_ship_source_error / dc_ship_source_error.
Server side errors - These are errors that happen after the C client has issued the put or delete operation to the remote DC but the remote DC has returned some kind of error. These are reported as xdr_ship_destination_error / dc_ship_destination_error.

It is important to state that not all errors will result in a record being re-logged.

The most common client side error would be that there is an issue with the network connection to the remote cluster or any other form of connectivity issue. Log entries in the XDR log will indicate that the client cannot write, these look as follows:

Aug 10 2016 03:19:24 GMT: INFO (xdr): (as_cluster.c:821) Node BB9E5DB57A55794 refresh failed: AEROSPIKE_ERR_CLIENT Socket write error: 111

The most common server side errors for which records are re-logged are as follows:

Timeouts - The server notices that it is taking too long to process the put or delete operation and so aborts it, the abort is reported back to the C client.
Stop-writes - If the node is in stop-writes it will refuse the XDR write as it would any other client write.
Hotkeys - If there is a hotkey (a record being updated too frequently) and transaction-pending-limit is being breached, the server gives a KEY BUSY error and XDR re-logs the record.
Write load - If the server write load is too high and the SSDs are not keeping up, the server will be generating queue too deep error messages as the write queue overflows. This will be passed back to XDR by the C client and the record will be re-logged.

One condition that is neither client nor server related but that will cause re-logging, is when the read from the local namespace fails. This happens when the read thread has tried to retrieve a record based on digest log entry and this has not been successful. The digest log entry is then re-logged as well, but not to the node it was initially being processed on. A situation where this could occur would be if the source cluster is migrating when the read thread attempts to read the record. As the dlog reader for a node only ships records for which it is the master, if a digest log entry for a record, R has been written on node A , which is master for that record at the time, node A will try and ship that record. If there is a lag in the system and, in between the writing of the digest log entry for R and subsequent reading by a read thread on A, the master node for that record has changed to node B, the read will fail. As a consequence of the failed read, the record is re-logged. The re-logging occurs into the digest log of the new master, B and all nodes where a prole copy of record R now exists. The record is not re-logged on node A unless A holds one of those prole copies, which is unlikely. Those are reported under the following 2 statistics: xdr_relogged_outgoing and xdr_relogged_incoming.

Notes

In the log file, client side errors and server side errors are logged as errcl and errsrv respectively.

Aug 10 2016 03:19:46 GMT: INFO (xdr): (xdr.c:2023) detail: sh 21755950 ul 0 lg 10878780 rlg 0 lproc 10878780 rproc 0 lkdproc 0 errcl 0 errsrv 0 hkskip 126606 hkf 125801 flat 0

When using asinfo the following keys give client side errors as a total and also per DC, xdr_ship_source_error, dc_ship_source_error
For server side errors the following asinfo keys can be used, again, in total and by DC, xdr_ship_destination_error, dc_ship_destination_error
The local read errors are counted by the xdr_read_error asinfo query key.

Applies To

Server prior to v. 5.0

Keywords

XDR RELOG RECORD DIGEST CLIENT SERVER FAIL

Timestamp

8/10/16

Topic		Replies	Views
Regarding shipping errors XDR (Cross Data Center Replication)	3	832	July 9, 2020
What are the client error codes for Aerospike 2.x? Client Libraries	1	2439	August 16, 2014
AER-dddd - can we view more details on bugs and jira	1	1261	January 31, 2017
What's up with AEROSPIKE_ERR_FAIL_FORBIDDEN? Node.js Client	3	4050	July 21, 2015
Aerospike return write errors messages	1	832	November 29, 2017