XDR Log line messages


#1

XDR Common Log line Details and States

XDR remote cluster States

There are only these 3 states for a remote cluster as seen from an XDR client: CLUSTER_UP, CLUSTER_DOWN and CLUSTER_WINDOW_SHIP

  1. CLUSTER_UP : Normal state with more than one node in remote cluster responding on port 3000.

  2. CLUSTER_DOWN : No connectivity for XDR client over port 3000 to any of the nodes of the remote cluster.

  3. CLUSTER_WINDOW_SHIP: Indicates that the remote cluster suffered an outage and is now up and ready to receive queued objects. XDR will start shipping the queued objects.

  4. CLUSTER_INACTIVE: Temporary state if the source cluster is not yet seeded with remote node IP’s. Can also show up in a running cluster when remote IP’s are unseeded dynamically.

How does a cluster go from CLUSTER_UP to CLUSTER_DOWN state in XDR?

Cluster UP/Down is independent of replication factor. Think of XDR as a regular Aerospike client which uses a seed node to discover all nodes in a cluster and is continuously aware through polling and partition map updates of the state of the cluster. After a certain polling time (every 1 minute) XDR will check for connectivity over port 3000 and determine a remote cluster to be down if not able to connect. XDR does this check multiple times (3 times) to make sure the remote cluster is really down. Once a Cluster is marked as down, XDR will create a separate thread to track the down state. As soon as the remote cluster comes back up and is able to connect, XDR will start shipping the queued objects in the “Window Shipping” queue. XDR goes back in time (with some additional buffer to be on safe side) and starts shipping the queued objects. The lag for the DC that went down will eventually catch up, there could be a sudden drop in lags as the lag head to zero.

See here for more details on how XDR handles local node loss and remote destination loss - Knowledge-Base

XDR states Flow Diagram:

XDR states has the following flow:

CLUSTER_UP -> CLUSTER_DOWN -> CLUSTER_WINDOW_SHIP -> CLUSTER_UP

Cluster State log line definitions

Following log line is used to confirm the statistics and the state of a destination cluster:

INFO (xdr): (xdr.c:607) [DC2]: dc-state CLUSTER_UP timelag-sec 0 lst 1525433886830 mlst 1525433886830 (2018-05-04 11:38:06 GMT) fnlst 0 (-) wslst 0 (-) shlat-ms 2 rsas-ms 0.000 rsas-pct 0.0 con 128 errcl 31 errsrv 5963 sz 2

Review server log messages reference for details on these statistics: https://www.aerospike.com/docs/reference/serverlogmessages

Outstanding log line message

Following log line can be used to see the throughput and the outstanding records.

INFO (xdr): (xdr.c:628) summary: throughput 32 inflight 0 dlog-outstanding 0 dlog-delta-per-sec 0.0

Review server log messages reference for details on these statistics: https://www.aerospike.com/docs/reference/serverlogmessages

Throughput log line definitions

Following log line can be used to review details statistics on records that were attempted to ship, re-logged due to remote error and if there are hot-keys when shipping via XDR.

INFO (xdr): (xdr.c:638) detail: sh 7798569 ul 77 lg 15045385 rlg 371 rlgi 1118 rlgo 372 lproc 15045385 rproc 21769 lkdproc 0 errcl 31 errsrv 5963 hkskip 1347 hkf 552 flat 0

Review server log messages reference for details on these statistics: https://www.aerospike.com/docs/reference/serverlogmessages

Waiting to discover log line

Aug 06 2015 21:51:54 GMT: INFO (xdr): (xdr.c::4675) Waiting to discover all the nodes in the 'us-west-1c' remote cluster. Cluster size=3

XDR client is having trouble communicating to all the nodes at the remote location when XDR is started for the first time. One possible cause is missing some entries for dc-int-ext-ipmap (XDR private to public IP mapping at remote location) or a connectivity issue.

Encountered a block with bad magic

Feb 03 2018 14:55:29 GMT: WARNING (xdr): (xdr.c:4608) Sleep & Retry: Encountered a block with bad magic...

“Encountered a block with bad magic” usually means the digestlog file has been corrupted.

The solution is to delete the XDR digestlog-file and restart XDR (requires restarting the asd daemon) in a rolling manner in a cluster (and potentially waiting for migrations to complete between each node if using a version not running the new cluster protocol introduced in 3.13), as per these steps below :

sudo service aerospike stop
delete the xdr digestlog file
sudo service aerospike start

And you may alternatively also be able to instruct XDR to skip the outstanding record by issuing the following command (but this may cause some records not to be shipped by XDR), as discussed below. You should ensure that there are no window shipping and failed node processing process ongoing: https://www.aerospike.com/docs/reference/info/#xdr-command

asinfo -v 'xdr-command:skip-outstanding'

Sleep & Retry: Filesystem error

Jan 30 2018 10:43:51 GMT: WARNING (xdr): (xdr.c:4391) Sleep & Retry: Filesystem error…

This is filesystem error while accessing the digestlog.

Please check dmesg/syslog for system errors, it could be filesystem errors (partition full or driver issues) which need to be resolved at the OS level.

You may have additional message like this: Jan 29 2018 10:45:51 GMT: WARNING (cf:rbuffer): (rbuffer.c:950) Fread failed with error 0 1

The last number is an indication of EOF. The second to last number represent any other error. In this case, the digestlog is likely corrupted so removing the digestlog (and asd restart) may help.

Digest Log Write Failed log line

Aug 07 2015 19:09:46 GMT: WARNING (xdr): (xdr.c::3698) Digest Log Write Failed !!! ... Critical error

The digestlog partition is most likely out of space. You may need to delete the existing digestlog file and recreate it at a new location. Moving the digestlog requires stopping the XDR service, deleting the disgestlog and configuring the new digestlog partition size in aerospike.conf prior to restarting the service.

Couldn’t update the server about XDR last ship time

Aug 10 2015 22:06:54 GMT: WARNING (xdr): (xds.c:625) Couldn't update the server about the XDR last ship time

If XDR is not able to communicate with server (means receiving empty buffer (no writer) on digest_log pipe), It waits collectively for approximately 10 seconds. If still there is no writer, it sends message to server to create new pipe (freshstart). It does 3 attempts to send this message (3 sec). Wait for 2 sec and attempt to open new pipe for read. If fails to open, XDR goes down.

It’s not a XDR crash, XDR shuts itself down, due to not able to communicate with server.

Solution is to find out the status of the destination server, to ensure that it’s reachable.

bad parameters are passed for write of destination DC warning

On version prior to 3.8.x

`WARNING (xdr): (xdr.c::2615) Error: It looks bad parameters are passed for write of destination DC. Maybe namespace config is not correct. Please check.`

On version 3.8.x and above:

Source cluster warning:

WARNING (xdr): (xdr.c:3785) Error: It seems bad parameters are passed for write at DC XXXXX. Check remote DC's errorlog. Code 4: AEROSPIKE_ERR_REQUEST_INVALID at /work/source/modules/ee/xdr/src/as_event.c:528. <Digest>:XXXXXXXX

Destination cluster warning:

WARNING (xdr): (xdr.c::2615) Error: It looks bad parameters are passed for write of destination DC. Maybe namespace config is not correct. Please check

When introducing a new namespace into existing cluster, this warning will be logged if the destination DC does not have the new namespace. This error can also occur during an XDR upgrade from a version prior 3.8.x.

The issue can be resolved by stopping the server and deleting the digestlog or resetting the digeslog with a skip-outstanding (or no-resume restart on versions prior to 3.8.x) and start afresh after fixing the configuration if this was caused by a bad configuration.

asinfo -v 'xdr-command:skip-outstanding'

Got digest packet for an unknown namespace with id

This warning is applicable on for versions prior to 3.8.

WARNING (xdr): (xdr.c::3852) Got digest packet for an unknown namespace with id 3 from named pipe... queuing it for digest logger WARNING (xdr): (xdr.c::3667) Got digest packet for an unknown namespace with id 3 from the queue ... Still log

Above warning messages indicate possible corruption of named-pipe.

Solution:

Create a fresh named-pipe by following these steps :

  1. Stop XDR.

  2. Delete namedpipe - (path is specified in configuration file.)

  3. Restart XDR.

Please do not delete digest-log for this error.

Note: Empty digest-log (digest-log deleted) and starting XDR in resume mode is equivalent to starting XDR in noresume mode.

Keywords

XDR log line states

Timestamp

05/04/2018


How to identify a bad DC that cause XDR throttling