XDR remote data center down but dc-state does not show CLUSTER_DOWN


#1

XDR remote data center down but dc-state does not show CLUSTER_DOWN

Problem Description

When testing XDR, remote data centers are shut down and while tend requests are seen to be failing, the dc-state of the remote data center is not listed as CLUSTER_DOWN.

Aug 12 2016 12:18:22 GMT: INFO (xdr): (as_cluster.c:821) Node BB901BE75270008 refresh failed: AEROSPIKE_ERR_CLIENT Socket write error: 111
Aug 12 2016 12:18:22 GMT: INFO (xdr): (as_cluster.c:821) Node BB91B0BF7270008 refresh failed: AEROSPIKE_ERR_CLIENT Socket write error: 111

Explanation

This is expected behaviour when there are no records being sent to the remote DC. When Aerospike starts, XDR tries to issue tend requests to the remote cluster and reports an error if these fail. XDR does not initiate a full healthcheck until there are writes being sent from the source DC to the remote DC. As there are no writes, there is no need to start a window shipper process and therefore it is not necessary to determine an exact status for the remote cluster. The following log excerpts show the sequence of events when XDR is shipping records. In this example, the local DC is shipping to 2 remote DCs, DC2 and DC3, these are then taken down in turn and brought back in the same order.

Point 1: Both DCs are available and so dc-state is reporting CLUSTER_UP and XDR is shipping and reclaiming space in the digest log as it does so.

Aug 12 2016 12:52:11 GMT: INFO (xdr): (xdr.c:1887) dlog-q: capacity 64 used-elements 0 read-offset 0 write-offset 0
Aug 12 2016 12:52:11 GMT: INFO (xdr): (xdr.c:1773) dlog: reclaimed 31300
Aug 12 2016 12:52:11 GMT: INFO (xdr): (xdr.c:1998) [DC2]: dc-state CLUSTER_UP timelag-sec 24 lst 1471006306795 mlst 1471006309465 (2016-08-12 12:51:49.465 GMT) fnlst 0 (-) wslst 0 (-) shlat-ms 7 rsas-ms 285.576 rsas-pct 30.9
Aug 12 2016 12:52:11 GMT: INFO (xdr): (xdr.c:1998) [DC3]: dc-state CLUSTER_UP timelag-sec 24 lst 1471006306795 mlst 1471006309465 (2016-08-12 12:51:49.465 GMT) fnlst 0 (-) wslst 0 (-) shlat-ms 6 rsas-ms 0.055 rsas-pct 1.9
Aug 12 2016 12:52:11 GMT: INFO (xdr): (xdr.c:2015) summary: throughput 7 inflight 0 dlog-outstanding 58700 dlog-delta-per-sec 2860.0

Point 2: DC2 is stopped, tend requests start to fail, the dc-state for DC2 is changed to CLUSTER_DOWN. So that shipping to DC3 can continue uninterrupted, XDR spawns a window shipper thread.

Aug 12 2016 12:52:16 GMT: INFO (xdr): (as_cluster.c:821) Node BB901BE75270008 refresh failed: AEROSPIKE_ERR_CLIENT Socket write error: 111
Aug 12 2016 12:52:17 GMT: INFO (xdr): (as_cluster.c:821) Node BB901BE75270008 refresh failed: AEROSPIKE_ERR_CLIENT Socket write error: 111
Aug 12 2016 12:52:18 GMT: INFO (xdr): (as_cluster.c:821) Node BB901BE75270008 refresh failed: AEROSPIKE_ERR_CLIENT Socket write error: 111
Aug 12 2016 12:52:19 GMT: INFO (xdr): (xdr.c:2938) Connection error when writing to cluster DC2. Checking its health.
Aug 12 2016 12:52:19 GMT: INFO (xdr): (xdr.c:2972) Changing state of cluster DC2 to CLUSTER_DOWN
Aug 12 2016 12:52:19 GMT: WARNING (xdr): (xdr.c:3002) Cluster DC2 is down! Spawning a thread.
Aug 12 2016 12:52:19 GMT: INFO (xdr): (xdr.c:2310) Windowshipper: Added a new window from 1471006009496 (2016-08-12 12:46:49.496 GMT). Waiting for DC to be UP.
Aug 12 2016 12:52:19 GMT: INFO (xdr): (xdr.c:2312) Windowshipper: Cluster DC2: Number of windows on list 1

Point 3: DC3 is taken offline, tend requests fail and eventually the dc-state is changed to CLUSTER_DOWN. As with DC2, a window shipper is started. This is the first outage for both DCs so only one window is required.

Aug 12 2016 12:52:57 GMT: INFO (xdr): (as_cluster.c:821) Node BB901BE75270008 refresh failed: AEROSPIKE_ERR_CLIENT Socket write error: 111
Aug 12 2016 12:52:58 GMT: INFO (xdr): (as_cluster.c:821) Node BB91B0BF7270008 refresh failed: AEROSPIKE_ERR_CLIENT Socket write error: 111
Aug 12 2016 12:52:58 GMT: INFO (xdr): (xdr.c:2938) Connection error when writing to cluster DC3. Checking its health.
Aug 12 2016 12:52:58 GMT: INFO (xdr): (xdr.c:2972) Changing state of cluster DC3 to CLUSTER_DOWN
Aug 12 2016 12:52:58 GMT: WARNING (xdr): (xdr.c:3002) Cluster DC3 is down! Spawning a thread.
Aug 12 2016 12:52:58 GMT: INFO (xdr): (xdr.c:2310) Windowshipper: Added a new window from 1471006033692 (2016-08-12 12:47:13.692 GMT). Waiting for DC to be UP.
Aug 12 2016 12:52:58 GMT: INFO (xdr): (xdr.c:2312) Windowshipper: Cluster DC3: Number of windows on list 1

Point 4: Both remote DCs are restarted, tend reqests begin to succeed and XDR notices the clusters are back. At this stage the window shipper starts to ship records now the cluster is back dc-state changes to show this with the CLUSTER_WINDOW_SHIP state. An estimate for the finish time of the _window shipper_is given.

Aug 12 2016 12:53:14 GMT: INFO (xdr): (xdr.c:2415) Windowshipper: Cluster DC2 has come up : 0
Aug 12 2016 12:53:14 GMT: INFO (xdr): (xdr.c:2425) Windowshipper: Changing state of cluster DC2 to CLUSTER_WINDOW_SHIP
Aug 12 2016 12:53:14 GMT: INFO (xdr): (xdr.c:2330) Setting end marker=1471006694590 (2016-08-12 12:58:14.590 GMT) for the window with start marker=1471006009496 (2016-08-12 12:46:49.496 GMT)
Aug 12 2016 12:53:14 GMT: INFO (xdr): (xdr.c:2447) Windowshipper: Cluster DC2: For node id 0, start window shipping between 1471006009496 (2016-08-12 12:46:49.496 GMT) and 1471006694590 (2016-08-12 12:58:14.590 GMT)
Aug 12 2016 12:53:14 GMT: INFO (xdr): (xdr.c:2449) Windowshipper: Cluster DC2: Number of windows on list 1
Aug 12 2016 12:53:14 GMT: INFO (xdr): (xdr.c:2097) Start xdr_do_windowship
Aug 12 2016 12:53:14 GMT: INFO (xdr): (xdr.c:2111) Windowshipper: Start digest timestamp 1471006306733 (2016-08-12 12:51:46.733 GMT)
Aug 12 2016 12:53:14 GMT: INFO (xdr): (xdr.c:2156) Data center recovery for DC2 pending, 242500 records left to check (at 0/s, ~0s left)

The same happens for DC3

Aug 12 2016 12:53:28 GMT: INFO (xdr): (xdr.c:2415) Windowshipper: Cluster DC3 has come up : 0
Aug 12 2016 12:53:28 GMT: INFO (xdr): (xdr.c:2425) Windowshipper: Changing state of cluster DC3 to CLUSTER_WINDOW_SHIP
Aug 12 2016 12:53:28 GMT: INFO (xdr): (xdr.c:2330) Setting end marker=1471006708757 (2016-08-12 12:58:28.757 GMT) for the window with start marker=1471006033692 (2016-08-12 12:47:13.692 GMT)
Aug 12 2016 12:53:28 GMT: INFO (xdr): (xdr.c:2447) Windowshipper: Cluster DC3: For node id 0, start window shipping between 1471006033692 (2016-08-12 12:47:13.692 GMT) and 1471006708757 (2016-08-12 12:58:28.757 GMT)
Aug 12 2016 12:53:28 GMT: INFO (xdr): (xdr.c:2449) Windowshipper: Cluster DC3: Number of windows on list 1
Aug 12 2016 12:53:28 GMT: INFO (xdr): (xdr.c:2097) Start xdr_do_windowship
Aug 12 2016 12:53:28 GMT: INFO (xdr): (xdr.c:2111) Windowshipper: Start digest timestamp 1471006306733 (2016-08-12 12:51:46.733 GMT)
Aug 12 2016 12:53:28 GMT: INFO (xdr): (xdr.c:2156) Data center recovery for DC3 pending, 258500 records left to check (at 0/s, ~0s left)

Point 5: When the window shippers have finished for each cluster respectively, dc-state changes back to CLUSTER_UP

Aug 12 2016 12:55:33 GMT: INFO (xdr): (xdr.c:2476) Windowshipper: Cluster DC2: For node id 0, done window shipping between 1471006009496 (2016-08-12 12:46:49.496 GMT) and 1471006694590 (2016-08-12 12:58:14.590 GMT). Shipped 356952 records
Aug 12 2016 12:55:33 GMT: INFO (xdr): (xdr.c:2486) Windowshipper: Cluster DC2: Number of windows on list 0
Aug 12 2016 12:55:33 GMT: INFO (xdr): (xdr.c:2490) Windowshipper: Cluster DC2: Done with shipping of all windows. Shipped 356952 records
Aug 12 2016 12:55:33 GMT: INFO (xdr): (xdr.c:2528) Windowshipper: Changing state of cluster DC2 to CLUSTER_UP

And

Aug 12 2016 12:55:39 GMT: INFO (xdr): (xdr.c:2476) Windowshipper: Cluster DC3: For node id 0, done window shipping between 1471006033692 (2016-08-12 12:47:13.692 GMT) and 1471006708757 (2016-08-12 12:58:28.757 GMT). Shipped 356952 records
Aug 12 2016 12:55:39 GMT: INFO (xdr): (xdr.c:2486) Windowshipper: Cluster DC3: Number of windows on list 0
Aug 12 2016 12:55:39 GMT: INFO (xdr): (xdr.c:2490) Windowshipper: Cluster DC3: Done with shipping of all windows. Shipped 356952 records
Aug 12 2016 12:55:39 GMT: INFO (xdr): (xdr.c:2528) Windowshipper: Changing state of cluster DC3 to CLUSTER_UP

Solution

There is no need to resolve this as XDR is behaving as designed. If there is no data to ship and no writes are failing, the dc-state is not relevant.

Notes

  • XDR log messages are explained in detail here:

http://www.aerospike.com/docs/reference/serverlogmessages

Keywords

XDR DC-STATE CLUSTER_UP CLUSTER_DOWN

Timestamp

12th August 2016


How to identify a bad DC that cause XDR throttling