asinfo with single XDR remote DC does not show DC state correctly when remote DC is down
When XDR is shipping to a single remote DC which is the only DC entry in the aerospike.conf file, the remote DC goes down and still, when asinfo is used to retrieve the dc_state metric, the DC is not marked as down. Though the DC is not marked as unavailable according to asinfo, the log shows messages such as those below.
Jul 28 2016 21:54:47 GMT: INFO (xdr): (xdr.c:2895) Connection error when writing to cluster remote2. Checking its health. Jul 28 2016 21:54:47 GMT: WARNING (xdr): (xdr.c:2908) Cluster remote2 is down. Not changing state as it is the only destination. Retrying... Jul 28 2016 21:54:48 GMT: INFO (xdr): (as_cluster.c:795) Node BB9FF57AA3E16FA refresh failed: AEROSPIKE_ERR_TIMEOUT Network timeout
As the log message indicates, this is designed behaviour.
Cluster remote2 is down. Not changing state as it is the only destination. Retrying...
When there is a single DC there is no need to mark the DC as down when it becomes unavailable. The reason why DCs are marked as down is because XDR uses lock step shipping which means that shipping to all DCs is at a constant pace, that of the slowest DC. In this scenario, if a DC goes down, the pace of the slowest DC is exactly 0 and so shipping to all DCs would stop if this were left unchecked. Therefore, the DC is only marked as down if there are multiple DCs defined. This then allows a window shipper to spawn and take care of the unavailable DC. As this is not required for a single DC, the DC is never marked as down and no window shipper is ever created. XDR continues to try and ship to the downed DC, relogging records if they fail.
This behaviour is as expected and does not require a solution. If asinfo is being used to monitor dc-state there is a way to change this behaviour by configuring a second, skeleton data center in aerospike.conf. Even though that definition is inactive, XDR would consider this to be a second data center and would adjust behaviour accordingly. In that scenario the following would happen:
- The remote DC would be marked as down according to the asinfo dc-state output
- A window shipper would be created to serve the remote DC
Therefore, if monitoring is based around the asinfo call returning dc-state even in a single remote DC scenario, adding a skeleton DC will cause XDR to mark the DC as down if it becomes unavailable.
Another common metric used to monitor XDR is xdr_outstanding_objects. It is very important to note that adding an inactive skeleton DC will affect this metric. xdr_outstanding_objects only counts objects that are being shipped by the dlog reader, which is used to ship to DCs that are up and available.
An improvement for this behavior is under review and tracked through internal jira AER-5193.
The solution proposed (skeleton DC) has the following drawback, in the case of lag, having a window shipper will cause some records to be shipped twice.
General information about Aerospike metrics.
- dc-state details.
- xdr_outstanding_objects details.
- How to configure XDR including skeleton DC definitions.
XDR SINGLE DC REMOTE DOWN DC-STATE INCORRECT AEROSPIKE_ERR_TIMEOUT