Monitoring XDR on a live Aerospike source cluster


#1

Details

XDR (Cross-Datacenter Replication) is one of Aerospike Enterprise features and is designed to synchronize clusters over higher-latency links asynchronously. More details on XDR - XDR Architecture

This knowledge base covers the key metrics to be monitored for XDR performance on a source cluster.

For server version 3.9.0 and above

General Performance

  1. Average latency to ship a record to remote Aerospike cluster: xdr_ship_latency_avg

  2. Maximum of time lag across all remote data centers: xdr_timelag

  3. Number of outstanding records: xdr_ship_outstanding_objects

  4. Current throughput: xdr_throughput

  5. Free digest-log percentage: dlog_free_pct

Errors:

  1. Number of records being relogged at the source: dlog_relogged

  2. Number of errors while shipping records: xdr_ship_source_error

  3. Number of errors from the remote while shipping: xdr_ship_destination_error

  4. Number of local read errors: xdr_read_error

At the per-DC level:

  1. Moving average of shipping latency for the specific datacenter: dc_ship_latency_avg

  2. Time lag for this specific datacenter: dc_timelag

Digest-log latency for reads and writes Note that these no longer exist in version 3.9.

Latency to read the records from the local Aerospike server:

  1. Moving average latency to read a record/batch of records from local Aerospike server: xdr_read_latency_avg

For server versions between 3.8.1 and 3.9.0

General Performance

  1. Average latency to ship a record to remote Aerospike cluster: latency_avg_ship

  2. Maximum of time lag across all remote data centers: xdr_timelag

  3. Number of outstanding records: stat_recs_outstanding

  4. Current throughput: cur_throughput

  5. Free digestlog percentage: free-dlog-pct

Errors:

  1. Number of records being relogged at the source: stat_recs_relogged

  2. Number of errors while shipping records: err_ship_client

  3. Number of errors from the remote while shipping: err_ship_server

  4. Number of local read errors: local_recs_error

At the per-DC level:

  1. Moving average of shipping latency for the specific datacenter: dc_latency_avg_ship

  2. Time lag for this specific datacenter: dc_timelag

Digest-log latency for reads and writes (note that these no longer exist in version 3.9 as they were potentially not critical to monitor):

  1. Moving average latency to read from the digest log: latency_avg_dlogread

  2. Moving average latency to write from the digest log: latency_avg_dlogwrite

Latency to read the records from the local Aerospike server:

  1. Moving average latency to read a record/batch of records from local Aerospike server: local_recs_fetch_avg_latency

For server version 3.8.0 and earlier

General Performance

  1. Average latency to ship a record to remote Aerospike cluster: latency_avg_ship

  2. Maximum of time lag across all remote data centers: timediff_lastship_cur_secs

  3. Number of outstanding records: stat_recs_outstanding

  4. Current throughput: cur_throughput

  5. Free digestlog percentage: free-dlog-pct

Errors:

  1. Number of records being relogged at the source: stat_recs_relogged

  2. Number of errors while shipping records: err_ship_client

  3. Number of errors from the remote while shipping: err_ship_server

  4. Number of local read errors: local_recs_error

At the per-DC level:

  • DC-level statistics were introduced in server version 3.8.1 and above.

Digest-log latency for reads and writes (note that these no longer exist in version 3.9 as they were potentially not critical to monitor):

  1. Moving average latency to read from the digest log: latency_avg_dlogread

  2. Moving average latency to write from the digest log: latency_avg_dlogwrite

Latency to read the records from the local Aerospike server:

  1. Moving average latency to read a record/batch of records from local Aerospike server: local_recs_fetch_avg_latency

Reference

See our metric reference to see details about the above statistics and to know more about other statistics: http://www.aerospike.com/docs/reference/metrics

Keywords

XDR MONITOR STATISTICS

Timestamp

07/27/2016


How to identify a bad DC that cause XDR throttling