Monitoring XDR on a live Aerospike source cluster

The Aerospike Knowledge Base has moved to https://support.aerospike.com. Content on https://discuss.aerospike.com is being migrated to either https://support.aerospike.com or https://docs.aerospike.com. Maintenance on articles stored in this repository ceased on December 31st 2022 and this article may be stale. If you have any questions, please do not hesitate to raise a case via https://support.aerospike.com.

Details

XDR (Cross-Datacenter Replication) is one of Aerospike Enterprise features and is designed to synchronize clusters over higher-latency links asynchronously. More details on XDR - XDR Architecture

This knowledge base covers the key metrics to be monitored for XDR performance on a source cluster.

For server version 5.x and above

General Performance:

  1. Average latency to ship a record to remote Aerospike cluster: latency_ms

  2. Time lag across for a given data center: lag

  3. Number of records pending completion: in_progress

  4. Current throughput: throughput

  5. Time taken to process records across partitions in one lap: lap_us

  6. Number of write requests in the XDR in-memory queue: in_queue

  7. Number of sucessful record shipped: success

  8. Number of partitions that are recovered by reducing the primary index of that partition: recoveries

Errors:

  1. Number of records being retried at the source due to connection reset: retry_conn_reset

  2. Number of records being retried at the source due to a temporary error returned by destination node: retry_dest

  3. Number of records abandoned due to permanent errors returned by destination node: abandoned

Monitoring info command:

XDR 5 introduces the new get-stats info commands:

Allowing the monitoring of DC level stats:

asinfo -v 'get-stats:context=xdr;dc=<DCNAME>' -l

and namespace level stats for XDR:

asinfo -v 'get-stats:context=xdr;dc=<DCNAME>;namespace=<NAMESPACE>' -l

Examples:

Admin> asinfo -v 'get-stats:context=xdr;dc=REMOTE_DC_1' -l
ubuntu-bionic:3000 (1.1.1.201) returned:
lag=0
in_queue=0
in_progress=0
success=0
abandoned=0
not_found=0
filtered_out=0
retry_conn_reset=0
retry_dest=0
recoveries=4096
recoveries_pending=0
hot_keys=0
uncompressed_pct=0.000
compression_ratio=1.000
throughput=0
latency_ms=0
lap_us=1629

Admin> asinfo -v 'get-stats:context=xdr;dc=REMOTE_DC_1;namespace=test' -l
ubuntu-bionic:3000 (1.1.1.201) returned:
lag=0
in_queue=0
in_progress=0
success=0
abandoned=0
not_found=0
filtered_out=0
retry_conn_reset=0
retry_dest=0
recoveries=4096
recoveries_pending=0
hot_keys=0
uncompressed_pct=0.000
compression_ratio=1.000
throughput=0

For server versions between 3.9.0 and 5.x

General Performance:

  1. Average latency to ship a record to remote Aerospike cluster: xdr_ship_latency_avg

  2. Maximum of time lag across all remote data centers: xdr_timelag

  3. Number of outstanding records: xdr_ship_outstanding_objects

  4. Current throughput: xdr_throughput

  5. Free digest-log percentage: dlog_free_pct

  6. Number of write requests initiated by XDR that succeeded on the namespace on this node: xdr_write_success

Errors:

  1. Number of records being relogged at the source: dlog_relogged

  2. Number of errors while shipping records: xdr_ship_source_error

  3. Number of errors from the remote while shipping: xdr_ship_destination_error

  4. Number of local read errors: xdr_read_error

At the per-DC level:

  1. Moving average of shipping latency for the specific datacenter: dc_ship_latency_avg

  2. Time lag for this specific datacenter: dc_timelag

Digest-log latency for reads and writes: Note that these no longer exist in version 3.9.

Latency to read the records from the local Aerospike server:

  1. Moving average latency to read a record/batch of records from local Aerospike server: xdr_read_latency_avg

For server versions between 3.8.1 and 3.9.0

General Performance:

  1. Average latency to ship a record to remote Aerospike cluster: latency_avg_ship

  2. Maximum of time lag across all remote data centers: xdr_timelag

  3. Number of outstanding records: stat_recs_outstanding

  4. Current throughput: cur_throughput

  5. Free digestlog percentage: free-dlog-pct

Errors:

  1. Number of records being relogged at the source: stat_recs_relogged

  2. Number of errors while shipping records: err_ship_client

  3. Number of errors from the remote while shipping: err_ship_server

  4. Number of local read errors: local_recs_error

At the per-DC level:

  1. Moving average of shipping latency for the specific datacenter: dc_latency_avg_ship

  2. Time lag for this specific datacenter: dc_timelag

Digest-log latency for reads and writes: (note that these no longer exist in version 3.9 as they were potentially not critical to monitor):

  1. Moving average latency to read from the digest log: latency_avg_dlogread

  2. Moving average latency to write from the digest log: latency_avg_dlogwrite

Latency to read the records from the local Aerospike server:

  1. Moving average latency to read a record/batch of records from local Aerospike server: local_recs_fetch_avg_latency

For server version 3.8.0 and earlier

General Performance:

  1. Average latency to ship a record to remote Aerospike cluster: latency_avg_ship

  2. Maximum of time lag across all remote data centers: timediff_lastship_cur_secs

  3. Number of outstanding records: stat_recs_outstanding

  4. Current throughput: cur_throughput

  5. Free digestlog percentage: free-dlog-pct

Errors:

  1. Number of records being relogged at the source: stat_recs_relogged

  2. Number of errors while shipping records: err_ship_client

  3. Number of errors from the remote while shipping: err_ship_server

  4. Number of local read errors: local_recs_error

At the per-DC level:

  • DC-level statistics were introduced in server version 3.8.1 and above.

Digest-log latency for reads and writes: (note that these no longer exist in version 3.9 as they were potentially not critical to monitor):

  1. Moving average latency to read from the digest log: latency_avg_dlogread

  2. Moving average latency to write from the digest log: latency_avg_dlogwrite

Latency to read the records from the local Aerospike server:

  1. Moving average latency to read a record/batch of records from local Aerospike server: local_recs_fetch_avg_latency

Reference

See our metric reference to see details about the above statistics and to know more about other statistics:

Keywords

XDR MONITOR STATISTICS

Timestamp

June 2020