Ran out of queue space... XDR cannot keep up with write

Aerospike_Knowledge · June 16, 2017, 11:13pm

The Aerospike Knowledge Base has moved to https://support.aerospike.com. Content on https://discuss.aerospike.com is being migrated to either https://support.aerospike.com or https://docs.aerospike.com. Maintenance on articles stored in this repository ceased on December 31st 2022 and this article may be stale. If you have any questions, please do not hesitate to raise a case via https://support.aerospike.com.

Problem Description

Aerospike log file contains this warning:

WARNING (xdr): (xdr_dlog.c:568) (repeated:173276) XDR digestlog cannot keep up with writes. Dropping record.

Older versions would have the following:

WARNING (xdr): (xdr.c:5021) Ran out of queue space... XDR cannot keep up with write .. some records may be lost!!!

Explanation

This is an internal in-memory queue used to batch digest log entries before persisting them on disk. It has a size of 1,000,000 and when it is full, it will print out the above message.

Here are the three common situations that would cause this:

The filesystem partition is full so the digest log is not able to expand as it is a sparse file. You will also see the following WARNING in the log file:
```
WARNING (xdr): (xdr_dlog.c:390) Digest Log Write Failed !!! ... Critical error
```

Older Aerospike versions have a different line number:

```
WARNING (xdr): (xdr.c:4887) Digest Log Write Failed !!! ... Critical error

```

The disk is not keeping up so it fill the internal queue faster than it can write to the digest log. This could be due to slow disk, remote mount or simply too much data being written to a potentially shared partition.
The last case is a bug (AER-5617), which was addressed in Enterprise Edition release 3.14.1.1.

When XDR is enabled but the remote DCs are all INACTIVE, XDR does not reclaim processed entries from the digest log, causing it to grow until it reaches its full size (and start overwriting older records). As the digest log grows, the internal logic to figure out the last ship time can takes longer, and, as it happens under a lock, it prevents new entries from being flushed to disk. If the load is such that enough digest log entries are populated in this internal queue and it reaches the limit, this WARNING message will be triggered (the ul is 1000101 in the example).

Apr 24 2017 10:00:26 GMT-0700: INFO (xdr): (xdr.c:2027) sh 0 : ul 65 : lg 7516764045 : rlg 0 : lproc 7516764000 : rproc 479081584 : lkdproc 0 : errcl 0 : errsrv 0 : hkskip 1250365174 745690834 : flat 0
Apr 24 2017 10:02:25 GMT-0700: INFO (xdr): (xdr.c:1773) Reclaimed 0 records space in digest log...
Apr 24 2017 10:02:25 GMT-0700: INFO (xdr): (xdr.c:2027) sh 0 : ul 1000101 : lg 7516889145 : rlg 0 : lproc 7516889100 : rproc 479081584 : lkdproc 0 : errcl 0 : errsrv 0 : hkskip 1250365641 745691255 : flat 0

Solution

Refer to the Digestlog partition out of space article for how to handle situations where the digestlog runs out of space.
Since remote datacenter is out of sync, refer to the How to recover from long term data center outages article
If the digestlog is not keeping up, note that you should NOT use direct disk based storage for xdr-digestlog-path. The digestlog is designed to work very well on files. Keeping digestlog on the disk directly may result in the writes not keeping up. Using file-backed storage will allow for more writes to happen, because: a) the reads will be, in most cases, going directly from the filesystem in-RAM cache, and b) bursts of writes will end up in filesystem dirty-cache before being slowly flushed to disks.

If you use directly disk-backed storage, you cannot take advantage of these features. For example, if you intend to use /dev/sdb for your digestlog, first put a filesystem on /dev/sdb, and then mount the /dev/sdb somewhere (e.g. on /xdr), adding the mount to /etc/fstab. Then configure xdr-digestlog-path /xdr/digestlog.

Notes

The metric to keep track for this situation is xdr-queue-overflow_error.
The digest log is implemented as a circular buffer and will overwrite old records. The metric dlog_overwritten_error keeps track of the for digets log overflow and is not related to the internal queue.

Applies To

Server prior to v. 5.0

Keywords

XDR DIGESTLOG OVERFLOW

Timestamp

July 12 2019

Topic		Replies	Views
XDR ran out of queue space XDR (Cross Data Center Replication) xdr	5	2666	July 19, 2017
Aerospike in VM : XDR and Logger	5	938	March 26, 2018
A write fail warning Aerospike Server Benchmarks	2	3502	December 20, 2014
Aerospike percentage of available writes less even with high Disk space availability	2	2141	December 13, 2017
Close to breaching memory HWM when using XDR XDR (Cross Data Center Replication)	4	735	January 12, 2022