When XDR bin-convergence is enabled, Aerospike tracks changes by storing an LUT (Last Update Time) value for each bin. This implies a 7 bytes overhead per bin.
- 1 byte is for the src-id use in bin-convergence tie-breakers when both XDR Source and XDR Destination are colliding.
- The other 6 bytes are for the LUT of the bin.
A record could have both a record-level LUT and a bin-level LUT for each bin. However, the bin LUT is not stored for each bin written by a non XDR client write because the bin LUTs would then all match the record LUT. This is not the case an XDR client write. The bin LUTs would carry over from the source cluster’s bin LUTs, but the record LUT would be set when the record reaches and is written at the destination cluster. It is important to understand that a record LUT is always the time the record is written in a cluster, no matter its origin, but the bin level LUTs are ‘sticky’ across clusters connected via XDR.
XDR bin convergence is supported with record deletes as long as they are durable deletes. Internally, these durable deletes are converted to writes which delete all the bins.
When a bin is deleted (because of a durably deleted record or as an individual bin delete), the 7 bytes overhead referred in the beginning of this article is of course necessary for supporting the bin convergence feature.
A record could therefore be larger at an XDR destination than at the source.
A durable delete will leave a bunch of bin-tombstones with LUTs and src-id. These records will occupy more space than regular record-tombstones which do not have any bins. They are referred to as an XDR bin cemetery and a statistic (xdr_bin_cemeteries) tracks their number. They should be taken into account while doing capacity planning.
It is possible to breach the write-block-size limit of a record written by XDR due to the 7 bytes overhead per bin.
Let’s dig into more details with an example.
A 2048 Bytes record with 5 bins; both the record LUT and bin LUTs are the same when the client writes that record at the XDR source.
The LUT overhead is optimized on the source cluster, but not on the destination cluster. The shipped record would have the record LUT when written to storage, as well as a bin LUT for each bin of the records.
The record from this example would threfore have a size of
2048 Bytes + (5 * 7 Bytes) = 2083 Bytes.
Such overhead may be negligible for larger records with few bins but could breach the
write-block-size for records that are close to the limit.
The size of the records could also be kept in check from the application side or in the case on an XDR active-passive tolology by leveraging the max-record-size threshold. This wouldn’t help for active-active topologies.
Is my XDR written record generation count expected to match the generation of the record at the source?
Not necessarily. Other than durably deleted records that could have been deleted by the tomb raider at different times on different clusters (causing the generation to reset independently on the different clusters), multiple updates of a given record (causing the generation to increment multiple times) could be shipped in one shot (based on the frequency of the updates and the XDR configuration such as hot-key-ms and delay-ms).
- Proper sizing is important when planning to use XDR Bin Convergence feature. The record and bin overheads should be taken into consideration.
- There are currently no APIs to retrieve the bin level LUTs. Record level LUTs can be accessed through a UDF method.
Aerospike version 5.4 and later.
BIN CONVERGENCE DELETE OVERHEAD SIZING