We have a 3-cluster Aerospike setup: a 6-node local cluster, and 2 AWS remote clusters (the AWS clusters are sized differently than our local cluster). The XDR setup is active-passive from the local cluster to the remote clusters. We’ve enabled replication at the set level, so we’re not shipping the entire namespace.
All 3 clusters were running version 3.5 until recently, when I updated the AWS clusters to 3.10. Shortly thereafter it seemed that the remote clusters were quickly running out of disk space and constantly going over the high-water-disk-pct, which is set to 50%. The steps I took to remediate the problem were:
- add additional nodes
- change the default-ttl from 1095D to 15D (1095 was from the config on our local cluster and should have been edited for the remote cluster)
- set replication factor to 1 (for our use case, this is fine)
I’m still seeing what to me is unusual growth on the remote clusters. The local cluster shows ~3 billion master objects. AWS1 has ~428 million, and AWS2 has ~213 million. With both remote clusters being shipped the same sets, and the configs on both AWS clusters being the same, why would AWS1 have double the number of objects as AWS2?
The second piece of confusion for me is understanding expiration on these clusters. For all 3 clusters the low-pct disk mark is the default (0) and the high-water-disk-pct is 50, therefore they should all 3 be expiring data. My local cluster is expiring data: 46949(2614628063) expired
But the AWS clusters seem to be expiring at a slower rate, even though their TTL is set to lower. AWS1: 0(559) expired
AWS2: 0(67977) expired
I ran the command asinfo -v ‘hist-dump:ns=prod_cp;hist=ttl’ to get some more insight. Local cluster: > ttl=100,946029,74713698,73000242,130840691,38823493,40849617,48754640,52353801,55451245,21571100,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,4138,0,0,952,0,637,1203,1375,2141,1665,842,2217,1322,2629,1767,2343,1798,1866,1902,1901,1862,2460,126052;
After reading a bunch of articles to figure out what the output meant, it seems that the buckets are divided into the same width on all 3 clusters, which happens to correspond to the original default-ttl of 1095D (100*946026)/3600 = 26278 hours (1094.9 days). For AWS2 for example, I have 18242811 records in bucket 1 that won’t expire out for 21 days (if I’m reading this correctly), and 151507 in bucket 99 that won’t expire for 1095 days.
Can I do something to change the widths of the buckets on the remote clusters to reflect the different ttl? Am I misunderstanding how this is supposed to work? Why am I growing data on the remote clusters so fast, and not evenly? Is it possible that the XDR writes are simply updating the generations on the records in a way that isn’t allowing them to expire?
I’m so confused.