Defrag not keeping up

One more question - are you using Community Edition or Enterprise Edition, and which version?

Aerospike Community Edition build 3.12.0

Thanks, I am going to test your scenario because this is not making sense to me. For the most part you should not be defragging at all because entire blocks would expiring due to your 5 minute TTL (300 sec per your config file).

Since you are only writing new data with 300 sec TTL, never updating, entire block of data is expiring every 300 sec. There is no need to retrieve unexpired, longer life data from that block and write elsewhere in a new block - which is what defragging does.

Meanwhile, can you test with defrag-lwm-pct set to very low number - say 5, and see if you are able to write your work load without increasing disk usage? You can change defrag-lwm-pct on the fly on all nodes via asadm and running asinfo -v inside it:

$asadm
Admin>asinfo -v "set-config:context=namespace;id=Cache;defrag-lwm-pct=5"
Admin>exit

You can change it back to 50 after the test.

I am able to reproduce your problem, partly. It will help to see your server logs stack trace at time of crash. Can you share these?

Thanks, we cleared the logs last week and it would take few days to have the servers running to run into space issue again. Were you able to updating any settings during the test to alleviate the problem?

Thanks.

I think I know the cause, don’t have a solution for you yet.

Please try the following:

$asadm
Admin>asinfo -v "set-config:context=service;nsup-delete-sleep=0"
Admin>exit

If this solves your problem, add this config parameter in the service context (below nsup-period).

Works in my test setup.

Thanks, should I keep nsup-period at 30 or try lowering it?

If this is all you are doing with Aerospike, by my calculations, your one node (30GB RAM, 950 GB SSD) should be able to handle this (Assuming SSDs can handle the write rate - check write-q in logs - should stay at or near zero). You are using replication factor of 1, so horizontal scaling is just for capacity. Assuming the numbers you want are exactly what you have listed, here is my calculation:

Write rate: 71,000 TPS Record size: 740 Bytes Default block flush time: 1 second Data rate: 52 Mb /sec, 128KB block size => 410 blocks/sec (you are seeing 82 write blocks/sec on 5 nodes) TTL of 300 sec Gigabytes needed for one TTL worth of continuously written data : RAM: 1.4 GB, SSD 15 GB Add for hwm, 50% to be generous: 3 GB RAM, 30 GB SSD Add for Expirations to keep up - factor of 2x worst case (should be much less): 6 GB RAM, 60 GB SSD. Once your setup is running - check RAM and SSD usage on AMC - should show a steady state of above numbers. I would set nsup period based on the logs. Look for the Total time number under steady load, use nsup period slightly higher than that number - may be 2x?. No load output below shows 2 ms. You should see something like 30 to 60 seconds under full load - depends on total number of records being scanned by nsup thread - just guessing. Set it something above that number.

$grep thr_nsup /var/log/aerospike/aerospike.log
{Cache} Records: 0, 0 0-vt, 0(5000000) expired, 0(0) evicted, 0(0) set deletes. Evict ttl: 0. Waits: 0,0,0. Total time: 2 ms

Finally, if you absolutely will not ever update any data, I would set defrag-lwm-pct to 5. Will reduce unneccessary defrag writes. In your case, you should ideally have zero defragging.

Please share and update here if all this works out for you and what were your final settings.

1 Like

Thanks, initial testing seems to be working. With nsup-delete-sleep lower to zero, the cpu do seem to run a little high. Going to tweak some of these settings and see if I can find a sweet spot. Thanks so much for your help!

Please do share your findings. On second thoughts, I think you will get better performance in this unique use case if you have more nodes, you can size the storage capacity down, because more nodes means more nsup threads running in parallel - each node has its own nsup thread. Once nsup thread total run time starts escalating, it feeds on itself since you have writes and expirations only.

I am only testing on a single node so I am keen to know what you find since you have a multi-node cluster.