Why defrag a block when it is 80% full? Updates will trigger defragging and you will unnecessarily over defrag, wear out your ssd, get write amplification.
Try:
defrag-lwm-pct 50
Also nsup-period of 1 sec seems rather small. Check in logs how long each namespace is taking, nsup period should be at least greater than the sum. In example below, namespace test is taking 372 milliseconds.
{test} Records: 393851, 0 0-vt, 0(0) expired, 7890(14047) evicted, 0(0) set deletes, 0(0) set evicted. Evict ttls: 2240,2380,0.439. Waits: 0,0,0. Total time: 372 ms
Then leave defrag-sleep at the default of 1000 microseconds and see.
Also it is my understanding that write (xxx, 274.5) includes defrag writes of (yyyy, 183.1) in your example. So, assuming all your writes are updates (274 - 183 = 91 blocks/sec), defrags are infact keeping up at 183 blocks/sec. What is your update rate, record size and flush-max-ms ? Is it possible that you are writing blocks that are less than 80% full due to the above combination and they are immediately becoming candidates for defrag?
Thanks for the reply. I had tried the default settings initially and noticed that defrag is not keeping up, therefore, I have updated the settings hoping it would resolve it. I again revised the setting back and defrag is still not able to keep up with the writes.
Well at 71,000 TPS writes, you are definitely filling multiple entire write blocks of 128KB within one second. So there should be no need to defrag.
What is the record TTL? Do you have records that are expiring causing the blocks to then defrag?
When you are initially creating records, you are getting a fully populated block of data comprising multiple records - as many as can fit in the write-block-size. When some of those records either are replaced by an update to a new block or expire due to expiring default-ttl or record ttl, the block gets fragmented. When useful records are less than defrag-lwm-pct (50%) in the block, the block becomes candidate for defragging. Remaining good records from this block are re-written in a new block along with other records from other blocks getting defragmented. This is how defrag works. So if you were just creating new records at a good rate, 71K on 5 nodes is more than adequate, you should not defrag till records start expiring.
So what is the default TTL of your namespace or of the records when you create them?
means - you are writing total of 141.0 blocks/sec (128KB blocks) of which 58.8 blocks/sec are from the defragger re-writing records to a new location. So you are writing 82.2 blocks/sec, or 82.2*128KB /sec at 71K TPS on 5 nodes, I estimate your records have about 500 bytes of data.
Defrag not keeping up is usually something you have to worry about if the disk usage is increasing while you are merely reading and updating records without writing any new ones.
One thing you can do is partition your SSD and associate those devices (partitions) with the namespace. You will be losing a marginal amount of storage, but gaining defrag threads. Both the physical SSD and its partitions are ‘devices’, and there’s a defrag thread per-device.
As Piyush mentioned, you’re simply writing faster than defrag can occur. Once the write load drops the defrag will catch up, but if this gap is constant it won’t be sustainable.
Thanks Piyush and Ronen. 60k tps on reads and 71k tps on writes are during normal hours, at peak, it could be more than double. This namespace at the moment is primary being used as a cache server, there are only creates and reads with no updates. TTL is set to 600 seconds.
I’ll see if partitioning the SSD into multiple devices will help speeding up the defrag process.
I don’t see any evidence that defrag isn’t keeping up. In all logs defrag-q is 0 which means there wasn’t any blocks eligible for defrag at that point in time. Partitioning the disks will not change this.
Let me clarify where I was headed with this.
Since you are merely creating records, your defrag-writes should be exactly ZERO.
But you have defrag writes, that means some of the records you wrote are naturally expiring.
600 seconds ==> 10 minutes. So after 10 minutes, records that you have written are being expired by Aerospike.
Those expired records are resulting in available space in previously fully used blocks. Once that space becomes above lwm (50%), the block gets in the queue to be defragged.
As kporter points out, your defrag-q is 0, so defrag is absolutely keeping up.
In short, with lwm 50% and other default configs that I suggested earlier, you have no problem at all. You are running fine.
Thanks All. If creating additional partitions is not going to help solve the problem, is there anything that can be done? Problems still exist. For example, the TTL is set to 600 sec and for testing how fast data is being removed after TTL, I had the service ran for a little over one hour. After one hour, I had completely stopped all traffics going to the Aerospike cluster. After 10 mins, all data in the cluster should be expired. However, it is taking Aerospike more than 1 hour (with absolutely no traffic hit it) to remove all the data.
Is there a way to have Aerospike delete data quicker from ssd after it has been expired?
This is a non-problem. If you try to read the expired data, you will not get any reads back. Aerospike does not erase individual records off the SSD. Even after defragging a block, old data is still sitting in the block till a new block overwrites it.
After 10 mins, as @pgupta said, the data is expired, therefore unreadable. I would like to see log lines pertaining to NSUP to verify that 1 hour before removal is expected here.
Grep the logs for a line in this format:
{ns-name} Records: 638101, 0 0-vt, 922(576066) expired, 259985(24000935) evicted, 0(0) set deletes. Evict ttl: 0. Waits: 0,0,0. Total time: 155 ms
At the moment, the issue is that, if I leave the application running for a day, all the disk space will be used (crashing the server) even when data is unreadable. I addition, it seems that old data in the block is not being overwritten.
Look at disk space usage in AMC when you are running your load. Do you see it going up? With hwm at 60% you will start evicting data and then hit stop writes at 90% RAM or 5% SSD min avail disk space. You cannot be crashing the server. Are you actually crashing the server?
Total write blocks/sec = new writes + defrag writes, and you also have separately the defrag blocks/sec number. After (ten minutes + nsup period) - say after 15 minutes of running your write load (ten minutes TTL), you should hit write blocks/sec = ~ 2 * defrag blocks/sec at 50% lwm setting. Do you see that? (Because in your case, almost an entire block will expire in ten minutes - so at some point you will reach a steady state of new blocks being written and entire blocks being expired.)
Yes, 3 out of the 5 servers actually crashed. The cluster never reached a steady state, disk usage would keep going up. We let it ran after hwm is reached. After a while, write stopped and later, servers would crash.