Defrag not keeping up

Hello,

Currently, Aerospike defrag process does not seem to be catching up with the writes.

Log: INFO (drv_ssd): (drv_ssd.c:2117) {Cache} /dev/xvdd: used-bytes 26335111680 free-wblocks 7040778 write-q 0 write (492422,274.5) defrag-q 0 defrag-read (285290,228.9) defrag-write (228192,183.1) INFO (drv_ssd): (drv_ssd.c:2117) {Cache} /dev/xvdd: used-bytes 26506883712 free-wblocks 7039803 write-q 0 write (497677,262.8) defrag-q 0 defrag-read (289570,214.0) defrag-write (231615,171.1) INFO (drv_ssd): (drv_ssd.c:2117) {Cache} /dev/xvdd: used-bytes 26679456640 free-wblocks 7038745 write-q 0 write (502640,248.1) defrag-q 0 defrag-read (293476,195.3) defrag-write (234739,156.2)

Config:

nsup-period 1
high-water-disk-pct 60
write-block-size 128K
scheduler-mode noop
defrag-sleep 0
defrag-lwm-pct 80

Are there additional settings that would help me speed up the defrag process?

Thanks.

Why defrag a block when it is 80% full? Updates will trigger defragging and you will unnecessarily over defrag, wear out your ssd, get write amplification.

Try: defrag-lwm-pct 50

Also nsup-period of 1 sec seems rather small. Check in logs how long each namespace is taking, nsup period should be at least greater than the sum. In example below, namespace test is taking 372 milliseconds.

{test} Records: 393851, 0 0-vt, 0(0) expired, 7890(14047) evicted, 0(0) set deletes, 0(0) set evicted. Evict ttls: 2240,2380,0.439. Waits: 0,0,0. Total time: 372 ms

Then leave defrag-sleep at the default of 1000 microseconds and see.

Also it is my understanding that write (xxx, 274.5) includes defrag writes of (yyyy, 183.1) in your example. So, assuming all your writes are updates (274 - 183 = 91 blocks/sec), defrags are infact keeping up at 183 blocks/sec. What is your update rate, record size and flush-max-ms ? Is it possible that you are writing blocks that are less than 80% full due to the above combination and they are immediately becoming candidates for defrag?

Thanks for the reply. I had tried the default settings initially and noticed that defrag is not keeping up, therefore, I have updated the settings hoping it would resolve it. I again revised the setting back and defrag is still not able to keep up with the writes.

Log:

used-bytes 10732713600 free-wblocks 7164414 write-q 0 write (113163,127.7) defrag-q 0 defrag-read (29667,89.2) defrag-write (14823,44.6)
used-bytes 11903876608 free-wblocks 7155389 write-q 0 write (130501,89.9) defrag-q 0 defrag-read (37980,15.2) defrag-write (18977,7.6)
used-bytes 12050042112 free-wblocks 7154922 write-q 0 write (133321,141.0) defrag-q 0 defrag-read (40333,117.7) defrag-write (20152,58.8)

Config:

nsup-period 30 
high-water-disk-pct 60 
write-block-size 128K 
scheduler-mode noop
defrag-sleep 1000 
defrag-lwm-pct 50

Also, I am currently running 5 nodes in a cluster and there are no updates, just create and read. Avg. about 60k tps on read and 71k tps on write.

Thanks.

Well at 71,000 TPS writes, you are definitely filling multiple entire write blocks of 128KB within one second. So there should be no need to defrag.

What is the record TTL? Do you have records that are expiring causing the blocks to then defrag?

When you are initially creating records, you are getting a fully populated block of data comprising multiple records - as many as can fit in the write-block-size. When some of those records either are replaced by an update to a new block or expire due to expiring default-ttl or record ttl, the block gets fragmented. When useful records are less than defrag-lwm-pct (50%) in the block, the block becomes candidate for defragging. Remaining good records from this block are re-written in a new block along with other records from other blocks getting defragmented. This is how defrag works. So if you were just creating new records at a good rate, 71K on 5 nodes is more than adequate, you should not defrag till records start expiring.

So what is the default TTL of your namespace or of the records when you create them?

means - you are writing total of 141.0 blocks/sec (128KB blocks) of which 58.8 blocks/sec are from the defragger re-writing records to a new location. So you are writing 82.2 blocks/sec, or 82.2*128KB /sec at 71K TPS on 5 nodes, I estimate your records have about 500 bytes of data.

Defrag not keeping up is usually something you have to worry about if the disk usage is increasing while you are merely reading and updating records without writing any new ones.

One thing you can do is partition your SSD and associate those devices (partitions) with the namespace. You will be losing a marginal amount of storage, but gaining defrag threads. Both the physical SSD and its partitions are ‘devices’, and there’s a defrag thread per-device.

As Piyush mentioned, you’re simply writing faster than defrag can occur. Once the write load drops the defrag will catch up, but if this gap is constant it won’t be sustainable.

Thanks Piyush and Ronen. 60k tps on reads and 71k tps on writes are during normal hours, at peak, it could be more than double. This namespace at the moment is primary being used as a cache server, there are only creates and reads with no updates. TTL is set to 600 seconds.

I’ll see if partitioning the SSD into multiple devices will help speeding up the defrag process.

Thanks.

I don’t see any evidence that defrag isn’t keeping up. In all logs defrag-q is 0 which means there wasn’t any blocks eligible for defrag at that point in time. Partitioning the disks will not change this.

Let me clarify where I was headed with this. Since you are merely creating records, your defrag-writes should be exactly ZERO. But you have defrag writes, that means some of the records you wrote are naturally expiring. 600 seconds ==> 10 minutes. So after 10 minutes, records that you have written are being expired by Aerospike. Those expired records are resulting in available space in previously fully used blocks. Once that space becomes above lwm (50%), the block gets in the queue to be defragged.

As kporter points out, your defrag-q is 0, so defrag is absolutely keeping up.

In short, with lwm 50% and other default configs that I suggested earlier, you have no problem at all. You are running fine.

Thanks All. If creating additional partitions is not going to help solve the problem, is there anything that can be done? Problems still exist. For example, the TTL is set to 600 sec and for testing how fast data is being removed after TTL, I had the service ran for a little over one hour. After one hour, I had completely stopped all traffics going to the Aerospike cluster. After 10 mins, all data in the cluster should be expired. However, it is taking Aerospike more than 1 hour (with absolutely no traffic hit it) to remove all the data.

Is there a way to have Aerospike delete data quicker from ssd after it has been expired?

Thanks

This is a non-problem. If you try to read the expired data, you will not get any reads back. Aerospike does not erase individual records off the SSD. Even after defragging a block, old data is still sitting in the block till a new block overwrites it.

After 10 mins, as @pgupta said, the data is expired, therefore unreadable. I would like to see log lines pertaining to NSUP to verify that 1 hour before removal is expected here.

Grep the logs for a line in this format:

{ns-name} Records: 638101, 0 0-vt, 922(576066) expired, 259985(24000935) evicted, 0(0) set deletes. Evict ttl: 0. Waits: 0,0,0. Total time: 155 ms

Also run:

asadm -e "asinfo -v build"
asadm -e "show distribution"

At the moment, the issue is that, if I leave the application running for a day, all the disk space will be used (crashing the server) even when data is unreadable. I addition, it seems that old data in the block is not being overwritten.

Look at disk space usage in AMC when you are running your load. Do you see it going up? With hwm at 60% you will start evicting data and then hit stop writes at 90% RAM or 5% SSD min avail disk space. You cannot be crashing the server. Are you actually crashing the server?

Total write blocks/sec = new writes + defrag writes, and you also have separately the defrag blocks/sec number. After (ten minutes + nsup period) - say after 15 minutes of running your write load (ten minutes TTL), you should hit write blocks/sec = ~ 2 * defrag blocks/sec at 50% lwm setting. Do you see that? (Because in your case, almost an entire block will expire in ten minutes - so at some point you will reach a steady state of new blocks being written and entire blocks being expired.)

Yes, 3 out of the 5 servers actually crashed. The cluster never reached a steady state, disk usage would keep going up. We let it ran after hwm is reached. After a while, write stopped and later, servers would crash.

If the servers crashed there should be a stack trace in the logs, could you provide that?

Few questions:

  1. 5 servers, these are all separate machines with one Aerospike process running on each. True?
  2. The 5 servers are identical hardware? RAM = ?? GB SSD =?? GB
  3. What is the full namespace configuration - can you copy and paste?
  4. What is the size of your data in the record?

Few questions:

  1. 5 servers, these are all separate machines with one Aerospike process running on each. True? Correct

  2. The 5 servers are identical hardware? RAM = ?? GB SSD =?? GB Yes

  3. What is the full namespace configuration - can you copy and paste?

service {
                user root
                group root
                paxos-single-replica-limit 1
                pidfile /var/run/aerospike/asd.pid
                service-threads 4
                transaction-queues 4
                transaction-threads-per-queue 4
                proto-fd-max 30000
                proto-fd-idle-ms 60000
                nsup-period 30
}

namespace Cache {
                replication-factor 1
                memory-size 25G
                default-ttl 300
                high-water-memory-pct 70
                high-water-disk-pct 60

                storage-engine device {
                                device /dev/nvme0n1p1
                                device /dev/nvme0n1p2
                                device /dev/nvme0n1p3
                                device /dev/nvme0n1p4
                                write-block-size 128K
                                data-in-memory false
                                scheduler-mode noop
                                cold-start-empty true
                                defrag-sleep 1000
                                defrag-lwm-pct 50
                }
}
  1. What is the size of your data in the record? This will vary, will be around 500 - 1000 bytes.

#2) what is the RAM and SSD size in GB of each server?

RAM = 30.5 gb SSD = 950 gb