Questions about very low disk-avail-pct

Hi, I’ve a cluster constructed by 8 nodes, recently some problems occured when dealing with high write pressure. The disk-avail-pct dropped extermly low and finally reached stop-write. Here’s the info of my cluster:

The version of the cluster is “Aerospike Community Edition build 3.16.0.1”

The problems occured on the namespace user_durable_list. This is a namespace that basically store only list data. From beginning, the namespace is restored from a backup, and the disk-usage-pct of the namespace is about 14, the disk-avail-pct is about 85. Then it begins to process write requests. The write pressure is high, which the monitor shows that the disk IO reaches over 90%. And the disk-avail-pct is continuously dropping, you could see the values in the screenshoot, which dropped from 80+ to 10+ for certain nodes. And they will finally drop to about 5 or 4, then stop write. I’ve read serval articles about the disk-usage-pct in the forum, but still can’t figure out why the disk-avail-pct dropped so low, please help.

BTW, I’m considering upgrade the version of the cluster(3.16) to the newest stable version 4.x, are there any special operations I should know?

What is your lwm set at and defrag-sleep? Can you shoot us output of sar -d -p so we can see disk util? Since defrag isnt keeping up, we could make it more aggressive… but if your disks are already going over 90% chances are you need more IO.

As far as upgrade gotcha’s, you can find them here- Aerospike Server CE Release Note | Download | Aerospike

  • When upgrading the Aerospike server, from a version prior to 4.6 , with the security feature enabled, make sure all Aerospike Clients are running a compatible version . (Enterprise Only)
  • When upgrading the Aerospike server, from a version prior to 4.5.1 , follow the 4.5 special upgrade document 4.5.1+ SMD protocol change.
  • When upgrading the Aerospike server, from a version prior to 4.3 with replication-factor of 2 or greater along with the use of the rack-aware feature in AP namespaces, refer to the special considerations knowledge base article for details. (Enterprise Only)
  • When upgrading the Aerospike server, from a version prior to 4.2 , follow the 4.2 special upgrade steps document Storage Format Upgrade in 4.2 Release.
  • When upgrading the Aerospike server, from a version prior to 3.14 , please follow the upgrade and protocol-switching PREREQUISITES for 3.13.0.11 documentation for Upgrade to 3.13.

outside of that you should be good

I just noticed how different some nodes are reporting. Are you running different hardware on various nodes? What kind of configuration is this? Any config diff? (asadm -e sh config diff )

The defrag-lwm-pct is set to 75%, the defrag-sleep is default 1000.
sar command shows the util% of the disks on each node are mostly over 95%. It’s true that the IO reach upper limit, but there are only few write timeout, and the average cost time of writes is acceptable.
The nodes have the same hardware configuration, and there is no config diff.

I don’t understand the write mechanism.Let’s take the last node as example, the info command shows that the avail% of the disk is only 13%, which means 87% of the disk has been used. However the disk used% is only 14%, there is a discrepancy of 63% between the data size and actual disk usage.
So the problem is where has the 63% disk space gone?

Setting defrag-lwm-pct to 75% will mean that aerospike will defrag blocks that are only 25% full. Typically you always want the LWM to be the same as the HWM. FAQ: Defragmentation

Basically, aerospike will only write to ‘free-blocks’ and blocks can only be free once they have been defragmented and made available. This parameter is usually set to 50% along with the high-water-mark-disk-pct. Higher values generate more write amplification by defragging more aggressively - lower values will save on IO but if you have high amounts of disk used or high write thoughput then the avail will drop.

Why did you set defrag-lwm to 75%? What’s most likely happenning is that its queuing up more defrag than it can process. You can find out by looking from ‘defrag-q’ in the aerospike log/journal.

If defrag is not keeping up, lowering defrag sleep should be first. Increasing defrag-lwm has write amplification effects so its best not to raise that unless there is no other option - assuming you have the IO for it.

Actually from begining the defrag-lwm-pct is set as default value 50%, however the disk avail% dropped too low, I thought it’s because the defrag speed is not fast enough, so I set the param to 75% to make defrag trigger earlier.
I have also tried set the defrag-sleep to 0, but seems the priority of defrag write is lower than data write, the disk avail% still dropped to very low. And I can’t find any param that could increase the priority of defrag write.
Since aerospike will only write to free blocks, if the defrag speed can’t catch up with data write speed, the disk avail% will drop, this make sense. I originally thought the disk write mechanism could be something like “overwrite” or “merge rewrite”.
So the essence of the problem is the write throughput has exceeded the IO capacity, I’ll try to controll the write speed then.

defrag-sleep=0 might disable it. im not sure. When you have defrag-sleep set low and lwm set to 50% - do you see anything in defrag-q? grep defrag-q aerospike.log ?

The log shows that there is a big value of defrag-q.

Nov 06 2019 00:00:03 GMT+0800: INFO (drv_ssd): (drv_ssd.c:2164) {user_durable_list} /data9/aerospike/data/user_durable_list.part.0: used-bytes 63502311936 free-wblocks 350585 write-q 0 write (4732401,33.6) defrag-q 65251 defrag-read (4676364,33.8) defrag-write (1371580,6.9)

This log is printed when the data write speed has been decreased.

I don’t understand what you mean by ‘printed when the data write speed has decreased’ ? Is this based on the file type data storage instead of raw devices? can we get a snippet of your aerospike.conf namespace section? and output of lsblk from a machine?

Sorry for didn’t make it clear. What I mean is the log shows that the defrag-q has a value of 65251, this log is printed when I have decreased data write speed to lower the IO pressure. And Before I decreased the write speed, the value of defrag-q is even higher.
The namespace config:

namespace user_durable_list {  
  memory-size 160G
  default-ttl 180d
  
  storage-engine device {
    file /data1/aerospike/data/user_durable_list.part.0
    file /data2/aerospike/data/user_durable_list.part.0
    file /data3/aerospike/data/user_durable_list.part.0
    file /data4/aerospike/data/user_durable_list.part.0
    file /data5/aerospike/data/user_durable_list.part.0
    file /data6/aerospike/data/user_durable_list.part.0
    file /data7/aerospike/data/user_durable_list.part.0
    file /data8/aerospike/data/user_durable_list.part.0
    file /data9/aerospike/data/user_durable_list.part.0
    file /data10/aerospike/data/user_durable_list.part.0
    file /data11/aerospike/data/user_durable_list.part.0
    file /data12/aerospike/data/user_durable_list.part.0
    
    filesize 460G
    max-write-cache 16M
    flush-max-ms 500
  }
}

The file list it too long so I remove some of the files.

The lsblk command result:

NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda      8:0    0 223.6G  0 disk 
├─sda1   8:1    0     3M  0 part 
├─sda2   8:2    0     1G  0 part /boot
├─sda3   8:3    0    50G  0 part /
├─sda4   8:4    0     2G  0 part 
├─sda5   8:5    0  97.7G  0 part /apsara
└─sda6   8:6    0  72.9G  0 part /online
sdb      8:16   0   1.8T  0 disk 
└─sdb1   8:17   0   1.8T  0 part /data1
sdc      8:32   0   1.8T  0 disk 
└─sdc1   8:33   0   1.8T  0 part /data2
sdd      8:48   0   1.8T  0 disk 
└─sdd1   8:49   0   1.8T  0 part /data3
sde      8:64   0   1.8T  0 disk 
└─sde1   8:65   0   1.8T  0 part /data4
sdf      8:80   0   1.8T  0 disk 
└─sdf1   8:81   0   1.8T  0 part /data5
sdg      8:96   0   1.8T  0 disk 
└─sdg1   8:97   0   1.8T  0 part /data6
sdh      8:112  0   1.8T  0 disk 
└─sdh1   8:113  0   1.8T  0 part /data7
sdi      8:128  0   1.8T  0 disk 
└─sdi1   8:129  0   1.8T  0 part /data8
sdj      8:144  0   1.8T  0 disk 
└─sdj1   8:145  0   1.8T  0 part /data9
sdk      8:160  0   1.8T  0 disk 
└─sdk1   8:161  0   1.8T  0 part /data10
sdl      8:176  0   1.8T  0 disk 
└─sdl1   8:177  0   1.8T  0 part /data11
sdm      8:192  0   1.8T  0 disk 
└─sdm1   8:193  0   1.8T  0 part /data12

I was curious if you were doing something like that. Why not just pass device /dev/sdg instead of making a filesystem/file? You should get better performance that way I believe. Also in 4.2 they overhauled the storage mechanism to make files smaller, actually all storage i think, so chances are you’ll get a decent performance boost even if you do continue using files. I’d recommend going to the latest you’re comfortable with (4.2+) and just using the raw devices. Unless there is some specific reason you’re using files?

I take over serveral aerospike clusters from my former colleague. These clusters have been run for at least 3 years, for online services usage. Actually I’ve suspected that these clusters were run on HDD not SSD when they were built at the very begining.
Also the way of deployment of the clusters may prevent the cluster from using raw devices instead of files. The clusters share each of the disk, what I mean is that on each disk, there are files owned by different clusters.
I’ve read articles about namespace device config, and I got few questions.

  1. The size upper limitation of a raw device is 2TB, however I couldn’t found a param that could manually set the limitation to a lower value.
  2. I guess a single raw device can’t be used by different namespaces, which means single raw device can only be monopolized by one namespace? What would happen if 2 namespaces on different clusters use same raw device?

You might want to see if you get better performance by using raw devices. The solution to limiting how much of a drive you give to a namespace is to just use partitions. You can pass device /dev/sdx1 for example and maybe just have sdx1 just have 400GB or so. Its up to you. The only caution is that if you have multiple namespaces running off the same disk (say you give sdx1 to ns1 and sdx2 to ns2) is that if one namespace starts needing a lot of IO it may affect the other. But sounds like you’re already in that situation.

Understood, I’ll have a test for raw device then. Many thanks for your replies.