Write-q stuck at max

I have gone through a lot of community forum threads, but I am not able to figure out the issue that happened with our aerospike cluster last night.

We are running 3 node cluster with 8 CPU cores, 2 Local SSDs 350 GB nvme and 2 persistent SSDs (shadow device), 32 GB RAM on GCP.

Our configuration includes: write-block-size: 128 KB defrag-lwm-pct: 5 nsup-period: 120 nsup-thread: 2 max-write-cache: 256 MB replication-factor: 2

We have a write heavy system, we use aerospike for creating records and then reading them but not for updates, hence defrag-lwm-pct is set to 5.

We have 2 namespaces mem, ssd.

On mem namespace we are writing with a throughput of 20 k writes / second and reading / removing at the same rate.

On ssd namespace we are writing with a throughput of 2k writes / s → with average record size of 12 KB, 99 %tile - 32 KB, average ttl is 2.5 hours it varies between 1 min to 4 hours 2k writes / s → with average record size of 4 KB, 99 %tile - 8 KB, average ttl is 3 hours varies between 30 min to 4 hour 3x0 min 4.5k reads / s → of the above 2 sets we are reading only 400 req / s there is some other persistent data with ttl that can range from 7 days to over a month where we are doing the rest of the reads.

We were earlier running with defrag-lwm-pct 50, but we were running with lot of device overload issues hence we decreased it to 5 and thereafter the issues came down to 0.

The system ran fine for 2-3 days but there after we started getting device overload issues again, because out of the 3 nodes, 2 were having their write-q stuck at the maximum value of 2048.

The device-availble-pct was 65, so it didn’t seem like to be an issue with availability of the free write-blocks.

Aerospike logs showed that 80 blocks were written / s which means around 10 MB / s which was also verified with iostat and iotop and the write-q on one of the device was 2048 and on another it was hovering around 200-300.

If we calculate from the above specs as well:

12 KB * 2k + 4 KB * 2k → 32 MB writes / s → 32 MB writes / s * 2 (replication factor) → 64 MB writes / s 64 MB writes / s divided over 6 Local SSDs ~ 10 MB / s 10 MB / 128 KB (block size) → 80 blocks / s written to disk

Out of the 3 nodes

Node - 1: had 0 write-q on both the devices, was functioning without any issues Node - 2: had write-q = 2048 (max) on device 1 and write-q ~ 300 on device 2 Node - 3: had write-q = 2048 (max) on device 2 and write-q ~ 300 on device 1

System load was under 6, cpu utilization was around 500%

We checked and there was no configuration difference between the 3 nodes.

This number was not going down, nor it was changing much. So my question here is how does write-q work.

I am assuming that nvme should perform better than 10 MB / s writes. So ideally, the write-q should have increased the blocks written to disk / second from 80 to let us say 120 or more to empty itself, but it stayed at 2048 which means that if 80 blocks were getting queued to be written, 80 were being written finally making the queue size constant.

We tried different resolutions but nothing worked, to increase the available-pct we increased the defrag-lwm-pct from 5 to 50 again (I understand that it was a big jump, but we wanted to see the behaviour).

So after doing the above change, even Node-1 went into queue state and Node-2, Node-3 also collected a lot of queued buffers. write-q went to around 20-30k for all the 3 nodes.

But the first node recovered from it in 15 minutes and write-q went to 0, the other 2 nodes took 30 mins to recover from it but again came back to the 2048 values and got stuck there.

Finally we increased the nsup-thread from 2 to 4 and at the same time traffic scaled down from 2k to 2.3k so we don’t know exactly what solved it, but things came back inline.

I am not able to fit in any equation or understanding the maths behind the problem that happened and what is best suited way to understand / debug this kind of issue.

I did a little more research to understand the maths behind the system. Just posting it here if someone is facing the same issue.

Also if someone can check if my reasoning is correct that would be great.

So GCP 375 GB Local SSD NVME has a write throughput of 350 MB / s with IOPS write 90000.

This number is mostly calculated with a block size of 4 KB which is considered SSD normal block size for linux systems.

Now if we can write 4 KB blocks at 350 MB / s means that we have IOPS of 350 * 1024 / 4 ~ 90k

Now since my block size is 128 KB, my throughput at 128 KB block size write will reduce to 350 * 4 / 128 ~ 10.93 MB / s

This is what I have been getting.

10.93 MB * 1024 / 128 = 87 blocks / s

87 blocks / s should be the capacity of the disk. Given that there are migrations and few defragmentations happening, I would be getting approximately 80 blocks / s on a single device.

May 22 2022 01:44:38 GMT: INFO (drv_ssd): (drv_ssd.c:1835) {ssd} /dev/nvme0n1: used-bytes 31302784112 free-wblocks 2045389 write-q 1719 write (13789437,80.2) defrag-q 0 defrag-read (11582404,0.0) defrag-write (1708291,0.0) shadow-write-q 0

May 22 2022 01:44:38 GMT: INFO (drv_ssd): (drv_ssd.c:1835) {ssd} /dev/nvme0n2: used-bytes 31265205648 free-wblocks 2045830 write-q 2048 write (13788729,80.2) defrag-q 0 defrag-read (11582218,0.0) defrag-write (1708230,0.0) shadow-write-q 0

The above is a sample log line.

Now since I was actually writing at a speed more than 80 blocks / s, it will get queued and the queue will keep on increasing because I am hitting bottleneck at disk end.

I think I mistunderstood that write (13788729,80.2) means the aerospike is writing at 80.2 blocks / s to disk, but what it actually means is the rate at which the aerospike is putting data in the swb or the write-q.

Now since the swb is flushing at the rate of 80 blocks / s and the writes are coming at the rate of 80 blocks / s, the queue is stuck at 2048.

Changing nsup-thread didn’t really do anything I guess, at the same time traffic went down, which might have brought down the rate at which I am writing blocks to the disk. Although from the logs I can still see 81-87 blocks being written / s for some time 5-10 mins both the write-q stabilised, but may have taken wrong approximations in the above equations. Anyways aerospike logs are written at 10 second interval, so may be the logs I am seeing might not be portraying complete picture.

But after 5-10 mins of write-q coming to 0, I can see the writes going down to 77 blocks / s

May 24 2022 03:38:11 GMT: INFO (drv_ssd): (drv_ssd.c:1835) {ssd} /dev/nvme0n1: used-bytes 13557107728 free-wblocks 2802621 write-q 1 write (25599786,75.8) defrag-q 0 defrag-read (21280079,0.0) defrag-write (1962140,0.0) shadow-write-q 0

May 24 2022 03:38:11 GMT: INFO (drv_ssd): (drv_ssd.c:1835) {ssd} /dev/nvme0n2: used-bytes 13537671760 free-wblocks 2802938 write-q 1 write (25598396,76.1) defrag-q 0 defrag-read (21278833,0.0) defrag-write (1962120,0.0) shadow-write-q 0

Also since today, we got the issue again when the traffic went above the expected levels, I can confirm that nsup threads didn’t really help.

The only solution I see here is to increase the Disk IOPS here.

I don’t think changing block size will do anything, because the numbers will divide amongst themselves and yield the same result.

Also earlier since the defrag-lwm-pct was also 50, defrag writes were adding up to the current problem and reducing the client request throughput further.

Hi @anuragagarwal561994 welcome to the community! It sounds like you hit the device’s limit or the device is not performing as expected.

Here is some info on the max-write-cache that may help. Also, depending on what server version you are using there are mechanisms in place outlined in this article on resilience. Hope that helps!

My earlier calculations are little bit off, so what I have observed is that the problem is 128 KB block size.

If case the block size is 128 KB, at 11 MB / s write speed we are hitting 98% disk utilisation in iostat

If case the block size is 1 MB, at 11 MB / s write speed we are hitting 3% disk utilisation.

@aanderson is there a theory behind why this must be happening?

There are a lot of factors that go into calculating disk utilisation in iostat, so to pin it down to one thing would be pretty tough. Here is a great Knowledge Base article on the write-block-size and how disk performance can vary.

Changing the write-block-size will also have an affect on your defragmentation and post-write-queue which may be a part of what you’re seeing.

I think neither of the things are affecting my workload, I checked from the aerospike logs cache-read-pct is somewhere around 5-6% as I have earlier mentioned that reads are very very less.

Defragmentation has already been mostly turned off using defrag-lwm-pct as 5

Could it simply be the characteristics of that disk? Best may be to run act (Github repo) and compare in order to remove any other variable.

1 Like

BTW, you could also give compression a shot if you are running the Enterprise version.

For now the best possible explanation I have (which I can’t represent in numbers) is that when I am writing 128 KB block size and record sizes averaging for 12 - 16 KB, I am increasing writes to disk but in small batches. While in case of 1 MB block size I am having a bigger batch size.

So I must be giving more work to the os to flush from memory but of small quantity and here large work in small quantity is doing good for me instead.

I would say that is fair. Other than the post-write-queue and the defragmentation activity, the difference should indeed just be the size of the streaming write buffers (configured through the write-block-size) and the frequency at which they are sent for being persisted. The flush-max-ms config only kicks in if you are not filling up your blocks fast enough. Also, on some public cloud environments, we have noticed that the disable-odsync config can significantly impact performance.