Why increasing write-block-size increase write throughput?


#1

Hi, We are doing heavy continuous writes on AS Cluster with data on Azure NAS (for durability) for > 10 minutes.

With dd /dev/zero and SAR we see around 250MB/s of write throughput on our Azure NAS(Standard L8s)-1.5TB NAS

But with 128KB write-block-size, we get 60MB/s of throughput on NAS (as seen by SAR command) just before hitting “Device Overload”.

When we increased write-block-size of 512KB, we got 120MB/s of throughput on NAS(just before hitting “Device Overload”).

For write-block-size of 1024KB, we got 160MB/s of throughput on NAS. But with more increase in application threads we started hitting device overload issue(even when throughput between AS Node and NAS was around 160MB/s)

I have few questions about it:

  1. Why increasing Aerospike write block size, increases write throughput(in bytes) on Azure NAS(just before hitting “Device Overload” issue) ?
  2. Why “Device Overload” issues happen even when SAR shows write operations(and NO read ops) with throughput(160MB/s) which is far lesser than 250MB/s throughput(as seen by DD command) supported by NAS. Doesn’t “Device Overload” issue comes when writer thread not able to keep up with NAS speed
  3. Apart from “write-block-size”, what are other configs that can be varied to increase write throughput before hitting Device Overload issue ?
  4. Whats config param to increase number of threads that flush data from write-queue to block device

Note: For 1.5TB NAS, Azure supports ~5000 IOPS so with write-block-size of 1MB it should be around 5GB/s but due to limitation of 250MB/s, I was expecting atleast 250MB/s throughput before hitting Device Overload


#2

This is because of the way Aerospike flushes writes to the disk. See Increase Block Write Size form 128K to 256K . I think its because there is less overhead, like using jumbo packets.


#3

A NAS will rarely perform like a locally attached SSD, often with network induced latencies. You also aren’t the only user of the drive. There are cases where they’re acceptable, but you should base this on testing.

It’s nice that Azure offers real numbers rather than ‘Up to 10Gbit’, ‘High’, ‘Medium’ that AWS uses to describe EBS and networking performance. Still, if you intend to use a NAS as the primary device (not as a shadow device) you should benchmark it using ACT, rather than extrapolating from Azure’s declared throughput numbers. The tool simulated actual database workloads, and should be run for an extended period of time (typically 24). Using this approach, we’ve seen that 5th generation Amazon EC2 instances (C5, M5) with NVMe drivers to the new EBS can rate at 2x. That’s a number you can then use for sizing your cluster to your workload.


#4

Hi @rbotzer , As per my understanding, write(create or replace) performance on NAS and Shadow device should be same. And hence if NAS based AS node gets “device overload” issue with X units of workload then Shadow device based AS node should also hit “device overload” issue with same X units of workload…please correct my understanding if shadow device gives better write(create or replace) performance than shadow device ?


#5

There are only buffered secondary writes going to a shadow device, where as if you use a NAS as the primary storage all reads, writes and defrag operations will go against it. Without small object reads, large block reads for defrag and large block writes for defrag the load on the drive is very different, especially as it relates to all things network - this drive is being accessed over a network, after all. The only time Aerospike would read from a shadow device is on startup, if the primary storage is corrupted or empty. It will then fill the primary storage from the NAS. It’s a form of backup.

Again, if you plan to use a NAS as primary storage, you should benchmark it as such. The ACT tool is one of the things you should be using to reveal what the drive can actually do under database loads.


#6

All writes go to the shadow device, including defrag’s large block writes. Also, the shadow device is always read on coldstart.