Experiencing high disk utilization after upgrading to Amazon Linux 2 kernel 5.10 and above

Hi there :wave:

I’d like to hear if anyone encountered similar problem as we are seeing after upgrading to AWS Amazon Linux 2 AMIs with kernel 5.10 (and also Amazon Linux 2023 - kernel 6.1).

We are running an Aerospike Community Edition cluster version 6.3 on i4i instances. After upgrading to AMIs with kernel 5.10 the the local SSD device utilization jumped significantly even for moderate workloads. It’s topping 100% where the old instances report 15%. This has been triggering some alerts on our side but in reality there doesn’t seem to be any negative performance impact.

I ran bunch of tests with the act tool and while the observed behavior was the same there was no impact on performance.

Can someone shed some more light on why this is happening or if there’s anything we need to tweak to get the utilization numbers report some realistic values?

We partition each device with four partitions and use device namespace configuration:

# lsblk
NAME          MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
nvme1n1       259:0    0   1.7T  0 disk 
├─nvme1n1p1   259:4    0 436.6G  0 part 
├─nvme1n1p2   259:5    0 436.6G  0 part 
├─nvme1n1p3   259:6    0 436.6G  0 part 
└─nvme1n1p4   259:7    0 436.6G  0 part 
nvme0n1       259:1    0    20G  0 disk 
├─nvme0n1p1   259:2    0    20G  0 part /
└─nvme0n1p128 259:3    0     1M  0 part 
  storage-engine device { 
    device /dev/nvme1n1p1
    device /dev/nvme1n1p2
    device /dev/nvme1n1p3
    device /dev/nvme1n1p4
    write-block-size 1024K
    max-write-cache 1024M
  }
# iostat -zxmty 3
Linux 5.10.210-201.852.amzn2.x86_64 (ip-10-30-101-32.ec2.internal)      03/20/2024      _x86_64_        (4 CPU)

03/20/2024 10:45:14 AM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.59    0.00    0.26   18.09    0.00   80.05

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme1n1           0.00     0.00 20235.33  235.33    58.71    29.42     8.82     3.34    0.16    0.16    0.08   0.05 100.00

03/20/2024 10:45:17 AM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.67    0.00    0.53   18.05    0.00   79.75

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme1n1           0.00     0.00 20235.00  235.33    58.71    29.42     8.82     3.34    0.16    0.16    0.08   0.05 100.00

03/20/2024 10:45:20 AM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.59    0.00    0.35   18.08    0.00   79.98

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme1n1           0.00     0.00 20235.67  235.33    58.71    29.42     8.82     3.34    0.16    0.16    0.08   0.05 100.00
nvme0n1           0.00     0.00    0.00    2.00     0.00     0.01    14.33     0.00    0.83    0.00    0.83   0.67   0.13

Thanks, Zbynek

Checking: Is this still an issue or settled out? Could be initial migrations/defrag etc occurring? Also, note %util in modern SSDs that are processing I/O requests in parallel can be highly misleading.

Hi Piyush, thank you for following up on my question.

The “issue” is consistent and reproducible simply by running the act tool (GitHub - aerospike/act: Aerospike Certification Tool). So I’m quite confident it doesn’t have anything to do with the actual workload (migrations etc.).

It indeed seems like a misleading reporting of %util on AL2 with kernels 5.10 and above.

I ran the act tool on machines with different kernels but otherwise identical. The actual performance is the same but the %util is dramatically different.

We have now accepted that high utilization and disabled our default IO utilization alerts which is not ideal but seems safe.

Just wonder that I haven’t found any reports of this since there must be others using the exact same configuration of AL2 linux, i3 and i4i instances on AWS.

With kind regards,

Zbynek

For illustration I took screenshots while running the same tests using act tool on 4.14 and 5.10 kernel.

In the meantime, from linux man page: iostat(1) - Linux manual page

 %util  Percentage of elapsed time during which I/O
           requests were issued to the device (bandwidth
           utilization for the device). Device saturation
           occurs when this value is close to 100% for devices
           serving requests serially.  But for devices serving
           requests in parallel, such as RAID arrays and
           modern SSDs, this number does not reflect their
           performance limits.
1 Like

Discussed with another engineer internally. Here is a summary:

%util is a useless metric. Always go by read/write IOPS and read/write MiB/s.

The reason is exactly as pointed above: %util doesn’t take into account the parallelism that modern drives are capable of. What you typically see is that you push a device to 100%, but then there’s still a lot of headroom to push it further and make it go faster. 100% can mean that 100% of time there’s 1 pending request. Or it can mean that 100% of time there are 8 pending requests.The only thing you get from %util is that for a given %util of x, your device is (100% - x) of the time idle.

But that does not explain an upgrade of the kernel or iostat changing the reported value. But then again, maybe that’s a bug fix and %util was previously underreporting or something else.

However, best to not base your monitoring on %util . They suggested - use fio to determine the available IOPS and bandwidth of your devices. Then set your monitoring alerts based on that.

Hope that helps!

1 Like

Thanks a lot for the detailed explanation and suggestions how to improve our monitoring. Really appreciate, that you took the time to answer my question and for putting my mind at ease!

With regards,

Zbynek