How to interpret iostat output related to Aerospike

FAQ - How to interpret iostat output related to Aerospike

Context

Here is an example of the output of the following command:

iostat -x
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvda              0.00     0.00    0.00    1.00     0.00     4.00     8.00     0.00    0.00    0.00    0.00   0.00   0.00
xvdj              0.00     0.00    0.00  352.00     0.00 45056.00   256.00     0.61    1.73    0.00    1.73   0.31  10.80
xvdl              0.00     0.00    0.00  304.00     0.00 38912.00   256.00     0.84    2.78    0.00    2.78   0.41  12.40
xvdk              0.00     0.00    0.00  328.00     0.00 41984.00   256.00     0.95    2.90    0.00    2.90   0.72  23.60
xvdm              0.00     0.00    0.00  328.00     0.00 41984.00   256.00     0.80    2.44    0.00    2.44   0.70  22.80
xvdn              0.00     0.00    0.00  336.00     0.00 43008.00   256.00     0.82    2.44    0.00    2.44   0.38  12.80
xvdo              0.00     0.00    0.00  344.00     0.00 44032.00   256.00     0.84    2.45    0.00    2.45   0.40  13.60
xvdq              0.00     0.00    0.00  312.00     0.00 39936.00   256.00     0.51    1.64    0.00    1.64   0.28   8.80
xvdp              0.00     0.00    0.00  232.00     0.00 29824.00   257.10     0.38    1.64    0.00    1.64   0.26   6.00
nvme0n1           0.00     0.00 3673.00  352.00 69467.00 45056.00    56.91     1.58    0.39    0.39    0.43   0.14  57.20
nvme4n1           0.00     0.00 3661.00  336.00 63436.50 43008.00    53.26     1.71    0.43    0.39    0.87   0.14  57.20
nvme3n1           0.00     0.00 3628.00  328.00 60616.00 41984.00    51.87     1.44    0.36    0.37    0.29   0.14  54.80
nvme5n1           0.00     0.00 3746.00  344.00 66855.50 44032.00    54.22     1.47    0.36    0.35    0.50   0.15  61.20
nvme6n1           0.00     0.00 3573.00  232.00 39589.00 29696.00    36.42     1.10    0.29    0.27    0.50   0.13  50.80
nvme7n1           0.00     0.00 3579.00  312.00 59083.00 39936.00    50.90     1.61    0.41    0.40    0.60   0.16  60.80
nvme1n1           0.00     0.00 3643.00  328.00 58824.50 41984.00    50.77     1.48    0.38    0.38    0.35   0.15  58.00
nvme2n1           0.00     0.00 3683.00  296.00 55940.00 37888.00    47.16     1.24    0.31    0.33    0.03   0.13  52.80

How to interpret such output?

The storage engine configuration is like this:

 storage-engine device {
    device /dev/nvme0n1 /dev/xvdj
    device /dev/nvme1n1 /dev/xvdk
    device /dev/nvme2n1 /dev/xvdl
    device /dev/nvme3n1 /dev/xvdm
    device /dev/nvme4n1 /dev/xvdn
    device /dev/nvme5n1 /dev/xvdo
    device /dev/nvme6n1 /dev/xvdp
    device /dev/nvme7n1 /dev/xvdq

    write-block-size 1024K
  }

Details

Here are the brief explanation of the column fields:

rrqm/s,wrqm/s   These are the number of read/write requests merged per second that were queued to the device.  
                They are normally zero for devices associated with Aerospike namespace(s).

r/s, w/s        These are the number of read/write issued to the device per second.

rKB/s, wKB/s    These are the amount of data read/written from/to the device expressed in kilobytes per second.


avgrq-sz        The average size (in sectors) of the requests that were issued to the device.
		A logical sector is normally 512 bytes but some devices have a sector size of 4K.

avgqu-sz        The average queue lenght of the requests that were issued to the device.


await		The average time (in milliseconds) for I/O requests issued to the device to be served.  
                This include the time spent by the requests in queue and the time spent servicing them.

svctm		The average service time but this will be removed in future sysstat version as it is not trusted.


%util		Percentage of CPU time during which I?O requests wee issued to the evice  (bandwidth utilization for the device).
                Device saturation occurs when this value is close to 100.

For the above example, the system disk is xvda, the namepsace primary devices are nvme0-nvme7 and the corresponding shadow devices are xvdj-xvdp.
The system disk should not be used by Aerospike namespace(s). It is recommended to use a separate devices (physical storage devices or partitions) for namespace(s) storage, XDR digestlog and allflash/pmem mounts. Most system disk activities related to Aerospike are for writing the logs (if the logs are stored on the system disk).

When using shadow devices, Aerospike only mirrors writes to those devices (client writes, migration writes, replication writes and defragmentation writes). All read-related statistics should be zero for such shadow devices. So for the devices xvdj-xvdp, the value are zero for r/s, rqmm/s…etc.

For those shadow devices, the avrq-sz is 256 such that the each request is 256*512 bytes (i.e. 131072 or 128K). In this example, this correspond to the write-block-size for the namespace.

%util is usually not a concern until you reach over 90%. See example below for details.

The throughput should be similar for each device. A non uniform distribution could indicate a hotkey. Configuring multiple devices per namespace can help performance by increasing the number of parallel write buffers. Refer to the Flash Devices Setup documentation for further details.

Example Device Overload

Here is an example of a device that is overloaded, where %util get to 100%:

12:44:01 AM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
:q
03:17:01 PM       vdb   6582.30 161842.52  77293.79     36.33      2.87      0.44      0.15     97.23
03:18:01 PM       vdb   8425.18 183806.43  68718.50     29.97      3.25      0.39      0.12     97.29
03:19:01 PM       vdb   6896.91 197475.42  59027.96     37.19      3.51      0.51      0.14     98.76
03:20:01 PM       vdb   7637.87 215375.50  62115.50     36.33      3.52      0.46      0.13     98.97
.
.
.
03:37:01 PM       vdb   7138.45 175430.52  31528.92     28.99      4.93      0.69      0.14    100.55
03:38:01 PM       vdb   7174.55 176466.58  31898.75     29.04      4.90      0.68      0.14    100.34
03:39:01 PM       vdb   6883.13 169833.23  30612.92     29.12      5.01      0.73      0.15    100.69

The corresponding line in the Aerospike log file is:

Jun 21 2019 22:38:57 GMT: INFO (drv_ssd): (drv_ssd.c:1911) {nsSSD} /dev/vdb: used-bytes 9587417328 free-wblocks 31708 write-q 514 write (314397,110.7) defrag-q 3 defrag-read (184067,103.7) defrag-write (85084,47.8)

As the device is not able to keep up with the write load, the write-q was gets over the configured max-write-cache resulting in queue too deep WARNINGs and device overload errors to the client.

More details can be found in the Warning Write Fail Queue Too Deep article.

Defragmentation example

Here is another example where there was no client traffic while the storage was not available.

The defag_lwm_pct was increased from 50 to 75 and then to 90:

egrep 'write-q|lwm-pct' /var/log/aerospike/aerospike.log| grep lwm-pct -A 10 | grep nsSSD
Jul 01 2019 21:36:35 GMT: INFO (info): (thr_info.c:3031) Changing value of defrag-lwm-pct of ns nsSSD from 50 to 75
Jul 01 2019 21:36:35 GMT: INFO (info): (thr_info.c:3495) config-set command completed: params context=namespace;id=nsSSD;defrag-lwm-pct=75
Jul 01 2019 21:36:38 GMT: INFO (drv_ssd): (drv_ssd.c:1911) {nsSSD} /dev/vdb: used-bytes 14896830208 free-wblocks 8 write-q 169 write (134917,20.4) defrag-q 24288 defrag-read (97986,1247.1) defrag-write (34055,19.9)
Jul 01 2019 21:36:58 GMT: INFO (drv_ssd): (drv_ssd.c:1911) {nsSSD} /dev/vdb: used-bytes 14905374128 free-wblocks 96 write-q 592 write (136273,67.8) defrag-q 22233 defrag-read (97995,0.4) defrag-write (35336,64.1)
Jul 01 2019 21:37:18 GMT: INFO (drv_ssd): (drv_ssd.c:1911) {nsSSD} /dev/vdb: used-bytes 14910792848 free-wblocks 158 write-q 988 write (137544,63.5) defrag-q 20261 defrag-read (98003,0.4) defrag-write (36562,61.3)
Jul 01 2019 21:37:38 GMT: INFO (drv_ssd): (drv_ssd.c:1911) {nsSSD} /dev/vdb: used-bytes 14915133008 free-wblocks 127 write-q 1596 write (139190,82.3) defrag-q 17664 defrag-read (98007,0.2) defrag-write (38171,80.4)
--
Jul 01 2019 21:43:36 GMT: INFO (info): (thr_info.c:3031) Changing value of defrag-lwm-pct of ns nsSSD from 75 to 90
Jul 01 2019 21:43:36 GMT: INFO (info): (thr_info.c:3495) config-set command completed: params context=namespace;id=nsSSD;defrag-lwm-pct=90
Jul 01 2019 21:43:38 GMT: INFO (drv_ssd): (drv_ssd.c:1911) {nsSSD} /dev/vdb: used-bytes 15028316368 free-wblocks 8299 write-q 62 write (151526,13.6) defrag-q 19195 defrag-read (117593,971.2) defrag-write (49554,9.8)
Jul 01 2019 21:43:58 GMT: INFO (drv_ssd): (drv_ssd.c:1911) {nsSSD} /dev/vdb: used-bytes 15032046064 free-wblocks 7023 write-q 1613 write (155591,203.2) defrag-q 15111 defrag-read (118028,21.8) defrag-write (53589,201.8)
Jul 01 2019 21:44:18 GMT: INFO (drv_ssd): (drv_ssd.c:1911) {nsSSD} /dev/vdb: used-bytes 15040502944 free-wblocks 5639 write-q 3244 write (159475,194.2) defrag-q 11133 defrag-read (118353,16.2) defrag-write (57401,190.6)
Jul 01 2019 21:44:18 GMT: INFO (drv_ssd): (drv_ssd.c:1911) {test} /opt/aerospike/test2.dat: used-bytes 0 free-wblocks 32704 write-q 0 write (0,0.0) defrag-q 0 defrag-read (0,0.0) defrag-write (0,0.0)

And the corresponded iostat output:

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
vdb               0.00     0.00   40.12  337.72 10270.66 86355.69   255.73     1.00    2.66   2.59  97.70

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.38    0.00    0.58   45.68    5.76   47.60

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
vdb               0.00     0.00   67.00  341.00 17152.00 87296.00   256.00     1.00    2.46   2.39  97.48

The “sar -d -p” is another option to show iostat output but with timestamp. The corresponding sar output was like this (grep vdb only):

02:36:01 PM       vdb      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
02:37:01 PM       vdb     71.18  12831.01   5383.85    255.88      0.73     10.19      5.91     42.10
02:38:01 PM       vdb    187.69  34096.16  13927.69    255.87      1.68      8.95      5.15     96.69
02:39:01 PM       vdb    300.39  53040.52  23827.33    255.89      1.55      5.16      3.07     92.10
02:40:01 PM       vdb     96.38   1677.96  22971.43    255.75      0.32      3.28      2.97     28.66
02:41:01 PM       vdb      2.60     41.08    599.94    246.56      0.01      3.40      3.38      0.88
02:42:01 PM       vdb      3.23     58.20    744.24    248.31      0.01      2.82      2.82      0.91
02:43:01 PM       vdb      3.18     49.05    739.93    248.31      0.01      2.35      2.32      0.74
02:44:01 PM       vdb    134.00  21730.33  12540.50    255.75      0.64      4.78      2.85     38.17
02:45:01 PM       vdb    337.50  57648.31  28727.62    255.93      1.54      4.56      2.73     92.18
02:46:01 PM       vdb    180.58   8614.03  37583.10    255.82      0.47      2.59      2.43     43.95
02:47:01 PM       vdb     66.05   8019.92   8865.72    255.63      0.11      1.64      1.63     10.79

There were two spikes (at 2:39:01 and 2:45:01) correspondings to the two increases of defrag-lwm-pct. As can be observed, even without client traffic, the impact of the defragmentation can be significant.

###Notes

The logical disk sector size is normally 512 bytes but the physical IO per request is limited by the following:

  • /sys/block//queue/max_sectors_kb
  • /sys/block//queue/max_hw_sectors_kb

For example, with a write_block_size of 1MB and a max_sectors_kb of default value of 512, the IO driver will split a single write request to two physical IO requests. And with a write-block-size of 8MB, the IO driver will split it to sixteen physical IO requests. While it may make sense to set max_sectors_kb to match write-block-size, it is best to try and benchmark based on the specific workload desired.

Keywords

IOSTAT SAR DISK SECTOR DEVICE OVERLOAD DEFRAGMENTATION

Timestamp

October 2019

1 Like
© 2015 Copyright Aerospike, Inc. | All rights reserved. Creators of the Aerospike Database.