FAQ - Why is CPU being used by kworker and kswapd0, causing latencies

FAQ - Why is CPU being used by kworker and kswapd0 causing latencies?

Detail

When an Aerospike system is experiencing latency the following symptoms may be observed:

  • high disk load observed via the uptime command
  • top output showing CPU utilization by either, or all of: kworker, kswapd0, khugepaged, while there is no swapping happening to disk (as observed from free)
  • mpstat showing one CPU core highly utilized - at 100% at all times, used by %system.
  • iostat showing one or more disks with either 100% util or very high w_await (in 100s)

Example uptime output on a 16-core machine, showing load over 40:

$ uptime
 13:02:52 up 571 days,  1:01,  1 user,  load average: 44.84, 43.19, 41.05

Example output from top showing high kworker, khugepaged and kswapd0 utilization.

$ top
   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
116145 root      20   0  0.113t 0.100t 0.090t S  1209 85.2  34769,37 asd
  9203 root      20   0       0      0      0 R  97.7  0.0 198:27.23 kworker/u+
   120 root      20   0       0      0      0 R  48.8  0.0  51336:15 kswapd0
  1298 root      20   0   64816  23652   1048 S  18.3  0.0   2664:49 collectl
    95 root      39  19       0      0      0 S   6.1  0.0 414:40.23 khugepaged
 31230 root      20   0   23680   2872   2400 R   6.1  0.0   0:00.02 top

The free command shows no swapping happening and a lot of memory in cached state:

$ free -m
             total       used       free     shared    buffers     cached
Mem:        122878     114948       7929      94433         71     101744
-/+ buffers/cache:      13132     109745
Swap:            0          0          0

The mpstat command shows one core at 100% utilization by system.

$ mpstat -P ALL 2 3
01:02:48 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:02:50 PM  all   29.54    0.00   43.70    6.48    0.00   11.54    1.19    0.00    0.00    7.55
01:02:50 PM    0   29.90    0.00   39.69    3.09    0.00   21.65    1.03    0.00    0.00    4.64
01:02:50 PM    1   30.10    0.00   44.90    2.04    0.00   19.39    0.51    0.00    0.00    3.06
01:02:50 PM    2   26.13    0.00   46.23    1.51    0.00   22.11    1.01    0.00    0.00    3.02
01:02:50 PM    3   26.29    0.00   44.85    2.06    0.00   22.16    1.03    0.00    0.00    3.61
01:02:50 PM    4   28.93    0.00   36.55    3.05    0.00   24.87    1.02    0.00    0.00    5.58
01:02:50 PM    5   26.80    0.00   47.42    2.06    0.00   19.59    1.03    0.00    0.00    3.09
01:02:50 PM    6   21.50    0.00   48.00    2.00    0.00   23.00    1.00    0.00    0.00    4.50
01:02:50 PM    7   27.78    0.00   36.36    2.02    0.00   29.29    1.01    0.00    0.00    3.54
01:02:50 PM    8   31.91    0.00   35.64   15.43    0.00    0.00    1.60    0.00    0.00   15.43
01:02:50 PM    9   35.60    0.00   37.17   13.09    0.00    0.00    1.57    0.00    0.00   12.57
01:02:50 PM   10   49.48    0.00   34.38    7.29    0.00    0.00    1.04    0.00    0.00    7.81
01:02:50 PM   11   27.18    0.00   40.00   14.87    0.00    0.00    2.05    0.00    0.00   15.90
01:02:50 PM   12   35.42    0.00   33.85   14.58    0.00    0.00    1.56    0.00    0.00   14.58
01:02:50 PM   13   47.12    0.00   31.41    9.95    0.00    0.00    1.57    0.00    0.00    9.95
01:02:50 PM   14    0.00    0.00  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
01:02:50 PM   15   30.98    0.00   39.67   12.50    0.00    0.00    1.63    0.00    0.00   15.22

The iostat command shows one ore more disks with very high w_await constantly:

$ iostat -y -x 5 4
Linux 4.4.0-92-generic (test-slow-disk)  06/10/2019      _x86_64_        (16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          30.05    0.00   52.62    7.50    1.30    8.53

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvda              0.00    19.00    0.00    9.40     0.00   180.00    38.30     0.00    0.51    0.00    0.51   0.51   0.48
xvdf              0.00     1.60  198.40   13.80   820.80  1068.00    17.80     5.50   27.01    0.67  405.74   3.17  67.36
nvme1n1           0.00     0.00 8502.40  854.40 124980.90 109363.20    50.09     6.72    0.72    0.75    0.42   0.09  81.60
nvme0n1           0.00     0.00 8059.20  683.20 102825.00 87449.60    43.53     5.00    0.57    0.58    0.45   0.09  76.32

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          32.70    0.00   51.14    6.62    1.16    8.37

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvda              0.00    10.20    0.00    4.60     0.00   108.00    46.96     0.00    0.35    0.00    0.35   0.35   0.16
xvdf              0.00     0.60  195.40   19.00   781.60  2043.20    26.35     3.31   15.31    0.56  167.03   2.00  42.96
nvme1n1           0.00     0.00 8236.00  878.40 124959.90 112435.20    52.09     6.14    0.67    0.70    0.41   0.09  81.12
nvme0n1           0.00     0.00 7783.00  628.80 94047.50 80486.40    41.50     4.34    0.51    0.52    0.37   0.09  73.04

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          30.15    0.00   54.83    6.24    1.13    7.64

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvda              0.00    11.00    0.00    8.20     0.00    76.80    18.73     0.00    0.49    0.00    0.49   0.49   0.40
xvdf              0.00     1.20  192.00   19.20   768.00  1440.80    20.92     8.50   39.46    0.65  427.54   4.73 100.00
nvme1n1           0.00     0.00 8184.60  872.00 123810.10 111616.00    51.99     5.98    0.66    0.69    0.35   0.09  80.64
nvme0n1           0.00     0.00 7764.60  644.80 96174.10 82534.40    42.50     4.84    0.57    0.58    0.44   0.09  74.56

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          29.45    0.00   54.67    6.74    1.21    7.92

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvda              0.00    10.40    0.00    4.80     0.00    86.40    36.00     0.00    0.17    0.00    0.17   0.17   0.08
xvdf              0.00     2.40  190.80   19.20   763.20  1474.40    21.31     8.65   40.66    0.51  439.62   4.76 100.00
nvme1n1           0.00     0.00 8051.80  884.80 128865.90 113254.40    54.19     6.72    0.75    0.79    0.37   0.09  82.16
nvme0n1           0.00     0.00 7658.00  646.40 97255.70 82739.20    43.35     4.42    0.53    0.54    0.42   0.08  70.24

Answer

Here, the issue is with a slow disk. In the example shown, xvdf. As the disk is slow, with over 400 queued operations at all times, all writes end up going to the dirty pages cache. Since the dirty page cache is now using a lot of the RAM that was free, the kernel worker (kworker) constantly tries to flush that memory to disk and the kernel swap daemon (kswapd0) keeps working on swapping out the page caches. This results in a single core being overutilized, and any threads the Aerospike daemon (asd) is running on that CPU core will fall behind as a consequence. At an application level this will manifest as latency.

The purpose of the disk in question is not relevant, this issue is not isolated to data disks. It may be the root drive, a drive used for the XDR digest log, data disks or any other drives.

The solution is to troubleshoot the disk in question. The disk may have not enough max write IOPS / max throughput or it may be faulty.

The hardware manufacturer should provide instructions on how to troubleshoot the faulty drive. A logical starting point would be SMART status should the drive in question support this (from the Linux smartmontools package).

Notes

  • Aerospike advises that Transparent Hugepages are disabled. The effect of those on high-performance databases has been seen to be negative. This article provides full details.

  • When mpstat shows 100% utilization under irq or soft, the issue is most likely IRQ balancing. See here for more details on how to resolve IRQ balancing issues.

Keywords

IOWAIT MPSTAT KWORKER KSWAPD KHUGEPAGED LATENCY CPU

Timestamp

24 June 2019

Think you mean 400ms for write ops? The avgqu-sz is 26 for that disk.

© 2015 Copyright Aerospike, Inc. | All rights reserved. Creators of the Aerospike Database.