The Aerospike Knowledge Base has moved to https://support.aerospike.com. Content on https://discuss.aerospike.com is being migrated to either https://support.aerospike.com or https://docs.aerospike.com. Maintenance on articles stored in this repository ceased on December 31st 2022 and this article may be stale. If you have any questions, please do not hesitate to raise a case via https://support.aerospike.com.
FAQ - Why is CPU being used by kworker and kswapd0 causing latencies?
Detail
When an Aerospike system is experiencing latency the following symptoms may be observed:
- high disk load observed via the
uptime
command -
top
output showing CPU utilization by either, or all of:kworker
,kswapd0
,khugepaged
, while there is no swapping happening to disk (as observed fromfree
) - mpstat showing one CPU core highly utilized - at 100% at all times, used by
%system
. - iostat showing one or more disks with either 100% util or very high
w_await
(in 100s)
Example uptime
output on a 16-core machine, showing load over 40:
$ uptime
13:02:52 up 571 days, 1:01, 1 user, load average: 44.84, 43.19, 41.05
Example output from top
showing high kworker
, khugepaged
and kswapd0
utilization.
$ top
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
116145 root 20 0 0.113t 0.100t 0.090t S 1209 85.2 34769,37 asd
9203 root 20 0 0 0 0 R 97.7 0.0 198:27.23 kworker/u+
120 root 20 0 0 0 0 R 48.8 0.0 51336:15 kswapd0
1298 root 20 0 64816 23652 1048 S 18.3 0.0 2664:49 collectl
95 root 39 19 0 0 0 S 6.1 0.0 414:40.23 khugepaged
31230 root 20 0 23680 2872 2400 R 6.1 0.0 0:00.02 top
The free
command shows no swapping happening and a lot of memory in cached
state:
$ free -m
total used free shared buffers cached
Mem: 122878 114948 7929 94433 71 101744
-/+ buffers/cache: 13132 109745
Swap: 0 0 0
The mpstat
command shows one core at 100% utilization by system
.
$ mpstat -P ALL 2 3
01:02:48 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
01:02:50 PM all 29.54 0.00 43.70 6.48 0.00 11.54 1.19 0.00 0.00 7.55
01:02:50 PM 0 29.90 0.00 39.69 3.09 0.00 21.65 1.03 0.00 0.00 4.64
01:02:50 PM 1 30.10 0.00 44.90 2.04 0.00 19.39 0.51 0.00 0.00 3.06
01:02:50 PM 2 26.13 0.00 46.23 1.51 0.00 22.11 1.01 0.00 0.00 3.02
01:02:50 PM 3 26.29 0.00 44.85 2.06 0.00 22.16 1.03 0.00 0.00 3.61
01:02:50 PM 4 28.93 0.00 36.55 3.05 0.00 24.87 1.02 0.00 0.00 5.58
01:02:50 PM 5 26.80 0.00 47.42 2.06 0.00 19.59 1.03 0.00 0.00 3.09
01:02:50 PM 6 21.50 0.00 48.00 2.00 0.00 23.00 1.00 0.00 0.00 4.50
01:02:50 PM 7 27.78 0.00 36.36 2.02 0.00 29.29 1.01 0.00 0.00 3.54
01:02:50 PM 8 31.91 0.00 35.64 15.43 0.00 0.00 1.60 0.00 0.00 15.43
01:02:50 PM 9 35.60 0.00 37.17 13.09 0.00 0.00 1.57 0.00 0.00 12.57
01:02:50 PM 10 49.48 0.00 34.38 7.29 0.00 0.00 1.04 0.00 0.00 7.81
01:02:50 PM 11 27.18 0.00 40.00 14.87 0.00 0.00 2.05 0.00 0.00 15.90
01:02:50 PM 12 35.42 0.00 33.85 14.58 0.00 0.00 1.56 0.00 0.00 14.58
01:02:50 PM 13 47.12 0.00 31.41 9.95 0.00 0.00 1.57 0.00 0.00 9.95
01:02:50 PM 14 0.00 0.00 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
01:02:50 PM 15 30.98 0.00 39.67 12.50 0.00 0.00 1.63 0.00 0.00 15.22
The iostat
command shows one ore more disks with very high w_await
constantly:
$ iostat -y -x 5 4
Linux 4.4.0-92-generic (test-slow-disk) 06/10/2019 _x86_64_ (16 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
30.05 0.00 52.62 7.50 1.30 8.53
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
xvda 0.00 19.00 0.00 9.40 0.00 180.00 38.30 0.00 0.51 0.00 0.51 0.51 0.48
xvdf 0.00 1.60 198.40 13.80 820.80 1068.00 17.80 5.50 27.01 0.67 405.74 3.17 67.36
nvme1n1 0.00 0.00 8502.40 854.40 124980.90 109363.20 50.09 6.72 0.72 0.75 0.42 0.09 81.60
nvme0n1 0.00 0.00 8059.20 683.20 102825.00 87449.60 43.53 5.00 0.57 0.58 0.45 0.09 76.32
avg-cpu: %user %nice %system %iowait %steal %idle
32.70 0.00 51.14 6.62 1.16 8.37
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
xvda 0.00 10.20 0.00 4.60 0.00 108.00 46.96 0.00 0.35 0.00 0.35 0.35 0.16
xvdf 0.00 0.60 195.40 19.00 781.60 2043.20 26.35 3.31 15.31 0.56 167.03 2.00 42.96
nvme1n1 0.00 0.00 8236.00 878.40 124959.90 112435.20 52.09 6.14 0.67 0.70 0.41 0.09 81.12
nvme0n1 0.00 0.00 7783.00 628.80 94047.50 80486.40 41.50 4.34 0.51 0.52 0.37 0.09 73.04
avg-cpu: %user %nice %system %iowait %steal %idle
30.15 0.00 54.83 6.24 1.13 7.64
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
xvda 0.00 11.00 0.00 8.20 0.00 76.80 18.73 0.00 0.49 0.00 0.49 0.49 0.40
xvdf 0.00 1.20 192.00 19.20 768.00 1440.80 20.92 8.50 39.46 0.65 427.54 4.73 100.00
nvme1n1 0.00 0.00 8184.60 872.00 123810.10 111616.00 51.99 5.98 0.66 0.69 0.35 0.09 80.64
nvme0n1 0.00 0.00 7764.60 644.80 96174.10 82534.40 42.50 4.84 0.57 0.58 0.44 0.09 74.56
avg-cpu: %user %nice %system %iowait %steal %idle
29.45 0.00 54.67 6.74 1.21 7.92
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
xvda 0.00 10.40 0.00 4.80 0.00 86.40 36.00 0.00 0.17 0.00 0.17 0.17 0.08
xvdf 0.00 2.40 190.80 19.20 763.20 1474.40 21.31 8.65 40.66 0.51 439.62 4.76 100.00
nvme1n1 0.00 0.00 8051.80 884.80 128865.90 113254.40 54.19 6.72 0.75 0.79 0.37 0.09 82.16
nvme0n1 0.00 0.00 7658.00 646.40 97255.70 82739.20 43.35 4.42 0.53 0.54 0.42 0.08 70.24
Answer
Here, the issue is with a slow disk. In the example shown, xvdf
. As the disk is slow, with over 400 queued operations at all times, all writes end up going to the dirty pages cache. Since the dirty page cache is now using a lot of the RAM that was free, the kernel worker (kworker
) constantly tries to flush that memory to disk and the kernel swap daemon (kswapd0
) keeps working on swapping out the page caches. This results in a single core being overutilized, and any threads the Aerospike daemon (asd
) is running on that CPU core will fall behind as a consequence. At an application level this will manifest as latency.
The purpose of the disk in question is not relevant, this issue is not isolated to data disks. It may be the root drive, a drive used for the XDR digest log, data disks or any other drives.
The solution is to troubleshoot the disk in question. The disk may have not enough max write IOPS / max throughput or it may be faulty.
The hardware manufacturer should provide instructions on how to troubleshoot the faulty drive. A logical starting point would be SMART status should the drive in question support this (from the Linux smartmontools
package).
Notes
-
Aerospike advises that Transparent Hugepages are disabled. The effect of those on high-performance databases has been seen to be negative. This article provides full details.
-
When mpstat shows 100% utilization under
irq
orsoft
, the issue is most likely IRQ balancing. See here for more details on how to resolve IRQ balancing issues.
Keywords
IOWAIT MPSTAT KWORKER KSWAPD KHUGEPAGED LATENCY CPU
Timestamp
24 June 2019