FAQ - Why is a single core running at 100% intermittently?

FAQ - Why is a single core running at 100% intermittently?

Detail

When running an Aerospike server, operating system optics show that a single CPU is periodically close to or at 100% utilization. What is the reason for this?

An example of how the CPU utilization will look by running mpsstat -P ALL 2 3

07:14:21 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
07:14:23 PM  all    4.26    0.00    0.94    3.95    0.00    0.44    0.00    0.00    0.00   90.40
07:14:23 PM    0    3.06    0.00    2.04    0.00    0.00    1.02    0.00    0.00    0.00   93.88
07:14:23 PM    1    2.06    0.00    1.55    0.00    0.00    1.55    0.00    0.00    0.00   94.85
07:14:23 PM    2  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
07:14:23 PM    3    2.00    0.00    1.50    0.00    0.00    2.50    0.00    0.00    0.00   94.00

07:14:23 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
07:14:25 PM  all    3.83    0.00    0.71    4.14    0.00    0.33    0.00    0.00    0.00   90.99
07:14:25 PM    0    1.01    0.00    2.02    0.51    0.00    1.52    0.00    0.00    0.00   94.95
07:14:25 PM    1    0.51    0.00    1.03    0.00    0.00    1.54    0.00    0.00    0.00   96.92
07:14:25 PM    2  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
07:14:25 PM    3    1.52    0.00    1.01    0.00    0.00    1.01    0.00    0.00    0.00   96.46

Answer

This is due to the Aerospike Namespace Supervisor process, nsup. The Namespace Supervisor is responsible for operations such as eviction and expiration. By default nsup is single threaded. As of Aerospike 4.5.1.5, this can be changed through the nsup-threads configuration parameter. Aerospike 4.5.1.5 also introduced a more efficient algorithm to expire and evict records which does not rely on the fabric channel and directly expires or evicts records as each partition is reduced by an nsup thread.

An nsup cycle can be time consuming as it has to cycle through all records in a given namespace. The frequency with which nsup runs is controlled using the nsup-period which defines the time period between nsup waking up from one run to the next. If the time taken for an nsup cycle is greater than the nsup-period then, in effect, nsup will be running continously.

The behaviour of nsup can be observed using the following log lines:

{ns-name} nsup-done: non-expirable 42162 expired (576066,922) evicted (24000935,259985) evict-ttl 134000 total-ms 120583

In the example above, the time taken for the nsup cycle concerned is 120583ms which exceeds the default nsup-period of 120s, and so here nsup would appear to be running all the time. This is not a problem and is a normal part of Aerospike operation. In versions later than Aerospike 4.5.1.5 it is more obvious what is happening due to the lack of context switching and the usage of a particular CPU all the time.

To validate that the 100% CPU usage is due to nsup running, nsup can be disabled on a temporary basis by setting nsup-period to 0. This can be done dynamically.

asinfo -v "set-config:context=namespace;id=namespaceName;nsup-period=0"
ok

Once nsup has been shown to be the reason for the CPU showing 100% it should be re-enabled (by setting a non-zero nsup-period). If nsup is not re-enabled records cannot be expired (or expired). If records are not expected to expire, nsup can be permanently disabled as such.

Notes

  • Some anecdotal differences have been observed between the previous versions (prior to 4.5.1) and the new ones (4.5.1 and above). Specifically it seems that older versions may be more likely to switch between CPU cores across nsup runs compared to the new versions and

  • Should it be required, nsup-threads can be increased to decrease the time taken for an nsup cycle and spread the load across multiple cores. While doing all these changes it is recommended to keep a watch on the normal write/read latencies so that the change you make does not affect them.

Keywords

NSUP 100% CPU CORE UTILISED

Timestamp