What does interrupt took too long mean

The Aerospike Knowledge Base has moved to https://support.aerospike.com. Content on https://discuss.aerospike.com is being migrated to either https://support.aerospike.com or https://docs.aerospike.com. Maintenance on articles stored in this repository ceased on December 31st 2022 and this article may be stale. If you have any questions, please do not hesitate to raise a case via https://support.aerospike.com.

What does interrupt took too long mean

Background

The linux kernel gathers samples using ‘perf’ performance monitor without affecting the latencies. These include getting interrupt times. If interrupts take too long, a similar message to this prints:

kernel: [ 6491.061361] perf: interrupt took too long (6650 > 6452), lowering kernel.perf_event_max_sample_rate to 30000

Meaning

This essentially means that the machine was stuck on an interrupt for a long amount of time. This can be caused by a number of reasons, including:

  1. DISK IO interrupt taking long would be caused by a faulty, slow or overloaded disk. Alternatively this can be caused by an issue with a disk or raid controller.
  2. Network IO interrupt taking too long would be caused most often by network driver issues being suboptimal. Alternatively, this can be caused by network issues, although the protocol switching should theoretically be preventing it.

Troubleshooting

The disk IO can be easily checked with disk IO stats (sysstat-sar and/or iostat) and confirmed. If the disk IO is not the reason for slow interrupts, the network IO will be. For this, the problem needs to be checked on the network and/or kernel side.

First point of troubleshooting should include checking kernel messages in /var/log/(messages|syslog) as well as dmesg. Should these show tracebacks from vmxnet driver, the interrupt taking too long will be caused by a faulty network driver. Please contact the network card provider for this, or try upgrading to the latest available stable kernel.

If there is no issue with the kernel drivers, the network would be most liekly at fault, most likely first hop. This then needs to be checked on the network side.

Keywords

network interrupt took too long latency

Timestamp

12/27/2018

1 Like

I think it would be good if you covered what “easily check disk IO stats” covers - examples and such. Do we check for a certain q depth? Avgqu-sz? await? Also s/liekly/likely/