High Latency Spikes Occurring Every 2-3 Minutes: Persistent Lag After Stopping Writes - Is NSUP the Root Cause?

hensen_yoo · November 20, 2025, 10:19am

Hello Aerospike Community Team,

We are experiencing a persistent, periodic latency issue in our cluster and are seeking community insights, especially regarding the Namespace Supervisor (NSUP) process.

We observe significant latency/lag spikes occurring consistently every 2-3 minutes, even after taking troubleshooting steps that ruled out our primary write workload.

1. Environment and Configuration

Category	Specification
Hardware	CPU: Xeon Silver 4310, 2.1 GHz, 12 core * 2 / RAM: 128GB (PC4-3200AA-R, 16GB * 8) / SSD: 480GB * 8, RAID 5
Aerospike Version	8.0.0.2

Cluster Configuration (`namespace aidev`)

namespace aidev {
    replication-factor 2

    stop-writes-sys-memory-pct 90
    evict-indexes-memory-pct 85

    nsup-period 120   # 2 minutes
    nsup-threads 2

    storage-engine device {
        evict-used-pct 50
        post-write-cache 256M

        filesize 180G
        file /home1/{userid}/data/namespace/aidev/aerospike-activation-data[1-10]
    }
}

2. Data Status and Observed Behavior

Metric	Value	Note
Problematic Set	`user_context`…	Primary set suspected of causing lag.
Object Count	481 Million Objects	Total count in the set.
Data Size on Disk	2.00 TB	Total size of the data on SSD/Device.
Write Pattern	`user_context` set is written to every 3 minutes.	Original write cycle.

Observed Problem: The cluster experiences noticeable lag during retrieval operations at intervals of approximately 3 minutes.

Total Data Bytes on cluster is 2.4TB.

3. Troubleshooting and Current Hypothesis

Our team initially suspected the large dataset or the regular Read/Write operations as the cause of the lag.

Action Taken: We completely stopped all Read/Write operations to the user_context set.
Result: The lag did not disappear. Retrieval operations still showed performance degradation every 3 minutes.
Monitoring: We checked general cluster monitoring and node-exporter resources (including the network traffic shown, which is high at ~681 Mb/s transmit on bond0), but no other resource or internal metric shows a corresponding spike at the exact 2-3 minute period.

Given the persistence of the lag and the lack of other correlating metrics, we are now highly suspicious of the NSUP (Namespace Supervisor) process, as its nsup-period is set to 120 seconds (2 minutes).

Our Core Question:

Could the TTL expiration processing, driven by the NSUP process running every 2-3 minutes, be the underlying cause of significant cluster latency/lag, even on a system utilizing the storage-engine device?

Any insights or recommended statistics to monitor during this 2-3 minute window would be greatly appreciated.

Thank you for your time.

hensen_yoo · November 21, 2025, 12:58am

To clarify the situation accurately: The user_context data has a 3-minute expiration cycle, whereas the nsup-period is set to 120 seconds.

My understanding is that if the lag were caused by expired data cleanup, the spikes should occur every 2 minutes to align with the NSUP cycle. However, we are currently observing lag on the server exactly every 3 minutes.

hensen_yoo · November 21, 2025, 5:13am

I increased the nsup-period to 300 and monitored the system, but the lag still persists at approximately 3-minute intervals. Therefore, I have concluded that nsup is not the cause. I would appreciate any further ideas on what else I should check.

system · February 13, 2026, 5:13am

This topic was automatically closed 84 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to configure nsup-period if default-ttl is non-zero (v4.9+)? Tuning	1	1607	April 2, 2021
Expiration falling behind Tuning nsup	8	4628	December 30, 2020
Intel S3500 480G SSD write ONLY 20000TPS, is this ok? SSD Benchmarks (using ACT)	28	6628	May 7, 2015
Aerospike is not removing expired records from memory fast enough Configuration	6	4466	December 7, 2022
Intermittent high latency Tuning	1	1558	April 17, 2018

High Latency Spikes Occurring Every 2-3 Minutes: Persistent Lag After Stopping Writes - Is NSUP the Root Cause?

1. Environment and Configuration

Cluster Configuration (namespace aidev)

2. Data Status and Observed Behavior

3. Troubleshooting and Current Hypothesis

Related topics

Cluster Configuration (`namespace aidev`)