High Latency Spikes Occurring Every 2-3 Minutes: Persistent Lag After Stopping Writes - Is NSUP the Root Cause?

Hello Aerospike Community Team,

We are experiencing a persistent, periodic latency issue in our cluster and are seeking community insights, especially regarding the Namespace Supervisor (NSUP) process.

We observe significant latency/lag spikes occurring consistently every 2-3 minutes, even after taking troubleshooting steps that ruled out our primary write workload.


1. Environment and Configuration

Category Specification
Hardware CPU: Xeon Silver 4310, 2.1 GHz, 12 core * 2 / RAM: 128GB (PC4-3200AA-R, 16GB * 8) / SSD: 480GB * 8, RAID 5
Aerospike Version 8.0.0.2

Cluster Configuration (namespace aidev)

namespace aidev {
    replication-factor 2

    stop-writes-sys-memory-pct 90
    evict-indexes-memory-pct 85

    nsup-period 120   # 2 minutes
    nsup-threads 2

    storage-engine device {
        evict-used-pct 50
        post-write-cache 256M

        filesize 180G
        file /home1/{userid}/data/namespace/aidev/aerospike-activation-data[1-10]
    }
}


2. Data Status and Observed Behavior

Metric Value Note
Problematic Set user_context Primary set suspected of causing lag.
Object Count 481 Million Objects Total count in the set.
Data Size on Disk 2.00 TB Total size of the data on SSD/Device.
Write Pattern user_context set is written to every 3 minutes. Original write cycle.

Observed Problem: The cluster experiences noticeable lag during retrieval operations at intervals of approximately 3 minutes.


Total Data Bytes on cluster is 2.4TB.

3. Troubleshooting and Current Hypothesis

Our team initially suspected the large dataset or the regular Read/Write operations as the cause of the lag.

  1. Action Taken: We completely stopped all Read/Write operations to the user_context set.

  2. Result: The lag did not disappear. Retrieval operations still showed performance degradation every 3 minutes.

  3. Monitoring: We checked general cluster monitoring and node-exporter resources (including the network traffic shown, which is high at ~681 Mb/s transmit on bond0), but no other resource or internal metric shows a corresponding spike at the exact 2-3 minute period.

Given the persistence of the lag and the lack of other correlating metrics, we are now highly suspicious of the NSUP (Namespace Supervisor) process, as its nsup-period is set to 120 seconds (2 minutes).

Our Core Question:

Could the TTL expiration processing, driven by the NSUP process running every 2-3 minutes, be the underlying cause of significant cluster latency/lag, even on a system utilizing the storage-engine device?

Any insights or recommended statistics to monitor during this 2-3 minute window would be greatly appreciated.

Thank you for your time.

1 Like

To clarify the situation accurately: The user_context data has a 3-minute expiration cycle, whereas the nsup-period is set to 120 seconds.

My understanding is that if the lag were caused by expired data cleanup, the spikes should occur every 2 minutes to align with the NSUP cycle. However, we are currently observing lag on the server exactly every 3 minutes.

I increased the nsup-period to 300 and monitored the system, but the lag still persists at approximately 3-minute intervals. Therefore, I have concluded that nsup is not the cause. I would appreciate any further ideas on what else I should check.

This topic was automatically closed 84 days after the last reply. New replies are no longer allowed.