Hello Aerospike Community Team,
We are experiencing a persistent, periodic latency issue in our cluster and are seeking community insights, especially regarding the Namespace Supervisor (NSUP) process.
We observe significant latency/lag spikes occurring consistently every 2-3 minutes, even after taking troubleshooting steps that ruled out our primary write workload.
1. Environment and Configuration
| Category | Specification |
|---|---|
| Hardware | CPU: Xeon Silver 4310, 2.1 GHz, 12 core * 2 / RAM: 128GB (PC4-3200AA-R, 16GB * 8) / SSD: 480GB * 8, RAID 5 |
| Aerospike Version | 8.0.0.2 |
Cluster Configuration (namespace aidev)
namespace aidev {
replication-factor 2
stop-writes-sys-memory-pct 90
evict-indexes-memory-pct 85
nsup-period 120 # 2 minutes
nsup-threads 2
storage-engine device {
evict-used-pct 50
post-write-cache 256M
filesize 180G
file /home1/{userid}/data/namespace/aidev/aerospike-activation-data[1-10]
}
}
2. Data Status and Observed Behavior
| Metric | Value | Note |
|---|---|---|
| Problematic Set | user_context… |
Primary set suspected of causing lag. |
| Object Count | 481 Million Objects | Total count in the set. |
| Data Size on Disk | 2.00 TB | Total size of the data on SSD/Device. |
| Write Pattern | user_context set is written to every 3 minutes. |
Original write cycle. |
Observed Problem: The cluster experiences noticeable lag during retrieval operations at intervals of approximately 3 minutes.
Total Data Bytes on cluster is 2.4TB.
3. Troubleshooting and Current Hypothesis
Our team initially suspected the large dataset or the regular Read/Write operations as the cause of the lag.
-
Action Taken: We completely stopped all Read/Write operations to the
user_contextset. -
Result: The lag did not disappear. Retrieval operations still showed performance degradation every 3 minutes.
-
Monitoring: We checked general cluster monitoring and
node-exporterresources (including the network traffic shown, which is high at ~681 Mb/s transmit onbond0), but no other resource or internal metric shows a corresponding spike at the exact 2-3 minute period.
Given the persistence of the lag and the lack of other correlating metrics, we are now highly suspicious of the NSUP (Namespace Supervisor) process, as its nsup-period is set to 120 seconds (2 minutes).
Our Core Question:
Could the TTL expiration processing, driven by the NSUP process running every 2-3 minutes, be the underlying cause of significant cluster latency/lag, even on a system utilizing the storage-engine device?
Any insights or recommended statistics to monitor during this 2-3 minute window would be greatly appreciated.
Thank you for your time.