I have gone through a lot of community forum threads, but I am not able to figure out the issue that happened with our aerospike cluster last night.
We are running 3 node cluster with 8 CPU cores, 2 Local SSDs 350 GB nvme and 2 persistent SSDs (shadow device), 32 GB RAM on GCP.
Our configuration includes: write-block-size: 128 KB defrag-lwm-pct: 5 nsup-period: 120 nsup-thread: 2 max-write-cache: 256 MB replication-factor: 2
We have a write heavy system, we use aerospike for creating records and then reading them but not for updates, hence defrag-lwm-pct is set to 5.
We have 2 namespaces mem, ssd.
On mem namespace we are writing with a throughput of 20 k writes / second and reading / removing at the same rate.
On ssd namespace we are writing with a throughput of 2k writes / s → with average record size of 12 KB, 99 %tile - 32 KB, average ttl is 2.5 hours it varies between 1 min to 4 hours 2k writes / s → with average record size of 4 KB, 99 %tile - 8 KB, average ttl is 3 hours varies between 30 min to 4 hour 3x0 min 4.5k reads / s → of the above 2 sets we are reading only 400 req / s there is some other persistent data with ttl that can range from 7 days to over a month where we are doing the rest of the reads.
We were earlier running with defrag-lwm-pct 50, but we were running with lot of device overload issues hence we decreased it to 5 and thereafter the issues came down to 0.
The system ran fine for 2-3 days but there after we started getting device overload issues again, because out of the 3 nodes, 2 were having their write-q stuck at the maximum value of 2048.
The device-availble-pct was 65, so it didn’t seem like to be an issue with availability of the free write-blocks.
Aerospike logs showed that 80 blocks were written / s which means around 10 MB / s which was also verified with iostat and iotop and the write-q on one of the device was 2048 and on another it was hovering around 200-300.
If we calculate from the above specs as well:
12 KB * 2k + 4 KB * 2k → 32 MB writes / s → 32 MB writes / s * 2 (replication factor) → 64 MB writes / s 64 MB writes / s divided over 6 Local SSDs ~ 10 MB / s 10 MB / 128 KB (block size) → 80 blocks / s written to disk
Out of the 3 nodes
Node - 1: had 0 write-q on both the devices, was functioning without any issues Node - 2: had write-q = 2048 (max) on device 1 and write-q ~ 300 on device 2 Node - 3: had write-q = 2048 (max) on device 2 and write-q ~ 300 on device 1
System load was under 6, cpu utilization was around 500%
We checked and there was no configuration difference between the 3 nodes.
This number was not going down, nor it was changing much. So my question here is how does write-q work.
I am assuming that nvme should perform better than 10 MB / s writes. So ideally, the write-q should have increased the blocks written to disk / second from 80 to let us say 120 or more to empty itself, but it stayed at 2048 which means that if 80 blocks were getting queued to be written, 80 were being written finally making the queue size constant.
We tried different resolutions but nothing worked, to increase the available-pct we increased the defrag-lwm-pct from 5 to 50 again (I understand that it was a big jump, but we wanted to see the behaviour).
So after doing the above change, even Node-1 went into queue state and Node-2, Node-3 also collected a lot of queued buffers. write-q went to around 20-30k for all the 3 nodes.
But the first node recovered from it in 15 minutes and write-q went to 0, the other 2 nodes took 30 mins to recover from it but again came back to the 2048 values and got stuck there.
Finally we increased the nsup-thread from 2 to 4 and at the same time traffic scaled down from 2k to 2.3k so we don’t know exactly what solved it, but things came back inline.
I am not able to fit in any equation or understanding the maths behind the problem that happened and what is best suited way to understand / debug this kind of issue.