Defrag queue fluctuates (~1.1M) after massive record deletion

Hello,

We are running an Aerospike cluster where we recently observed a large defrag queue (~1.1M write blocks).

This situation appeared after a massive record deletion operation in our application. A large amount of data was removed within a relatively short period of time, which we believe created significant fragmentation in the storage layer.

To help the system recover space we temporarily reduced defrag-sleep to accelerate defragmentation.

The defrag queue continuously fluctuates (increasing and decreasing) around ~1.1M instead of steadily decreasing.

Aerospike version: 6.4

Deployment: Kubernetes

Nodes: 8

Replication factor: 4

namespace PROFILE {

replication-factor 4

memory-size 80G

default-ttl 0

nsup-period 600

nsup-threads 4

write-commit-level-override master

storage-engine device {

    device /dev/sdb1

    max-write-cache 256M

}

current defrag params:

defrag-lwm-pct: 50

defrag-sleep: 500

It looks like the defrag process is not converging after a massive record deletion. The defrag queue stays around ~1.1M

What steps can we take to allow the defragmentation to complete and reclaim space?

There’s not enough data to concretely say what the case is, but I can make some educated guesses.

Most likely: you’re IO bottlenecked. What is sdb1 (Device type)? Do you know its expected throughput? What does iostat look like, specifically with aqusz/wawait/rawait/rMBs/wMBs?

What did you reduce defrag-sleep from ? 1000?

There are some other possible causes.