Hello,
We are running an Aerospike cluster where we recently observed a large defrag queue (~1.1M write blocks).
This situation appeared after a massive record deletion operation in our application. A large amount of data was removed within a relatively short period of time, which we believe created significant fragmentation in the storage layer.
To help the system recover space we temporarily reduced defrag-sleep to accelerate defragmentation.
The defrag queue continuously fluctuates (increasing and decreasing) around ~1.1M instead of steadily decreasing.
Aerospike version: 6.4
Deployment: Kubernetes
Nodes: 8
Replication factor: 4
namespace PROFILE {
replication-factor 4
memory-size 80G
default-ttl 0
nsup-period 600
nsup-threads 4
write-commit-level-override master
storage-engine device {
device /dev/sdb1
max-write-cache 256M
}
current defrag params:
defrag-lwm-pct: 50
defrag-sleep: 500
It looks like the defrag process is not converging after a massive record deletion. The defrag queue stays around ~1.1M
What steps can we take to allow the defragmentation to complete and reclaim space?