Defrag_q and write per second

Our testing cluster has 3 nodes with SSDs. So far the performance seems good but probably needs a bit of tuning. We have 40k w/s (prepending lists and making removeRange queries) on average which it handles easily after some configuration. During peaks, this number can grow a lot larger, which I am limiting to ~80k w/s over some period. When it gets 70k+, defrag_q starts constantly growing until the peak is done. Already tried playing around with defrag_sleep but no help.

defrag_q/wps

So the question is, am I reaching the limits of the current SSDs? Is there any configuration I can play around with ?

Keep in mind, configuring defrag to be less aggressive can result in defrag not keeping up with the write load and eventually hitting stop writes due to lack of available ‘clean’ write blocks. Also making it overly aggressive can negatively affect your peak performance.

The primary parameter to tune defrag is the defrag-lwm-pct. By default this is 50% which causes a 2x write aplification. The write amplification caused by this parameter grows non-linearly.

More information about defrag can be found here: Defragmentation

What do you mean by write amplification?

By write-amplification, I am referring to the additional writes required by the defrag process.

1 Like

Got you, 2x part confuses me a bit. You don’t mean I get twice the amount of actual writes, right? Is it even possible to make assumptions on the amount of fragmentation just by looking at defrag-lwm-pct?

defrag-lwm-pct 50 means that when a block is left utilized only 50%, (because other 50% records got updated and re-written to a different block,) it will be combined with another such block and re-written into one new block, freeing up these two 50% used blocks. That is the defrag process.

So, write-amplification means, a write-block worth of updates coming in, will cause one more block worth of extra writes due to defrag ( 50% of the records updated coming form one block, 50% from another, these two blocks then to be written for defrag). So each block worth of client writes, turns into two blocks worth of writes on disk. If you set defrag-lwm-pct to 75%, each block worth of updates, will cause 3 additional blocks worth of writes due to defrag. i.e. 4x write-amplification. So >50 defrag-lwm-pct gives you higher disk usage but with higher disk wear (and therefore shorter disk life) due to write-amplification. 50% is therefore the recommended value for defrag-lwm-pct.

The amount of fragmentation purely depends on your expiration and update use pattern. If I only created records with exact same expiration time - say 5d - at a rate of one block worth of records per second, and never updated them, I will fill blocks with records that will expire together within a second of each other and entire block will become free with basically zero need for defrag. However, if instead, 50% of the records were “updated”, now half of the original block will become candidate for defrag.

Regarding expiration, if I filled the block with records where half were “live-for-ever” and other half had a ttl of say 1d, I will end up with 50% of the block eligible for defrag after a day.

So, how much of a block will defrag is totally dependent on your read-write-update and ttl pattern.

1 Like

The write-amplification is referring to the system at equilibrium state, where the amount of data being added equals the amount being removed. At equilibrium with default defrag configuration, for every large-block write from the write-path there is expected to be one large-block write from defrag.

1 Like

@pgupta @kporter. That clears a lot up, thanks a lot. I will try playing with defrag-lwm-pct. I will also try compression to get a bit of load off the SSD(I suspect it should help).

This topic was automatically closed 6 days after the last reply. New replies are no longer allowed.