How to recover tombstones faster


Background: Tombstones & Tomb-Raider


When a durable delete is issued a tombstone is written.

  • It continues to occupy an entry in the index, together with other record entries in the index.
  • It is persisted on disk.
  • It has the same meta-data as any other record:
    • last-update-time – just like a normal update.
    • generation – increments just like a normal update.
  • It is replicated at the same replication factor specified on the namespace.
  • It is migrated the same way current records are migrated.
  • It is conflict resolved as any other record.
  • It is written without expiration.


A special background mechanism (“Tomb-Raider”) is used to remove no-longer-needed tombstones. The conditions for a tombstone to be removed are as follows:

  • There are no previous copies of the record on disk.
    • This condition assures that a cold start will not bring back any older copy.
  • The tombstone’s last-update-time is before the current time minus the configured tomb-raider-eligible-age (or, said differently, the tombstone is older than the tomb-raider-eligible-age).
    • This condition prevents a node that’s been apart from the cluster for tomb-raider-eligible-age seconds to rejoin and re-introduce an older copy.
  • The node is not waiting for any incoming migration.
  • Only when all the above conditions are satisfied, the tombstone is reclaimed.

Tomb-Raider tuning:

  • tomb-raider-period - The minimum amount of time, in seconds, in between runs, the default is 1 day (86400).
  • tomb-raider-eligible-age - The number of seconds to retain a tombstone, even though it’s discovered to be safe to remove, the default is 1 day (86400).
  • tomb-raider-sleep (storage-only) - The number of microseconds to sleep in between large block reads on disk, the default is 1000 µs (1 ms).

For more details on Tombstones & Tomb-Raider please refer to the Durable Deletes Guide.


Irrespective of the tuning parameters of the tomb-raider, the tombstones will be eligible to be deleted by the tomb-raider only when there are no previous copies of the record on disk.

To illustrate, consider a scenario where a few million records are durably deleted. The object count may decrease immediately on the namespace but the memory may not be available immediately. The reason being even if the tomb-raider is made to run faster (by decreasing the tomb-raider-sleep, decreasing the tomb-raider-eligible-age, decrease tomb-raider-period) the tombstones may not be eligible to be marked as cenotaphs because there exist one or more previous versions of the same record on the disk which have not yet been overwritten by incoming writes.

Potential solutions

Two potential procedures can be considered to speed up the recovery of the memory occupied by tombstones :

Solution 1: Force as many older versions of the tombstoned records to be overwritten

The idea here is to overwrite as many of the older versions of the records as possible in order for tombstones to be removed. In order to achieve this, the write blocks containing these previous copies of the data have to be overwritten by new writes, which is possible if these write blocks are made eligible for new writes (free blocks). Free blocks are generated through the defragmentation process. Once these blocks with older copies of data are eligible for defragmentation, the defrag thread will move and combine all the active records in these blocks into new blocks, leaving these blocks with these older copies as ‘free’ and available to be overwritten by the new writes / updates coming in from the client.

Please note that by design, the defragmented write blocks are used first by new writes(writes/updates) before any new (never before used) blocks on the disk. So, the new writes coming in will first occupy these defragmented blocks, overwriting the previous copies of the deleted records, eventually making the tombstones eligible to be deleted by the tomb-raider.

  1. Increasing the defrag-lwm-pct makes more blocks eligible for defragmentation. But this has to be done in an incremental fashion in order to mitigate the risks of impacting performance with the increased defragmentation activity this will cause. Gradually increase the defrag-lwm-pct, which will build up the defrag-q. The defrag-q will be consumed at a rate dictated by the defrag-sleep configuration parameter. In the following example, the defrag-q jumps to over 800,000 at an average of 40,000 blocks per second over 20 second after the defrag-lwm-pct is increased:
Apr 18 2018 17:16:52 GMT: INFO (drv_ssd): (drv_ssd.c:2143) {namespace1} /dev/nvme1: used-bytes 158041644160 free-wblocks 2101709 write-q 0 write (16289610,302.1) defrag-q 810680 defrag-read (15531912,40701.8) defrag-write (7165227,300.7) tomb-raider-read (1292756,270.8)
Apr 18 2018 17:17:12 GMT: INFO (drv_ssd): (drv_ssd.c:2143) {namespace1} /dev/nvme1: used-bytes 158041861376 free-wblocks 2105305 write-q 0 write (16296266,332.8) defrag-q 800477 defrag-read (15531961,2.5) defrag-write (7171853,331.3) tomb-raider-read (1298110,267.7)
  1. Decreasing the defrag-sleep will speed up the rate at which the defrag-q is consumed but will add more load to the underlying storage device. Any change in the defrag-sleep should be monitored via iostat. If the metrics indicate that the system has reached full capacity, you may want to bring the defrag-sleep value back to the previous value.

  2. The above steps make sure that the write blocks containing the previous copies of the records can now we overwritten by new writes/updates. So, to overwrite these defraged blocks we need to rely on the client traffic or we can issue write/update traffic by populating a dummy set and then issuing a truncate to delete this dummy set. The benchmark tool could be used to populate such dummy set but the regular client traffic may be enough.

  3. Reset the defrag-lwm-pct and defrag-sleep to their original value (defaults are 50% and 1000µs respectively).

  4. Tune the tomb-raider to scan the disks as quickly as the system can sustain, by reducing tomb-raider-sleep and the tomb-raider-period and having a low tomb-raider-eligible-age (for example 5 minutes).

6- You can grep for cenotaphs to monitor how the tombstones are being reaped by the tomb raider:

Apr 18 2018 14:30:49 GMT: INFO (drv_ssd): (drv_ssd_ee.c:1256) {namespace1} ... tomb raider done - removed 11496131 cenotaphs
Apr 18 2018 15:03:09 GMT: INFO (drv_ssd): (drv_ssd_ee.c:1256) {namespace1} ... tomb raider done - removed 20577 cenotaphs

Solution 2: Take down a node, wipe out the disks on this node and restart the node.

This way the incoming migrations from the other nodes will migrate tombstones but not any previous copies of the tombstones on the disk. Wait for the migrations to complete before you take down the next node. Do the same on all the nodes in a rolling fashion. Ones all the nodes are restarted, wait for the migrations to complete and then tuning the tomb-raider to run fast will remove all the tombstones.