How to delete all records in set / Deleted set or records get restored on 'restart'

deletion
durable-deletion

#1

Post by arayappan » Thu Oct 24, 2013 9:37 pm

Hi All,

I want to delete all records in a set(table). is there any command in aerospike to delete all records in set like “DELETE FROM test.test-set”?

Thanks & Regards, Rayappan A


How to delete whole set from namespace?
#2

You will find information on how to delete records in a set here.


#3

I followed this exact set of instructions but as soon as there a service restart the data and the set gets restored. How to avoid this?


#4

The following thread should provide you with the details for why this is happening:


#5

Aerospike 3.10.0 introduces durable deletes for Aerospike Enterprise. Learn more about how they work here: www.aerospike.com/docs/guide/durable_deletes.html.


#6

So, to clarify, community edition does not actually delete? Only marks stuff deleted without reclaiming disk?


#7

Below applies to both CE and EE. If data storage is in RAM, both data and Primary Index are deleted, right away on delete. If data sotrage is on SSD, Primary Index (PI) is always on RAM, delete just deletes the Primary Index in RAM by default. The data on SSD is eventually overwritten by a new record. Aerospike does not “erase” data on SSD on delete. Likewise, when you update a record on SSD, it is not modified “in-situ” - the entire record is read into RAM, modified or updated and written back to a new location on the SSD. The pointers in PI are updated to point to the new location. Data in the old location is eventually overwritten by a new record.

On EE only - PI is stored on Linux shared memory instead of process RAM which gives you the ability to fastrestart a node but that aside, on EE you can delete with an available delete policy of durable-delete=true. In this case, Aerospike writes a “tombstone” - an update of the record with gen, digest and lastupdate time metadata, TTL=live-for-ever but no bins data on the SSD. This tombstone is eventually deleted by a TombRaider thread when all previous versions of the record present on the SSD have been overwritten AND a configured amount of time (default 1 day) has elapsed AND there are no ongoing migrations in the cluster.


#8

Thanks for this explanation, Piyush. Please could you (or someone) further clarify the following:

  • what does “eventually” mean in your first paragraph? How does Aerospike decide when records should be overwritten on disk? Is there any defragmentation process that runs in the background, cassandra-style, that “compacts” the database periodically thereby reclaiming space? Does the TombRaider process do anything like this?

  • If I restart a node whose data is replicated on another node, will the other node “inform” the restarted node of deleted data? Or will deleted data on disk from the restarted node find its way back into both nodes?

  • Is it your opinion that this Enterprise-only durable delete feature is a big advantage (converseley - a big disadvantage for Community)? Assuming I don’t restart nodes often, what are the practical implications of this feature disparity? The context is a startup business which is willing medium term to pay for Enterprise, but must be able to put Community into effective production until revenue permits an Enterprise license purchase. Until then, efficiency of storage system is crucial because we are creating and deleting hundreds, even thousands, of intermediate (that is, calculated) timeseries every day, and what we absolutely do not want is unbounded disk storage growth, because we may grow well beyond RAM in our use case. In other words, an effective disk compaction/delete re-usage strategy is very important for us so I’d like to get a feeling for how much one is giving up on this in Community vs Enterprise.

  • Related to the above, since we want to use Aerospike for time series, which guarantees of contiguous storage at record or bin level can we expect, if any?


#9

eventually means when there is new data to be written and this space can be reclaimed. Aerospike writes in a log structured fashion in block size you decide, with as many records that will fit in a block. typical block size is 128KB for SSD, 1 MB max. So the max record size on SSD is 1MB (data + some negligible overhead) ie in this case you are fitting only one record in a write-block. A block is enqueued to defrag when its in-use size drops below the defrag-lwm-pct to be picked up by the drive’s defrag threads - coalescing partially used blocks, due to deletes and updates, into available blocks and relocating remaining good data in those blocks into compactly filled blocks.

Restarting a node - unless you use EE with durable delete - CE in certain cases can result in non-durable delete - implication is as you observe correctly. There are no tombstones in CE - so other node has no knowledge of deleted record once it deletes it - PI is gone. Tombstone in EE still has PI.

Also note, deletion is different from expiration. If you configure a namespace with a default time-to-live (ttl) of one day, all records you add to that namespace will expire after 24 hours. In CE or EE, such naturally expired records cannot get resurrected. Its only when you delete or change record TTL to reduce it and therefore force expire it that you can get into restart corner cases.

All the above - only applies if you are storing data on SSD. No write-blocks if storing in RAM - so no record size limit or defrag issues if you configure your namespace to store all your data purely in RAM. BTW, you can have multiple (32 max) namespace definitions in a cluster

EE vs CE for your use case and financials of startup, best discussed offline.

A record in Aerospike is always stored contiguously. (write-block is your record container.) This is the essence of Aerospike - it does not store data into a file system (unless you configure you namespace to use rotational disks for storage) - it effectively uses SSDs very much like RAM and is able to deliver RAM like performance with persistence using SSDs with this “Hybrid Memory Architecture”. The price you pay is 1Mb max record size limitation on SSDs.

EDIT - Defrag is not periodic, it has been an ongoing process since 3.3.17.


#10

OK, :slight_smile: . I wanted to keep it simple without introducing defrag-sleep!

defrag-sleep [dynamic] Context: namespace Subcontext: storage-engine device Default: 1000 Introduced: 3.3.17 Number of microseconds to sleep after each wblock defragged.