How to delete all records in set / Deleted set or records get restored on 'restart'

Post by arayappan » Thu Oct 24, 2013 9:37 pm

Hi All,

I want to delete all records in a set(table). is there any command in aerospike to delete all records in set like “DELETE FROM test.test-set”?

Thanks & Regards, Rayappan A

You will find information on how to delete records in a set here.

I followed this exact set of instructions but as soon as there a service restart the data and the set gets restored. How to avoid this?

The following thread should provide you with the details for why this is happening:

Aerospike 3.10.0 introduces durable deletes for Aerospike Enterprise. Learn more about how they work here: www.aerospike.com/docs/guide/durable_deletes.html.

So, to clarify, community edition does not actually delete? Only marks stuff deleted without reclaiming disk?

Below applies to both CE and EE. If data storage is in RAM, both data and Primary Index are deleted, right away on delete. If data sotrage is on SSD, Primary Index (PI) is always on RAM, delete just deletes the Primary Index in RAM by default. The data on SSD is eventually overwritten by a new record. Aerospike does not “erase” data on SSD on delete. Likewise, when you update a record on SSD, it is not modified “in-situ” - the entire record is read into RAM, modified or updated and written back to a new location on the SSD. The pointers in PI are updated to point to the new location. Data in the old location is eventually overwritten by a new record.

On EE only - PI is stored on Linux shared memory instead of process RAM which gives you the ability to fastrestart a node but that aside, on EE you can delete with an available delete policy of durable-delete=true. In this case, Aerospike writes a “tombstone” - an update of the record with gen, digest and lastupdate time metadata, TTL=live-for-ever but no bins data on the SSD. This tombstone is eventually deleted by a TombRaider thread when all previous versions of the record present on the SSD have been overwritten AND a configured amount of time (default 1 day) has elapsed AND there are no ongoing migrations in the cluster.

3 Likes

Thanks for this explanation, Piyush. Please could you (or someone) further clarify the following:

  • what does “eventually” mean in your first paragraph? How does Aerospike decide when records should be overwritten on disk? Is there any defragmentation process that runs in the background, cassandra-style, that “compacts” the database periodically thereby reclaiming space? Does the TombRaider process do anything like this?

  • If I restart a node whose data is replicated on another node, will the other node “inform” the restarted node of deleted data? Or will deleted data on disk from the restarted node find its way back into both nodes?

  • Is it your opinion that this Enterprise-only durable delete feature is a big advantage (converseley - a big disadvantage for Community)? Assuming I don’t restart nodes often, what are the practical implications of this feature disparity? The context is a startup business which is willing medium term to pay for Enterprise, but must be able to put Community into effective production until revenue permits an Enterprise license purchase. Until then, efficiency of storage system is crucial because we are creating and deleting hundreds, even thousands, of intermediate (that is, calculated) timeseries every day, and what we absolutely do not want is unbounded disk storage growth, because we may grow well beyond RAM in our use case. In other words, an effective disk compaction/delete re-usage strategy is very important for us so I’d like to get a feeling for how much one is giving up on this in Community vs Enterprise.

  • Related to the above, since we want to use Aerospike for time series, which guarantees of contiguous storage at record or bin level can we expect, if any?

eventually means when there is new data to be written and this space can be reclaimed. Aerospike writes in a log structured fashion in block size you decide, with as many records that will fit in a block. typical block size is 128KB for SSD, 1 MB max. So the max record size on SSD is 1MB (data + some negligible overhead) ie in this case you are fitting only one record in a write-block. A block is enqueued to defrag when its in-use size drops below the defrag-lwm-pct to be picked up by the drive’s defrag threads - coalescing partially used blocks, due to deletes and updates, into available blocks and relocating remaining good data in those blocks into compactly filled blocks.

Restarting a node - unless you use EE with durable delete - CE in certain cases can result in non-durable delete - implication is as you observe correctly. There are no tombstones in CE - so other node has no knowledge of deleted record once it deletes it - PI is gone. Tombstone in EE still has PI.

Also note, deletion is different from expiration. If you configure a namespace with a default time-to-live (ttl) of one day, all records you add to that namespace will expire after 24 hours. In CE or EE, such naturally expired records cannot get resurrected. Its only when you delete or change record TTL to reduce it and therefore force expire it that you can get into restart corner cases.

All the above - only applies if you are storing data on SSD. No write-blocks if storing in RAM - so no record size limit or defrag issues if you configure your namespace to store all your data purely in RAM. BTW, you can have multiple (32 max) namespace definitions in a cluster

EE vs CE for your use case and financials of startup, best discussed offline.

A record in Aerospike is always stored contiguously. (write-block is your record container.) This is the essence of Aerospike - it does not store data into a file system (unless you configure you namespace to use rotational disks for storage) - it effectively uses SSDs very much like RAM and is able to deliver RAM like performance with persistence using SSDs with this “Hybrid Memory Architecture”. The price you pay is 1Mb max record size limitation on SSDs.

EDIT - Defrag is not periodic, it has been an ongoing process since 3.3.17.

OK, :slight_smile: . I wanted to keep it simple without introducing defrag-sleep!

defrag-sleep [dynamic] Context: namespace Subcontext: storage-engine device Default: 1000 Introduced: 3.3.17 Number of microseconds to sleep after each wblock defragged.

Hello Mr. Gupta,

Using Community for development, is there some way to force Aerospike to write changes out to SSD to simulate durable deletes? Something like a commit or flush operation?

It is difficult to write or test logic that expects deleted to stay deleted when the data isn’t deleted in a predictable manner.

Edit: I should add, your DB is excellent, the C client API is really excellent. Maybe the Community version could be kneecapped on total amount of data stored or such, rather than a fairly useful thing like delete meaning delete?

Flushing the writes to disk wouldn’t help actually. It would need to go write zeros on the previous location of the record on the storage subsystem. That would not be performant at all.

For straight up deleted records to not come back when using the Community Edition, one would have to erase the storage upon restart.

The alternative would be to leverage expirations, while making sure the time when a record will expire is not shorten.

This article explains the different situations when records are loaded upon a cold restart:

(Of course, the other option would be to use the Enterprise Edition for development. Hopefully something Aerospike will allow with a special license in the near future).

Thanks, Meher!

Erasing the storage would also erase the non-deleted records it seems to me.

I had assumed a deleted disk record was just a flag bit or byte rather than an overwrite with zeros. But really that would be no less performant than an insert. Well if I were an expert on DB design I’d not be here, eh!

I suppose I will cook up a “deleted” flag to exclude deleted entries…

Correct, erasing storage would erase everything but assuming replication factor 2 or more, data will be repopulated and one would have to wait for migrations to complete prior to taking the next node down.

A non durable delete is actually simply the removal of the record’s primary index entry… it’s internally called an ‘expunge’. Fast and released memory immediately from the primary index. Durable delete is also fast, but doesn’t give the memory back until the tombstone is cleaned up, which is a bit indeterministic as it requires the underlying previous versions on the disk to have been overwritten.

Yeah, the durable delete would cause some memory or SSD space churn but you get that with any decent DBMS. The issue I ran into is that I expected deletes in the Community version to be committed to disk quicker than they actually are. If I end up with a product that looks marketable I might look at your Enterprise product.

Aerospike should look at kneecapping Community in some other way than disabling durable deletes… having deleted data sometimes come back after a restart surely puts a lot of devs off, it’s something that a lot of people would just react badly to. If you absolutely will not allow Community to do durable deletes then I suggest you also remove the device storage engine from Community. As it is people try Community with SSD backing store and what they get is non-ACID deletes. If I have backing store then it’s not unreasonable to expect that a restart preserves changes - otherwise why would I have backing store?

Well, that quirk aside I like the product. I really really like the C client API. Many key/value or noSQL vendors seem to think only Python or Ruby or Java matter.

Thanks for the feedback. It does make sense to some extend, indeed. I have passed the feedback internally.