We are looking at a couple of alternate strategies that will overcome
delete that were acknowledged before a crash
delete that were acknowledged before a planned shutdown
There are some complexities that we need to deal with, but its a key objective to provide a solution to both of the above cases. This should come into a release in the first half of 2016.
Can you please give me an explanation why you canât keep in memory all the keys (include removed ones) during server startup? This way the last generation for the particular key will be available and removed ones will not reappear.
This behaviour can force the server to run out of memory under certain circumstances (e.g. when the number of dead records much more than the number of alive ones). But this feature can be really helpful for the users who donât use delete/TTL so intensively (of course it should be configurable).
This may tighten the hole but wouldnât solve it. Deletes are not written to disk and if you set a short TTL this will be written to disk but new writes and defrag can overwrite free blocks in arbitrary order, so the block containing the shorter TTL and higher generation may have already been overwritten before the block containing the record with a lower generation. Therefore the disk no longer has information about the record ever having been deleted. When cold starting the record would still return (warm starts the record will not).
When running in memory only, deletes work as you expect. When running Aerospike Enterprise with warm-start and having cold-start-empty set deletes work as you expect (as long as multiple nodes donât need to coldstart and you are running replication factor > 1).
If you are neither of those, and you need deletes then the present solution is to implement tombstones from the application. Instead of issuing a delete, replace the record with a value indicating that the application understands as a deleted record and set the TTL to be the same as the original record plus a few seconds. If the recordâs TTL is set to -1/0 (never expire) then you will need to set the TTL high enough such that any other copies have surely been overwritten by the time it has expired also understand that eviction can expire these records earlier which if left unchecked may return you back to the original problem. So you will need to adjust your sizing to account for the tombstones.
Thanks for the explanation, I didnât use to know about deletes are not written to disk. Sorry for bothering you, but this issue looks very serious to me. And all the workarounds are pretty insecure, except the cold-start-empty one which doubles (at least) nodeâs start-up time.
How would this slow down the nodeâs start-up time. Cold-start-empty doesnât read the disks when starting meaning that if Aerospike has to perform a Cold-Start (which is the only form of starting in the Community Edition) then it will come up empty. You couldnât hope to start faster ;).
The cold-start-empty option isnât as viable of a solution on community because the community edition does not support Fast Restart which is an Enterprise feature.
Yes, deletes only delete the 64 byte in-memory index entry. It wouldnât help to do a write to disk unless we also tracked that delete in memory as a tombstone.
Also why would you say tombstoning is an insecure workaround? Of the workarounds it is the most robust IMO, but comes with operations hurdles.
Currently the only reliable delete in Aerospike is expiration where the TTLs of all records have never been reduced. If a record expires in this way, it does not come back. Deletes have been a long running hot topic in Aerospike and there is a lot of momentum to resolve this issue, but we refuse to compromise performance, scalability, or ease of operations for deletes. Many of our users have confronted this very problem and many of those eventually found that they could use the expiration system and achieve a better workflow.
Perhaps if you discussed your current architecture requiring deletes someone may be able to point out an alternative?
I mean the node will be fully operational only after the migration complete. And the migration take twice as long to finish as a cold start (with reading data from a disk) on my dataset.
Thanks for the information. The tombstone looks like a good solution, but itâs affected by the same issues (defrag and the start-up data loading order). In general you just donât know when is safe to remove (during defrag) the tombstone from the disk.
Sorry, I forgot about it because actually itâs not a delete. The whole record set canât be reduced this way. But anyway itâs a nice workaround if you have enough memory to hold all the records (including dead ones)
Thatâs a good point. So if I have a constant record TTL and I just updating my records I will never face the issue.
IMO the broken data integrity is the most troublesome consequence of the cold-start (with no cold-start-empty).
In some cases I really need the data integrity with deletes so probably Iâll try to use the cold-start-empty solution. User-handled tombstones are also suitable.
Only if your TTLs are never expire, and even then you can set a large TTL on tombstones so that the record is almost certainly gone from drive when it is expired. But as you say, the memory cost here is a bit high for data by definition we no longer want.
Thought about all this and like to add further input. I like Vincents proposal and cannot spot an edge-case it wouldnât solve with some adjustments. It could even solve problems beyond the delete feature (e.g. overflow of generation ID which is very easy with just 16-bits for that!). Any solution should take care that deleted or simply updated record versions cannot come back in any circumstances IMHO, otherwise itâs either a âbad DXâ or causing data loss because the developer didnât expect/mind it. Even in complex cases like: Update rec to gen 40, Delete, Create, Update to gen 15 => cold restart shouldnât bring any old gen 15-40 back just cause their generation is the highest.
Vincents proposal a.k.a the problem of building the index of records in a serialized way can be reduced to serializing write-blocks within every single partition which, in theory, should be possible even with migrations and failures. If occasionally scans over ssd blocks can be afforded, tombstones can have a temporary life and take no space & introduce no limit on amount of deletes on the long term.
But an incremented wblock id would be a better fit than timestamps as it doesnât require synced clocks and no coordination other than transferring an integer during migrations.
But whatever concept you choose, anything would be a big step forward towards consistency.
@seeminglee - this feature is under current development. We expect a preview in Fall 2016.
@ManuelSchmidt - Some of the groundwork will be released as part of 3.9. We have moved from generation to time-last-modified for the metadata to track and decide.
Hi, the feature of adding âtombstoneâ support ( along side âexpungeâ use ) has been added to Aerospike in 3.10.
Here is a discussion about the feature in blog format:
I have already had discussions with customers that they want a free lunch, that is, they want deletes to not take storage space and they want them to persist across reboots. There is no free lunch, and you have to pick the kind of delete you want - the kind that frees storage and memory immediately, or the kind that persists but takes storage.
Best way in Community Edition to not get zombie records coming to life is to upfront think about the default ttl for your namespace that works for your business use case. Size your cluster to hold the records for this default life time based on your workload. Then neverreduce a recordâs ttl to force it to expire or delete it through your client/application. Let records expire naturally through their default-ttl. Simple and clean.
How does that work? What is stored in the persistent store of the record (HDD or SSD) is the future timestamp of recordâs expiration. If system clock is determined to be ahead of that timestamp during coldstart, that record will not be brought to life.
This is why setting default-ttl in the namespace to live forever (ie 0) and then reducing it in the client to force expire the record or deleting the record from the client exposes this vulnerability.
There is another feature you can exploit to pseudo-delete. From client, set record ttl to -2 (minus two)- this means update the record without changing its current TTL ( avail in ver 3.10.1). Then you can add a bin that to you means this record is stale or donât use. Something like bin âuseâ = 0 or 1. Then let the record expire naturally via its original ttl. You will have to test out the concept though for yourself!
I have a suggestion for a tombstone-based deletion, which I think works well:
To delete a record:
Update it as follows:
Use ttl=-2. Donât extend TTL, and, if TTL was 0 (never expire), it remains so.
Set all data bins to null
Add a tombstone value (e.g. âTâ) to a secondary-indexed, dedicated tombstone bin (e.g. âtombstoneâ)
Delete the record (i.e. remove its reference from the in-memory indexâthe record is still on disk until defrag).
Analysis
The tombstone always trumps older data due to its more recent generation value, and will never expire before any of its older data.
So, upon server restart, the record is either (1) gone due to defrag/ttl, or (2) reappears only as a tombstone.
Operation Implications
Tombstones for records with ttl=0 will never expire. They need periodic, manual removal. Tombstones that resurface from a restart may live long before expiring due to ttl; they also benefit from periodic, manual removal as well.
This is easy because we have a secondary index on the tombstone bin.
We simply set up a periodic secondary index query that filters on âtombstoneâ==âTâ, and delete matching records.
For example, in Java:
Create a statement with a Filter.equal("tombstone", "T").
execute() the statement against a UDF that simply does aerospike:remove(rec)
Performance
Deletes will involve one disk write (tombstoning) and one memory write (index removal). Overall cost is low.
Periodic delete-by-secondary-index is more expensive, but can be done infrequently during off-peak hours.
If tombstones are very prevalent, avoid using a secondary index (which lives in RAM), and do a periodic scan instead.
Any feedback (comments, corrections, insights, etc.) on this suggestion is appreciated.
This wouldnât work, defrag doesnât necessarily reclaim the oldest data first. So your deleted tombstone may be recovered before what it was covering which would cause a restart to revive the original.
There is a finite time when a record lives on the server with the Tombstone bin as âTâ before your periodic removal runs.
So now your application has to always read the record first, decide if it is a tombstone by reading the Tombstone bin for âTâ and then do whatever record operation it was doing otherwise it could update a tombstoned record and resurrect it as a good record.