Expired/Deleted data reappears after server is restarted

deletion
durable-deletion

#22

Greetings guys!

Can you please give me an explanation why you can’t keep in memory all the keys (include removed ones) during server startup? This way the last generation for the particular key will be available and removed ones will not reappear.

This behaviour can force the server to run out of memory under certain circumstances (e.g. when the number of dead records much more than the number of alive ones). But this feature can be really helpful for the users who don’t use delete/TTL so intensively (of course it should be configurable).


#23

This may tighten the hole but wouldn’t solve it. Deletes are not written to disk and if you set a short TTL this will be written to disk but new writes and defrag can overwrite free blocks in arbitrary order, so the block containing the shorter TTL and higher generation may have already been overwritten before the block containing the record with a lower generation. Therefore the disk no longer has information about the record ever having been deleted. When cold starting the record would still return (warm starts the record will not).

When running in memory only, deletes work as you expect. When running Aerospike Enterprise with warm-start and having cold-start-empty set deletes work as you expect (as long as multiple nodes don’t need to coldstart and you are running replication factor > 1).

If you are neither of those, and you need deletes then the present solution is to implement tombstones from the application. Instead of issuing a delete, replace the record with a value indicating that the application understands as a deleted record and set the TTL to be the same as the original record plus a few seconds. If the record’s TTL is set to -1/0 (never expire) then you will need to set the TTL high enough such that any other copies have surely been overwritten by the time it has expired also understand that eviction can expire these records earlier which if left unchecked may return you back to the original problem. So you will need to adjust your sizing to account for the tombstones.


#25

Thanks for the explanation, I didn’t use to know about deletes are not written to disk. Sorry for bothering you, but this issue looks very serious to me. And all the workarounds are pretty insecure, except the cold-start-empty one which doubles (at least) node’s start-up time.


#26

How would this slow down the node’s start-up time. Cold-start-empty doesn’t read the disks when starting meaning that if Aerospike has to perform a Cold-Start (which is the only form of starting in the Community Edition) then it will come up empty. You couldn’t hope to start faster ;).

The cold-start-empty option isn’t as viable of a solution on community because the community edition does not support Fast Restart which is an Enterprise feature.

Yes, deletes only delete the 64 byte in-memory index entry. It wouldn’t help to do a write to disk unless we also tracked that delete in memory as a tombstone.

Also why would you say tombstoning is an insecure workaround? Of the workarounds it is the most robust IMO, but comes with operations hurdles.

Currently the only reliable delete in Aerospike is expiration where the TTLs of all records have never been reduced. If a record expires in this way, it does not come back. Deletes have been a long running hot topic in Aerospike and there is a lot of momentum to resolve this issue, but we refuse to compromise performance, scalability, or ease of operations for deletes. Many of our users have confronted this very problem and many of those eventually found that they could use the expiration system and achieve a better workflow.

Perhaps if you discussed your current architecture requiring deletes someone may be able to point out an alternative?


Clarification on stop writes working
#27

I mean the node will be fully operational only after the migration complete. And the migration take twice as long to finish as a cold start (with reading data from a disk) on my dataset.

Thanks for the information. The tombstone looks like a good solution, but it’s affected by the same issues (defrag and the start-up data loading order). In general you just don’t know when is safe to remove (during defrag) the tombstone from the disk.

Sorry, I forgot about it because actually it’s not a delete. The whole record set can’t be reduced this way. But anyway it’s a nice workaround if you have enough memory to hold all the records (including dead ones)

That’s a good point. So if I have a constant record TTL and I just updating my records I will never face the issue.

IMO the broken data integrity is the most troublesome consequence of the cold-start (with no cold-start-empty).

In some cases I really need the data integrity with deletes so probably I’ll try to use the cold-start-empty solution. User-handled tombstones are also suitable.


Stop writes and Restart
#28

Only if your TTLs are never expire, and even then you can set a large TTL on tombstones so that the record is almost certainly gone from drive when it is expired. But as you say, the memory cost here is a bit high for data by definition we no longer want.


Persisted Deletes as an Option (AER-1226) (3.10.0)
#29

Thought about all this and like to add further input. I like Vincents proposal and cannot spot an edge-case it wouldn’t solve with some adjustments. It could even solve problems beyond the delete feature (e.g. overflow of generation ID which is very easy with just 16-bits for that!). Any solution should take care that deleted or simply updated record versions cannot come back in any circumstances IMHO, otherwise it’s either a ‘bad DX’ or causing data loss because the developer didn’t expect/mind it. Even in complex cases like: Update rec to gen 40, Delete, Create, Update to gen 15 => cold restart shouldn’t bring any old gen 15-40 back just cause their generation is the highest.

Vincents proposal a.k.a the problem of building the index of records in a serialized way can be reduced to serializing write-blocks within every single partition which, in theory, should be possible even with migrations and failures. If occasionally scans over ssd blocks can be afforded, tombstones can have a temporary life and take no space & introduce no limit on amount of deletes on the long term. But an incremented wblock id would be a better fit than timestamps as it doesn’t require synced clocks and no coordination other than transferring an integer during migrations.

But whatever concept you choose, anything would be a big step forward towards consistency.


#30

Hi @meher and @kporter,

Is there any update about this issue? Thank you!


#31

@seeminglee - this feature is under current development. We expect a preview in Fall 2016.

@ManuelSchmidt - Some of the groundwork will be released as part of 3.9. We have moved from generation to time-last-modified for the metadata to track and decide.

Alvin Richards VP of Product Aerospike, Inc.


#32

Really glad to hear that. Thank you!


#33

hi,this feature means which?


#35

Aerospike 3.10.0 introduces durable deletes for Aerospike Enterprise. Learn more about how they work here: www.aerospike.com/docs/guide/durable_deletes.html.


#36

Hi, the feature of adding “tombstone” support ( along side “expunge” use ) has been added to Aerospike in 3.10.

Here is a discussion about the feature in blog format:

I have already had discussions with customers that they want a free lunch, that is, they want deletes to not take storage space and they want them to persist across reboots. There is no free lunch, and you have to pick the kind of delete you want - the kind that frees storage and memory immediately, or the kind that persists but takes storage.

We welcome feedback about this.


#37

Since durable delete is only available in the Enterprise edition is there a reliable way of deleting in the Community edition of aerospike?

Thanks. Deepak.


#38

Best way in Community Edition to not get zombie records coming to life is to upfront think about the default ttl for your namespace that works for your business use case. Size your cluster to hold the records for this default life time based on your workload. Then never reduce a record’s ttl to force it to expire or delete it through your client/application. Let records expire naturally through their default-ttl. Simple and clean.

How does that work? What is stored in the persistent store of the record (HDD or SSD) is the future timestamp of record’s expiration. If system clock is determined to be ahead of that timestamp during coldstart, that record will not be brought to life.

This is why setting default-ttl in the namespace to live forever (ie 0) and then reducing it in the client to force expire the record or deleting the record from the client exposes this vulnerability.

There is another feature you can exploit to pseudo-delete. From client, set record ttl to -2 (minus two)- this means update the record without changing its current TTL ( avail in ver 3.10.1). Then you can add a bin that to you means this record is stale or don’t use. Something like bin “use” = 0 or 1. Then let the record expire naturally via its original ttl. You will have to test out the concept though for yourself!


#39

I have a suggestion for a tombstone-based deletion, which I think works well:

To delete a record:

  1. Update it as follows:
  • Use ttl=-2. Don’t extend TTL, and, if TTL was 0 (never expire), it remains so.
  • Set all data bins to null
  • Add a tombstone value (e.g. “T”) to a secondary-indexed, dedicated tombstone bin (e.g. “tombstone”)
  1. Delete the record (i.e. remove its reference from the in-memory index–the record is still on disk until defrag).

Analysis

  1. The tombstone always trumps older data due to its more recent generation value, and will never expire before any of its older data.
  2. So, upon server restart, the record is either (1) gone due to defrag/ttl, or (2) reappears only as a tombstone.

Operation Implications

  1. Tombstones for records with ttl=0 will never expire. They need periodic, manual removal. Tombstones that resurface from a restart may live long before expiring due to ttl; they also benefit from periodic, manual removal as well.
  2. This is easy because we have a secondary index on the tombstone bin.
  3. We simply set up a periodic secondary index query that filters on “tombstone”==“T”, and delete matching records.

For example, in Java:

  1. Create a statement with a Filter.equal("tombstone", "T").
  2. execute() the statement against a UDF that simply does aerospike:remove(rec)

Performance

  1. Deletes will involve one disk write (tombstoning) and one memory write (index removal). Overall cost is low.
  2. Periodic delete-by-secondary-index is more expensive, but can be done infrequently during off-peak hours.
  3. If tombstones are very prevalent, avoid using a secondary index (which lives in RAM), and do a periodic scan instead.

Any feedback (comments, corrections, insights, etc.) on this suggestion is appreciated.


#40

@ronny1204

This wouldn’t work, defrag doesn’t necessarily reclaim the oldest data first. So your deleted tombstone may be recovered before what it was covering which would cause a restart to revive the original.


#41

There is a finite time when a record lives on the server with the Tombstone bin as ‘T’ before your periodic removal runs.

So now your application has to always read the record first, decide if it is a tombstone by reading the Tombstone bin for ‘T’ and then do whatever record operation it was doing otherwise it could update a tombstoned record and resurrect it as a good record.


#42

@pgupta yes but that is easy to handle

However, as @kporter points out, this implementation does not work.


#43

Sure, was just pointing out additional overhead that will occur. instead of being able to just update a record, you have to read, (first record lock), then update (second record lock) - then worry about if you are overwriting someone other client’s delete in between. To avoid that, you will have to use the Check and Set feature and check generation, retry if gen does not match. Just starts slowing the transactions down … but good thinking and deep diving! Yes, it can be handled but costs cycles and therefore latency on the transaction. And this you will have to do for every transaction in your application.