Expired/Deleted data reappears after server is restarted

deletion
durable-deletion

#1

Aerospike 3.3.19 In-Memory + Disk Persistence, single node.

With conf: default-ttl 0 # Never expire/evict fsync-max-sec 1 flush-max-ms 30000

In an empty DB, using C client to:

  1. Insert a record via aerospike_key_put().
  2. Update TTL = 10 (expired in 10s) for that record via as_operations_add_touch() and aerospike_key_operate().
  3. asmonitor / info show 0 record after 10s.
  4. Restart server after 30s.
  5. asmonitor / info show 1 record.

You can see that expired record reappear after server restarted.

Is it a bug of Index re-build from persistent data file, or the TTL is not written into data file for persistence?

The same issue for Deleted data which can reappear after server restarted.


Found existence of deleted and expired records in Aerospike!
Durability issue: lost set after 3-4 days
Migration of Aerospike cluster without downtime
Expired records reappears in cold restart
Eviction mechanisms in Aerospike
How to delete all records in set / Deleted set or records get restored on 'restart'
#2

Thanks for posting this question. Here are details around this topic based on the current implementation.

The general mechanism for deletes:

Records do get deleted first from the index. The data on disk does get erased through the defrag process recombining blocks in order to create new empty blocks. This is done for performance optimization.

Now when it comes to restarting the aerospike server on a node, there are 2 modes for restarting a node:

  • In case of cold start, the index will be rebuilt from persistent storage and hence, data still there will re-appear (for the part for which defrag has not processed).

  • In case of fast start (available in enterprise edition), the index is in shared memory and will be there after restart, so the deleted data will not reappear. (but a forced cold start, or full box reboot, will also make it reappear in this case).

Details on fast start can be found here: http://www.aerospike.com/docs/operations/manage/aerospike/fast_start

Now coming to your particular example, as you have data in memory, the server will go through a cold start.

  • When updating a record’s ttl to expire it (let’s say change it from 30 days to 10 seconds), the record will expire in 10 seconds and will be deleted from index when the data is accessed by the client or when the ‘nsup’ thread goes over it to expire it (which ever happens first). Now, if the defrag thread has not reclaimed the blocks containing those records, 2 records are still on disk, the one with 30 days ttl and the one with 10 seconds.

  • Upon cold start, those records will be scanned. If the 30 days ttl record is scanned first, it will be re-inserted in the index. When the 10 seconds ttl record is then scanned, it’s generation will be checked and, as it has a greater generation then the other record it will overwrite it, but as its ttl has expired, the record will be deleted. Everything is fine in this case.

  • Now if those records were scanned in the reverse order, the 10 seconds ttl record would be thrown out as expired, but then, as the 30 days ttl record is scanned, it will re-appear in the index… with a 30 days ttl.


Stale Data Comes Up on Node restart temporarily
#3

I believe that my case is what you described:

Now if those records were scanned in the reverse order, the 10 seconds ttl record would be thrown out as expired, but then, as the 30 days ttl record is scanned, it will re-appear in the index… with a 30 days ttl.

Instead of the record with smaller TTL “be thrown out”, how about to keep it in Index, so the later records with longer TTL will be thrown out during Index re-building, and finally:

be deleted from index when the data is accessed by the client or when the ‘nsup’ thread goes over it to expire it (which ever happens first).


#4

The reason why I update the record with small TTL is that:

I use that small TTL as a simulated tombstone to mark a record as deleted in the Disk, expect it can be filtered out during Index re-building when cold start.


#5

Tombstoning is a very hard problem to solve correctly. We are looking at the best way to address this issue.


#6

I have a similar problem to it. I store data in memory and disk (not flash disk). I use community edition 3.2.0. My configuration is here:

namespace my_namespace {
  replication-factor 2
  high-water-memory-pct 90
  high-water-disk-pct 90
  stop-writes-pct 90
  memory-size 3G
  default-ttl 0

  storage-engine device {
    file /opt/aerospike/data/my_namespace.data
    filesize 20G
    data-in-memory true
    defrag-period 120
    defrag-lwm-pct 50
    defrag-max-blocks 4000
    defrag-startup-minimum 10
  }
}

I use java aerospike-client to access aerospike server. This problem occurs in case of below.

  1. insert or update data using aerospike client.
  2. delete data using aerospike client.
  3. aerospike service restart

Tentatively, we follow procedures below:

  1. Backup data using our backup tool before restarting aerospike
  2. stop aerospike
  3. delete /opt/aerospike/data/my_namespace.data
  4. start aerospike
  5. restore data using our restoring tool

But, it is very annoying. Is there a better way?


Rack aware cluster failure causes higher number of objects
#7

For data in memory persisted on disk, you could check the following new option as of 3.3.21: cold-start-empty

This has to be used carefully, and (assuming you are running in multi-node cluster) you would have to wait for migrations to finish before cold starting another node. (With your configuration, cold starting is the only option anyway).


What's up with AEROSPIKE_ERR_FAIL_FORBIDDEN?
#8

I almost have the same problem, and I tried using cold-start-empty option and after the migration on all nodes finished, the deleted data didn’t appear as expected, which is good. Then I reset this option “cold-start-empty” to false again (as it is very risky to keep it activated), but after I restarted one of the nodes, the old deleted data reappeared again. should I keep this option true for ever?


#9

It depends on your use case. The options would be to either use the cold-start-empty option (which I agree with you is risky) or if this is a rare occurrence, you could erase (dd) your device(s) when restarting each node (and of course again wait for migrations to finish between each node).


#10

The cold-start-empty option does not meet our need:

We need the valid data to be loaded into memory from persistence file after restarted, and removed those Expired / Deleted data.

In other words, the data in memory shall be equal before and after restarted.


#11

There are unfortunately no other options at this point. This is a very hard problem and we will continue to explore ways to address this, as mentioned earlier in this topic.

I also want to point out (just in case it is useful for your use case), Aerospike Enterprise Edition includes a Fast Start feature which keeps the index in shared memory across graceful restarts of nodes.


#12

This part I was curious about. Do you mean that the record with a 10 second ttl is scanned, then if it takes longer than 10 seconds to scan the second record then the first will already have been evicted?

Would the solution here be to just make sure the ttl on any record you want permanently removed longer than the time to rescan the entire index from disk?

So if that particular data set took 60 seconds to scan, then a TTL on the record you want removed of 120 seconds would be enough that the latest record would be scanned with the 120 seconds, then the 30 day ttl record would be scanned, since it has an older generation the 120 second record would remain.

I guess it depends if the TTL is from when the record is loaded or based on a timestamp in which case the 120 seconds would be expired immediately no matter what.

We’ve resorted to setting a flag on records to mark them as deleted so we can go through and clean up records after a restart. Every now and again stopping instances, wiping the drives and restarting to force expired records to go away.

Since one of the ways we use Aerospike is as a queue, there are millions of deleted records that reappear if we restart any of the nodes in the cluster. It feels a bit misleading to have a ‘remove’ capability that doesn’t actually remove. It’s more like “mark as safe to remove” than anything.

This thread is a bit old so I’m hoping there’s been some progress or better strategies to ensure data is permanently removed.


#13

Sorry for not being clearer in the explanation. This has nothing to do with the ttl value itself. Let me try explaining differently.

  • Data on disk only get ‘erased’ when written over.
  • At any point of time, if records have been updated multiple times and are still on the disk, you will have multiple versions of the same record on disk, with different ‘generation’ (version ID).
  • Upon cold start, all records on disk are scanned. As they are scanned, they are inserted in the index.
  • If a different version of the same record is scanned at a later time, its generation will be checked against the record that is already in the index and will be thrown out if older generation or will overwrite the record if newer generation.

Here is where this breaks down:

  • If a record which has already expired (doesn’t matter if initially the ttl was 10 second or 10 month), if such a record is scanned, its generation will be checked against a potential record already in the index and if found in the index with an older generation that record in the index will be deleted and the expired record from disk will be skipped.

  • Now, as scanning continues, if yet another version of this same record is found on the disk, there is nothing to be checked against anymore and if this record had not expired, it will find its way back into the index.

  • Yes, if we know that we have a lot of extra memory around, one way of solving this particular issue would be to keep track of all records scanned to make sure this doesn’t happen. But, this would not cover the case for a record deleted and then later on reinserted.

Hope this helps understand the problem.

We are definitely looking at addressing this, it is just complex to figure out the best way to do this with minimal impact to performance and in a way that would accommodate all use cases.


#14

Hello! Hanson Can you please share your C sample code to set data in Aerospike ? I am not able to set String type datatype in Aerospike with the use of variables. Interger I am able to do that.

Please help.

Thanks Asif


#15

Sorry to beat a dead horse here but I want to make sure I’m correct in understanding the ramifications of removing a record then.

If I have a record with the following history:

  • gen 0 - created
  • gen 1 - updated

Then no matter what after a restart I’ll always end up with gen 1. Now, if I have the following:

  • gen 0 - created
  • gen 1 - updated
  • gen 2 - remove or add a ttl

now if I restart I could end up with gen 0 OR gen 1 depending on the order the records are scanned. Really I could end up with any generation in the history of the object if I add a ttl or remove it. If the scan order is 1,2,0 we end up with gen 0, if the order is 0,2,1 we end up with gen 1.

It seems the only way to ensure you have the latest generation of an object is to simply never remove it since removing it might result in a restart giving you a random generation of the record from anywhere in it’s lifecycle.

Seems there should be some strong warnings around calling remove since it can result in pretty unpredictable behavior on a restart.


#16

You are absolutely correct.


#17

Thanks. All clear now.

For what it’s worth, in my opinion, I don’t think there’s a use case where unpredictable/inconsistent behavior is preferable to consistency. Whether it’s that the result is the deletes are always ignored on scan, or the ttls are ignored, or enforced no matter if it was re-inserted, or something.

In any situation even when you might have re-inserted the record this behavior wouldn’t be great since it may result in the record being deleted on restart or the generation before the delete, or after the reinsert. So the inconsistency makes this edge case you reference not really ok either.

I’d personally prefer after any static rule so that I can say that the server will be in a predictable state after a restart, even if that predictable state is that deletes are always ignored so I have the latest generation of records regardless of any ttl or delete. Then at least it’s easier to come up with a consistent solution to consistent behavior.

I believe this is an aerospike bug but I understand you guys don’t have the same perspective. I think non-deterministic/unpredictable behavior in the way data is loaded from disk in any data store should be considered a bug.

Anyway we can work around this now that we are aware of the exact behavior and we will just never remove a record except when we’re prepared to zero disks to avoid inconsistent behavior.


#18

Thanks for the feedback. Operationally, the cold-start-empty configuration can help in some situation.

To be clear, I do agree with you that this is an issue (whether we call it a bug or a big miss on the spec/feature) and it is on a prioritized list of things to be addressed.


#19

It seems all about the scan order while loading. As the write block is queued and flushed to ssd, I think we can set the timestamp for each write block and then we can scan the write block in timestamp order while cold startup. In this way, the deleted record will be deleted at the end, and the reinsert after delete can stay. Hope this can give any help. It would be great to see the aerospike can handle the deleted records correctly while restart.


#20

Thanks for the input. I know the core team is brainstorming ideas. I am not sure your idea will fully work, though, as records get mixed up in write blocks during the defragmentation process…