Expired/Deleted data reappears after server is restarted

Aerospike 3.3.19 In-Memory + Disk Persistence, single node.

With conf: default-ttl 0 # Never expire/evict fsync-max-sec 1 flush-max-ms 30000

In an empty DB, using C client to:

  1. Insert a record via aerospike_key_put().
  2. Update TTL = 10 (expired in 10s) for that record via as_operations_add_touch() and aerospike_key_operate().
  3. asmonitor / info show 0 record after 10s.
  4. Restart server after 30s.
  5. asmonitor / info show 1 record.

You can see that expired record reappear after server restarted.

Is it a bug of Index re-build from persistent data file, or the TTL is not written into data file for persistence?

The same issue for Deleted data which can reappear after server restarted.

4 Likes

Thanks for posting this question. Here are details around this topic based on the current implementation.

The general mechanism for deletes:

Records do get deleted first from the index. The data on disk does get erased through the defrag process recombining blocks in order to create new empty blocks. This is done for performance optimization.

Now when it comes to restarting the aerospike server on a node, there are 2 modes for restarting a node:

  • In case of cold start, the index will be rebuilt from persistent storage and hence, data still there will re-appear (for the part for which defrag has not processed).

  • In case of fast start (available in enterprise edition), the index is in shared memory and will be there after restart, so the deleted data will not reappear. (but a forced cold start, or full box reboot, will also make it reappear in this case).

Details on fast start can be found here: http://www.aerospike.com/docs/operations/manage/aerospike/fast_start

Now coming to your particular example, as you have data in memory, the server will go through a cold start.

  • When updating a recordā€™s ttl to expire it (letā€™s say change it from 30 days to 10 seconds), the record will expire in 10 seconds and will be deleted from index when the data is accessed by the client or when the ā€˜nsupā€™ thread goes over it to expire it (which ever happens first). Now, if the defrag thread has not reclaimed the blocks containing those records, 2 records are still on disk, the one with 30 days ttl and the one with 10 seconds.

  • Upon cold start, those records will be scanned. If the 30 days ttl record is scanned first, it will be re-inserted in the index. When the 10 seconds ttl record is then scanned, itā€™s generation will be checked and, as it has a greater generation then the other record it will overwrite it, but as its ttl has expired, the record will be deleted. Everything is fine in this case.

  • Now if those records were scanned in the reverse order, the 10 seconds ttl record would be thrown out as expired, but then, as the 30 days ttl record is scanned, it will re-appear in the indexā€¦ with a 30 days ttl.

2 Likes

I believe that my case is what you described:

Now if those records were scanned in the reverse order, the 10 seconds ttl record would be thrown out as expired, but then, as the 30 days ttl record is scanned, it will re-appear in the indexā€¦ with a 30 days ttl.

Instead of the record with smaller TTL ā€œbe thrown outā€, how about to keep it in Index, so the later records with longer TTL will be thrown out during Index re-building, and finally:

be deleted from index when the data is accessed by the client or when the ā€˜nsupā€™ thread goes over it to expire it (which ever happens first).

The reason why I update the record with small TTL is that:

I use that small TTL as a simulated tombstone to mark a record as deleted in the Disk, expect it can be filtered out during Index re-building when cold start.

1 Like

Tombstoning is a very hard problem to solve correctly. We are looking at the best way to address this issue.

I have a similar problem to it. I store data in memory and disk (not flash disk). I use community edition 3.2.0. My configuration is here:

namespace my_namespace {
  replication-factor 2
  high-water-memory-pct 90
  high-water-disk-pct 90
  stop-writes-pct 90
  memory-size 3G
  default-ttl 0

  storage-engine device {
    file /opt/aerospike/data/my_namespace.data
    filesize 20G
    data-in-memory true
    defrag-period 120
    defrag-lwm-pct 50
    defrag-max-blocks 4000
    defrag-startup-minimum 10
  }
}

I use java aerospike-client to access aerospike server. This problem occurs in case of below.

  1. insert or update data using aerospike client.
  2. delete data using aerospike client.
  3. aerospike service restart

Tentatively, we follow procedures below:

  1. Backup data using our backup tool before restarting aerospike
  2. stop aerospike
  3. delete /opt/aerospike/data/my_namespace.data
  4. start aerospike
  5. restore data using our restoring tool

But, it is very annoying. Is there a better way?

For data in memory persisted on disk, you could check the following new option as of 3.3.21: cold-start-empty

This has to be used carefully, and (assuming you are running in multi-node cluster) you would have to wait for migrations to finish before cold starting another node. (With your configuration, cold starting is the only option anyway).

I almost have the same problem, and I tried using cold-start-empty option and after the migration on all nodes finished, the deleted data didnā€™t appear as expected, which is good. Then I reset this option ā€œcold-start-emptyā€ to false again (as it is very risky to keep it activated), but after I restarted one of the nodes, the old deleted data reappeared again. should I keep this option true for ever?

1 Like

It depends on your use case. The options would be to either use the cold-start-empty option (which I agree with you is risky) or if this is a rare occurrence, you could erase (dd) your device(s) when restarting each node (and of course again wait for migrations to finish between each node).

The cold-start-empty option does not meet our need:

We need the valid data to be loaded into memory from persistence file after restarted, and removed those Expired / Deleted data.

In other words, the data in memory shall be equal before and after restarted.

3 Likes

There are unfortunately no other options at this point. This is a very hard problem and we will continue to explore ways to address this, as mentioned earlier in this topic.

I also want to point out (just in case it is useful for your use case), Aerospike Enterprise Edition includes a Fast Start feature which keeps the index in shared memory across graceful restarts of nodes.

1 Like

This part I was curious about. Do you mean that the record with a 10 second ttl is scanned, then if it takes longer than 10 seconds to scan the second record then the first will already have been evicted?

Would the solution here be to just make sure the ttl on any record you want permanently removed longer than the time to rescan the entire index from disk?

So if that particular data set took 60 seconds to scan, then a TTL on the record you want removed of 120 seconds would be enough that the latest record would be scanned with the 120 seconds, then the 30 day ttl record would be scanned, since it has an older generation the 120 second record would remain.

I guess it depends if the TTL is from when the record is loaded or based on a timestamp in which case the 120 seconds would be expired immediately no matter what.

Weā€™ve resorted to setting a flag on records to mark them as deleted so we can go through and clean up records after a restart. Every now and again stopping instances, wiping the drives and restarting to force expired records to go away.

Since one of the ways we use Aerospike is as a queue, there are millions of deleted records that reappear if we restart any of the nodes in the cluster. It feels a bit misleading to have a ā€˜removeā€™ capability that doesnā€™t actually remove. Itā€™s more like ā€œmark as safe to removeā€ than anything.

This thread is a bit old so Iā€™m hoping thereā€™s been some progress or better strategies to ensure data is permanently removed.

Sorry for not being clearer in the explanation. This has nothing to do with the ttl value itself. Let me try explaining differently.

  • Data on disk only get ā€˜erasedā€™ when written over.
  • At any point of time, if records have been updated multiple times and are still on the disk, you will have multiple versions of the same record on disk, with different ā€˜generationā€™ (version ID).
  • Upon cold start, all records on disk are scanned. As they are scanned, they are inserted in the index.
  • If a different version of the same record is scanned at a later time, its generation will be checked against the record that is already in the index and will be thrown out if older generation or will overwrite the record if newer generation.

Here is where this breaks down:

  • If a record which has already expired (doesnā€™t matter if initially the ttl was 10 second or 10 month), if such a record is scanned, its generation will be checked against a potential record already in the index and if found in the index with an older generation that record in the index will be deleted and the expired record from disk will be skipped.

  • Now, as scanning continues, if yet another version of this same record is found on the disk, there is nothing to be checked against anymore and if this record had not expired, it will find its way back into the index.

  • Yes, if we know that we have a lot of extra memory around, one way of solving this particular issue would be to keep track of all records scanned to make sure this doesnā€™t happen. But, this would not cover the case for a record deleted and then later on reinserted.

Hope this helps understand the problem.

We are definitely looking at addressing this, it is just complex to figure out the best way to do this with minimal impact to performance and in a way that would accommodate all use cases.

Hello! Hanson Can you please share your C sample code to set data in Aerospike ? I am not able to set String type datatype in Aerospike with the use of variables. Interger I am able to do that.

Please help.

Thanks Asif

Sorry to beat a dead horse here but I want to make sure Iā€™m correct in understanding the ramifications of removing a record then.

If I have a record with the following history:

  • gen 0 - created
  • gen 1 - updated

Then no matter what after a restart Iā€™ll always end up with gen 1. Now, if I have the following:

  • gen 0 - created
  • gen 1 - updated
  • gen 2 - remove or add a ttl

now if I restart I could end up with gen 0 OR gen 1 depending on the order the records are scanned. Really I could end up with any generation in the history of the object if I add a ttl or remove it. If the scan order is 1,2,0 we end up with gen 0, if the order is 0,2,1 we end up with gen 1.

It seems the only way to ensure you have the latest generation of an object is to simply never remove it since removing it might result in a restart giving you a random generation of the record from anywhere in itā€™s lifecycle.

Seems there should be some strong warnings around calling remove since it can result in pretty unpredictable behavior on a restart.

You are absolutely correct.

Thanks. All clear now.

For what itā€™s worth, in my opinion, I donā€™t think thereā€™s a use case where unpredictable/inconsistent behavior is preferable to consistency. Whether itā€™s that the result is the deletes are always ignored on scan, or the ttls are ignored, or enforced no matter if it was re-inserted, or something.

In any situation even when you might have re-inserted the record this behavior wouldnā€™t be great since it may result in the record being deleted on restart or the generation before the delete, or after the reinsert. So the inconsistency makes this edge case you reference not really ok either.

Iā€™d personally prefer after any static rule so that I can say that the server will be in a predictable state after a restart, even if that predictable state is that deletes are always ignored so I have the latest generation of records regardless of any ttl or delete. Then at least itā€™s easier to come up with a consistent solution to consistent behavior.

I believe this is an aerospike bug but I understand you guys donā€™t have the same perspective. I think non-deterministic/unpredictable behavior in the way data is loaded from disk in any data store should be considered a bug.

Anyway we can work around this now that we are aware of the exact behavior and we will just never remove a record except when weā€™re prepared to zero disks to avoid inconsistent behavior.

3 Likes

Thanks for the feedback. Operationally, the cold-start-empty configuration can help in some situation.

To be clear, I do agree with you that this is an issue (whether we call it a bug or a big miss on the spec/feature) and it is on a prioritized list of things to be addressed.

It seems all about the scan order while loading. As the write block is queued and flushed to ssd, I think we can set the timestamp for each write block and then we can scan the write block in timestamp order while cold startup. In this way, the deleted record will be deleted at the end, and the reinsert after delete can stay. Hope this can give any help. It would be great to see the aerospike can handle the deleted records correctly while restart.

3 Likes

Thanks for the input. I know the core team is brainstorming ideas. I am not sure your idea will fully work, though, as records get mixed up in write blocks during the defragmentation processā€¦