Thanks for posting this question. Here are details around this topic based on the current implementation.
The general mechanism for deletes:
Records do get deleted first from the index. The data on disk does get erased through the defrag process recombining blocks in order to create new empty blocks. This is done for performance optimization.
Now when it comes to restarting the aerospike server on a node, there are 2 modes for restarting a node:
In case of cold start, the index will be rebuilt from persistent storage and hence, data still there will re-appear (for the part for which defrag has not processed).
In case of fast start (available in enterprise edition), the index is in shared memory and will be there after restart, so the deleted data will not reappear. (but a forced cold start, or full box reboot, will also make it reappear in this case).
Now coming to your particular example, as you have data in memory, the server will go through a cold start.
When updating a recordās ttl to expire it (letās say change it from 30 days to 10 seconds), the record will expire in 10 seconds and will be deleted from index when the data is accessed by the client or when the ānsupā thread goes over it to expire it (which ever happens first). Now, if the defrag thread has not reclaimed the blocks containing those records, 2 records are still on disk, the one with 30 days ttl and the one with 10 seconds.
Upon cold start, those records will be scanned. If the 30 days ttl record is scanned first, it will be re-inserted in the index. When the 10 seconds ttl record is then scanned, itās generation will be checked and, as it has a greater generation then the other record it will overwrite it, but as its ttl has expired, the record will be deleted. Everything is fine in this case.
Now if those records were scanned in the reverse order, the 10 seconds ttl record would be thrown out as expired, but then, as the 30 days ttl record is scanned, it will re-appear in the index⦠with a 30 days ttl.
Now if those records were scanned in the reverse order, the 10 seconds ttl record would be thrown out as expired, but then, as the 30 days ttl record is scanned, it will re-appear in the index⦠with a 30 days ttl.
Instead of the record with smaller TTL ābe thrown outā, how about to keep it in Index, so the later records with longer TTL will be thrown out during Index re-building, and finally:
be deleted from index when the data is accessed by the client or when the ānsupā thread goes over it to expire it (which ever happens first).
The reason why I update the record with small TTL is that:
I use that small TTL as a simulated tombstone to mark a record as deleted in the Disk, expect it can be filtered out during Index re-building when cold start.
For data in memory persisted on disk, you could check the following new option as of 3.3.21: cold-start-empty
This has to be used carefully, and (assuming you are running in multi-node cluster) you would have to wait for migrations to finish before cold starting another node. (With your configuration, cold starting is the only option anyway).
I almost have the same problem, and I tried using cold-start-empty option and after the migration on all nodes finished, the deleted data didnāt appear as expected, which is good. Then I reset this option ācold-start-emptyā to false again (as it is very risky to keep it activated), but after I restarted one of the nodes, the old deleted data reappeared again. should I keep this option true for ever?
It depends on your use case. The options would be to either use the cold-start-empty option (which I agree with you is risky) or if this is a rare occurrence, you could erase (dd) your device(s) when restarting each node (and of course again wait for migrations to finish between each node).
There are unfortunately no other options at this point. This is a very hard problem and we will continue to explore ways to address this, as mentioned earlier in this topic.
I also want to point out (just in case it is useful for your use case), Aerospike Enterprise Edition includes a Fast Start feature which keeps the index in shared memory across graceful restarts of nodes.
This part I was curious about. Do you mean that the record with a 10 second ttl is scanned, then if it takes longer than 10 seconds to scan the second record then the first will already have been evicted?
Would the solution here be to just make sure the ttl on any record you want permanently removed longer than the time to rescan the entire index from disk?
So if that particular data set took 60 seconds to scan, then a TTL on the record you want removed of 120 seconds would be enough that the latest record would be scanned with the 120 seconds, then the 30 day ttl record would be scanned, since it has an older generation the 120 second record would remain.
I guess it depends if the TTL is from when the record is loaded or based on a timestamp in which case the 120 seconds would be expired immediately no matter what.
Weāve resorted to setting a flag on records to mark them as deleted so we can go through and clean up records after a restart. Every now and again stopping instances, wiping the drives and restarting to force expired records to go away.
Since one of the ways we use Aerospike is as a queue, there are millions of deleted records that reappear if we restart any of the nodes in the cluster. It feels a bit misleading to have a āremoveā capability that doesnāt actually remove. Itās more like āmark as safe to removeā than anything.
This thread is a bit old so Iām hoping thereās been some progress or better strategies to ensure data is permanently removed.
Sorry for not being clearer in the explanation. This has nothing to do with the ttl value itself. Let me try explaining differently.
Data on disk only get āerasedā when written over.
At any point of time, if records have been updated multiple times and are still on the disk, you will have multiple versions of the same record on disk, with different āgenerationā (version ID).
Upon cold start, all records on disk are scanned. As they are scanned, they are inserted in the index.
If a different version of the same record is scanned at a later time, its generation will be checked against the record that is already in the index and will be thrown out if older generation or will overwrite the record if newer generation.
Here is where this breaks down:
If a record which has already expired (doesnāt matter if initially the ttl was 10 second or 10 month), if such a record is scanned, its generation will be checked against a potential record already in the index and if found in the index with an older generation that record in the index will be deleted and the expired record from disk will be skipped.
Now, as scanning continues, if yet another version of this same record is found on the disk, there is nothing to be checked against anymore and if this record had not expired, it will find its way back into the index.
Yes, if we know that we have a lot of extra memory around, one way of solving this particular issue would be to keep track of all records scanned to make sure this doesnāt happen. But, this would not cover the case for a record deleted and then later on reinserted.
Hope this helps understand the problem.
We are definitely looking at addressing this, it is just complex to figure out the best way to do this with minimal impact to performance and in a way that would accommodate all use cases.
Hello! Hanson
Can you please share your C sample code to set data in Aerospike ?
I am not able to set String type datatype in Aerospike with the use of variables. Interger I am able to do that.
Sorry to beat a dead horse here but I want to make sure Iām correct in understanding the ramifications of removing a record then.
If I have a record with the following history:
gen 0 - created
gen 1 - updated
Then no matter what after a restart Iāll always end up with gen 1. Now, if I have the following:
gen 0 - created
gen 1 - updated
gen 2 - remove or add a ttl
now if I restart I could end up with gen 0 OR gen 1 depending on the order the records are scanned. Really I could end up with any generation in the history of the object if I add a ttl or remove it. If the scan order is 1,2,0 we end up with gen 0, if the order is 0,2,1 we end up with gen 1.
It seems the only way to ensure you have the latest generation of an object is to simply never remove it since removing it might result in a restart giving you a random generation of the record from anywhere in itās lifecycle.
Seems there should be some strong warnings around calling remove since it can result in pretty unpredictable behavior on a restart.
For what itās worth, in my opinion, I donāt think thereās a use case where unpredictable/inconsistent behavior is preferable to consistency. Whether itās that the result is the deletes are always ignored on scan, or the ttls are ignored, or enforced no matter if it was re-inserted, or something.
In any situation even when you might have re-inserted the record this behavior wouldnāt be great since it may result in the record being deleted on restart or the generation before the delete, or after the reinsert. So the inconsistency makes this edge case you reference not really ok either.
Iād personally prefer after any static rule so that I can say that the server will be in a predictable state after a restart, even if that predictable state is that deletes are always ignored so I have the latest generation of records regardless of any ttl or delete. Then at least itās easier to come up with a consistent solution to consistent behavior.
I believe this is an aerospike bug but I understand you guys donāt have the same perspective. I think non-deterministic/unpredictable behavior in the way data is loaded from disk in any data store should be considered a bug.
Anyway we can work around this now that we are aware of the exact behavior and we will just never remove a record except when weāre prepared to zero disks to avoid inconsistent behavior.
Thanks for the feedback. Operationally, the cold-start-empty configuration can help in some situation.
To be clear, I do agree with you that this is an issue (whether we call it a bug or a big miss on the spec/feature) and it is on a prioritized list of things to be addressed.
It seems all about the scan order while loading. As the write block is queued and flushed to ssd, I think we can set the timestamp for each write block and then we can scan the write block in timestamp order while cold startup. In this way, the deleted record will be deleted at the end, and the reinsert after delete can stay.
Hope this can give any help. It would be great to see the aerospike can handle the deleted records correctly while restart.
Thanks for the input. I know the core team is brainstorming ideas. I am not sure your idea will fully work, though, as records get mixed up in write blocks during the defragmentation processā¦