The Aerospike Knowledge Base has moved to https://support.aerospike.com. Content on https://discuss.aerospike.com is being migrated to either https://support.aerospike.com or https://docs.aerospike.com. Maintenance on articles stored in this repository ceased on December 31st 2022 and this article may be stale. If you have any questions, please do not hesitate to raise a case via https://support.aerospike.com.
Synopsis:
This article describes various scenarios around non durable deleted records resurrecting upon cold restart
Background:
Aerospike supports two methods of deleting records:
-
Expunge - The original (and default) delete, sometimes referred to as expunge where records are removed from the index, leaving the corresponding entry on the storage layer (if namespace is persisted). Such deletes free up the memory immediately (64 bytes per record). The block on the disk containing the value of the record will be eventually defragmented (once its used capacity falls under the defrag-lwm-pct threshold) and made available to be used for new write transactions. Only when new write transactions overwrite such defragmented blocks are the old records values removed from the storage layer.
-
Durable delete - Starting with version 3.10, a new client policy is introduced allowing to durably delete records preventing older versions of such records to reappear upon cold restarts (or addition/removal of nodes within a specific persiod of time). Refer to the following page for details: http://www.aerospike.com/docs/guide/durable_deletes.html.
This document refers to cold restart scenarios where records have been expunged (and not durably deleted).
Record values de-referencing from the primary index
When the server does a cold restart, the storage layer is scanned in order to rebuild the primary index. There are scenarios where the records which are dereferenced from index could get indexed again. Refer to the following page for details on cold restarts: http://www.aerospike.com/docs/operations/manage/aerospike/cold_start.
Here are different ways for a record’s value to be obsoleted on the persisted storage layer in Aerospike (dereferenced from the primary index):
i. Application deletes (including removal of the last bin of a record).
ii. Expirations (For records with ttl set).
iii. Evictions (Records getting deleted due to breaching either disk or memory high water mark). The records need to have a ttl set to get evicted, if there is no ttl for a record it will neither get expired not get evicted.
iv. Updates to the existing records, since Aerospike does not do in-place updates (it always write new records as a whole in the current streaming write buffer (swb) which always starts in a completely empty block).
Last Update Time
Starting version 3.8.3, Aerospike added the last update time of a record as part of its metadata, to be used for conflict resolution during cold restart. Before the introduction of last-update-time, the conflict resolution was done based on generation during cold restart.
In versions prior to 3.8.3
1- Record updated several times
- Record created (gen-1)
- Record updated (gen-2)
- Record updated (gen-3)
In the worst case, at the time of a cold restart all 3 versions of the record still exist on the disk (if the write transactions traffice didn’t lead to defragmentation and overwriting of the blocks the 3 values belonged to).
Upon cold restart, the persisted storage layer is scanned, but the order in which the records are encountered depends on how the records ended up distributed when initially written, and therefore, in general, the records can be encountered in any order. Let’s assume the gen-1 version of the record is scanned and re-indexed first, followed by the gen-3 version. When gen-3 is read, the already indexed version’s generation is compared and the higher generation version will replace the older generation in the index.
In this case, no matter the order in which the records are scanned, the version with the higher generation (gen-3) will ultimately make it to the index.
Note that the generation of a record is limited to 65535 and will then wrap around back to 1. Therefore, for records which are frequently updated, an older version of a record could have a higher generation than a more recent one. Depending on the different versions still present on the persisted layer, older versions can again reappear in place of newer ones.
2- Record deleted and re-created
- Record created (gen-1)
- Record deleted
- Record re-created (gen-1)
The previous example had only updates (no deletes). Here, a record is created (gen-1), deleted by the application and re-created (gen-1 again as it is now a new record from the index’s perspective). In this example, we will assume that the original version of the record still exists on the persisted storage layer (not overwritten).
When the server cold restarts, records get scanned from the disk and either version of the record (both with gen-1) could be scanned first. If these records were written with a TTL, the TTL would be used to break the tie with the void time and keep the record with the furthest void time. Since in this example both records were written without a TTL the tie cannot be broken and the record that is scanned first prevails.
Therefore, in this case, the wrong record could end up re-indexed upon cold restart.
3- Record updated several times and then deleted
- Record created (gen-1)
- Record updated (gen-2)
- Record updated (gen-3)
- Record deleted
If none of the different versions of this record is overwritten on the persistence layer, the version of the record with the highest generation (gen-3) will end up re-appearing.
If some of the versions of the record get overwritten by new write transaction, then the version with the highest generation among the records still present on the persisted layer will be re-indexed upon cold start.
If all versions are overwritten, then this record will not re-appear.
4- Record created without a ttl but then updated with a ttl
- Record created without a ttl (gen-1 / no ttl)
- Record updated with a ttl (gen-2 / ttl set)
- Record expires
In this case if the version with gen-2 is read first, it will be skipped as it has expired, but if the version with generation 1 is then encountered, it will be re-indexed as there is nothing in the index at this point for this record to be compared against and this older version will re-appear upon the cold restart.
If gen-1 is scanned first, it will get indexed, but if the version with gen-2 is still present, the version with gen-1 will be removed from the index as this higher generation version has expired. The correct state is then preserved in this specific case.
5- Record created with a ttl but then updated with a ttl that would make it expire sooner
- Record created with ttl1 (gen-1 / ttl1 - void time t1)
- Record updated with ttl2 (gen-2 / ttl2 - void time t2 < t1)
- Record expires
If the record with gen-1 is still on the disk and a cold restart happens after the gen-2 version of the record has expired, if the gen-2 record is not on disk anymore (overwritten by new records after defragmentation) or is scanned first (and will be skipped since it has expired), record with gen-1 will be resurrected.
In versions post 3.8.3 and the introduction of last-update-time
As mentioned earlier in this article, Aerospike introduced the last update time as part of a record’s metadata in version 3.8.3. This replaces the generation for conflict resolution during cold restart.
Let’s go over the same examples.
1- Record updated
- Record created (gen-1)
- Record updated (gen-2)
- Record updated (gen-3)
Since the version with gen-3 will be the one with the latest last-update-time it will be the one prevailing. In case of generation wrap around, the correct version of the record will still prevail given the last update time which is absolute and guarantees the most recent version of the record to win any conflict resolution.
2- Record deleted and re-created
- Record created (gen-1)
- Record deleted
- Record re-created (gen-1)
In this case, the last update time based conflict resolution guarantees that the most recent version will be re-indexed, despite potentially having 2 versions of the record with the same generation (if the initial one had not been overwritten by new write transactions).
3- Record updated several times and then deleted
- Record created (gen-1)
- Record updated (gen-2)
- Record updated (gen-3)
- Record deleted
Very similar to example 3. prior to version 3.8.3, based on the versions of the record still present on the persisted layer, the version with the most recent last update time will end up being re-indexed upon cold restart.
4- Record created without a ttl but then updated with a ttl
- Record created without a ttl (gen-1 / no ttl)
- Record updated with a ttl (gen-2 / ttl set)
- Record expires
Again, this is very similar to example 4. prior to version 3.8.3. The order in which the different versions are scanned determines the version that will be re-indexed, if any.
5- Record created with a ttl but then updated with a ttl that would make it expire sooner
- Record created with ttl1 (gen-1 / ttl1 - void time t1)
- Record updated with ttl2 (gen-2 / ttl2 - void time t2 < t1)
- Record expires
If the record with gen-1 is still on the disk and a cold restart happens after the gen-2 version of the record has expired, if the gen-2 record is not on disk anymore (overwritten by new records after defragmentation) or is scanned first (and will be skipped since it has expired), record with gen-1 will be resurrected.
XDR consideration
In case of XDR setup, the resurrected deletes, even if migrated to another node, will not be shipped to any destination cluster as XDR only ships records which are resulting from direct client (potentially another source XDR cluster) write transactions. Those are the transactions logged in the digest log.
Keywords
COLD RESTART DELETE RESTART ZOMBIE RESURRECTED
Timestamp
03/04/2017