Issues with cold-start resurrecting deleted records

The Aerospike Knowledge Base has moved to https://support.aerospike.com. Content on https://discuss.aerospike.com is being migrated to either https://support.aerospike.com or https://docs.aerospike.com. Maintenance on articles stored in this repository ceased on December 31st 2022 and this article may be stale. If you have any questions, please do not hesitate to raise a case via https://support.aerospike.com.

Synopsis:

This article describes various scenarios around non durable deleted records resurrecting upon cold restart

Background:

Aerospike supports two methods of deleting records:

  1. Expunge - The original (and default) delete, sometimes referred to as expunge where records are removed from the index, leaving the corresponding entry on the storage layer (if namespace is persisted). Such deletes free up the memory immediately (64 bytes per record). The block on the disk containing the value of the record will be eventually defragmented (once its used capacity falls under the defrag-lwm-pct threshold) and made available to be used for new write transactions. Only when new write transactions overwrite such defragmented blocks are the old records values removed from the storage layer.

  2. Durable delete - Starting with version 3.10, a new client policy is introduced allowing to durably delete records preventing older versions of such records to reappear upon cold restarts (or addition/removal of nodes within a specific persiod of time). Refer to the following page for details: http://www.aerospike.com/docs/guide/durable_deletes.html.

This document refers to cold restart scenarios where records have been expunged (and not durably deleted).

Record values de-referencing from the primary index

When the server does a cold restart, the storage layer is scanned in order to rebuild the primary index. There are scenarios where the records which are dereferenced from index could get indexed again. Refer to the following page for details on cold restarts: http://www.aerospike.com/docs/operations/manage/aerospike/cold_start.

Here are different ways for a record’s value to be obsoleted on the persisted storage layer in Aerospike (dereferenced from the primary index):

i. Application deletes (including removal of the last bin of a record).

ii. Expirations (For records with ttl set).

iii. Evictions (Records getting deleted due to breaching either disk or memory high water mark). The records need to have a ttl set to get evicted, if there is no ttl for a record it will neither get expired not get evicted.

iv. Updates to the existing records, since Aerospike does not do in-place updates (it always write new records as a whole in the current streaming write buffer (swb) which always starts in a completely empty block).

Last Update Time

Starting version 3.8.3, Aerospike added the last update time of a record as part of its metadata, to be used for conflict resolution during cold restart. Before the introduction of last-update-time, the conflict resolution was done based on generation during cold restart.

In versions prior to 3.8.3

1- Record updated several times

  • Record created (gen-1)
  • Record updated (gen-2)
  • Record updated (gen-3)

In the worst case, at the time of a cold restart all 3 versions of the record still exist on the disk (if the write transactions traffice didn’t lead to defragmentation and overwriting of the blocks the 3 values belonged to).

Upon cold restart, the persisted storage layer is scanned, but the order in which the records are encountered depends on how the records ended up distributed when initially written, and therefore, in general, the records can be encountered in any order. Let’s assume the gen-1 version of the record is scanned and re-indexed first, followed by the gen-3 version. When gen-3 is read, the already indexed version’s generation is compared and the higher generation version will replace the older generation in the index.

In this case, no matter the order in which the records are scanned, the version with the higher generation (gen-3) will ultimately make it to the index.

Note that the generation of a record is limited to 65535 and will then wrap around back to 1. Therefore, for records which are frequently updated, an older version of a record could have a higher generation than a more recent one. Depending on the different versions still present on the persisted layer, older versions can again reappear in place of newer ones.

2- Record deleted and re-created

  • Record created (gen-1)
  • Record deleted
  • Record re-created (gen-1)

The previous example had only updates (no deletes). Here, a record is created (gen-1), deleted by the application and re-created (gen-1 again as it is now a new record from the index’s perspective). In this example, we will assume that the original version of the record still exists on the persisted storage layer (not overwritten).

When the server cold restarts, records get scanned from the disk and either version of the record (both with gen-1) could be scanned first. If these records were written with a TTL, the TTL would be used to break the tie with the void time and keep the record with the furthest void time. Since in this example both records were written without a TTL the tie cannot be broken and the record that is scanned first prevails.

Therefore, in this case, the wrong record could end up re-indexed upon cold restart.

3- Record updated several times and then deleted

  • Record created (gen-1)
  • Record updated (gen-2)
  • Record updated (gen-3)
  • Record deleted

If none of the different versions of this record is overwritten on the persistence layer, the version of the record with the highest generation (gen-3) will end up re-appearing.

If some of the versions of the record get overwritten by new write transaction, then the version with the highest generation among the records still present on the persisted layer will be re-indexed upon cold start.

If all versions are overwritten, then this record will not re-appear.

4- Record created without a ttl but then updated with a ttl

  • Record created without a ttl (gen-1 / no ttl)
  • Record updated with a ttl (gen-2 / ttl set)
  • Record expires

In this case if the version with gen-2 is read first, it will be skipped as it has expired, but if the version with generation 1 is then encountered, it will be re-indexed as there is nothing in the index at this point for this record to be compared against and this older version will re-appear upon the cold restart.

If gen-1 is scanned first, it will get indexed, but if the version with gen-2 is still present, the version with gen-1 will be removed from the index as this higher generation version has expired. The correct state is then preserved in this specific case.

5- Record created with a ttl but then updated with a ttl that would make it expire sooner

  • Record created with ttl1 (gen-1 / ttl1 - void time t1)
  • Record updated with ttl2 (gen-2 / ttl2 - void time t2 < t1)
  • Record expires

If the record with gen-1 is still on the disk and a cold restart happens after the gen-2 version of the record has expired, if the gen-2 record is not on disk anymore (overwritten by new records after defragmentation) or is scanned first (and will be skipped since it has expired), record with gen-1 will be resurrected.

In versions post 3.8.3 and the introduction of last-update-time

As mentioned earlier in this article, Aerospike introduced the last update time as part of a record’s metadata in version 3.8.3. This replaces the generation for conflict resolution during cold restart.

Let’s go over the same examples.

1- Record updated

  • Record created (gen-1)
  • Record updated (gen-2)
  • Record updated (gen-3)

Since the version with gen-3 will be the one with the latest last-update-time it will be the one prevailing. In case of generation wrap around, the correct version of the record will still prevail given the last update time which is absolute and guarantees the most recent version of the record to win any conflict resolution.

2- Record deleted and re-created

  • Record created (gen-1)
  • Record deleted
  • Record re-created (gen-1)

In this case, the last update time based conflict resolution guarantees that the most recent version will be re-indexed, despite potentially having 2 versions of the record with the same generation (if the initial one had not been overwritten by new write transactions).

3- Record updated several times and then deleted

  • Record created (gen-1)
  • Record updated (gen-2)
  • Record updated (gen-3)
  • Record deleted

Very similar to example 3. prior to version 3.8.3, based on the versions of the record still present on the persisted layer, the version with the most recent last update time will end up being re-indexed upon cold restart.

4- Record created without a ttl but then updated with a ttl

  • Record created without a ttl (gen-1 / no ttl)
  • Record updated with a ttl (gen-2 / ttl set)
  • Record expires

Again, this is very similar to example 4. prior to version 3.8.3. The order in which the different versions are scanned determines the version that will be re-indexed, if any.

5- Record created with a ttl but then updated with a ttl that would make it expire sooner

  • Record created with ttl1 (gen-1 / ttl1 - void time t1)
  • Record updated with ttl2 (gen-2 / ttl2 - void time t2 < t1)
  • Record expires

If the record with gen-1 is still on the disk and a cold restart happens after the gen-2 version of the record has expired, if the gen-2 record is not on disk anymore (overwritten by new records after defragmentation) or is scanned first (and will be skipped since it has expired), record with gen-1 will be resurrected.

XDR consideration

In case of XDR setup, the resurrected deletes, even if migrated to another node, will not be shipped to any destination cluster as XDR only ships records which are resulting from direct client (potentially another source XDR cluster) write transactions. Those are the transactions logged in the digest log.

Keywords

COLD RESTART DELETE RESTART ZOMBIE RESURRECTED

Timestamp

03/04/2017

I need information for below ,

Say Aerospike(Community Edition) Cluster has 4 nodes named A,B,C & D in it. Replication factor of 2 is configured. Record Named X is stored on Node A as master and its replica is at Node B & C. Now

Record X updated several times

  • Record created (gen-1)
  • Record updated (gen-2)
  • Record updated (gen-3)

During Replication will all above versions of Records X will be replicated to Node B & C one by one ? Means will Node B & C will have all these versions gen-1,gen-2 & gen-3

Lets say Master Node A which contains master data of Record X is unavailable then as per

https://docs.aerospike.com/docs/architecture/data-distribution.html

  • Say Node B becomes the master for Record X after Node A unavailable.

  • Will Node B contain all the version for Record X after replication, so that post cold restart for Node B gen-3 will be index

  • Can this happen ? gen-1, gen-2 of Record X versions were replicated to only Node A & gen-3 to B. If this is the case say Node A cold starts and it does not have gen-3 then gen-2 will be index which is wrong

Only a single version would be retained. As a record is replicated from the master to the replica the generation is the same between both of them. Older versions of the records may still exist on disk until the blocks they are contained within are defragmented and subsequently overwritten.

Only 1 version of the record would ever be valid within the cluster. If you have the cold start resurrection issue as described above, when the cold starting node returns to the cluster, we’ll migrate and a ‘winner’ will be chosen. We either use LUT or generation to decide which ‘wins’

So, if node A has generation 2 and shuts down. Node B is the master and C the replica. Record is still generation 2.

We take updates and go to generation 5 (let’s say).

A cold starts and comes back with record X generation 2. Even if we then lose B we still do conflict resolution and the gen 5 record present on node C wins, it would then migrate to A

Does that answer your question?

Thanks @BenBates for quick reply. So in a nutshell for All Updates of data (No Delete), In any case Only 1 version of the record would ever be valid within the cluster and that too it will be latest generation(latest updated) right? Means latest updated Version will only be valid in a cluster ?

You can define conflict resolution policy to use either generation or last update time.

The risk here is less about reverting versions (though I could think of an edge case where cold start resurrection might look like it’s reverted a version). More about a record that is actually deleted coming back into the cluster.

If you’re concerned then durable deletes are 100% guarding against this phenomena.

Currently We are just updating i.e. using put to override the value (Not deleting) of records. We won’t be deleting the records. We will override the value via calling put multiple time on same key as per scenarios.

So in that case will conflict resolution to use last update time will work fine in all scenarios for above use case?

If it will work fine for any Cold Start scenario, can you give me any reference link or example on conflict resolution to use last update time. We are using Java Client.

If you are not deleting any records, then a cold restart will always bring back the latest version of the record (regardless of the conflict resolution policy configured).

If records were updated while the node was down, then during migrations, after the node is restarted, the conflict resolution policy will be used… In that case, using last update time will make sure that in the edge case of generation overlap (at 64Ki) you will get the most recent version.

The conflict resolution policy will also impact split brain situations, though, in AP mode.

Thank you @meher & @BenBates for help to get my queries answered. Appreciate the quick response for my queries.