KEY_NOT_FOUND (Error 2) on put() with UPDATE + EXPECT_GEN_EQUAL when record is deleted or expired concurrently

Summary

We are observing KEY_NOT_FOUND (ResultCode 2) errors on AerospikeClient.put() calls when using RecordExistsAction.UPDATE combined with GenerationPolicy.EXPECT_GEN_EQUAL. Per the Aerospike documentation, RecordExistsAction.UPDATE should “Create or update record. Merge write command bins with existing bins.”

The error occurs when a record ceases to exist between our application’s read and subsequent write — either because a concurrent cleanup process deleted it, or because the record’s TTL expired. Both the initial write attempt and our application-level retry fail with the same error, resulting in the incoming data being silently dropped.

We are seeking clarification on whether this is expected behavior and guidance on the recommended approach.

Architecture & Scenario

We have two independent processes operating on the same Aerospike namespace/set:

1. Writer Process (UPC) — Performs a read-modify-write cycle on user profile records using selective bin-level updates:

- Reads the record (captures record.generation for optimistic concurrency)                                                                                          

- Modifies specific bins (adds/updates audience data)                                                                                                               

- Writes back using RecordExistsAction.UPDATE + GenerationPolicy.EXPECT_GEN_EQUAL with the generation from the read                                                 

2. Cleanup Process — Periodically scans for stale or empty records and deletes them using GenerationPolicy.EXPECT_GEN_EQUAL to avoid deleting records modified since the scan. Durable deletes are enabled (durableDelete = true).

The race condition occurs in two scenarios:

Scenario A — Concurrent delete:

T1: Writer reads record (generation=N)

T2: Cleanup process deletes the record (generation check passes, record removed)

T3: Writer writes with generation=N → KEY_NOT_FOUND

Scenario B — TTL expiration:

T1: Writer reads record (generation=N, TTL is near expiry)

T2: Record TTL expires, Aerospike removes the record

T3: Writer writes with generation=N → KEY_NOT_FOUND


Relevant Write Policy Configuration (Writer Process)

WritePolicy writePolicy = new WritePolicy();
  writePolicy.recordExistsAction  = RecordExistsAction.UPDATE;                                                         writePolicy.generationPolicy    = GenerationPolicy.EXPECT_GEN_EQUAL;
  writePolicy.generation          = <generation from prior read, or 0 if record was not found>;              
  writePolicy.socketTimeout       = 250;   // milliseconds                                                                       
  writePolicy.totalTimeout        = 300;   // milliseconds                                                                         
  writePolicy.maxRetries          = 2;                                                                                                       
  writePolicy.sleepBetweenRetries = 0;
  writePolicy.commitLevel         = CommitLevel.COMMIT_ALL;                                                          
  writePolicy.durableDelete       = false;                                                                                                 
  writePolicy.sendKey             = true;

We can provide additional configuration details (read policy, cleanup process policy, batch policy, etc.) if needed.


Application-Level Retry Logic

When the first write fails (any failure, not just KEY_NOT_FOUND), the application retries once:

1. Re-reads the record from Aerospike

2. Reapplies the modifications to the fresh data

3. Writes again using UPDATE + EXPECT_GEN_EQUAL with the new generation

However, when the record no longer exists:

1. The re-read returns null (no record found)

2. The application constructs a new profile with version = null, which maps to generation = 0

3.The write is issued with RecordExistsAction.UPDATE + GenerationPolicy.EXPECT_GEN_EQUAL + generation = 0

  • This also fails with KEY_NOT_FOUND

  • Both attempts fail, and the incoming data is silently dropped.


Error Details

Exception from production logs:

  com.aerospike.client.AerospikeException: Error 2,1,0,250,300,2,BB9<redacted> <node-ip>:<port>

                                                                                               
  Error field breakdown: Error <resultCode>,<iteration>,<inDoubt>,<socketTimeout>,<totalTimeout>,<maxRetries>,<key-digest> <node> <port>                                
  - resultCode = 2 (KEY_NOT_FOUND) 
  - inDoubt = 0 (server definitively responded --- not a timeout ambiguity)

**Observed error rate: The errors have been happening for a while now with peaks reaching ~930 errors/sec and recurring bursts in the 100-600 errors/sec range. The pattern aligns with the cleanup process scan cycles.

Questions

1. When RecordExistsAction.UPDATE is used with GenerationPolicy.EXPECT_GEN_EQUAL and the target record does not exist (either deleted or TTL-expired), is KEY_NOT_FOUND the expected server response? The documentation describes UPDATE as “Create or update record” — we want to confirm whether the “create” semantic is suppressed when a generation policy is set.

2. Specifically for generation = 0 on a non-existent key: is there any combination of generation value and policy that would allow UPDATE to create the record, or does EXPECT_GEN_EQUAL unconditionally require an existing record to compare against?

3. Given that durable deletes are enabled on the cleanup process, does the resulting tombstone interact with the generation check in any way? (i.e., does a tombstone retain a generation that could be matched, or is it transparent to subsequent writes?)

4. What is the recommended approach for maintaining optimistic concurrency control (generation checks) on writes while gracefully handling the case where the record was legitimately removed between read and write?

Environment

  • Storage: data on disk, indexes in memory

  • Replication factor: 2

  • Client: com.aerospike:aerospike-client-jdk8:10.1.0

Any guidance from the Aerospike team or community would be greatly appreciated.

Thank you.

No, I would expect an GENERATION_ERROR (3).

No, this seems unusual - this policy is usually used in when an update is based on information in the prior generation. If the record no longer exists then the information that is was based on seems now invalidated.

When transitioning from Tombstone to live record, the generation is treated as though it is 0, but will actually continue to increment:

Action          r->generation  r->tombstone  gen_check sees  Client sees
─────────────── ────────────── ───────────── ─────────────── ───────────
(record created)     0              0              -              -
                     │
write #1             1              0              1              1
                     │
write #2             2              0              2              2
                     │
durable delete       3              1              0*             -
                     │              ▲
                     │    ┌─────────┘
                     │    │ tombstone=1, so
                     │    │ gen_check pretends
                     │    │ generation is 0
                     │
write #3             4              0              4              4
                     │
write #4             5              0              5              5

Currently, you would need to retry with CREATE_ONLY or using GEN_EQUAL with generation of 0.

Can you please describe what you’re trying to do (your application logic)? I’m not tracking your intent.

RecordExistsAction.UPDATE states “upsert this record”, but you’re indicating that you previously read the record, otherwise you wouldn’t have a generation to compare against using GenerationPolicy.EXPECT_GEN_EQUAL so it comes across as a RecordExistsAction.UPDATE_ONLY.

There’s no contradiction there - you are telling the server to only apply the write if the generation of the record you previously read hasn’t changed. A deleted record has no generation (more precisely, its generation is 0). The idea behind the GenerationPolicy is for the server to refuse a write when the record has been modified by another write in between your reading it and writing it. A delete is such a write.