I have a record UDF that is supposed to delete only records with a specified number of bins. I have a record that the Python API shows as having two bins but when I process the record via lua I get:
record.numbins(rec) → 3
and when I display the values of the bins I see (this is the unexpected bin of the 3):
Sep 15 2022 21:14:20 GMT: DEBUG (udf): (/opt/aerospike/usr/udf/lua/prune_fsegs.lua:7) bin 1 name = N nil
This bin does not get returned by the Python client and I was under the impression that it was a core Aerospike principle that bins cannot contain “nil”. Is there something I am missing and do I need to iterate over bins to check if the count includes empty bins?
There is a question as to whether the record once had an “N” bin and the simple answer is I don’t know!
One possibility is that the states that Python and Lua are seeing are not the same, For example, the Lua has:
- rec[bin] = nil
- n = record.numbins(rec)
– lua log shows this state
– python client sees this state
As always, a reproducible case will be great to have.
I have a suspicion but I do not have a good way to replicate the case. Fortunately this affects only a small percentage of records and even more fortunately it leads to “false negatives”, i.e., the UDF fails to delete records that we wanted deleted as opposed to false positives which would have been a showstopper.
What is odd is that when I have a record in this state the problem is reproducible in the sense that a “get” issued by AQL or the python client shows two bins and manually applying the UDF shows three including the “empty” bin which precludes a race.
We recently simplified our writes so that we simply overwrite bins but earlier were using operations on CDT’s that modified the map in place. I had a theory that possibly this left the bin in a state that lua reads as “present but empty” and the other API’s as non-existent, however I am not 100% sure I buy this as this is affecting our dev environment which we regularly truncate and so there is a limit to how old any record can be.
Are you using XDR 5.0 feature: bi-directional XDR with bin-convergence?
No, in fact this test cluster does not actually replicate (we have XDR enabled just to be able to test dynamic configuration changes).
In the XDR config - do you have ship-bin-luts set to true? (It may not matter if you are actually shipping to destination.) what about conflict-resolve-writes true for the namespace?
Okay, it does appear to be XDR-related. I tested 100_000 entries on four clusters and the two that do not have XDR enabled showed no examples and the two that did had examples. The background of the udf in question was that it was deciding what records it could delete and since the “failure” rate was < 0.2% it was not worth tracking down, although I could have tested the bin for nil.
To give more context our XDR config does specify: bin-policy=changed-and-specified which implies that there was a tombstone entry that some clients interpreted as no value (e.g., Python and I am guessing that aql is based on the C client) whereas the lua client thought differently.
To answer the previous question in the interest of completeness:
“In the XDR config - do you have ship-bin-luts set to true ? (It may not matter if you are actually shipping to destination.) what about conflict-resolve-writes true for the namespace?”
No and no.
OK, that makes sense. When you want server to ship “changed” bins, and set the bin to null to delete it, server will have to keep it around to ship the bin deletion. The bin will be deleted upon the next record update after it has been shipped, and, its default life of 1 day has elapsed. If there is no subsequent record update after xdr-bin-tombstone-ttl (1 day), it will be there hanging around in the record.