@lucien, is this problem now fixed? We are observing a similar (?) issue where we are getting duplicate records:
aql> select key from test.context where flag between 0 and 1
+-------------------+
| key |
+-------------------+
| "Device000000033" |
| "Device000000030" |
| "Device000000031" |
| "Device000000031" |
| "Device000000030" |
| "Device000000034" |
| "Device000000033" |
+-------------------+
This doesn’t happen if we use asbackup to backup the database and do a clean import on e.g. a different machine with a clean install (and a new CREATE INDEX call). This leads us to believe that somehow the index got corrupted, but have no errors or indications otherwise. The only time this issue manifests itself is if we use BETWEEN.
We have inserted the keys through aerospike_put (by using the C client library)
The operations we typically do (apart from the initial bulk insert of the records using aerospike_key_put) are:
we run a query that selects on as_integer_equals(0) for the in_use field (w/ corresponding index). The callback for this query selects the string value in the key bin and returns it to the caller through the user data parameter and returns false to stop the query
after the foreach-query returns, we do an aerospike_key_get using the returned key to select that record.
next, we re-write that record with the in_use field set to (integer) 1 and with a AS_POLICY_GEN_EQ to ensure the record wasn’t modified in between
if the latter operation fails, we retry the whole procedure (up to a certain defined maximum)
Additionally, we have an operation to empty a set (having an API call for this would be highly appreciated, by the way!) that works by iterating over all the records in a set and, in its callback, will re-write that record with a TTL of 1 (we had a tombstoning bug with this previously) and subsequently calls aerospike_key_remove. We have been using this code for quite a while and have verified that it indeed removes the records. It might be that this operation has been called once (or more) on the mentioned set.
@pratyyy - note: dropping the index & recreating it resolves the issue. Question is of course how this issue can occur in the first place and what can be done to avoid it (preferably in the server). In the meantime, are there any means to validate correctness of an index, or let them be rebuilt automatically?
We have identified the issue.
You have some un-cleanable garbage in secondary index.
This generally happens when a user deletes and re-insert a record in on-disk namespace within a short interval.
And I can see that your namespace’s type is device.
So, can you confirm that when you say that you re-insert a record that means “delete + insert” operation ?
To make deletes efficient on on-disk namespace, we do not delete sindex entry synchronously. There is a separate background thread which clean such entries. But if you are inserting the record with updated value, before this gc-thread cleans the data, this garbage cannot be cleaned unless this record is deleted again.
You will not see such results in equality queries. But there is a possibility that you may see this records in range queries. We have identified this possibility now. And will get fixed in future releases.
To avoid such situation, one simple way is to not delete the record to re-insert it.
You can directly re-insert the record with aerospike_put() API. It will automatically update the previous value.
And there is no possibility of generation of garbage in secondary index as well.
Best way to come out of this situation is to recreate the index which will work well since you have very less amount of data in secondary indexes.
thanks for getting back. To clarify, we have two operations that we have identified that might be contributing to the issue we’re observing:
marking as in-use - we use the procedure as described under point 3 in my post above (Query by range does not return all data - #7 by bbavn). Note that this operation never deletes any records; it only does a put
deleting records - since we ran in some tombstoning problems (records reappearing after having been removed), we “fixed” this by first issueing a write with a 1 second TTL, immediately followed by an aerospike_key_remove. This seems to have removed the records reappearing, but if I read you correctly, may have caused the index corruption?
Note: these two operations are distinct and do not happen in the same flow; they might happen at the same time (although the record deletion path is not very common in our system).
That said, I’m not quite sure how to move forward with your advice. Is there any trick we can pull using the C client library?
Apologies for the late reply.
As I said earlier we have found the issue regarding wrong results in secondary index query. It will be fixed in future release. Thanks again for bringing this to our notice.
Till then you can apply some tricks to avoid getting into such situation.
Do not keep the window between deleting and inserting a same primary key very small.
For example -
User deleted a record with key- ‘Car’’ bin - ‘Ford’
User inserted a record with key - ‘Car’ bin - ‘Toyota’
I understand if it is hard to do that. This is very application specific.
You can increase the gc speed of the secondary index.
An update regarding the bug -
Unfortunately you will not be able to get the required fix in the next release.
But if you are blocked on this, we can provide you a dev build with the fix.
Hi @pratyyy, thanks for circling back. We haven’t been able to try the suggestion yet as we’ve been focusing on a few other points so there is no urgency at this moment for a dev build (thanks for the offer, though!).
I’ll update this ticket as soon as we have more results.