We saw data loss when we added two new nodes(same time) in 3-node cluster during live traffic.Our aerospike cluster is using shadow devices.
We started these two new nodes with 8hr older backup( EBS snapshot). After migration completion, we observed that for few records there is data loss of last 8hrs. I guess, older records( from backup) replaced new records during migration but don’t know how.
did this happened because we added both boxes at same time?
Hi kporter, I was not upgrading Aerospike version. I only added two new box with 8hr old backup, that leads to loss of few new records( Actually, these new records were replaced by older records from backup).
I’m using default conflict-resolution-policy that is generation
What is your server version? What generation values did you have? Take a few suspect records, read using aql, set record_print_metadata to true and read the record. Were you on a generation value rollover cusp from 65K to 0?
From the conflict resolution policy link:
Generation value could wrap back to 0 on a record with a high update rate (Max of 65K generation number per Records). On cold-start a previous copy with higher generation number may be re-indexed and lead to stale data being available. last-update-time is the recommended value.
and it makes sense to change this field to last-update-time. Thanks for the suggestion.
In our case, Record is getting updated more than 64k but still generation is less than 64k. I guess that because of generation rolling policy.
Now, we have another issue related to this, in our Java client, where we are trying to find record insert/update based on the generation number returned by the client. As per your suggestion, we may end up in getting record generation number as 1 multiple times. In this case, how should we rely on the java client to find that the transaction is insert/update? Our counters went bad with this client logic.
It will wrap at 65k. Using conflict resolution policy of last update time will resolve conflicts based on that. Read-modify-write transactions use a generation equal policy which will work fine over the wrap.
By the way, I overlooked the fact that the asrestore utility will still resolve based on generation. For asrestore, you should use the create-only policy, ‘--unique’, to avoid this scenario.
As per your suggestion, we may end up in getting record generation number as 1 multiple times. In this case, how should we rely on the java client to find that the transaction is insert/update? Our counters went bad with this client logic.