Data loss when added new box in cluster

aanand · March 9, 2018, 5:11am

We saw data loss when we added two new nodes(same time) in 3-node cluster during live traffic.Our aerospike cluster is using shadow devices.

We started these two new nodes with 8hr older backup( EBS snapshot). After migration completion, we observed that for few records there is data loss of last 8hrs. I guess, older records( from backup) replaced new records during migration but don’t know how.

did this happened because we added both boxes at same time?

kporter · March 9, 2018, 10:28pm

What version were you upgrading from and to?

Which conflict-resolution-policy are you using?

aanand · March 12, 2018, 4:58pm

Hi kporter, I was not upgrading Aerospike version. I only added two new box with 8hr old backup, that leads to loss of few new records( Actually, these new records were replaced by older records from backup).

I’m using default conflict-resolution-policy that is generation

pgupta · March 13, 2018, 3:45am

What is your server version? What generation values did you have? Take a few suspect records, read using aql, set record_print_metadata to true and read the record. Were you on a generation value rollover cusp from 65K to 0?

From the conflict resolution policy link:

Generation value could wrap back to 0 on a record with a high update rate (Max of 65K generation number per Records). On cold-start a previous copy with higher generation number may be re-indexed and lead to stale data being available. last-update-time is the recommended value.

aanand · March 13, 2018, 7:43am

We are using 3.15.1.3.

I went through this link

and it makes sense to change this field to last-update-time. Thanks for the suggestion.

In our case, Record is getting updated more than 64k but still generation is less than 64k. I guess that because of generation rolling policy.

Now, we have another issue related to this, in our Java client, where we are trying to find record insert/update based on the generation number returned by the client. As per your suggestion, we may end up in getting record generation number as 1 multiple times. In this case, how should we rely on the java client to find that the transaction is insert/update? Our counters went bad with this client logic.

kporter · March 13, 2018, 8:24am

The record generation will continue to advance on update when using ‘last-update-time’.

aanand · March 13, 2018, 9:46am

Thanks @kporter. Then in this case, up to what max limit record generation can increase? By any chance it gets reset to 1?

kporter · March 13, 2018, 3:15pm

It will wrap at 65k. Using conflict resolution policy of last update time will resolve conflicts based on that. Read-modify-write transactions use a generation equal policy which will work fine over the wrap.

By the way, I overlooked the fact that the asrestore utility will still resolve based on generation. For asrestore, you should use the create-only policy, ‘--unique’, to avoid this scenario.

aanand · March 14, 2018, 10:02am

@kporter I got it. ‘–unique’ will work in case of asrestore.

But still I didn’t get answer asked here Data loss when added new box in cluster - #5 by aanand

As per your suggestion, we may end up in getting record generation number as 1 multiple times. In this case, how should we rely on the java client to find that the transaction is insert/update? Our counters went bad with this client logic.

kporter · March 14, 2018, 6:46pm

I thought this was answering that question:

If not, could you try to clarify what is being asked?

system · March 20, 2018, 6:47pm

This topic was automatically closed 6 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Conflict-resolution-policy=ttl Operations	0	1252	March 11, 2016
Why conflict-resolution-policy is on "generation" by default? Configuration	5	2334	March 3, 2016
Strange consistency problem [Resolved] Java Client	2	2924	September 16, 2015
How generation field gets updated when XDR is replicating the record	5	799	November 19, 2019
Aerospike migrations issue/ data loss Migration query	12	1799	July 15, 2019

Data loss when added new box in cluster

Related topics