Data loss when added new box in cluster


#1

We saw data loss when we added two new nodes(same time) in 3-node cluster during live traffic.Our aerospike cluster is using shadow devices.

We started these two new nodes with 8hr older backup( EBS snapshot). After migration completion, we observed that for few records there is data loss of last 8hrs. I guess, older records( from backup) replaced new records during migration but don’t know how.

did this happened because we added both boxes at same time?


#2

What version were you upgrading from and to?

Which conflict-resolution-policy are you using?


#3

Hi kporter, I was not upgrading Aerospike version. I only added two new box with 8hr old backup, that leads to loss of few new records( Actually, these new records were replaced by older records from backup).

I’m using default conflict-resolution-policy that is generation


#4

What is your server version? What generation values did you have? Take a few suspect records, read using aql, set record_print_metadata to true and read the record. Were you on a generation value rollover cusp from 65K to 0?

From the conflict resolution policy link:

Generation value could wrap back to 0 on a record with a high update rate (Max of 65K generation number per Records). On cold-start a previous copy with higher generation number may be re-indexed and lead to stale data being available. last-update-time is the recommended value.


#5

We are using 3.15.1.3.

I went through this link https://www.aerospike.com/docs/reference/configuration#conflict-resolution-policy and it makes sense to change this field to last-update-time. Thanks for the suggestion.

In our case, Record is getting updated more than 64k but still generation is less than 64k. I guess that because of generation rolling policy.

Now, we have another issue related to this, in our Java client, where we are trying to find record insert/update based on the generation number returned by the client. As per your suggestion, we may end up in getting record generation number as 1 multiple times. In this case, how should we rely on the java client to find that the transaction is insert/update? Our counters went bad with this client logic.


#6

The record generation will continue to advance on update when using ‘last-update-time’.


#7

Thanks @kporter. Then in this case, up to what max limit record generation can increase? By any chance it gets reset to 1?


#8

It will wrap at 65k. Using conflict resolution policy of last update time will resolve conflicts based on that. Read-modify-write transactions use a generation equal policy which will work fine over the wrap.

By the way, I overlooked the fact that the asrestore utility will still resolve based on generation. For asrestore, you should use the create-only policy, ‘--unique’, to avoid this scenario.


#9

@kporter I got it. ‘–unique’ will work in case of asrestore.

But still I didn’t get answer asked here Data loss when added new box in cluster

As per your suggestion, we may end up in getting record generation number as 1 multiple times. In this case, how should we rely on the java client to find that the transaction is insert/update? Our counters went bad with this client logic.


#10

I thought this was answering that question:

If not, could you try to clarify what is being asked?


#11

This topic was automatically closed 6 days after the last reply. New replies are no longer allowed.