Losing records after node fails


#1

Hi guys! We are using Aerospike in our company for quite a while now. Currently it is a cluster of 11 nodes and more than 4 000 000 000 records.

Yet we’ve encountered a strange problem: when one node fails (ssd breaks, or hangs up) we loose a portion of records. Last time we’ve lost approximately 20% of them. We evaluate the loss amount using the daily backups by comparing with backups that are made after migration is completed. We are using Aerospike 3.3.21 (replication-factor=2) So, what’s wrong with it?


#2

I see that you are comparing with backups after migrations complete, but I am not sure to what you are comparing them to. If you are comparing them to backups taken while migrations are in progress, it is expected that the backups will have missing and duplicated data.

Does your data have expiration set? If so did you nodes exceed a high-water threshold and start evicting?


#3
  1. I am comparing them to backups taken the day before the accident (migration was not in progress)
  2. Our data have no expiration. Avail pct = 40…45% in cluster

#4

Could you provide your Aerospike.conf file?

Also which OS are you running on?

Which Aerospike features are you using?

  • Key/value reads and writes?
  • scan?
  • query/secondary indexes?
  • Complex datatypes such as list and map?
  • LDTs (LList)?

Anything you can tell me about the data that was missing?