Updates disappear after recovery nodes from crash

We have 6 nodes cluster with replication factor 3 (aerospike CE 3.6.0). Namespace is configured to store data in memory with data persistence file. Data is updated once a day at midnight. During a simulation of massive hw failure we kill 3 nodes (aprox 10 hours after last update). When we (re)start nodes, old data for some keys appears. This affects aprox. 3% of keys and persist even when all migrations are finished. We’re using python client and only simple puts/gets.

We are working on test with more details (logging generation number, store timestamp of update). Could you get me some hints how to track down this issue?

  1. Does your application always set the same TTL?

    As is very well known, Aerospike currently cannot persist deletes: How to ensure that deleted data from set does not come back?

    Setting a lower TTL that may expire before an invalidated copy has very much the same characteristics as a delete.

  2. Though I would suspect that the system would have fsynced in the 10 hours, you may want to try setting fsync-max-sec to 600.

Look forward to seeing the analysis of the test.

Thanks for answer @kporter.

  1. The namespace has on server side default-ttl 0. In application we don’t use another settings for TTL. We know behaviour of deletes, but we don’t perform any delete operation on keys (at least few months).
  2. We will try explicit setting fsync-max-sec - thanks for hint.

I will post results of next round of the test.

We aren’t able to replicate this issue now. Similar behaviour can be seen, when there isn’t enough space and some records has been evicted. Feel free to delete this topics - I don’t want scare other users :wink: Anyway thanks for your hints.