Aerospike behavior when node dies


#1

Hello!

We are thinking about replacing MongoDB on Aerospike. We want to have different installation types with replication factor 1 and 2.

The question is about behavior of Aerospike when one node from the cluster goes down. And then come back in few minutes (15-40 for e. g.).

  • What we will see on client side if we will request data stored on alive nodes?
  • What we will see on client side if we will request data stored on dead node?
  • How cluster will be reconfigured and how it will be restored after restoring node?
  • What happened if we try to write data which should be stored on dead node?

#2

Thanks for posting to on our forum!

You can find details about Aerospike’s data distribution and rebalancing mechanism on our site, specifically on this page.

To your questions:

  • With replication factor 1, if a node goes down the data stored on it will not be available (roughly 1/N of the data, with N the number of nodes in the cluster). With replication factor 2, the data will be available from the node holding the replica copy of the data that was held by the node holding the master copy. The data will also rebalance across all the nodes and you would still end up with 2 copies (assuming you have enough storage capacity on the remaining nodes of course). I encourage you to read also the per-transaction consistency guarantees.

  • The client fetches a partition map every second from all nodes in the cluster (a partition map indicates, for each partition, which node holds the master copy and which node holds the replica). So when a node goes down, the client will within a second max get the new partition map and will not issue any request to a dead node. If the data has not rebalanced yet, request will be proxied to the node holding the data.

  • When a node leaves or joins a cluster, a new partition map is generated and the data rebalances accordingly across the nodes (we call this ‘migrations’). During migrations, data may not ‘yet’ be on the node it’s supposed to be, which will cause such requests to ‘proxy’ to the right node.

  • When a node ‘dies’ or leaves the cluster, a new partition map is generated which the client will get within 1 second and will then write the data to the new node holding the master copy for that data. You can tune the number of retries and the time out per request on the client side.

Hope this helps, let us know otherwise! –meher


#3

Thanks for answers!

I read documents you recommend and still have a question.

Lets say I have cluster with 4 nodes and replication factor 1. One node died. As you explain, in a second a new partition map will be generated and distributed across clients. But what it really mean? Does it mean that I lose all data from this node? What if it just power outage and I restore node in an hour with all old data? What will happened? Cluster clean data on this node or understand that it is the old node from the same cluster with previous data and restore data from it? Something else?

Why I’m asking is because on Mongo, which we currently use, partitions are store on separate servers and when one node died we just can’t read and write this portion of data. But after restoring (from backup for e. g.) we restore full functional cluster with whole data set.


#4

Does it mean that I lose all data from this node?

No !!

What if it just power outage and I restore node in an hour with all old data? What will happened? Cluster clean data on this node or understand that it is the old node from the same cluster with previous data and restore data from it? Something else?

It will see the data in the old node as a version of partition. What this means is, the cluster which is newly formed will have two versions of the same partition: one with the node which died, and second one which was created when the node died. The process of migration would cause these partitions to get merged. At the end of the migration, you will have data from both partitions in the final copy; in case same record was found in both versions, the resolution happens based on generation or TTL. Generation.

– R


#5

Thank you for the answer!

Process of migration would cause these partitions to get merged. At the end of the migration, you will have data from both partitions in the final copy; in case same record was found in both versions, the resolution happens based on generation or TTL. Generation.

Let me clarify conflict resolution process.

I have node n1 and n2 and write a record with key k1. Lets assume a partition, where the record was written was on n1 node. Then I update it two times and have generation 3 in records metadata. Then n1 node died and all partitions were re-created on n2 node with no data. Then I try to update k1 again and because of empty partitions this operation will create a new record with generation 1. Later I restore n1 node with all data and Aerospike will find two records with the same key. As I understand in this case Aerospike will prefer old record because of bigger generation (3 vs. 1), but if old record expired by TTL than choose second one. Is my understanding correct?

Do you have merging process description in documentation?


#6

Your understanding is correct.

Even if none of the records may expire, if the resolution policy is TTL then the records which would expire later is kept. Say copy1 expires tomorrow at 11:00 am and copy2 expires at 1:00pm. The second copy will win.

Fundamental semantics is generation is picked if need is more information (higher generation translates to more number of updates). TTL is picked if need is latest information (given a condition that record written later has higher TTL)

You may check the following article on how to set this:

– R