Preventing data loss during data migration


#1

hi,

I find words below on http://www.aerospike.com/docs/operations/upgrade/aerospike/

Interrupting a cluster during data migration may cause data loss. To be fully ACID-compliant, you should wait for all data migration to complete

What’s the exact meaning of “Interrupting” ? Adding / removing nodes during data migration ?

If I have 3 nodes in 1 cluster with replication factor of 2. What’s the best policy to add/remove node from it without data loss?

thx


#2

Adding a node isn’t a problem. Removing a node while there are ongoing migrations can result in data loss.

  • Adding node is safe.
  • Before removing a node, you should always ensure there aren’t any ongoing migrations. To expand on that, if you plan to remove multiple nodes, you should wait for migrations to complete between each node removal.

#3

@kporter Thanks.

According to your comments, data will lose if any 2 nodes crash , even there’re 10 nodes in total ?


#4

Yes, for replication factor 2, losing 2 random nodes will result in data loss. The amount of data lost decrease as the cluster size increases. There are replication models used in other systems that decrease the probability of data loss as the cluster size increases but increases the amount of data lost in an event that would result in data loss.

For our replication model, the amount of data lost with 2 node failure and replication factor 2 can be calculated as 2/(n(n-1)) where n is the number of nodes in the cluster.

Hope this helps.


#5

I have additional question: Lets have a cluster with persistent namespace (RAM+HDD) with read-only load. Can data loss happen when node(s) or network fail during migration?


#6

@manana with replication factor 2, if 2 nodes are unable to respond to requests for whatever reason, the data shared by those nodes will not be reachable (2/(n(n-1)).


#7

@kporter, yes, this is absolutely clear… I am asking about scenario, when the whole cluster is being restarted by restarting node by node without waiting to migration. In any time max one node is down, data is read only and persistent on HDD. I’ve heard, that in 6 node r-o cluster with replication factor 3, this cause permanent data loss (part of data wasn’t be reachable even when all nodes was up and migrations were complete).

Unfortunately I have no precise details about this issue. I will try to replicate.


#8

Hi guys, I wanted to understand what will happen in a scenario, where I have 3 nodes, with replication factor = 3. Configured cold-start-empty in all sets of all nodes.

The timeline is:

  1. All 3 nodes are up
  2. Taking 2 nodes down, so only 1 remains up.
  3. Starting one of the nodes, so now 2 are up and a migration is ongoing.
  4. Starting the third node before the migration has finished.

Can I experience data loss in this scenario?


#9

And another scenario:

  1. 2 Nodes are up.
  2. Adding a third node - causing migration
  3. During the ongoing migration, Aerospike node number 2 got disconnected from the network, causing the cluster not to see it anymore.

Is there a data loss problem in this scenario?


#10

I would not expect, nor does our tests indicate any issues with this scenario.


#11

I wouldn’t expect data loss from either scenario.


#12

Because of the replication factor?


#13

Yes, with replication factor N, I wouldn’t expect loss until N nodes are lost.