Adding a node isn’t a problem. Removing a node while there are ongoing migrations can result in data loss.
Adding node is safe.
Before removing a node, you should always ensure there aren’t any ongoing migrations. To expand on that, if you plan to remove multiple nodes, you should wait for migrations to complete between each node removal.
Yes, for replication factor 2, losing 2 random nodes will result in data loss. The amount of data lost decrease as the cluster size increases. There are replication models used in other systems that decrease the probability of data loss as the cluster size increases but increases the amount of data lost in an event that would result in data loss.
For our replication model, the amount of data lost with 2 node failure and replication factor 2 can be calculated as 2/(n(n-1)) where n is the number of nodes in the cluster.
I have additional question: Lets have a cluster with persistent namespace (RAM+HDD) with read-only load. Can data loss happen when node(s) or network fail during migration?
@manana with replication factor 2, if 2 nodes are unable to respond to requests for whatever reason, the data shared by those nodes will not be reachable (2/(n(n-1)).
@kporter, yes, this is absolutely clear… I am asking about scenario, when the whole cluster is being restarted by restarting node by node without waiting to migration. In any time max one node is down, data is read only and persistent on HDD. I’ve heard, that in 6 node r-o cluster with replication factor 3, this cause permanent data loss (part of data wasn’t be reachable even when all nodes was up and migrations were complete).
Unfortunately I have no precise details about this issue. I will try to replicate.
Hi guys, I wanted to understand what will happen in a scenario, where I have 3 nodes, with replication factor = 3. Configured cold-start-empty in all sets of all nodes.
The timeline is:
All 3 nodes are up
Taking 2 nodes down, so only 1 remains up.
Starting one of the nodes, so now 2 are up and a migration is ongoing.
Starting the third node before the migration has finished.