I am working on a “Light Weight EMR( Electronic Medical Record)” to monitor Covid patients that are relatively less severe in their symptoms, so that those patients can be monitored at home cost effectively. The planned system would have IoT devices sending frequent vital signs of the patient ( Heart Rate, Blood pressure, Oxygen level ) to the cloud database on a national scale ( Potentially millions of patients and devices)
For this system Durability is very important, but availability and latency are not a concern as the outpatient’s can be treated based on the vital signs that are 5-10 minutes old. I have gone through AeroSpike documentation. I feel in very simple terms it is a " Massive implementation of RAID 10 on the cloud with Paxos and gossip protocol acting as software based RAID controller"
When I hear most of the companies implementing with replication factor of 2 even at PetaByte scale, I am thinking I must be missing something. What if a second node fails during the replication/repair/transfer of the 1st node failure? To me it looks like a replication factor of 3 or more is needed for critical data.
Am I correct in my assumption? How long does it take for one node failure to be replicated? ( Assumptions are SSD based data, DRAM based indexes and low availability is acceptable ) How long does it take for one for replication if one data center goes down? Are there other methods of improving durability of data outside of increasing replication factor.