Hello there :
We built a large HPC environment ( 60 PetaFlops Single Precision) in our data center with kubernetes and Slurm ( Ephemeral Slurm on top of Kubernetes ) Our Kubernetes Engine is from SUSE Rancher and given the tight integration with LongHorn, we also use Longhorn for our EBS volumes.
- Storage Replication :
We use LongHorn in our Kubernetes clusters for EBS volumes. We set up a replica factor of 3 for storage. It is our understanding that is works more like full replica and not parity check with partial data etc.
Now if we deploy our stateful sets for AeroSpike on this persistent volumes with another 3 replica factor,
Are we using about 9 times more storage?
If there are two engines that are trying to heal a failed node, does it cause problems?
We use NVME for all our storage.
I am not sure whether such combination has been tested on Aerospike. Having said that, Aerospike does maintain / manage its own replica through the
replication-factor configuration option, so I would guess/assume that you will end up with 9 copies (I don’t know LongHorn but if it does its own replication on each volume or something similar, it will simply add up).
From Aerospike’s perspective, healing from a failed node typically means that the data will be redistributed across the remaining nodes automatically. So, partitions of data (Aerospike shards the data across 4096 partitions) will be moving across the remaining nodes. Whatever the storage subsystem does will happen I guess based on the data being added/removed to/from a node. When a failed node returns, a similar process will rebalance and redistribute the data again and the storage subsystem will do what it would when data is removed/added.
Hope this helps a bit at least.
Thanks for the message.
This would be an issue with any of the newer centralized storage solutions. Ceph, Rook, Longhorn all have their own replication engine for recovery from failed nodes and disks.
With SUSE Rancher, lot of database Helm charts show up with basic installation. Some NewSQL models like Yugabyte are also there. That is where I started thinking about this issue of duplicate recovery by both Storage engine and DB engine.
Either Way, I will give it a try and let you know. Glad to know that AeroSpike started graph support.