Kubernetes - LongHorn Storage related

Prasad_Reddy · July 21, 2023, 6:47pm

Hello there :

We built a large HPC environment ( 60 PetaFlops Single Precision) in our data center with kubernetes and Slurm ( Ephemeral Slurm on top of Kubernetes ) Our Kubernetes Engine is from SUSE Rancher and given the tight integration with LongHorn, we also use Longhorn for our EBS volumes.

Storage Replication : We use LongHorn in our Kubernetes clusters for EBS volumes. We set up a replica factor of 3 for storage. It is our understanding that is works more like full replica and not parity check with partial data etc.

Now if we deploy our stateful sets for AeroSpike on this persistent volumes with another 3 replica factor,

Are we using about 9 times more storage?
If there are two engines that are trying to heal a failed node, does it cause problems?

We use NVME for all our storage.

meher · July 24, 2023, 9:46pm

I am not sure whether such combination has been tested on Aerospike. Having said that, Aerospike does maintain / manage its own replica through the replication-factor configuration option, so I would guess/assume that you will end up with 9 copies (I don’t know LongHorn but if it does its own replication on each volume or something similar, it will simply add up).

From Aerospike’s perspective, healing from a failed node typically means that the data will be redistributed across the remaining nodes automatically. So, partitions of data (Aerospike shards the data across 4096 partitions) will be moving across the remaining nodes. Whatever the storage subsystem does will happen I guess based on the data being added/removed to/from a node. When a failed node returns, a similar process will rebalance and redistribute the data again and the storage subsystem will do what it would when data is removed/added.

Hope this helps a bit at least.

Prasad_Reddy · July 24, 2023, 10:18pm

Thanks for the message.

This would be an issue with any of the newer centralized storage solutions. Ceph, Rook, Longhorn all have their own replication engine for recovery from failed nodes and disks.

With SUSE Rancher, lot of database Helm charts show up with basic installation. Some NewSQL models like Yugabyte are also there. That is where I started thinking about this issue of duplicate recovery by both Storage engine and DB engine.

Either Way, I will give it a try and let you know. Glad to know that AeroSpike started graph support.

Prasad

Topic		Replies	Views
Wondering if what cluster config for aerospike to meet following SLA	2	1054	January 12, 2018
Will Data Recover on the Other Cluster or on the Local HDD? How Aerospike Works	6	2353	August 3, 2015
Aerospike loses data during network partitions?	4	4224	May 8, 2015
How does the Aerospike protect my data? Configuration	1	1166	August 16, 2014
Capacity planning/analysis tools?	5	503	February 15, 2024

Kubernetes - LongHorn Storage related

Related topics