Guideline on the rationale behind shadow EBS in Aerospike 3.5.14


With the new Aerospike 3.5.14 comes the shadow device functionality on EC2 as read here:

I am still unclear as to why one would want that. We are already using fast direct-attached instance store SSD with a replication factor of 2. We don’t use EBS. If we loose a node then the cluster rebalances itself.

Unless you would recommend the shadow EBS when using a replication factor of 1 and allowing downtime on some part of the data when a node goes down? A replication factor of 1 would decrease cost as we would need to run less nodes. The tradeoff would be some downtime on part of the data set when the node goes down but since the data would be copied to the shadow EBS then there would be no data loss once we restore the node and reattach the volume?


The reason for the shadow device stems from the local SSD being ephemeral in AWS. Due the SSDs being ephemeral, if multiple Amazon instances stopped suddenly the instances may return without empty SSDs. Many users initially chose BCache to restore the lost durability, but found problems with BCache under load and noticed that a significant number of reads were unnecessarily delivered to the backing EBS device.

Shadow devices were developed to address this concern. But you are correct, there are other ways that shadow devices could be used. In your example you would trade durability for cost.


Thank you for your answer. With your response, I have a follow up question:

Let’s say that I want a replication factor of 2 so data are available at all time if one node goes down but I do not want the cluster to rebalance itself because if i use the shadow EBS then I know I can bring back that node to life. This way I can increase memory-size high-water-disk-pct high-water-memory-pct on all nodes because I don’t need the extra space for the case where one node goes down and other nodes need to accommodate the data coming from the cluster rebalancing itself.

Hence my question: Can I disable the cluster rebalancing capability?


Presently there isn’t a way to achieve this goal. Rebalancing can be halted but there are ripple effects to be considered which will affect not only the cluster but also the clients. This mode of operation isn’t supported and in the current state would be very problematic for operations.