How does Aerospike protect my data in case of a SSD failure?


#1

Summary

What should I do when one of namespace SSDs fails? Does Aerospike copy the data across SSDs associated with a namespace? How do I replace failed SSD and restore the data?

Resolution

The data recovery will depend on your configuration. Aerospike does allow the possibility to have an unreplicated namespace. But normally, you should have your replication factor set to at least 2 (which means that you will have two copies of your data at any time in the cluster assuming everything is balanced). With replication factor 2, the cluster will automatically copy the data between the remaining nodes. No commands are necessary to rebalance the data.

Keeping the above in mind, let’s say that you have 3 SSDs on your nodes for 1 namespace and one SSD fails. You will need to physically replace only the failed SSD. You should be able to stop the Aerospike server, hot swap the SSD (if its hot-swappable) that failed and bring back the Aerospike server back up. You would not need to clear all the SSD’s by re-initializing the drive.

For Aerospike Enterprise Edition, nodes have the added feature of going through a fast restart which would still work in case of a SSD replacement after a failure. In case of replacing a disk, records from that failed drive will be cleaned up from the index upon fast start and repopulated through migrations. So, all of the data will be restored once migration is completed. For Community Editions, you would go through a cold start to restart the server with the replaced SSD.

Instructions on adding or upgrading a device is explained in the following documentation:

For more information on SSD Initialization, please see below: https://www.aerospike.com/docs/operations/plan/ssd/ssd_init.html