Are there recommendations for deploying on Amazon EC2?
One of the concerns companies have with deploying on Amazon EC2 is that sometimes an availability zone may go down in its entirety. They have asked what is the best way to deploy redundant clusters to ensure availability of the cluster when they lose an Availability Zone.
Aerospike recommends putting a cluster completely in one Availability Zone, rather than across 2. The reason for this is that network latency between Availability Zones tend to be high and unpredictable. While this can happen even within an Availability Zone, they are much smaller in scale.
There are 3 options for deploying in 2 different Availability Zones. In each of these options the clusters should use the Aerospike XDR (Cross Datacenter Replication) between the Availability Zones. Also, XDR should be configured with active-active replication. This is done so that if you much use the secondary cluster for writes, that any changes that are written to the backup cluster will be tracked and restored when the primary comes back up.
Option 1 - 2 clusters in 2 Availability Zones, both with replication factor of 2 In this configuration, you will have a pair of replicated clusters. This ensures that even if an Availability Zone goes down, that you will have a fully replicated backup. The downside of this is that you will have 4 copies of the data, and so the hardware costs are higher.
Option 2 - 2 clusters in 2 Availability Zones, both with replication factor of 1. (Not recommended) In this configuration, you will have a pair of unreplicated clusters. One cluster typically acts as the primary and the other a hot backup. The downsides of this architecture are: If a node goes down, the data for that node will not be available. While a copy of the data will exist in the second cluster, retrieving the data is a manual process. This is a big problem for most customers and this alone will make this strategy undesirable. In order to do rolling upgrades, you will first need to add a new server to an existing cluster. At the end of the upgrades, the last node can be taken offline.
Option 3 - 2 clusters in 2 Availability Zones with replication factor of 2 in the primary Availability Zone and replication factor of 1 in the secondary. In this hybrid approach, the primary Availability Zone has a fully replicated cluster. This should handle node failures and just about all cases short of a catastrophic failure of the Availability Zone (which has happened). In this event the cluster in the second availability zone acts as a backup. While this option is cheaper than option 1, sometimes the difference in cost is negligible and maintaing a full replication factor of 2 in the secondary is worthwhile.
Interesting insight - would be useful to have that here http://www.aerospike.com/docs/deploy_guides/aws/recommendations/
How do you recommend deploying on multiple AZs with the community edition (aka without XDR)?
To be honest, we do not recommend it, even if it’s possible through rack-awareness. The main problem is that the latency between AZs varies wildly by region, and we’ve also seen them to be inconsistent within a single region. There is no promise given to you by Amazon for latency between the AZs. Since Aerospike is a distributed database, and since the cluster tracks its node through heartbeats, you are inviting lots of re-clustering and migrations due to network events. The problems you seem to be seeing are caused by your decision to go cross-AZ.
https://www.aerospike.com/docs/deploy_guides/aws/recommendations/ explains the advantage of being in a single AZ. The correct way to use Aerospike on EC2 is to build a cluster in a single AZ, better yet a single placement group within it, and to use either an application-level synchronization with something like Kafka, or using our enterprise edition feature, cross-datacenter replication to another AZ or region.
Are there examples anywhere on how to do this with Kafka?
Network throughput issues with asd running
@naoum that’s a question to ask the community here or on StackOverflow, I believe. The main takeaway is keep your clusters in a single AZ, plan failover for the community edition to a hot standby cluster in a different AZ (or region), or use the enterprise edition’s XDR feature.
Assuming we can tolerate a small downtime in the case of a full AZ failure, can EBS shadow devices be used for full AZ downtime resolution? I.e. Set up cluster with SSD + EBS shadow, then in case of full cluster failure launch instances in different AZ linked to the same EBS drives? Are you aware of anyone doing that?
Yes, you can use EBS to recreate a cluster on a different AZ. Each node in your current cluster will have to have one or more EBS drives, depending on how many namespaces you have storing their data on SSD.
If the cluster goes down you can bring up nodes in a different AZ and reattach the EBS drives to them, bringing them up one by one. You’ll use the same configuration. Since the local ephemeral disk will not have data on it, Aerospike will read the data from the shadow device to populate the database. That’s pretty much it, it’s just that you’ll have downtime (how long depends on how much data you have), as opposed to a hot failover that you can get with the Enterprise Edition using XDR to update a cluster on a separate AZ. This is a cold restart situation where the primary index is rebuilt from the EBS, just as the ephemeral is filled from it.
You can do the same thing for data-in-memory that is persisted to an ephemeral SSD and EBS as a shadow device.