Why is the cluster so fragile when spreading the nodes of a single cluster across AZs?

rafiton · November 25, 2015, 11:43pm

I launched a cluster of 4x r3.8xl servers each with 4 network cards as suggested here running in memory data. Each node is located in a different zone in AWS US East (a,d,b,e). Every hour or so a server crashes killing the asd process. Also, servers come and go from the AMC console every few minutes.

I’m using Mesh heartbeat and the network configuration lists all servers with all the internal IP addresses of all cards (total of 16 addresses). Timeout set to 20 and interval to 150.

The load is quite light at the moment with about 4k TPS write and 1k TPS read.

Any idea why is the cluster so fragile?

rbotzer · November 26, 2015, 12:09am

Hey Rafi. This is mentioned in several other places in the discussion forum, and in our documentation, but basically it has to do with unpredictable inter-zone latencies in AWS. Aerospike requires consistently low latencies between the nodes, otherwise the cluster might assume that certain nodes have dropped off, a new cluster will form automatically, and migration begin. Migrations will have an effect on the load on each node, and will use up network resources as partitions migrate.

The thing is that Amazon has wildly varied inter-zone latencies by region. In some places the zones are physically very close, in others they are not. There are also spikes in the network that you cannot control. Amazon does not provide you with an SLA for such behavior, and therefore we do not recommend having the nodes of a single cluster spread across AZs. In fact, our Amazon guide recommends not only the same AZ, but also the same placement group. Going across AZs, or across regions, should be done with XDR (in Enterprise Edition), or using some mechanism you develop, such as Kafka-based queueing.

Topic		Replies	Views
Are there recommendations for deploying on Amazon EC2? Configuration	9	5112	March 8, 2020
Extending Aerospike from 1 node to a 2-node cluster in AWS (Amazon Web Services) Configuration aws	8	4238	November 11, 2015
Uneven partition distribution Operations	3	1744	June 13, 2019
Higher Latencies In Few Particular Nodes query , udf , latency , index	4	819	April 19, 2022
Problems Configuring Clustering on AWS EC2 with 3 DB Instances Configuration	2	1991	August 21, 2015

Why is the cluster so fragile when spreading the nodes of a single cluster across AZs?

Related topics