Why is the cluster so fragile when spreading the nodes of a single cluster across AZs?

I launched a cluster of 4x r3.8xl servers each with 4 network cards as suggested here running in memory data. Each node is located in a different zone in AWS US East (a,d,b,e). Every hour or so a server crashes killing the asd process. Also, servers come and go from the AMC console every few minutes.

I’m using Mesh heartbeat and the network configuration lists all servers with all the internal IP addresses of all cards (total of 16 addresses). Timeout set to 20 and interval to 150.

The load is quite light at the moment with about 4k TPS write and 1k TPS read.

Any idea why is the cluster so fragile?

1 Like

Hey Rafi. This is mentioned in several other places in the discussion forum, and in our documentation, but basically it has to do with unpredictable inter-zone latencies in AWS. Aerospike requires consistently low latencies between the nodes, otherwise the cluster might assume that certain nodes have dropped off, a new cluster will form automatically, and migration begin. Migrations will have an effect on the load on each node, and will use up network resources as partitions migrate.

The thing is that Amazon has wildly varied inter-zone latencies by region. In some places the zones are physically very close, in others they are not. There are also spikes in the network that you cannot control. Amazon does not provide you with an SLA for such behavior, and therefore we do not recommend having the nodes of a single cluster spread across AZs. In fact, our Amazon guide recommends not only the same AZ, but also the same placement group. Going across AZs, or across regions, should be done with XDR (in Enterprise Edition), or using some mechanism you develop, such as Kafka-based queueing.