I launched a cluster of 4x r3.8xl servers each with 4 network cards as suggested here running in memory data. Each node is located in a different zone in AWS US East (a,d,b,e). Every hour or so a server crashes killing the asd process. Also, servers come and go from the AMC console every few minutes.
I’m using Mesh heartbeat and the network configuration lists all servers with all the internal IP addresses of all cards (total of 16 addresses). Timeout set to 20 and interval to 150.
The load is quite light at the moment with about 4k TPS write and 1k TPS read.
Any idea why is the cluster so fragile?