XDR or Rack Awareness for geo cluster


#1

Hi,

We plan to deploy solution with Aerospike as a database. There are two datacenters (different locations) where each handles own traffic. In each location there are own application servers communicating with aerospike through the smart client. In case one location is down, the second one must be able to handle whole traffic. It means both locations must have all data. In addition, if only Aerospike is down in one location, the second one must be able handle traffic from application servers located in both locations.

Now we designing the infrastructure and I would like to ask you what approach is more suitable. We see two available options:

  • Two clusters (3 nodes in each location – same configuration) with XDR (active – active)
  • One cluster across both location using rack awareness (3 + 3 servers)

For the first sight the rack awareness approach seems to be much easier because through smart client we don’t have to care if one location on Aerospike level is down. If the first location is down, the second one is used and it has all data available. But we are not sure if this approach is suitable for geo clusters. The second approach is probably recommended one. It is OK, if whole one location (application and aerospike servers) is down – the second one has all data and handle all traffic. As soon as the first location is up, data are replicated (XDR replication) from the second location and everything works OK. But we don’t know how to handle situation when only aerospike cluster in one location is down (eg. the first location). In this case we need to somehow route the traffic from application servers in the first location to Aerospike cluster at the second location. Is smart client able to handle such situation automatically or are we supposed to do that on application level? Or what would you recommended, please?

Thank you in advance, Jan


#2

As you already guessed it right, rack awareness is not a good idea for geo clusters. We expect that the latency between the cluster nodes is sub millisecond. Also, the interconnect should be reliable to avoid network flapping leading to cluster integrity issues. This is hard in WAN links.

Over WAN links, XDR is definitely the recommended way. But the smart client will not switch from one cluster to the other. Smart client will keep track of changes only in one cluster and handle them. So, you need to build a failover mechanism at application level based on the health of the cluster. You check the state of each node either by checking process state in OS or by making info calls to the nodes. The former may be more reliable because the info call to the nodes may get rejected when things go bad (like connection limit reached etc).