Aerospike node seeding with AWS route53 DNS and why ELB (load balancer) won't work


#1

Aerospike node seeding with AWS route53 DNS and why ELB (load balancer) doesn’t work on all client

Update on ELB support

We have added load balancer detection and support for node seeding as a feature to our client. As such, as long as the access-address is properly configured on each server and the client can contact every node directly using their access-address, an ELB can be used for node seeding.

The clients will make an initial connection to a random seed node via the load balancer, request access-address from it and reconnect to the node on that access-address before continuing. As such, the load balancer will only be used to find the seed node’s real IP.

This feature is supported as of the following versions and clients:

Client Version
Java 4.1.7+
C# 3.6.4+
C 4.3.14+
Go 1.35.0
Node.js 3.5.0
Python 3.4.1

Added support for ELB in the C client also adds support for C-based clients, such as python and node.js.

Problem Description

When using AWS for deploying Aerospike, the elastic IPs are not predictable. These can be made more predictable using pre-allocated EIP and by assigning them to the instances. Unfortunately, that does not resolve the issue of having a single (set of) seed node(s) as any node can disappear or be replaced. This is especially true when using autoscaling groups as well.

This problem exists for both, clients and XDR.

Explanation

The easiest solution that comes to mind is using the ELB (elastic load balancer) with the autoscaling group. This unfortunately won’t work yet for Aerospike (but is likely coming soon). The flow of what happens when using a load balancer for node seeding is more or less as follows:

  1. Client connects to ELB, gets routed to node A. It associates the ELB IP with node A.
  2. Client gets (from A) list of nodes in the cluster and their IPs and connects to them directly (say node B, C, D). So far so good.
  3. Client makes a second connection to ELB IP, expecting to hit node A, it connects and gets told it hit node B (as it’s a load balancer).
  4. Client gets confused as now it’s hitting node B on 2 different IPs (ELB IP and directly) as well as node A on the same IP as node B (the 2 connections it made to ELB).
  5. Client assumes something is wrong and starts disconnecting.

As you can see the ELB (nor any other load balancer) won’t work for node seeding. Aerospike clients and XDR must connect to each node directly (or routed via NAT on 1:1 basis). This cannot be load balanced.

Solution

There are a number of solutions possible for this, some outlined below, the use and choice of which will depend on your use scenario and requirements. Note that all these solutions focus around adding and maintaining a list of IPs in the DNS (e.g. AWS route53) to make use of round-robin DNS feature in aerospike clients.

You may for example find that putting clients in AWS and allowing them to talk to aerospike servers on private IPs is preferential, in which case you can use a method for private IPs.

Or you may find that setting up VPN from your VPC to your local network where clients live is easier than coding lambda for the “public IP approach” (or vice versa). The options, while probably not fully exhaustive, are there to provide information and ideas to choose from for the best approach for your use case.

If using VPCs with internal subnets only (i.e. client can access service via internal IPs)

Note: For this to be true, clients must live inside the VPC or have routing inside AWS to that VPC configured or be connected to the VPC using AWS VPN so as to appear to be on the local network.

Using route53 with healthchecks
  1. Configure a small VPC / subnet that will be used with the autoscaling group, say expected size of cluster + 20 IPs
  2. Create a healthcheck for the IPs in the VPC / subnet in route53, to test TCP port 3000
  3. Add all the IPs into a route53 hosted zone DNS record, set the record to be of type “Weighted” and enable the healthcheck for the record.
  4. Use the DNS record for/as seed node.

The result of this action is that you will have a DNS record with all the IPs in your pool, but only those which are reachable via port TCP3000 will be advertised in DNS queries (health check does this). As a result, any client querying the DNS record for seed node will always get only the current list of active IPs.

Full instructions on route53 DNS and health checks can be obtained on AWS manuals, starting with Route 53 Developer Guide.

Using a static IP list in DNS, bearing first connection timeouts

This method is very similar to the above, except you would not be using healthchecks. In this case, a client will try connecting to each IP in order presented by the DNS server until it finds one where it can connect to port 3000. It will then use that IP as seed.

The drawback of this method is that, the client might get a long timeout period of inactivity when trying to initially connect. Once it finds a seed node, it will work as expected. initial connection from a disconnected client, though, might take a while. The upside is, that this removes the complexity of adding route53 healthchecks to the DNS records.

This is made possible by aerospike supporting round robin DNS for finding the seed node. More on this is available on this knowledge base article.

If client must access service via public IPs (EIP - elastic IP)

Using cloudwatch alarms with lambda
  1. Create a cloudwatch alarm on autoscaling::instance_state_changes, with your lambda code as target
  2. Create lambda code to check autoscaling group instance list and update the route53 DNS with a list of public IPs for the aerospike server autoscaling group.
  3. As an added bonus, the lambda code may also setup healthchecks (remove / add) to ensure you have healthchecks for all the IPs. This would allow to ensure only the nodes with a live port 3000 (aerospike running) are presented to the querying client

Pseudo-code example for point 2 above using boto3 references:

For each instance in describe_auto_scaling_instances():
	if AutoScalingGroupName matches "aerospike":
		Add boto3.resource('ec2’).ec2.Instance('id’).public_ip_address to pubIPlist[]
Run boto3.client('route53’).change_resource_record_sets with pubIPlist[] to update the record set

Solution information can be found on AWS website at: https://aws.amazon.com/blogs/compute/building-a-dynamic-dns-for-route-53-using-cloudwatch-events-and-lambda/

Source code from awslabs for this solution (not maintained nor supported by Aerospike) can be found here: https://github.com/awslabs/route53-dynamic-dns-with-lambda

Using a startup/shutdown script in the instances themselves

If you are using public IPs with AWS, you probably already have a customization script which runs on image creation and/or start to configure access-address of the node, making this approach more integrated with current flow. If you do not have such customization, please see:

  1. As you create a master image for the autoscaling group to use, with all preinstalled software, also create a startup script to:
  2. On start:
    • get public IP of self using magic IP (curl http://169.254.169.254 from the instance itself). Save the public IP to file (e.g. /tmp/public_ip)
    • wait for port 3000 to be bound (bash example: ret=1; while [ $ret -ne 0 ]; do nc -z -w1 localhost 3000; ret=$? ; sleep 1; done
    • once the port is open, add public IP to route53 DNS (either by calling lambda, using boto3 or simply using aws cli)
  3. On stop:
    • read the public IP from file (e.g. /tmp/public_ip)
    • remove the public IP from the route53 DNS record (either by calling lambda, using boto3 or simply using aws cli)
Combine the 2 methods above for best results when using EIPs:
  1. Use a startup script to add the IP to DNS (this way the IP will only be added once aerospike is running, not when the instance starts)
  2. Monitor port 3000 availability and remove the IP from DNS should port 3000 be unavailable for X number of seconds (either by adding healthchecks in the step above or with a simple looped script running in the background every X seconds/minutes)
  3. Use cloudwatch alarm on autoscaling::instance_state_changes with lambda to remove public IPs for nodes which have been terminated (as this covers force-termination as well).

Upcoming client feature

Aerospike client libraries will allow using ELB as a seed note in upcoming releases.

Keywords

SEED ELB LOADBALANCER ROUTE53 AUTOSCALING AWS DNS RDS CLOUDWATCH LAMBDA

Timestamp

5/9/18