Why is tip-clear not working when DNS is used in the mesh seed list configuration?

Why is tip-clear not working when DNS is used in the mesh seed list configuration?

Problem Description

I see the warning could not create heartbeat connection to node IP for a node that I no longer have in the cluster even though I issued the tip-clear command.

Details

A node removal would trigger the following messages in the Aerospike server logs:

Aug 07 2019 01:45:34 GMT: WARNING (socket): (socket.c:959) (repeated:20) Error while connecting socket to 10.0.88.238:3002
Aug 07 2019 01:45:34 GMT: WARNING (hb): (hb.c:4882) (repeated:20) could not create heartbeat connection to node {10.0.88.238:3002}

In such cases, it is recommended to run tip-clear to clear the IP from the mesh mode heartbeats.

However, there can be situations when you are not able to get rid of those messages using tip-clear command, for example:

asinfo -v 'tip-clear:host-port-list=10.0.88.238:3002'
error: 0 cleared, 1 not found

And the server log have the following messages indicating the IP address does not exist:

Aug 07 2019 01:57:11 GMT: INFO (info): (thr_info.c:827) tip clear command received: params host-port-list=10.0.88.238:3002
Aug 07 2019 01:57:11 GMT: WARNING (info): (thr_info.c:883) seed node 10.0.88.238:3002 does not exist

In such situations, check whether you are using an IP or a hostname in the mesh-seed-address-port configuration. If you are using DNS, try using the DNS name in the tip-clear command:

asinfo -v 'tip-clear:host-port-list=node3.example.com:3002'
ok

However, if the server log still contains the warnings after issuing the tip-clear command, the underlying IP address behind the DNS might have changed:

Aug 07 2019 01:58:07 GMT: INFO (info): (thr_info.c:827) tip clear command received: params host-port-list=node3.example.com:3002
Aug 07 2019 01:58:07 GMT: INFO (info): (thr_info.c:900) tip clear command executed: cleared 1, params host-port-list=node3.example.com:3002
Aug 07 2019 01:58:09 GMT: WARNING (socket): (socket.c:959) (repeated:20) Error while connecting socket to 10.0.88.238:3002

Explanation

This happens when the heartbeat seeded nodes were specified using DNS or FQDN and the underlying IP addresses changed on the removed node or a new node was added with the same DNS name but a different IP. Since at this point Aerospike resolves DNS only once, the tip-clear does not behave as expected.

Solution

For the removal of the old IP addresses in the aerospike.log, there is a workaround. Temporarily add an entry under /etc/hosts on the host with warnings with the DNS name and the old IP address, and re-run tip-clear using that DNS name.

Then, cleanup the hosts file (after successfully running the tip-clear command).

Example:

  1. Save /etc/hosts
cp hosts hosts.orig
  1. Add to /etc/hosts
10.0.88.238  node3.example.com
  1. Then run on each node:
asinfo -v 'tip-clear:host-port-list=node3.example.com:3002'
  1. Restore the original /etc/hosts
cp hosts.orig hosts

Internal Jira AER-5852 tracks upcoming enhancements to avoid such workarounds. Track release notes for any change in upcoming releases.

Notes

This issue may also lead to a new server receiving the same old IP address and then accidentally joining the incorrect cluster. The use of cluster-name in /etc/aerospike.conf is recommended in order to prevent a node from joining the wrong cluster. For further details, refer to the article on How to divide a mesh cluster.

References

Keywords

TIP-CLEAR DNS

Timestamp

September 2019

© 2015 Copyright Aerospike, Inc. | All rights reserved. Creators of the Aerospike Database.