Minimal Heartbeat

Hi I am using Aerospike Server 3.7.2.

This is my heartbeat configuration:

heartbeat {
            mode mesh
            address 192.168.231.16
            port 3002 # Heartbeat port for this node.

            mesh-seed-address-port 192.168.231.18 3002

            interval 100 # Interval in milliseconds in which heartbeats are sent.
            timeout 5 # Number of missing heartbeats after which the remote node will be declared dead.
    }

In my scenario I have 2 nodes, and I’m taking one of them down. I think it takes about ~1.2 seconds until the other node’s “friendlist” is updated and evicted. I suspect that during this time, the hearbeat socket is HUNG, and is released only after a socket timeout. If that is correct, what is the timeout that is configured on the heartbeat sockets? Is it configurable?

It depends on where you’re measuring for this change. The time it will take for the client to identify the cluster changing depends on cluster tending, the thread that checks every second on the state of the cluster. Is that how you’re getting the information?

The heartbeat socket timeout is calculated based on your interval. It ends up being be shorter than the interval.

One other thing to consider is how you’re ‘taking down the cluster’. If you’re shutting down the daemon (asd), that’s one way. If you’re simply getting in the way of the mesh heartbeat, be aware that once a node is added to the cluster there is also fabric back channel established, and even if the heartbeats are missed the cluster still tries to talk to the node through the fabric. This may extend the time till the node is out of the cluster.

First of all, my client’s tend thread is running every 250ms.

The configuration of the heartbeat is what I attached:

interval 100
timeout 5

When I take down a node, it takes the client about 1.5 seconds until it is aware of that.

I used some debugging tools to identify that I get 4-5 “failed” cycles of tend, before the node sends me a friend-list without the other node.

The client’s code will not remove the bad node until the good node sends him a friend-list without the bad node. So, to sum up, I wanted to identify that a node went down in less than 1 second.

For some reason, a friend-list without the bad node, is sent to me only after ~1.2 seconds, even though the hearbeat is configured for 500ms (100*5).